Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data on the Web
Search
Will Farrington
June 08, 2011
Technology
3
190
Data on the Web
It's an intro to data on the web for some folks new to web development.
Will Farrington
June 08, 2011
Tweet
Share
More Decks by Will Farrington
See All by Will Farrington
test-queue makes your tests run fast
wfarr
0
440
Incident Response Done Right: From First Page to Postmortem
wfarr
0
550
Boxen: PuppetConf 2013
wfarr
6
860
Puppet at GitHub: PuppetConf 2013
wfarr
21
2.2k
Puppet at GitHub (PuppetCamp Raleigh 2013)
wfarr
1
450
Boxen: PuppetCamp SF 2013
wfarr
5
810
Boxen: MWRC
wfarr
5
220
Boxen: PuppetCamp ATL
wfarr
0
270
BOXEN
wfarr
43
5.5k
Other Decks in Technology
See All in Technology
強化されたAmazon Location Serviceによる新機能と開発者体験
dayjournal
2
210
セキュリティの民主化は何故必要なのか_AWS WAF 運用の 10 の苦悩から学ぶ
yoh
1
160
Welcome to the LLM Club
koic
0
170
Fabric + Databricks 2025.6 の最新情報ピックアップ
ryomaru0825
1
140
Github Copilot エージェントモードで試してみた
ochtum
0
110
25分で解説する「最小権限の原則」を実現するための AWS「ポリシー」大全 / 20250625-aws-summit-aws-policy
opelab
9
1.1k
Amazon Bedrockで実現する 新たな学習体験
kzkmaeda
2
560
Clineを含めたAIエージェントを 大規模組織に導入し、投資対効果を考える / Introducing AI agents into your organization
i35_267
4
1.6k
~宇宙最速~2025年AWS Summit レポート
satodesu
1
1.8k
Uniadex__公開版_20250617-AIxIoTビジネス共創ラボ_ツナガルチカラ_.pdf
iotcomjpadmin
0
160
HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation
spatial_ai_network
0
110
生成AI時代の開発組織・技術・プロセス 〜 ログラスの挑戦と考察 〜
itohiro73
0
200
Featured
See All Featured
Speed Design
sergeychernyshev
32
1k
Embracing the Ebb and Flow
colly
86
4.7k
BBQ
matthewcrist
89
9.7k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
35
2.3k
The Pragmatic Product Professional
lauravandoore
35
6.7k
Docker and Python
trallard
44
3.4k
Testing 201, or: Great Expectations
jmmastey
42
7.5k
[RailsConf 2023] Rails as a piece of cake
palkan
55
5.6k
The Cost Of JavaScript in 2023
addyosmani
51
8.5k
Why Our Code Smells
bkeepers
PRO
337
57k
How to Think Like a Performance Engineer
csswizardry
24
1.7k
Statistics for Hackers
jakevdp
799
220k
Transcript
Data on the Web Will Farrington
File I/O
File I/O • Minimal reusability • No "correct" format •
Hard to maintain • Prone to problems caused by encoding changes
Standardize
CSV
Comma Separated Values
CSV • Used for tabular data • Small footprint •
Widely recognized and supported format • Many different flavors • Support in database systems and spreadsheets
Example CSV Id,Name,Desc,Points,Due 1,Homework 1,Nothing special,15,6/7/2011 15,"Project, número uno",,100,6/21/2001
XML
Extensible Markup Language
XML • Open, standard specification • Unicode-friendly • Came to
prominence with Java and .NET • Widely used on the web • Good at representing tree-like data
Example XML <?xml version="1.0" encoding="UTF-8"?> <statuses type="array"> <status> <created_at>Tue Jun
07 21:30:50 +0000 2011</created_at> <id>78212343649140736</id> <text>@skalnik Looks good.</text> <source><a href="http://itunes.apple.com/us/app/twitter/id409789998?mt=12" rel="nofollow">Twitter for Mac</a></source> <truncated>false</truncated> <favorited>false</favorited> <in_reply_to_status_id>78211453777231872</in_reply_to_status_id> <in_reply_to_user_id>15878923</in_reply_to_user_id> <in_reply_to_screen_name>skalnik</in_reply_to_screen_name> <retweet_count>0</retweet_count> <retweeted>false</retweeted> <user> <id>10403812</id> </user> <geo/> <coordinates/> <place/> <contributors/> </status> </statuses>
Criticisms of XML • Very verbose • Parsers can be
extremely complicated • Does not map well to some type systems • Does not represent highly structured data well
JSON
JavaScript Object Notation
JSON • Based on a subset of JavaScript circa 2003
• Lightweight • Simple to parse • Designed to be human-readable • Well-suited to structured data as well as trees
Example JSON [{ "coordinates":null, "created_at":"Tue Jun 07 21:30:50 +0000 2011",
"truncated":false, "favorited":false, "contributors":null, "text":"@skalnik Looks good.", "id":78212343649140736, "retweet_count":0, "geo":null, "retweeted":false, "in_reply_to_user_id":15878923, "source":"<a href=\"http://itunes.apple.com/us/app/twitter/id409789998?mt=12\" rel= \"nofollow\">Twitter for Mac</a>", "place":null, "in_reply_to_screen_name":"skalnik", "user":{"id":10403812}, "in_reply_to_status_id":78211453777231872 }]
More on JSON • eval() (is bad) • JSON.parse() •
Built-in browser support • Popular for AJAX: both single-domain and cross-domain
JSONP • JSON with Padding • Used for cross-domain requests
• Alternative to Cross-Origin Resource Sharing • Only supports GET
BSON • Binary JSON • Superset of JSON • Used
by MongoDB for storage of binary data
YAML
YAML Ain't Markup Language
YAML • Not often used over the network • Popular
for configuration files • Human-readable • Data-oriented • No execution means no injection
Example YAML --- - coordinates: created_at: Tue Jun 07 21:30:50
+0000 2011 truncated: false favorited: false contributors: text: "@skalnik Looks good." id: 78212343649140736 retweet_count: 0 geo: retweeted: false in_reply_to_user_id: 15878923 source: <a href="http://itunes.apple.com/us/app/twitter/id409789998?mt=12" rel="nofollow">Twitter for Mac</a> place: in_reply_to_screen_name: skalnik user: id: 10403812 in_reply_to_status_id: 78211453777231872
What to do with all these formats?
APIs
Application Programming Interfaces
APIs • Websites tell you what formats they support •
Websites document their URL structure • Developers use these APIs to integrate products • You can even consume your own APIs
But...
Not everyone offers APIs
What do?
Screen-scraping
Screen-scraping • Requests the full HTML for a page •
Parses out the content you want • Slow • Website layout may change and break yours
Demo!
Questions?
Will Farrington
[email protected]
http://speakerdeck.com/u/wfarr