Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data on the Web
Search
Will Farrington
June 08, 2011
Technology
3
200
Data on the Web
It's an intro to data on the web for some folks new to web development.
Will Farrington
June 08, 2011
Tweet
Share
More Decks by Will Farrington
See All by Will Farrington
test-queue makes your tests run fast
wfarr
0
460
Incident Response Done Right: From First Page to Postmortem
wfarr
0
570
Boxen: PuppetConf 2013
wfarr
6
880
Puppet at GitHub: PuppetConf 2013
wfarr
21
2.2k
Puppet at GitHub (PuppetCamp Raleigh 2013)
wfarr
1
470
Boxen: PuppetCamp SF 2013
wfarr
5
990
Boxen: MWRC
wfarr
5
240
Boxen: PuppetCamp ATL
wfarr
0
290
BOXEN
wfarr
43
5.5k
Other Decks in Technology
See All in Technology
re:Invent2025 コンテナ系アップデート振り返り(+CloudWatchログのアップデート紹介)
masukawa
0
360
regrowth_tokyo_2025_securityagent
hiashisan
0
240
技術以外の世界に『越境』しエンジニアとして進化を遂げる 〜Kotlinへの愛とDevHRとしての挑戦を添えて〜
subroh0508
1
450
20251209_WAKECareer_生成AIを活用した設計・開発プロセス
syobochim
7
1.5k
Databricks向けJupyter Kernelでデータサイエンティストの開発環境をAI-Readyにする / Data+AI World Tour Tokyo After Party
genda
1
120
コミューンのデータ分析AIエージェント「Community Sage」の紹介
fufufukakaka
0
490
AWS Security Agentの紹介/introducing-aws-security-agent
tomoki10
0
230
Debugging Edge AI on Zephyr and Lessons Learned
iotengineer22
0
180
学習データって増やせばいいんですか?
ftakahashi
2
330
AWS re:Invent 2025で見たGrafana最新機能の紹介
hamadakoji
0
370
[CMU-DB-2025FALL] Apache Fluss - A Streaming Storage for Real-Time Lakehouse
jark
0
120
寫了幾年 Code,然後呢?軟體工程師必須重新認識的 DevOps
cheng_wei_chen
1
1.4k
Featured
See All Featured
4 Signs Your Business is Dying
shpigford
186
22k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
35
2.3k
Building Adaptive Systems
keathley
44
2.9k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
36
6.2k
Imperfection Machines: The Place of Print at Facebook
scottboms
269
13k
How To Stay Up To Date on Web Technology
chriscoyier
791
250k
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Git: the NoSQL Database
bkeepers
PRO
432
66k
Build The Right Thing And Hit Your Dates
maggiecrowley
38
3k
Building Applications with DynamoDB
mza
96
6.8k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.3k
The World Runs on Bad Software
bkeepers
PRO
72
12k
Transcript
Data on the Web Will Farrington
File I/O
File I/O • Minimal reusability • No "correct" format •
Hard to maintain • Prone to problems caused by encoding changes
Standardize
CSV
Comma Separated Values
CSV • Used for tabular data • Small footprint •
Widely recognized and supported format • Many different flavors • Support in database systems and spreadsheets
Example CSV Id,Name,Desc,Points,Due 1,Homework 1,Nothing special,15,6/7/2011 15,"Project, número uno",,100,6/21/2001
XML
Extensible Markup Language
XML • Open, standard specification • Unicode-friendly • Came to
prominence with Java and .NET • Widely used on the web • Good at representing tree-like data
Example XML <?xml version="1.0" encoding="UTF-8"?> <statuses type="array"> <status> <created_at>Tue Jun
07 21:30:50 +0000 2011</created_at> <id>78212343649140736</id> <text>@skalnik Looks good.</text> <source><a href="http://itunes.apple.com/us/app/twitter/id409789998?mt=12" rel="nofollow">Twitter for Mac</a></source> <truncated>false</truncated> <favorited>false</favorited> <in_reply_to_status_id>78211453777231872</in_reply_to_status_id> <in_reply_to_user_id>15878923</in_reply_to_user_id> <in_reply_to_screen_name>skalnik</in_reply_to_screen_name> <retweet_count>0</retweet_count> <retweeted>false</retweeted> <user> <id>10403812</id> </user> <geo/> <coordinates/> <place/> <contributors/> </status> </statuses>
Criticisms of XML • Very verbose • Parsers can be
extremely complicated • Does not map well to some type systems • Does not represent highly structured data well
JSON
JavaScript Object Notation
JSON • Based on a subset of JavaScript circa 2003
• Lightweight • Simple to parse • Designed to be human-readable • Well-suited to structured data as well as trees
Example JSON [{ "coordinates":null, "created_at":"Tue Jun 07 21:30:50 +0000 2011",
"truncated":false, "favorited":false, "contributors":null, "text":"@skalnik Looks good.", "id":78212343649140736, "retweet_count":0, "geo":null, "retweeted":false, "in_reply_to_user_id":15878923, "source":"<a href=\"http://itunes.apple.com/us/app/twitter/id409789998?mt=12\" rel= \"nofollow\">Twitter for Mac</a>", "place":null, "in_reply_to_screen_name":"skalnik", "user":{"id":10403812}, "in_reply_to_status_id":78211453777231872 }]
More on JSON • eval() (is bad) • JSON.parse() •
Built-in browser support • Popular for AJAX: both single-domain and cross-domain
JSONP • JSON with Padding • Used for cross-domain requests
• Alternative to Cross-Origin Resource Sharing • Only supports GET
BSON • Binary JSON • Superset of JSON • Used
by MongoDB for storage of binary data
YAML
YAML Ain't Markup Language
YAML • Not often used over the network • Popular
for configuration files • Human-readable • Data-oriented • No execution means no injection
Example YAML --- - coordinates: created_at: Tue Jun 07 21:30:50
+0000 2011 truncated: false favorited: false contributors: text: "@skalnik Looks good." id: 78212343649140736 retweet_count: 0 geo: retweeted: false in_reply_to_user_id: 15878923 source: <a href="http://itunes.apple.com/us/app/twitter/id409789998?mt=12" rel="nofollow">Twitter for Mac</a> place: in_reply_to_screen_name: skalnik user: id: 10403812 in_reply_to_status_id: 78211453777231872
What to do with all these formats?
APIs
Application Programming Interfaces
APIs • Websites tell you what formats they support •
Websites document their URL structure • Developers use these APIs to integrate products • You can even consume your own APIs
But...
Not everyone offers APIs
What do?
Screen-scraping
Screen-scraping • Requests the full HTML for a page •
Parses out the content you want • Slow • Website layout may change and break yours
Demo!
Questions?
Will Farrington
[email protected]
http://speakerdeck.com/u/wfarr