Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data on the Web
Search
Will Farrington
June 08, 2011
Technology
210
3
Share
Data on the Web
It's an intro to data on the web for some folks new to web development.
Will Farrington
June 08, 2011
More Decks by Will Farrington
See All by Will Farrington
test-queue makes your tests run fast
wfarr
0
480
Incident Response Done Right: From First Page to Postmortem
wfarr
0
590
Boxen: PuppetConf 2013
wfarr
6
910
Puppet at GitHub: PuppetConf 2013
wfarr
21
2.2k
Puppet at GitHub (PuppetCamp Raleigh 2013)
wfarr
1
490
Boxen: PuppetCamp SF 2013
wfarr
5
1.1k
Boxen: MWRC
wfarr
5
280
Boxen: PuppetCamp ATL
wfarr
0
320
BOXEN
wfarr
43
5.6k
Other Decks in Technology
See All in Technology
知ってた?JavaScriptの"正しさ"を検証するテストが5万以上もあること(Test262)
riyaamemiya
1
190
ブラウザの投機的読み込みと投機ルールAPIを理解し、Webサービスのパフォーマンスを最適化する
shuta13
3
300
PdM・Eng・QAで進めるAI駆動開発の現在地/aidd-with-pdm-eng-qa
shota_kusaba
0
210
そのSLO 99.9%、本当に必要ですか? 〜優先度付きSLOによる責任共有の設計思想〜 / Is that 99.9% SLO really necessary? Design philosophy of shared responsibility through prioritized SLOs
vtryo
0
600
全社統制を維持しながら現場負担をどう減らすか〜プラットフォームチームとセキュリティチームで進めたSecurity Hub活用によるAWS統制の見直し〜/secjaws-security-hub-custom-insights
mhrtech
1
410
Agent Skillsで実現する記憶領域の運用とその後
yamadashy
2
1.8k
なぜ、私がCommunity Builderに?〜活動期間1か月半でも選出されたワケ〜
yama3133
0
120
Oracle Cloud Infrastructure presents managed, serverless MCP Servers for Oracle AI Database
thatjeffsmith
0
230
AI飲み会幹事エージェントを作っただけなのに
ykimi
0
170
拝啓、あの夏の僕へ〜あなたも知っているApp Runnerの世界〜
news_it_enj
0
240
"うちにはまだ早い"は本当? ─ 小さく始めるPlatform Engineering入門
harukasakihara
5
510
「背中を見て育て」からの卒業 〜専門技術としてのテスト設計を軸に、品質保証のバトンを繋ぐ〜 #genda_tech_talk
nihonbuson
PRO
3
1.3k
Featured
See All Featured
Exploring anti-patterns in Rails
aemeredith
3
350
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
140
Organizational Design Perspectives: An Ontology of Organizational Design Elements
kimpetersen
PRO
1
690
Tips & Tricks on How to Get Your First Job In Tech
honzajavorek
1
500
Imperfection Machines: The Place of Print at Facebook
scottboms
270
14k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.9k
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
11
910
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
199
73k
How STYLIGHT went responsive
nonsquared
100
6.1k
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Design of three-dimensional binary manipulators for pick-and-place task avoiding obstacles (IECON2024)
konakalab
0
420
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.4k
Transcript
Data on the Web Will Farrington
File I/O
File I/O • Minimal reusability • No "correct" format •
Hard to maintain • Prone to problems caused by encoding changes
Standardize
CSV
Comma Separated Values
CSV • Used for tabular data • Small footprint •
Widely recognized and supported format • Many different flavors • Support in database systems and spreadsheets
Example CSV Id,Name,Desc,Points,Due 1,Homework 1,Nothing special,15,6/7/2011 15,"Project, número uno",,100,6/21/2001
XML
Extensible Markup Language
XML • Open, standard specification • Unicode-friendly • Came to
prominence with Java and .NET • Widely used on the web • Good at representing tree-like data
Example XML <?xml version="1.0" encoding="UTF-8"?> <statuses type="array"> <status> <created_at>Tue Jun
07 21:30:50 +0000 2011</created_at> <id>78212343649140736</id> <text>@skalnik Looks good.</text> <source><a href="http://itunes.apple.com/us/app/twitter/id409789998?mt=12" rel="nofollow">Twitter for Mac</a></source> <truncated>false</truncated> <favorited>false</favorited> <in_reply_to_status_id>78211453777231872</in_reply_to_status_id> <in_reply_to_user_id>15878923</in_reply_to_user_id> <in_reply_to_screen_name>skalnik</in_reply_to_screen_name> <retweet_count>0</retweet_count> <retweeted>false</retweeted> <user> <id>10403812</id> </user> <geo/> <coordinates/> <place/> <contributors/> </status> </statuses>
Criticisms of XML • Very verbose • Parsers can be
extremely complicated • Does not map well to some type systems • Does not represent highly structured data well
JSON
JavaScript Object Notation
JSON • Based on a subset of JavaScript circa 2003
• Lightweight • Simple to parse • Designed to be human-readable • Well-suited to structured data as well as trees
Example JSON [{ "coordinates":null, "created_at":"Tue Jun 07 21:30:50 +0000 2011",
"truncated":false, "favorited":false, "contributors":null, "text":"@skalnik Looks good.", "id":78212343649140736, "retweet_count":0, "geo":null, "retweeted":false, "in_reply_to_user_id":15878923, "source":"<a href=\"http://itunes.apple.com/us/app/twitter/id409789998?mt=12\" rel= \"nofollow\">Twitter for Mac</a>", "place":null, "in_reply_to_screen_name":"skalnik", "user":{"id":10403812}, "in_reply_to_status_id":78211453777231872 }]
More on JSON • eval() (is bad) • JSON.parse() •
Built-in browser support • Popular for AJAX: both single-domain and cross-domain
JSONP • JSON with Padding • Used for cross-domain requests
• Alternative to Cross-Origin Resource Sharing • Only supports GET
BSON • Binary JSON • Superset of JSON • Used
by MongoDB for storage of binary data
YAML
YAML Ain't Markup Language
YAML • Not often used over the network • Popular
for configuration files • Human-readable • Data-oriented • No execution means no injection
Example YAML --- - coordinates: created_at: Tue Jun 07 21:30:50
+0000 2011 truncated: false favorited: false contributors: text: "@skalnik Looks good." id: 78212343649140736 retweet_count: 0 geo: retweeted: false in_reply_to_user_id: 15878923 source: <a href="http://itunes.apple.com/us/app/twitter/id409789998?mt=12" rel="nofollow">Twitter for Mac</a> place: in_reply_to_screen_name: skalnik user: id: 10403812 in_reply_to_status_id: 78211453777231872
What to do with all these formats?
APIs
Application Programming Interfaces
APIs • Websites tell you what formats they support •
Websites document their URL structure • Developers use these APIs to integrate products • You can even consume your own APIs
But...
Not everyone offers APIs
What do?
Screen-scraping
Screen-scraping • Requests the full HTML for a page •
Parses out the content you want • Slow • Website layout may change and break yours
Demo!
Questions?
Will Farrington
[email protected]
http://speakerdeck.com/u/wfarr