Slide 1

Slide 1 text

Data on the Web Will Farrington

Slide 2

Slide 2 text

File I/O

Slide 3

Slide 3 text

File I/O • Minimal reusability • No "correct" format • Hard to maintain • Prone to problems caused by encoding changes

Slide 4

Slide 4 text

Standardize

Slide 5

Slide 5 text

CSV

Slide 6

Slide 6 text

Comma Separated Values

Slide 7

Slide 7 text

CSV • Used for tabular data • Small footprint • Widely recognized and supported format • Many different flavors • Support in database systems and spreadsheets

Slide 8

Slide 8 text

Example CSV Id,Name,Desc,Points,Due 1,Homework 1,Nothing special,15,6/7/2011 15,"Project, número uno",,100,6/21/2001

Slide 9

Slide 9 text

XML

Slide 10

Slide 10 text

Extensible Markup Language

Slide 11

Slide 11 text

XML • Open, standard specification • Unicode-friendly • Came to prominence with Java and .NET • Widely used on the web • Good at representing tree-like data

Slide 12

Slide 12 text

Example XML Tue Jun 07 21:30:50 +0000 2011 78212343649140736 @skalnik Looks good. <a href="http://itunes.apple.com/us/app/twitter/id409789998?mt=12" rel="nofollow">Twitter for Mac</a> false false 78211453777231872 15878923 skalnik 0 false 10403812

Slide 13

Slide 13 text

Criticisms of XML • Very verbose • Parsers can be extremely complicated • Does not map well to some type systems • Does not represent highly structured data well

Slide 14

Slide 14 text

JSON

Slide 15

Slide 15 text

JavaScript Object Notation

Slide 16

Slide 16 text

JSON • Based on a subset of JavaScript circa 2003 • Lightweight • Simple to parse • Designed to be human-readable • Well-suited to structured data as well as trees

Slide 17

Slide 17 text

Example JSON [{ "coordinates":null, "created_at":"Tue Jun 07 21:30:50 +0000 2011", "truncated":false, "favorited":false, "contributors":null, "text":"@skalnik Looks good.", "id":78212343649140736, "retweet_count":0, "geo":null, "retweeted":false, "in_reply_to_user_id":15878923, "source":"Twitter for Mac", "place":null, "in_reply_to_screen_name":"skalnik", "user":{"id":10403812}, "in_reply_to_status_id":78211453777231872 }]

Slide 18

Slide 18 text

More on JSON • eval() (is bad) • JSON.parse() • Built-in browser support • Popular for AJAX: both single-domain and cross-domain

Slide 19

Slide 19 text

JSONP • JSON with Padding • Used for cross-domain requests • Alternative to Cross-Origin Resource Sharing • Only supports GET

Slide 20

Slide 20 text

BSON • Binary JSON • Superset of JSON • Used by MongoDB for storage of binary data

Slide 21

Slide 21 text

YAML

Slide 22

Slide 22 text

YAML Ain't Markup Language

Slide 23

Slide 23 text

YAML • Not often used over the network • Popular for configuration files • Human-readable • Data-oriented • No execution means no injection

Slide 24

Slide 24 text

Example YAML --- - coordinates: created_at: Tue Jun 07 21:30:50 +0000 2011 truncated: false favorited: false contributors: text: "@skalnik Looks good." id: 78212343649140736 retweet_count: 0 geo: retweeted: false in_reply_to_user_id: 15878923 source: Twitter for Mac place: in_reply_to_screen_name: skalnik user: id: 10403812 in_reply_to_status_id: 78211453777231872

Slide 25

Slide 25 text

What to do with all these formats?

Slide 26

Slide 26 text

APIs

Slide 27

Slide 27 text

Application Programming Interfaces

Slide 28

Slide 28 text

APIs • Websites tell you what formats they support • Websites document their URL structure • Developers use these APIs to integrate products • You can even consume your own APIs

Slide 29

Slide 29 text

But...

Slide 30

Slide 30 text

Not everyone offers APIs

Slide 31

Slide 31 text

What do?

Slide 32

Slide 32 text

Screen-scraping

Slide 33

Slide 33 text

Screen-scraping • Requests the full HTML for a page • Parses out the content you want • Slow • Website layout may change and break yours

Slide 34

Slide 34 text

Demo!

Slide 35

Slide 35 text

Questions?

Slide 36

Slide 36 text

Will Farrington will@railsmachine.com http://speakerdeck.com/u/wfarr