Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data on the Web

Data on the Web

It's an intro to data on the web for some folks new to web development.

Will Farrington

June 08, 2011
Tweet

More Decks by Will Farrington

Other Decks in Technology

Transcript

  1. File I/O • Minimal reusability • No "correct" format •

    Hard to maintain • Prone to problems caused by encoding changes
  2. CSV

  3. CSV • Used for tabular data • Small footprint •

    Widely recognized and supported format • Many different flavors • Support in database systems and spreadsheets
  4. XML

  5. XML • Open, standard specification • Unicode-friendly • Came to

    prominence with Java and .NET • Widely used on the web • Good at representing tree-like data
  6. Example XML <?xml version="1.0" encoding="UTF-8"?> <statuses type="array"> <status> <created_at>Tue Jun

    07 21:30:50 +0000 2011</created_at> <id>78212343649140736</id> <text>@skalnik Looks good.</text> <source>&lt;a href="http://itunes.apple.com/us/app/twitter/id409789998?mt=12" rel="nofollow"&gt;Twitter for Mac&lt;/a&gt;</source> <truncated>false</truncated> <favorited>false</favorited> <in_reply_to_status_id>78211453777231872</in_reply_to_status_id> <in_reply_to_user_id>15878923</in_reply_to_user_id> <in_reply_to_screen_name>skalnik</in_reply_to_screen_name> <retweet_count>0</retweet_count> <retweeted>false</retweeted> <user> <id>10403812</id> </user> <geo/> <coordinates/> <place/> <contributors/> </status> </statuses>
  7. Criticisms of XML • Very verbose • Parsers can be

    extremely complicated • Does not map well to some type systems • Does not represent highly structured data well
  8. JSON • Based on a subset of JavaScript circa 2003

    • Lightweight • Simple to parse • Designed to be human-readable • Well-suited to structured data as well as trees
  9. Example JSON [{ "coordinates":null, "created_at":"Tue Jun 07 21:30:50 +0000 2011",

    "truncated":false, "favorited":false, "contributors":null, "text":"@skalnik Looks good.", "id":78212343649140736, "retweet_count":0, "geo":null, "retweeted":false, "in_reply_to_user_id":15878923, "source":"<a href=\"http://itunes.apple.com/us/app/twitter/id409789998?mt=12\" rel= \"nofollow\">Twitter for Mac</a>", "place":null, "in_reply_to_screen_name":"skalnik", "user":{"id":10403812}, "in_reply_to_status_id":78211453777231872 }]
  10. More on JSON • eval() (is bad) • JSON.parse() •

    Built-in browser support • Popular for AJAX: both single-domain and cross-domain
  11. JSONP • JSON with Padding • Used for cross-domain requests

    • Alternative to Cross-Origin Resource Sharing • Only supports GET
  12. BSON • Binary JSON • Superset of JSON • Used

    by MongoDB for storage of binary data
  13. YAML • Not often used over the network • Popular

    for configuration files • Human-readable • Data-oriented • No execution means no injection
  14. Example YAML --- - coordinates: created_at: Tue Jun 07 21:30:50

    +0000 2011 truncated: false favorited: false contributors: text: "@skalnik Looks good." id: 78212343649140736 retweet_count: 0 geo: retweeted: false in_reply_to_user_id: 15878923 source: <a href="http://itunes.apple.com/us/app/twitter/id409789998?mt=12" rel="nofollow">Twitter for Mac</a> place: in_reply_to_screen_name: skalnik user: id: 10403812 in_reply_to_status_id: 78211453777231872
  15. APIs • Websites tell you what formats they support •

    Websites document their URL structure • Developers use these APIs to integrate products • You can even consume your own APIs
  16. Screen-scraping • Requests the full HTML for a page •

    Parses out the content you want • Slow • Website layout may change and break yours