Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Taming JSON with SQL

Greg Rahn
October 26, 2015

Taming JSON with SQL

Talk for OakTable World 2015

Greg Rahn

October 26, 2015
Tweet

More Decks by Greg Rahn

Other Decks in Technology

Transcript

  1. Taming JSON with SQL: From Raw to Results OakTable World

    2015 | #otw15 2015-10-26 Greg Rahn | @GregRahn 1
  2. About Me { "Name": "Greg Rahn", "Twitter": "@GregRahn", "CompaniesWorked": [

    { "Company": "Snowflake Computing", "NumberOfYears": 2.0 }, { "Company": "Cloudera", "NumberOfYears": 1.5 }, { "Company": "Oracle", "NumberOfYears": 8.0 } ] } 2
  3. 3

  4. Why Is JSON So Popular? • JavaScript is (has been)

    eating the world • Atwood’s Law: any application that can be written in JavaScript, will eventually be written in JavaScript. • REST APIs produce/consume data in JSON format (IoT, Mobile) • Application logs (feeding into Logstash or Elasticsearch) • Have you ever tried to work with (or read) XML? • A self-describing text format makes data very portable • Open Datasets (data.gov, datasf.org, data.cityofnewyork.us, Yelp) 4
  5. MongoDB Aggregation vs. SQL db.zipcodes.aggregate( [ { $group: { _id:

    "$state", totalPop: { $sum: "$pop" } } }, { $match: { totalPop: { $gte: 10*1000*1000 } } } ] ) SELECT state, SUM(pop) AS totalPop FROM zipcodes GROUP BY state HAVING totalPop >= (10*1000*1000) 6
  6. Example Data Pipeline 9 Land Load Discovery & Apply Schema

    Refinement & Augmentation Publish Query
  7. Challenges • Get data in w/o defining a schema •

    Performing analytics on text files or string fields is slow • Variable schema data does not fit well with many formats • Need ANSI SQL support and useful extensions & functions for JSON 11
  8. Thinking Differently • Native ingest support for JSON, Avro, &

    XML • No schema required — discovered on ingest • VARIANT data type • Optimized storage for both relational and complex types • Optimizations used across both formats 13
  9. Self-Defined Redefined • We can land and ingest "raw" JSON

    • We can have data pipelines with both self-defined and pre-defined schemas • We can have SQL on self-defining data and complex types (JSON, etc.) • We can be agile without building fragile systems — taking the best of both worlds • Have your JSON and SQL too! 17