Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Taming JSON with SQL

4ecabc55db2222a77aac4e27f7fc31d9?s=47 Greg Rahn
October 26, 2015

Taming JSON with SQL

Talk for OakTable World 2015

4ecabc55db2222a77aac4e27f7fc31d9?s=128

Greg Rahn

October 26, 2015
Tweet

More Decks by Greg Rahn

Other Decks in Technology

Transcript

  1. Taming JSON with SQL: From Raw to Results OakTable World

    2015 | #otw15 2015-10-26 Greg Rahn | @GregRahn 1
  2. About Me { "Name": "Greg Rahn", "Twitter": "@GregRahn", "CompaniesWorked": [

    { "Company": "Snowflake Computing", "NumberOfYears": 2.0 }, { "Company": "Cloudera", "NumberOfYears": 1.5 }, { "Company": "Oracle", "NumberOfYears": 8.0 } ] } 2
  3. 3

  4. Why Is JSON So Popular? • JavaScript is (has been)

    eating the world • Atwood’s Law: any application that can be written in JavaScript, will eventually be written in JavaScript. • REST APIs produce/consume data in JSON format (IoT, Mobile) • Application logs (feeding into Logstash or Elasticsearch) • Have you ever tried to work with (or read) XML? • A self-describing text format makes data very portable • Open Datasets (data.gov, datasf.org, data.cityofnewyork.us, Yelp) 4
  5. Flexible Data Models 5

  6. MongoDB Aggregation vs. SQL db.zipcodes.aggregate( [ { $group: { _id:

    "$state", totalPop: { $sum: "$pop" } } }, { $match: { totalPop: { $gte: 10*1000*1000 } } } ] ) SELECT state, SUM(pop) AS totalPop FROM zipcodes GROUP BY state HAVING totalPop >= (10*1000*1000) 6
  7. None
  8. None
  9. Example Data Pipeline 9 Land Load Discovery & Apply Schema

    Refinement & Augmentation Publish Query
  10. If it were only that easy… 10

  11. Challenges • Get data in w/o defining a schema •

    Performing analytics on text files or string fields is slow • Variable schema data does not fit well with many formats • Need ANSI SQL support and useful extensions & functions for JSON 11
  12. Numerous Options 12

  13. Thinking Differently • Native ingest support for JSON, Avro, &

    XML • No schema required — discovered on ingest • VARIANT data type • Optimized storage for both relational and complex types • Optimizations used across both formats 13
  14. None
  15. SQL> _

  16. Summary 16

  17. Self-Defined Redefined • We can land and ingest "raw" JSON

    • We can have data pipelines with both self-defined and pre-defined schemas • We can have SQL on self-defining data and complex types (JSON, etc.) • We can be agile without building fragile systems — taking the best of both worlds • Have your JSON and SQL too! 17
  18. Thank you! Questions? @GregRahn https://speakerdeck.com/gregrahn @SnowflakeDB https://snowflake.net 18