Slide 1

Slide 1 text

Taming JSON with SQL: From Raw to Results OakTable World 2015 | #otw15 2015-10-26 Greg Rahn | @GregRahn 1

Slide 2

Slide 2 text

About Me { "Name": "Greg Rahn", "Twitter": "@GregRahn", "CompaniesWorked": [ { "Company": "Snowflake Computing", "NumberOfYears": 2.0 }, { "Company": "Cloudera", "NumberOfYears": 1.5 }, { "Company": "Oracle", "NumberOfYears": 8.0 } ] } 2

Slide 3

Slide 3 text

3

Slide 4

Slide 4 text

Why Is JSON So Popular? • JavaScript is (has been) eating the world • Atwood’s Law: any application that can be written in JavaScript, will eventually be written in JavaScript. • REST APIs produce/consume data in JSON format (IoT, Mobile) • Application logs (feeding into Logstash or Elasticsearch) • Have you ever tried to work with (or read) XML? • A self-describing text format makes data very portable • Open Datasets (data.gov, datasf.org, data.cityofnewyork.us, Yelp) 4

Slide 5

Slide 5 text

Flexible Data Models 5

Slide 6

Slide 6 text

MongoDB Aggregation vs. SQL db.zipcodes.aggregate( [ { $group: { _id: "$state", totalPop: { $sum: "$pop" } } }, { $match: { totalPop: { $gte: 10*1000*1000 } } } ] ) SELECT state, SUM(pop) AS totalPop FROM zipcodes GROUP BY state HAVING totalPop >= (10*1000*1000) 6

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Example Data Pipeline 9 Land Load Discovery & Apply Schema Refinement & Augmentation Publish Query

Slide 10

Slide 10 text

If it were only that easy… 10

Slide 11

Slide 11 text

Challenges • Get data in w/o defining a schema • Performing analytics on text files or string fields is slow • Variable schema data does not fit well with many formats • Need ANSI SQL support and useful extensions & functions for JSON 11

Slide 12

Slide 12 text

Numerous Options 12

Slide 13

Slide 13 text

Thinking Differently • Native ingest support for JSON, Avro, & XML • No schema required — discovered on ingest • VARIANT data type • Optimized storage for both relational and complex types • Optimizations used across both formats 13

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

SQL> _

Slide 16

Slide 16 text

Summary 16

Slide 17

Slide 17 text

Self-Defined Redefined • We can land and ingest "raw" JSON • We can have data pipelines with both self-defined and pre-defined schemas • We can have SQL on self-defining data and complex types (JSON, etc.) • We can be agile without building fragile systems — taking the best of both worlds • Have your JSON and SQL too! 17

Slide 18

Slide 18 text

Thank you! Questions? @GregRahn https://speakerdeck.com/gregrahn @SnowflakeDB https://snowflake.net 18