The Jedi Masters Guide to Wrangling JSON ... with SQL

The Jedi Masters Guide to Wrangling JSON … with SQL
Strata + Hadoop World NYC 2015-09-30 Greg Rahn | @GregRahn 1

About Me { "FirstName": "Greg", "LastName": "Rahn", "Twitter": "@GregRahn", "CompaniesWorked":
[ { "Company": "Snowflake Computing", "NumberOfYears": 2.0 }, { "Company": "Cloudera", "NumberOfYears": 1.5 }, { "Company": "Oracle", "NumberOfYears": 8.0 } ] } 2

Why Is JSON So Popular? • JavaScript is (has been)
eating the world • Atwood’s Law: any application that can be written in JavaScript, will eventually be written in JavaScript. • REST APIs produce/consume data in JSON format • Internet of Things (IoT) / Internet of Anything (IoAT) • Mobile devices • Application logs (feeding into Logstash or Elasticsearch) • Have you ever tried to work with (or read) XML? • A self-describing text format makes data very portable • Open Datasets (data.gov, datasf.org, data.cityofnewyork.us) 4

The Rise of Self-Deﬁned Data 5 https://www.mongodb.com/compare/mongodb-mysql

Produce / consume ﬁrst, organize / analyze later 6

SQL on JSON 7

Numerous Options 8

Hive 9

Hive • Files in HDFS or S3 containing JSON text
• Deﬁne a table with single STRING column • HiveQL • Access JSON via • GET_JSON_OBJECT() • LATERAL VIEW JSON_TUPLE() • Use a SerDe with matching table deﬁnition • EXPLODE() only takes an array or map, not a string 10

Hive Example create external table json_tab(json string); select lv1.* from
json_tab jt lateral view json_tuple(jt.json, 'FirstName', 'LastName', 'Twitter', 'CompaniesWorked') lv1 as FirstName, LastName, Twitter, CompaniesWorked; Greg Rahn @GregRahn [{"Company":"Snowflake Computing","NumberOfYears":2.0}, {"Company":"Cloudera","NumberOfYears":1.5}, {"Company":"Oracle","NumberOfYears":8.0}] 11

Drill 13

Drill • Can operate on JSON files directly without a
predefined schema • Uses from dfs.`/path/to/file/` • Currently no describe <table> • Inferred schema visible via select * from … limit N • but only see attributes from those N rows • ANSI SQL • JSON functions • FLATTEN() • KVGEN() • REPEATED_COUNT(array) • REPEATED_CONTAINS(array, keyword) 14

Spark 16

Spark • Can operate on JSON files directly without a
predefined schema • HDFS, S3, local filesystem • Schema inference provides column metadata from • SQLContext() • printSchema() • registerTempTable() • CREATE TEMPORARY TABLE t USING org.apache.spark.sql.json • Column names follow dot notation • Spark SQL does not currently have EXPLODE() 17

BigQuery 20

BigQuery • BigQuery - Google’s query engine (in the cloud)
• SQL-like language, not ANSI • Supports CSV or JSON data • Schema required either: • On table creation • Speciﬁed when loading data • JSON functions • JSON_EXTRACT() • JSON_EXTRACT_SCALAR() • FLATTEN() • WITHIN 21

BigQuery Schema Deﬁnition [ { "name": "FirstName","type": “string”, "mode": "nullable"
}, { "name": "LastName", "type": "string", "mode": "nullable" }, { "name": "Twitter", "type": "string", "mode": "nullable" }, { "name": "CompaniesWorked", "type": "record", "mode": "repeated", "fields": [ { "name": "Company", "type": "string", "mode": "nullable" }, { "name": "NumberOfYears", "type": "float", "mode": "nullable" } ] } ] 23

BigQuery Show Schema $ bq show strata.greg Last modified Schema
----------------- --------------------------------------- 29 Sep 12:26:50 |- FirstName: string |- LastName: string |- Twitter: string +- CompaniesWorked: record (repeated) | |- Company: string | |- NumberOfYears: float 24

Redshift 25

Redshift • Amazon’s hosted data warehouse solution • Supports JSON
as a VARCHAR data type • ANSI SQL • Can load scalar values or JSON strings from ﬁles with COPY command • JSON functions • JSON_ARRAY_LENGTH() • JSON_EXTRACT_ARRAY_ELEMENT_TEXT() • JSON_EXTRACT_PATH_TEXT() 26

Redshift Example (1/3) create table greg ( FirstName varchar, LastName
varchar, Twitter varchar, CompaniesWorked varchar ); copy greg from 's3://bucket/strata/greg.json' credentials 'aws_access_key_id=xxx;aws_secret_access_key=xxx' json 's3://bucket/strata/jsonpaths.json'; { "jsonpaths": [ "$.FirstName", "$.LastName", "$.Twitter", "$.CompaniesWorked" ] } 27

Redshift Example (2/3) select * greg; firstname | lastname |
twitter | ----------+----------+------------ Greg | Rahn | @GregRahn | companiesworked +----------------------------------------------------------------------------------------------------------- ------------------------------ [{"Company":"Snowflake Computing”,"NumberOfYears":2},{"Company":"Cloudera","NumberOfYears":1.5}, {"Company":"Oracle","NumberOfYears":8}] create table seq(i int); insert into seq (i) values (1),(2),(3); select firstname, lastname, twitter, json_extract_array_element_text(companiesworked, seq.i) from greg, seq where seq.i < json_array_length(companiesworked); firstname | lastname | twitter | json_extract_array_element_text -----------+----------+-----------+----------------------------------------------------- Greg | Rahn | @GregRahn | {"Company":"Cloudera","NumberOfYears":1.5} Greg | Rahn | @GregRahn | {"Company":"Oracle","NumberOfYears":8} Greg | Rahn | @GregRahn | {"Company":"Snowflake Computing","NumberOfYears":2} 28

Redshift Example (3/3) select firstname, lastname, twitter, json_extract_path_text(json_extract_array_element_text(companiesworked, seq.i),'Company') as
company, json_extract_path_text(json_extract_array_element_text(companiesworked, seq.i),'NumberOfYears') as NumberOfYears from greg, seq where seq.i < json_array_length(companiesworked); firstname | lastname | twitter | company | numberofyears ----------+----------+-----------+---------------------+--------------- Greg | Rahn | @GregRahn | Cloudera | 1.5 Greg | Rahn | @GregRahn | Oracle | 8 Greg | Rahn | @GregRahn | Snowflake Computing | 2 (3 rows) select firstname, lastname, twitter, json_extract_path_text(json_extract_array_element_text(companiesworked, seq.i),'Company') as company, json_extract_path_text(json_extract_array_element_text(companiesworked, seq.i),'NumberOfYears') as NumberOfYears from greg, generate_series(1,3) AS seq(i) where seq.i < json_array_length(companiesworked); INFO: Function "generate_series(integer,integer)" not supported. ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables. 29

Snowﬂake 30

Snowﬂake • Cloud based data warehouse (DW as a Service)
• Native support for JSON via VARIANT data type • JSON (or Avro) can be directly loaded via COPY into a VARIANT column • On load, frequently repeated attributes are shredded into columnar storage • ANSI SQL with extensions for dot notation paths • VARIANT:path.to.the.thing • Numerous JSON/VARIANT functions • Validation functions: CHECK_JSON(), PARSE_JSON() • Table functions: FLATTEN() • Array functions: ARRAY_AGG(), ARRAY_APPEND(), ARRAY_SIZE()… • Path extraction functions: GET(), GET_PATH() 31

Snowﬂake Example create or replace table greg (c1 variant); copy
into greg from 's3://bucket/strata/greg.json' credentials=(aws_key_id='xxx' aws_secret_key=‘xxx') file_format = (type=json); select c1:FirstName::string as FirstName, c1:LastName::string as LastName, c1:Twitter::string as Twitter, v1.value:Company::string as CompaniesWorked_Company, v1.value:NumberOfYears::float as CompaniesWorked_NumberOfYears from greg, table(flatten(c1:CompaniesWorked)) v1; -----------+----------+-----------+-------------------------+-------------------------------+ FIRSTNAME | LASTNAME | TWITTER | COMPANIESWORKED_COMPANY | COMPANIESWORKED_NUMBEROFYEARS | -----------+----------+-----------+-------------------------+-------------------------------+ Greg | Rahn | @GregRahn | Snowflake Computing | 2.0 | Greg | Rahn | @GregRahn | Cloudera | 1.5 | Greg | Rahn | @GregRahn | Oracle | 8.0 | -----------+----------+-----------+-------------------------+-------------------------------+ 32

Mastering SQL on JSON • Self-describing data is very prevalent
and a new norm • Even with self-describing data, schema definitions may be required • Different systems offer different levels of functionality • No standard for SQL on JSON • Analytics on text files is never efficient • Storing JSON in STRING fields isn't efficient either 33

Thank You • Stop by booth 952 to see Snowflake
Computing • Web: https://snowflake.net • Twitter: @SnowflakeDB • Twitter: @GregRahn • Slides: https://speakerdeck.com/gregrahn 34

The Jedi Masters Guide to Wrangling JSON ... wi...

The Jedi Masters Guide to Wrangling JSON ... with SQL

More Decks by Greg Rahn

Other Decks in Technology

Featured

Transcript