Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Jedi Masters Guide to Wrangling JSON ... wi...

Greg Rahn
September 30, 2015

The Jedi Masters Guide to Wrangling JSON ... with SQL

http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/46809

The flexibility and simplicity of JSON have made it one of the most common formats for data. Data engines need to be able to load, process, and query JSON and nested data types quickly and efficiently. There are multiple approaches to processing JSON data, each with trade offs.

In this session we’ll discuss the reasons and ways that developers want to use flexible schema options and the challenges that creates for processing and querying that data. We’ll dive into the approaches taken by different technologies such as Hive, Drill, BigQuery, Spark, and others, and the performance and complexity trade offs of each. The attendee will leave with an understanding of how to assess which system is best for their use case.

Greg Rahn

September 30, 2015
Tweet

More Decks by Greg Rahn

Other Decks in Technology

Transcript

  1. The Jedi Masters Guide to Wrangling JSON … with SQL

    Strata + Hadoop World NYC 2015-09-30 Greg Rahn | @GregRahn 1
  2. About Me { "FirstName": "Greg", "LastName": "Rahn", "Twitter": "@GregRahn", "CompaniesWorked":

    [ { "Company": "Snowflake Computing", "NumberOfYears": 2.0 }, { "Company": "Cloudera", "NumberOfYears": 1.5 }, { "Company": "Oracle", "NumberOfYears": 8.0 } ] } 2
  3. 3

  4. Why Is JSON So Popular? • JavaScript is (has been)

    eating the world • Atwood’s Law: any application that can be written in JavaScript, will eventually be written in JavaScript. • REST APIs produce/consume data in JSON format • Internet of Things (IoT) / Internet of Anything (IoAT) • Mobile devices • Application logs (feeding into Logstash or Elasticsearch) • Have you ever tried to work with (or read) XML? • A self-describing text format makes data very portable • Open Datasets (data.gov, datasf.org, data.cityofnewyork.us) 4
  5. Hive • Files in HDFS or S3 containing JSON text

    • Define a table with single STRING column • HiveQL • Access JSON via • GET_JSON_OBJECT() • LATERAL VIEW JSON_TUPLE() • Use a SerDe with matching table definition • EXPLODE() only takes an array or map, not a string 10
  6. Hive Example create external table json_tab(json string); select lv1.* from

    json_tab jt lateral view json_tuple(jt.json, 'FirstName', 'LastName', 'Twitter', 'CompaniesWorked') lv1 as FirstName, LastName, Twitter, CompaniesWorked; Greg Rahn @GregRahn [{"Company":"Snowflake Computing","NumberOfYears":2.0}, {"Company":"Cloudera","NumberOfYears":1.5}, {"Company":"Oracle","NumberOfYears":8.0}] 11
  7. 12

  8. Drill • Can operate on JSON files directly without a

    predefined schema • Uses from dfs.`/path/to/file/` • Currently no describe <table> • Inferred schema visible via select * from … limit N • but only see attributes from those N rows • ANSI SQL • JSON functions • FLATTEN() • KVGEN() • REPEATED_COUNT(array) • REPEATED_CONTAINS(array, keyword) 14
  9. Drill Example with t as ( select FirstName, LastName, Twitter,

    flatten(CompaniesWorked) as CompaniesWorked from dfs.`/data/greg.json` ) select FirstName, LastName, Twitter, t.CompaniesWorked.Company as Company, t.CompaniesWorked.NumberOfYears as NumberOfYears from t; +------------+-----------+------------+----------------------+----------------+ | FirstName | LastName | Twitter | Company | NumberOfYears | +------------+-----------+------------+----------------------+----------------+ | Greg | Rahn | @GregRahn | Snowflake Computing | 2.0 | | Greg | Rahn | @GregRahn | Cloudera | 1.5 | | Greg | Rahn | @GregRahn | Oracle | 8.0 | +------------+-----------+------------+----------------------+----------------+ 15
  10. Spark • Can operate on JSON files directly without a

    predefined schema • HDFS, S3, local filesystem • Schema inference provides column metadata from • SQLContext() • printSchema() • registerTempTable() • CREATE TEMPORARY TABLE t USING org.apache.spark.sql.json • Column names follow dot notation • Spark SQL does not currently have EXPLODE() 17
  11. Spark Example (1/2) import org.apache.spark.sql.functions._ val df = sqlContext.read.json(“/data/greg.json”) df.withColumn("companyWorked",

    explode($"CompaniesWorked")) .select("FirstName", "LastName", "Twitter", "companyWorked.Company", "companyWorked.NumberOfYears") .show() +---------+--------+---------+-------------------+-------------+ |FirstName|LastName| Twitter| Company|NumberOfYears| +---------+--------+---------+-------------------+-------------+ | Greg| Rahn|@GregRahn|Snowflake Computing| 2.0| | Greg| Rahn|@GregRahn| Cloudera| 1.5| | Greg| Rahn|@GregRahn| Oracle| 8.0| +---------+--------+---------+-------------------+-------------+ 18
  12. Spark Example (2/2) df.printSchema() root |-- CompaniesWorked: array (nullable =

    true) | |-- element: struct (containsNull = true) | | |-- Company: string (nullable = true) | | |-- NumberOfYears: double (nullable = true) |-- FirstName: string (nullable = true) |-- LastName: string (nullable = true) |-- Twitter: string (nullable = true) 19
  13. BigQuery • BigQuery - Google’s query engine (in the cloud)

    • SQL-like language, not ANSI • Supports CSV or JSON data • Schema required either: • On table creation • Specified when loading data • JSON functions • JSON_EXTRACT() • JSON_EXTRACT_SCALAR() • FLATTEN() • WITHIN 21
  14. BigQuery Example $ bq mk strata $ bq load --source_format=NEWLINE_DELIMITED_JSON

    strata.greg greg.json gregSchema.json $ bq query " > select > FirstName, > LastName, > Twitter, > CompaniesWorked.Company, > CompaniesWorked.NumberOfYears > from [strata.greg] > " +-----------+----------+-----------+-------------------------+-------------------------------+ | FirstName | LastName | Twitter | CompaniesWorked_Company | CompaniesWorked_NumberOfYears | +-----------+----------+-----------+-------------------------+-------------------------------+ | Greg | Rahn | @GregRahn | Snowflake Computing | 2.0 | | Greg | Rahn | @GregRahn | Cloudera | 1.5 | | Greg | Rahn | @GregRahn | Oracle | 8.0 | +-----------+----------+-----------+-------------------------+-------------------------------+ 22
  15. BigQuery Schema Definition [ { "name": "FirstName","type": “string”, "mode": "nullable"

    }, { "name": "LastName", "type": "string", "mode": "nullable" }, { "name": "Twitter", "type": "string", "mode": "nullable" }, { "name": "CompaniesWorked", "type": "record", "mode": "repeated", "fields": [ { "name": "Company", "type": "string", "mode": "nullable" }, { "name": "NumberOfYears", "type": "float", "mode": "nullable" } ] } ] 23
  16. BigQuery Show Schema $ bq show strata.greg Last modified Schema

    ----------------- --------------------------------------- 29 Sep 12:26:50 |- FirstName: string |- LastName: string |- Twitter: string +- CompaniesWorked: record (repeated) | |- Company: string | |- NumberOfYears: float 24
  17. Redshift • Amazon’s hosted data warehouse solution • Supports JSON

    as a VARCHAR data type • ANSI SQL • Can load scalar values or JSON strings from files with COPY command • JSON functions • JSON_ARRAY_LENGTH() • JSON_EXTRACT_ARRAY_ELEMENT_TEXT() • JSON_EXTRACT_PATH_TEXT() 26
  18. Redshift Example (1/3) create table greg ( FirstName varchar, LastName

    varchar, Twitter varchar, CompaniesWorked varchar ); copy greg from 's3://bucket/strata/greg.json' credentials 'aws_access_key_id=xxx;aws_secret_access_key=xxx' json 's3://bucket/strata/jsonpaths.json'; { "jsonpaths": [ "$.FirstName", "$.LastName", "$.Twitter", "$.CompaniesWorked" ] } 27
  19. Redshift Example (2/3) select * greg; firstname | lastname |

    twitter | ----------+----------+------------ Greg | Rahn | @GregRahn | companiesworked +----------------------------------------------------------------------------------------------------------- ------------------------------ [{"Company":"Snowflake Computing”,"NumberOfYears":2},{"Company":"Cloudera","NumberOfYears":1.5}, {"Company":"Oracle","NumberOfYears":8}] create table seq(i int); insert into seq (i) values (1),(2),(3); select firstname, lastname, twitter, json_extract_array_element_text(companiesworked, seq.i) from greg, seq where seq.i < json_array_length(companiesworked); firstname | lastname | twitter | json_extract_array_element_text -----------+----------+-----------+----------------------------------------------------- Greg | Rahn | @GregRahn | {"Company":"Cloudera","NumberOfYears":1.5} Greg | Rahn | @GregRahn | {"Company":"Oracle","NumberOfYears":8} Greg | Rahn | @GregRahn | {"Company":"Snowflake Computing","NumberOfYears":2} 28
  20. Redshift Example (3/3) select firstname, lastname, twitter, json_extract_path_text(json_extract_array_element_text(companiesworked, seq.i),'Company') as

    company, json_extract_path_text(json_extract_array_element_text(companiesworked, seq.i),'NumberOfYears') as NumberOfYears from greg, seq where seq.i < json_array_length(companiesworked); firstname | lastname | twitter | company | numberofyears ----------+----------+-----------+---------------------+--------------- Greg | Rahn | @GregRahn | Cloudera | 1.5 Greg | Rahn | @GregRahn | Oracle | 8 Greg | Rahn | @GregRahn | Snowflake Computing | 2 (3 rows) select firstname, lastname, twitter, json_extract_path_text(json_extract_array_element_text(companiesworked, seq.i),'Company') as company, json_extract_path_text(json_extract_array_element_text(companiesworked, seq.i),'NumberOfYears') as NumberOfYears from greg, generate_series(1,3) AS seq(i) where seq.i < json_array_length(companiesworked); INFO: Function "generate_series(integer,integer)" not supported. ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables. 29
  21. Snowflake • Cloud based data warehouse (DW as a Service)

    • Native support for JSON via VARIANT data type • JSON (or Avro) can be directly loaded via COPY into a VARIANT column • On load, frequently repeated attributes are shredded into columnar storage • ANSI SQL with extensions for dot notation paths • VARIANT:path.to.the.thing • Numerous JSON/VARIANT functions • Validation functions: CHECK_JSON(), PARSE_JSON() • Table functions: FLATTEN() • Array functions: ARRAY_AGG(), ARRAY_APPEND(), ARRAY_SIZE()… • Path extraction functions: GET(), GET_PATH() 31
  22. Snowflake Example create or replace table greg (c1 variant); copy

    into greg from 's3://bucket/strata/greg.json' credentials=(aws_key_id='xxx' aws_secret_key=‘xxx') file_format = (type=json); select c1:FirstName::string as FirstName, c1:LastName::string as LastName, c1:Twitter::string as Twitter, v1.value:Company::string as CompaniesWorked_Company, v1.value:NumberOfYears::float as CompaniesWorked_NumberOfYears from greg, table(flatten(c1:CompaniesWorked)) v1; -----------+----------+-----------+-------------------------+-------------------------------+ FIRSTNAME | LASTNAME | TWITTER | COMPANIESWORKED_COMPANY | COMPANIESWORKED_NUMBEROFYEARS | -----------+----------+-----------+-------------------------+-------------------------------+ Greg | Rahn | @GregRahn | Snowflake Computing | 2.0 | Greg | Rahn | @GregRahn | Cloudera | 1.5 | Greg | Rahn | @GregRahn | Oracle | 8.0 | -----------+----------+-----------+-------------------------+-------------------------------+ 32
  23. Mastering SQL on JSON • Self-describing data is very prevalent

    and a new norm • Even with self-describing data, schema definitions may be required • Different systems offer different levels of functionality • No standard for SQL on JSON • Analytics on text files is never efficient • Storing JSON in STRING fields isn't efficient either 33
  24. Thank You • Stop by booth 952 to see Snowflake

    Computing • Web: https://snowflake.net • Twitter: @SnowflakeDB • Twitter: @GregRahn • Slides: https://speakerdeck.com/gregrahn 34