Spark SQL of Hadoop Con 2015

Tutorial - Spark SQL Li-Hui Li @HadoopCon 2015

About Me ! • @Spark TW • Spent time with
a lot of fun • Doing data mining using Hadoop and Spark ! • http://goo.gl/WU9xdu Machu Picchu in Perú

Scala Spark! Core Spark SQL Streaming ML Data! Frame Kafka
MLlib ALS SQL

Thank you to our sponsor

Review of SQL SQL ! • Structured Query Language •
SELECT statement • SELECT • FROM • WHERE • GROUP BY • ORDER BY SELECT category, SUM(rev) AS total_rev FROM mytable WHERE rev > 2000 GROUP BY category http://www.w3schools.com/sql/sql_groupby.asp Category Product Rev cellphone thin 6000 tablet normal 1500 tablet mini 5500 cellphone ultra thin 5000 cellphone very thin 6000 tablet big 2500 cellphone bendale 3000 cellphone foldable 3000 tablet pro 4500 tablet pro2 6500

About Spark SQL Spark SQL ! • From Shark to
Spark SQL • Support relational processing both within native RDDs and data sources • High performance using established DBMS techniques • Enable extension for graph and machine learning Bye Bye MapReduce

This Lecture Spark SQL ! • Spark SQL concept -
v1.4.1 • Data sources • Input and output • User deﬁned functions • Performance tuning Spark RDD Spark SQL DataFrame Catalyst Optimizer JDBC User Program

Concept Spark SQL ! • SchemaRDD to DataFrame (since Spark
1.3) • SQLContext vs. HiveContext • Data sources API - external & built-in Parquet HDFS Hive JSON S3 MySQL PostgresQL HBASE CSV AVRO elasticsearch cassandra …more…

Query Planning M Armbrust et al. (2015) Spark SQL: Relational
Data Processing in Spark SQL Query DataFrame Unresolved! Logical Plan Logical Plan Optimized! Logical Plan Physical! Plan Cost Model Selected! Physical Plan RDDs Catalog Analysis Logical Optimization Physical! Planning Code Generation

DataFrames A distributed collection of data named columns df.map(lambda x:
(x[0], float(x[2])))\ .reduceByKey(lambda x, y: x + y)\ .collect() df.groupBy("category").sum("sales") Category Product Rev cellphone thin 6000 tablet normal 1500 tablet mini 5500 cellphone ultra thin 5000 cellphone very thin 6000 tablet big 2500 cellphone bendale 3000 cellphone foldable 3000 tablet pro 4500 tablet pro2 6500 • RDD API • DataFrame API

Abundant Classes Spark SQL ! • SQLContext • DataFrame •
Row • HiveContext • GroupedData • Window ! ! • DataFrameNaFunctions • DataFrameStatFunctions • Functions • Types • …

Spark SQL & DataFrames ! • Random data generation !
• Descriptive statistics ! • Selected summary statistics ! • Pearson correlation ! • Cross tabulation ! • Aggregation ! • Multi-dimensional cube df.withColumn('normal', randn(seed=123)) df.describe('col1', 'col2') df.select(mean('col1'),min('col2')) df.stat.corr('col1','col2') df.stat.crosstab('col1','col2') df.groupBy('col1').sum('col2') df.cube('col1').sum('col2')

Let’s Start it Spark SQL ! • Start with HiveContext
• SELECT, GROUP BY, ORDER BY, JOIN, etc. • Available API - Scala, Python, Java, R (1.4+)

Input & Output Spark SQL ! • Read, write, and
transform data • SELECT, GROUP BY, ORDER BY, JOIN, NESTED-QUERY • Default - Parquet, JSON sqlContext.read.json(filepath) df.write.json(filepath) sqlContext.table(tablename)

User Deﬁned Functions Spark SQL ! • Build your own
functions registerFunction() • Of course with Hive UDF, UDAF, make sure the jar sqlContext.registerFunction( \ 'strLength',\ lambda x: len(x)) UDAF in DF? https://issues.apache.org/jira/browse/SPARK-6802

Performance Tuning Spark SQL ! • Materialize hot data in
memory • How to speed up the performance of query execution? spark.sql.shufﬂe.partitions spark.sql.inMemoryColumnarStorage.batchSize sqlContext.cacheTable('df') sqlContext.sql('CACHE TABLE df') sqlContext.uncacheTable('df') sqlContext.setConf('spark.sql.shuffle.partitions','200')

Column Category Column Product Column Rev cellphone tablet tablet cellphone
thin normal mini ultra thin 6000 1500 5000 ultra thin Array ByteBuffer Columnar in-Memory Storage Partition 1

Q: Spark SQL vs. Impala Which one is better? !
• Or which one should be used for which use cases? Iterative data processing? Big volumes? Heavy ETL jobs?

Reference Spark SQL ! • Holden Karau et al. Learning
Spark, O'Reilly Media, 2015 • Spark SQL and DataFrame Guide • …

Spark SQL of Hadoop Con 2015

Spark SQL of Hadoop Con 2015

Erica Li

More Decks by Erica Li

Other Decks in Programming

Featured

Transcript

Tutorial - Spark SQL Li-Hui Li @HadoopCon 2015

About Me ! • @Spark TW • Spent time with

Scala Spark! Core Spark SQL Streaming ML Data! Frame Kafka

Thank you to our sponsor

Review of SQL SQL ! • Structured Query Language •

About Spark SQL Spark SQL ! • From Shark to

This Lecture Spark SQL ! • Spark SQL concept -

Concept Spark SQL ! • SchemaRDD to DataFrame (since Spark

Query Planning M Armbrust et al. (2015) Spark SQL: Relational

DataFrames A distributed collection of data named columns df.map(lambda x:

Abundant Classes Spark SQL ! • SQLContext • DataFrame •

Spark SQL & DataFrames ! • Random data generation !

Let’s Start it Spark SQL ! • Start with HiveContext

Input & Output Spark SQL ! • Read, write, and

User Deﬁned Functions Spark SQL ! • Build your own

Performance Tuning Spark SQL ! • Materialize hot data in

Column Category Column Product Column Rev cellphone tablet tablet cellphone

Q: Spark SQL vs. Impala Which one is better? !

Reference Spark SQL ! • Holden Karau et al. Learning