Slide 1

Slide 1 text

Tutorial - Spark SQL Li-Hui Li @HadoopCon 2015

Slide 2

Slide 2 text

About Me ! • @Spark TW • Spent time with a lot of fun • Doing data mining using Hadoop and Spark ! • http://goo.gl/WU9xdu Machu Picchu in Perú

Slide 3

Slide 3 text

Scala Spark! Core Spark SQL Streaming ML Data! Frame Kafka MLlib ALS SQL

Slide 4

Slide 4 text

Thank you to our sponsor

Slide 5

Slide 5 text

Review of SQL SQL ! • Structured Query Language • SELECT statement • SELECT • FROM • WHERE • GROUP BY • ORDER BY SELECT category, SUM(rev) AS total_rev FROM mytable WHERE rev > 2000 GROUP BY category http://www.w3schools.com/sql/sql_groupby.asp Category Product Rev cellphone thin 6000 tablet normal 1500 tablet mini 5500 cellphone ultra thin 5000 cellphone very thin 6000 tablet big 2500 cellphone bendale 3000 cellphone foldable 3000 tablet pro 4500 tablet pro2 6500

Slide 6

Slide 6 text

About Spark SQL Spark SQL ! • From Shark to Spark SQL • Support relational processing both within native RDDs and data sources • High performance using established DBMS techniques • Enable extension for graph and machine learning Bye Bye MapReduce

Slide 7

Slide 7 text

This Lecture Spark SQL ! • Spark SQL concept - v1.4.1 • Data sources • Input and output • User defined functions • Performance tuning Spark RDD Spark SQL DataFrame Catalyst Optimizer JDBC User Program

Slide 8

Slide 8 text

Concept Spark SQL ! • SchemaRDD to DataFrame (since Spark 1.3) • SQLContext vs. HiveContext • Data sources API - external & built-in Parquet HDFS Hive JSON S3 MySQL PostgresQL HBASE CSV AVRO elasticsearch cassandra …more…

Slide 9

Slide 9 text

Query Planning M Armbrust et al. (2015) Spark SQL: Relational Data Processing in Spark SQL Query DataFrame Unresolved! Logical Plan Logical Plan Optimized! Logical Plan Physical! Plan Cost Model Selected! Physical Plan RDDs Catalog Analysis Logical Optimization Physical! Planning Code Generation

Slide 10

Slide 10 text

DataFrames A distributed collection of data named columns df.map(lambda x: (x[0], float(x[2])))\ .reduceByKey(lambda x, y: x + y)\ .collect() df.groupBy("category").sum("sales") Category Product Rev cellphone thin 6000 tablet normal 1500 tablet mini 5500 cellphone ultra thin 5000 cellphone very thin 6000 tablet big 2500 cellphone bendale 3000 cellphone foldable 3000 tablet pro 4500 tablet pro2 6500 • RDD API • DataFrame API

Slide 11

Slide 11 text

Abundant Classes Spark SQL ! • SQLContext • DataFrame • Row • HiveContext • GroupedData • Window ! ! • DataFrameNaFunctions • DataFrameStatFunctions • Functions • Types • …

Slide 12

Slide 12 text

Spark SQL & DataFrames ! • Random data generation ! • Descriptive statistics ! • Selected summary statistics ! • Pearson correlation ! • Cross tabulation ! • Aggregation ! • Multi-dimensional cube df.withColumn('normal', randn(seed=123)) df.describe('col1', 'col2') df.select(mean('col1'),min('col2')) df.stat.corr('col1','col2') df.stat.crosstab('col1','col2') df.groupBy('col1').sum('col2') df.cube('col1').sum('col2')

Slide 13

Slide 13 text

Let’s Start it Spark SQL ! • Start with HiveContext • SELECT, GROUP BY, ORDER BY, JOIN, etc. • Available API - Scala, Python, Java, R (1.4+)

Slide 14

Slide 14 text

Input & Output Spark SQL ! • Read, write, and transform data • SELECT, GROUP BY, ORDER BY, JOIN, NESTED-QUERY • Default - Parquet, JSON sqlContext.read.json(filepath) df.write.json(filepath) sqlContext.table(tablename)

Slide 15

Slide 15 text

User Defined Functions Spark SQL ! • Build your own functions registerFunction() • Of course with Hive UDF, UDAF, make sure the jar sqlContext.registerFunction( \ 'strLength',\ lambda x: len(x)) UDAF in DF? https://issues.apache.org/jira/browse/SPARK-6802

Slide 16

Slide 16 text

Performance Tuning Spark SQL ! • Materialize hot data in memory • How to speed up the performance of query execution? spark.sql.shuffle.partitions spark.sql.inMemoryColumnarStorage.batchSize sqlContext.cacheTable('df') sqlContext.sql('CACHE TABLE df') sqlContext.uncacheTable('df') sqlContext.setConf('spark.sql.shuffle.partitions','200')

Slide 17

Slide 17 text

Column Category Column Product Column Rev cellphone tablet tablet cellphone thin normal mini ultra thin 6000 1500 5000 ultra thin Array ByteBuffer Columnar in-Memory Storage Partition 1

Slide 18

Slide 18 text

Q: Spark SQL vs. Impala Which one is better? ! • Or which one should be used for which use cases? Iterative data processing? Big volumes? Heavy ETL jobs?

Slide 19

Slide 19 text

Reference Spark SQL ! • Holden Karau et al. Learning Spark, O'Reilly Media, 2015 • Spark SQL and DataFrame Guide • …