About Me
!
• @Spark TW
• Spent time with a lot of fun
• Doing data mining using
Hadoop and Spark
!
• http://goo.gl/WU9xdu
Machu Picchu in Perú
Slide 3
Slide 3 text
Scala
Spark!
Core
Spark
SQL
Streaming
ML
Data!
Frame
Kafka
MLlib
ALS
SQL
Slide 4
Slide 4 text
Thank you to our sponsor
Slide 5
Slide 5 text
Review of SQL
SQL
!
• Structured Query Language
• SELECT statement
• SELECT
• FROM
• WHERE
• GROUP BY
• ORDER BY
SELECT category,
SUM(rev) AS total_rev
FROM mytable
WHERE rev > 2000
GROUP BY category
http://www.w3schools.com/sql/sql_groupby.asp
Category Product Rev
cellphone thin 6000
tablet normal 1500
tablet mini 5500
cellphone ultra thin 5000
cellphone very thin 6000
tablet big 2500
cellphone bendale 3000
cellphone foldable 3000
tablet pro 4500
tablet pro2 6500
Slide 6
Slide 6 text
About Spark SQL
Spark SQL
!
• From Shark to Spark SQL
• Support relational processing both within native
RDDs and data sources
• High performance using established DBMS techniques
• Enable extension for graph and machine learning
Bye Bye
MapReduce
Slide 7
Slide 7 text
This Lecture
Spark SQL
!
• Spark SQL concept - v1.4.1
• Data sources
• Input and output
• User defined functions
• Performance tuning
Spark RDD
Spark SQL
DataFrame
Catalyst Optimizer
JDBC User Program
Slide 8
Slide 8 text
Concept
Spark SQL
!
• SchemaRDD to DataFrame (since Spark 1.3)
• SQLContext vs. HiveContext
• Data sources API - external & built-in
Parquet
HDFS
Hive JSON
S3
MySQL
PostgresQL
HBASE
CSV AVRO
elasticsearch
cassandra
…more…
Slide 9
Slide 9 text
Query Planning
M Armbrust et al. (2015) Spark SQL: Relational Data Processing in Spark
SQL Query
DataFrame
Unresolved!
Logical Plan
Logical Plan
Optimized!
Logical Plan
Physical!
Plan
Cost Model
Selected!
Physical
Plan
RDDs
Catalog
Analysis Logical Optimization
Physical!
Planning
Code Generation
Slide 10
Slide 10 text
DataFrames
A distributed collection of data named columns
df.map(lambda x: (x[0], float(x[2])))\
.reduceByKey(lambda x, y: x + y)\
.collect()
df.groupBy("category").sum("sales")
Category Product Rev
cellphone thin 6000
tablet normal 1500
tablet mini 5500
cellphone ultra thin 5000
cellphone very thin 6000
tablet big 2500
cellphone bendale 3000
cellphone foldable 3000
tablet pro 4500
tablet pro2 6500
• RDD API
• DataFrame API
Let’s Start it
Spark SQL
!
• Start with HiveContext
• SELECT, GROUP BY, ORDER BY, JOIN, etc.
• Available API - Scala, Python, Java, R (1.4+)
Slide 14
Slide 14 text
Input & Output
Spark SQL
!
• Read, write, and transform data
• SELECT, GROUP BY, ORDER BY, JOIN, NESTED-QUERY
• Default - Parquet, JSON
sqlContext.read.json(filepath)
df.write.json(filepath)
sqlContext.table(tablename)
Slide 15
Slide 15 text
User Defined Functions
Spark SQL
!
• Build your own functions
registerFunction()
• Of course with Hive UDF, UDAF, make sure the jar
sqlContext.registerFunction( \
'strLength',\
lambda x: len(x))
UDAF in DF?
https://issues.apache.org/jira/browse/SPARK-6802
Slide 16
Slide 16 text
Performance Tuning
Spark SQL
!
• Materialize hot data in memory
• How to speed up the performance of query execution?
spark.sql.shuffle.partitions
spark.sql.inMemoryColumnarStorage.batchSize
sqlContext.cacheTable('df')
sqlContext.sql('CACHE TABLE df')
sqlContext.uncacheTable('df')
sqlContext.setConf('spark.sql.shuffle.partitions','200')
Q: Spark SQL vs. Impala
Which one is better?
!
• Or which one should be used for which use cases?
Iterative data processing?
Big volumes?
Heavy ETL jobs?