SparkSQL HadoopCon 2016

Tutorial - SparkSQL Erica Li @HadoopCon 2016 1

About Me Erica Li • ericalitw • inBOUND CTO &
Co-Founder • Taiwan Spark User Group Founder 2

Agenda • Introduction to SparkSQL • Data sources • Generic
data load & save mode • Cache in memory • User defined functions 3

Introduction to SparkSQL 4

Challenges for Data Scientists • Lots of analytics experts are
NOT Java, Scala experts • Learning curve for non-research specialists • Data scientists won’t set up a large computer cluster just for RESEARCH 5

Review of SQL • SQL - strunctured query language •
Interact with databases 6

SparkSQL • What is SparkSQL for? ◦ Part of distribution
since Spark 1.0 ◦ Connected BI tools through JDBC • The history of SparkSQL • Initialize SparkSQL with SparkSession NOW 7

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ 結構化資料流式應用機器學習圖應用 8

http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Spark 2.0.0 2016.28 Jul Spark 1.6.2 2016.25.Jun Spark 1.6.1
2016.09.Mar Spark 1.6.0 2016.04.Jan Spark 1.5.2 2015.09.Nov Spark 1.5.1 2015.02.Oct Spark 1.5.0 2015.09.Sep Spark 1.4.1 2015.15.Jul Spark 1.4.0 2015.11.Jun Spark 1.2.2 1.3.1 2015.17.Apr Spark 1.3.0 2015.13.Mar Spark 1.2.1 2015.09.Feb ... 2016 9

What’s new in SparkSQL 2.0 • SparkSQL ◦ Structured API
improvements ◦ Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features ◦ Visualization of dataframe query plan on webUI ◦ Multiple version of Hive (0.12 - 1.2.1) ◦ Reading several Parquet variants, e.g. impala, avro, thrift, Hive, etc ◦ ... 10

Core of SparkSQL • SparkSession ◦ Start point: SQLContext SparkSession
◦ Upper point: HiveContext ◦ What’s the difference? ▪ HiveContext has more complete parser function, it is recommened for most use cases • Datasets (1.6+) & DataFrames 11

Dataframes & Datasets • DataFrames ◦ A Dataset organized into
named columns ◦ Structured data files, tables in Hive, external databases, or existing RDDs • Datasets ◦ A distributed collection of data ▪ Available on Scala & Java ▪ Haven’t support on Python & R 12

From 1.6 to 2.0 • SparkSession ◦ Dataset API and
DataFrame API are unified ◦ Scala & Java - Dataset ◦ Python & R - DataFrame • For example ◦ unionAll -> union ◦ registerTempTable -> createOrReplaceTempView 13

https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.html 14

Data Sources 16

Data Sources • Data sources API - external & built-in
17

Generic data load & save mode 18

Input • Connected with local system • Run SQL on
file directly • Connected with Hive • JDBC to other databases HDFS File Local File Hive Table 19

Connected with Local • Create RDD from local JSON file
Python filepath = "examples/src/main/resources/people.json" df = spark.read.load(filepath, format="json") df.show() Scala val df = spark.read.json("examples/src/main/resources/people.json") 20

Create DF from RDD Python from pyspark.sql import Row rdd
= sc.textFile("$SPARK_HOME/examples/src/main/resources/data.json") df = spark.read.load(rdd) Scala val rdd = sparkContext.textFile("src/main/resources/data.txt") val ds = sparkSession.read.text("src/main/resources/data.txt").as[String] 21

Connected with Local • Run SQL on file directly Python
df = spark.sql("SELECT * FROM parquet. 'examples/src/main/resources/users.parquet'") Scala val sqlDF = spark.sql("SELECT * FROM parquet.'examples/src/main/resources/users.parquet'") 22

Connected with Hive • SparkSQL supports reading and writing data
stored in Hive ◦ The Hive dependency is not included in the default Spark distribution • Check the configuration of Hive in conf/. ◦ hive-site.xml ◦ core-site.xml ◦ hdfs-site.xml 23

JDBC to Other Databases • JDBC driver for the DB
on spark classpath • From the remote DB bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar Python df = spark.read.format(source="jdbc",url="jdbc:postgresql:dbserver",dbtable ="schema.tablename").load() 24

Output • DataFrames can also be saved as persistent tables
into Hive metastore using the saveAsTable command df.write.csv('mycsv.csv') df.write.format('json').save(path) df.write.parquet("mytable.parquet") 25

Save Mode Scala/Java Python Description SaveMode.ErrorIfExists error If data already
exists, an exception is expected to be thrown. SaveMode.Append append DataFrame are expected to be appended to existing data. SaveMode.Overwrite overwrite Data is expected to be overwritten by the contents of the DataFrame. SaveMode.Ignore ignore Like ‘CREATE TABLE IF NOT EXISTS’ 26

DataFrame 27

From SchemaRDD to DataFrame • DataFrame (since Spark 1.3) •
Idea from Python Pandas and R data frames • Create DataFrame in three ways ◦ an existing RDD ◦ running SQL query ▪ spark.sql("SELECT * FROM table") ◦ data sources ▪ spark.read.json("s3n://path/data.json","json") 28

DataFrames • A distributed collection of data named columns •
RDD API df.map(lambda x: (x[0], float(x[2])))\ .reduceByKey(lambda x, y: x + y)\ .collect() • DataFrame API df.groupBy("category").sum("sales") 29

• Common operations ◦ Selecting columns ◦ Joining different data
sources ◦ Filtering with specific statement ◦ Aggregation (sum, count, average, etc) 30

DataFrame Operations - basic • Basic DataFrame operations df.show() df.printSchema()
df.select("name").show() df.filter(peopledf.age > 20).show() df.groupBy("name").count().show() df.join(df2, df2.uid == df.uid, "left") 31

• Pandas <-> Dataframe #Convert SparkSQL DataFrame to Pandas dataframe
pandas_df = df.toPandas() #Create a SparkSQL DataFrame from Pandas spark_df = spark.createDataFrame(pandas.DataFrame([[1, 2]]) 32

• RDD <-> Dataframes - (1) # Load a text
file and convert each line to a Row. lines = sc.textFile("examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) # Infer the schema, and register the DataFrame as a table. schemaPeople = spark.createDataFrame(people) schemaPeople.createOrReplaceTempView("people") 33

• RDD <-> Dataframes - (2) parts = lines.map(lambda l:
l.split(",")) people = parts.map(lambda p: (p[0], p[1].strip())) schemaString = "name age" fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()] schema = StructType(fields) # Apply the schema to the RDD. schemaPeople = spark.createDataFrame(people, schema) # Creates a temporary view using the DataFrame schemaPeople.createOrReplaceTempView("people") 34

• Schema merging Python df1 = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: Row(single=i,
double=i * 2))) df1.write.parquet("data/test_table/key=1") df2 = spark.createDataFrame(sc.parallelize(range(6, 11)).map(lambda i: Row(single=i, triple=i * 3))) df2.write.parquet("data/test_table/key=2") df3 = spark.read.option("mergeSchema","true").parquet("data/test_table") 35

DataFrame Operations - For Example • Random data generation df.withColumn('normal',
randn(seed=123)) • Descriptive statistics df.describe('col1', 'col2') • Pearson correlation df.stat.corr('col1','col2') • Cross tabulation df.stat.crosstab('col1','col2') 36

#random data generation from pyspark.sql.functions import randn df1 = df.withColumn('age2',
df.age*100) #data from standard normal distribution df2 = df.withColumn('var1', randn(seed=123)) df3 = df1.join(df2, df1.name == df2.name).select(df1.name, df1.age, 'age2', 'var1') df3.describe('age') df3.stat.corr('age', 'var1') 37

Caching in memory • How to cache table in Spark
SQL? spark.cacheTable("tableName") dataFrame.cache() • Remove table from memory spark.uncacheTable("tableName") 38

User Defined Functions 39

User Defined Functions • How to create udf in SparkSQL?
• Concept - Lambda • In Python sqlContext.registerFunction('strLength', lambda x: len(x)) sqlContext.sql("SELECT col1, \ strLength(col2) AS lenAct \ FROM mytable") 40

Review • SparkSQL 1.6 vs. 2.0 • SQL query from
Hive, and local files • DataFrame operations • Key points for user defined functions • Cache file for iteration 41

? • Faster than our Relational Databases? • Spark on
Hadoop? • Performance tuning? • Which language API is the best? 42

SparkSQL HadoopCon 2016

SparkSQL HadoopCon 2016

More Decks by Erica Li

Other Decks in Technology

Featured

Transcript