$30 off During Our Annual Pro Sale. View Details »

SparkSQL HadoopCon 2016

Erica Li
September 09, 2016

SparkSQL HadoopCon 2016

Spark tutorial slide:
Introduction to SparkSQL
Data sources
Generic data load & save mode
Cache in memory
User defined functions

Erica Li

September 09, 2016
Tweet

More Decks by Erica Li

Other Decks in Technology

Transcript

  1. About Me Erica Li • ericalitw • inBOUND CTO &

    Co-Founder • Taiwan Spark User Group Founder 2
  2. Agenda • Introduction to SparkSQL • Data sources • Generic

    data load & save mode • Cache in memory • User defined functions 3
  3. Challenges for Data Scientists • Lots of analytics experts are

    NOT Java, Scala experts • Learning curve for non-research specialists • Data scientists won’t set up a large computer cluster just for RESEARCH 5
  4. SparkSQL • What is SparkSQL for? ◦ Part of distribution

    since Spark 1.0 ◦ Connected BI tools through JDBC • The history of SparkSQL • Initialize SparkSQL with SparkSession NOW 7
  5. http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Spark 2.0.0 2016.28 Jul Spark 1.6.2 2016.25.Jun Spark 1.6.1

    2016.09.Mar Spark 1.6.0 2016.04.Jan Spark 1.5.2 2015.09.Nov Spark 1.5.1 2015.02.Oct Spark 1.5.0 2015.09.Sep Spark 1.4.1 2015.15.Jul Spark 1.4.0 2015.11.Jun Spark 1.2.2 1.3.1 2015.17.Apr Spark 1.3.0 2015.13.Mar Spark 1.2.1 2015.09.Feb ... 2016 9
  6. What’s new in SparkSQL 2.0 • SparkSQL ◦ Structured API

    improvements ◦ Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features ◦ Visualization of dataframe query plan on webUI ◦ Multiple version of Hive (0.12 - 1.2.1) ◦ Reading several Parquet variants, e.g. impala, avro, thrift, Hive, etc ◦ ... 10
  7. Core of SparkSQL • SparkSession ◦ Start point: SQLContext SparkSession

    ◦ Upper point: HiveContext ◦ What’s the difference? ▪ HiveContext has more complete parser function, it is recommened for most use cases • Datasets (1.6+) & DataFrames 11
  8. Dataframes & Datasets • DataFrames ◦ A Dataset organized into

    named columns ◦ Structured data files, tables in Hive, external databases, or existing RDDs • Datasets ◦ A distributed collection of data ▪ Available on Scala & Java ▪ Haven’t support on Python & R 12
  9. From 1.6 to 2.0 • SparkSession ◦ Dataset API and

    DataFrame API are unified ◦ Scala & Java - Dataset ◦ Python & R - DataFrame • For example ◦ unionAll -> union ◦ registerTempTable -> createOrReplaceTempView 13
  10. 15

  11. Input • Connected with local system • Run SQL on

    file directly • Connected with Hive • JDBC to other databases HDFS File Local File Hive Table 19
  12. Connected with Local • Create RDD from local JSON file

    Python filepath = "examples/src/main/resources/people.json" df = spark.read.load(filepath, format="json") df.show() Scala val df = spark.read.json("examples/src/main/resources/people.json") 20
  13. Create DF from RDD Python from pyspark.sql import Row rdd

    = sc.textFile("$SPARK_HOME/examples/src/main/resources/data.json") df = spark.read.load(rdd) Scala val rdd = sparkContext.textFile("src/main/resources/data.txt") val ds = sparkSession.read.text("src/main/resources/data.txt").as[String] 21
  14. Connected with Local • Run SQL on file directly Python

    df = spark.sql("SELECT * FROM parquet. 'examples/src/main/resources/users.parquet'") Scala val sqlDF = spark.sql("SELECT * FROM parquet.'examples/src/main/resources/users.parquet'") 22
  15. Connected with Hive • SparkSQL supports reading and writing data

    stored in Hive ◦ The Hive dependency is not included in the default Spark distribution • Check the configuration of Hive in conf/. ◦ hive-site.xml ◦ core-site.xml ◦ hdfs-site.xml 23
  16. JDBC to Other Databases • JDBC driver for the DB

    on spark classpath • From the remote DB bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar Python df = spark.read.format(source="jdbc",url="jdbc:postgresql:dbserver",dbtable ="schema.tablename").load() 24
  17. Output • DataFrames can also be saved as persistent tables

    into Hive metastore using the saveAsTable command df.write.csv('mycsv.csv') df.write.format('json').save(path) df.write.parquet("mytable.parquet") 25
  18. Save Mode Scala/Java Python Description SaveMode.ErrorIfExists error If data already

    exists, an exception is expected to be thrown. SaveMode.Append append DataFrame are expected to be appended to existing data. SaveMode.Overwrite overwrite Data is expected to be overwritten by the contents of the DataFrame. SaveMode.Ignore ignore Like ‘CREATE TABLE IF NOT EXISTS’ 26
  19. From SchemaRDD to DataFrame • DataFrame (since Spark 1.3) •

    Idea from Python Pandas and R data frames • Create DataFrame in three ways ◦ an existing RDD ◦ running SQL query ▪ spark.sql("SELECT * FROM table") ◦ data sources ▪ spark.read.json("s3n://path/data.json","json") 28
  20. DataFrames • A distributed collection of data named columns •

    RDD API df.map(lambda x: (x[0], float(x[2])))\ .reduceByKey(lambda x, y: x + y)\ .collect() • DataFrame API df.groupBy("category").sum("sales") 29
  21. • Common operations ◦ Selecting columns ◦ Joining different data

    sources ◦ Filtering with specific statement ◦ Aggregation (sum, count, average, etc) 30
  22. DataFrame Operations - basic • Basic DataFrame operations df.show() df.printSchema()

    df.select("name").show() df.filter(peopledf.age > 20).show() df.groupBy("name").count().show() df.join(df2, df2.uid == df.uid, "left") 31
  23. • Pandas <-> Dataframe #Convert SparkSQL DataFrame to Pandas dataframe

    pandas_df = df.toPandas() #Create a SparkSQL DataFrame from Pandas spark_df = spark.createDataFrame(pandas.DataFrame([[1, 2]]) 32
  24. • RDD <-> Dataframes - (1) # Load a text

    file and convert each line to a Row. lines = sc.textFile("examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) # Infer the schema, and register the DataFrame as a table. schemaPeople = spark.createDataFrame(people) schemaPeople.createOrReplaceTempView("people") 33
  25. • RDD <-> Dataframes - (2) parts = lines.map(lambda l:

    l.split(",")) people = parts.map(lambda p: (p[0], p[1].strip())) schemaString = "name age" fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()] schema = StructType(fields) # Apply the schema to the RDD. schemaPeople = spark.createDataFrame(people, schema) # Creates a temporary view using the DataFrame schemaPeople.createOrReplaceTempView("people") 34
  26. • Schema merging Python df1 = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: Row(single=i,

    double=i * 2))) df1.write.parquet("data/test_table/key=1") df2 = spark.createDataFrame(sc.parallelize(range(6, 11)).map(lambda i: Row(single=i, triple=i * 3))) df2.write.parquet("data/test_table/key=2") df3 = spark.read.option("mergeSchema","true").parquet("data/test_table") 35
  27. DataFrame Operations - For Example • Random data generation df.withColumn('normal',

    randn(seed=123)) • Descriptive statistics df.describe('col1', 'col2') • Pearson correlation df.stat.corr('col1','col2') • Cross tabulation df.stat.crosstab('col1','col2') 36
  28. #random data generation from pyspark.sql.functions import randn df1 = df.withColumn('age2',

    df.age*100) #data from standard normal distribution df2 = df.withColumn('var1', randn(seed=123)) df3 = df1.join(df2, df1.name == df2.name).select(df1.name, df1.age, 'age2', 'var1') df3.describe('age') df3.stat.corr('age', 'var1') 37
  29. Caching in memory • How to cache table in Spark

    SQL? spark.cacheTable("tableName") dataFrame.cache() • Remove table from memory spark.uncacheTable("tableName") 38
  30. User Defined Functions • How to create udf in SparkSQL?

    • Concept - Lambda • In Python sqlContext.registerFunction('strLength', lambda x: len(x)) sqlContext.sql("SELECT col1, \ strLength(col2) AS lenAct \ FROM mytable") 40
  31. Review • SparkSQL 1.6 vs. 2.0 • SQL query from

    Hive, and local files • DataFrame operations • Key points for user defined functions • Cache file for iteration 41
  32. ? • Faster than our Relational Databases? • Spark on

    Hadoop? • Performance tuning? • Which language API is the best? 42