Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Spark SQL of Hadoop Con 2015

Erica Li
September 19, 2015

Spark SQL of Hadoop Con 2015

In this lecture, you will learn how to use SQL command in Spark 1.4.1. For example, data sources, data input, output and udf functions. Enjoy it.

Erica Li

September 19, 2015
Tweet

More Decks by Erica Li

Other Decks in Programming

Transcript

  1. About Me ! • @Spark TW • Spent time with

    a lot of fun • Doing data mining using Hadoop and Spark ! • http://goo.gl/WU9xdu Machu Picchu in Perú
  2. Review of SQL SQL ! • Structured Query Language •

    SELECT statement • SELECT • FROM • WHERE • GROUP BY • ORDER BY SELECT category, SUM(rev) AS total_rev FROM mytable WHERE rev > 2000 GROUP BY category http://www.w3schools.com/sql/sql_groupby.asp Category Product Rev cellphone thin 6000 tablet normal 1500 tablet mini 5500 cellphone ultra thin 5000 cellphone very thin 6000 tablet big 2500 cellphone bendale 3000 cellphone foldable 3000 tablet pro 4500 tablet pro2 6500
  3. About Spark SQL Spark SQL ! • From Shark to

    Spark SQL • Support relational processing both within native RDDs and data sources • High performance using established DBMS techniques • Enable extension for graph and machine learning Bye Bye MapReduce
  4. This Lecture Spark SQL ! • Spark SQL concept -

    v1.4.1 • Data sources • Input and output • User defined functions • Performance tuning Spark RDD Spark SQL DataFrame Catalyst Optimizer JDBC User Program
  5. Concept Spark SQL ! • SchemaRDD to DataFrame (since Spark

    1.3) • SQLContext vs. HiveContext • Data sources API - external & built-in Parquet HDFS Hive JSON S3 MySQL PostgresQL HBASE CSV AVRO elasticsearch cassandra …more…
  6. Query Planning M Armbrust et al. (2015) Spark SQL: Relational

    Data Processing in Spark SQL Query DataFrame Unresolved! Logical Plan Logical Plan Optimized! Logical Plan Physical! Plan Cost Model Selected! Physical Plan RDDs Catalog Analysis Logical Optimization Physical! Planning Code Generation
  7. DataFrames A distributed collection of data named columns df.map(lambda x:

    (x[0], float(x[2])))\ .reduceByKey(lambda x, y: x + y)\ .collect() df.groupBy("category").sum("sales") Category Product Rev cellphone thin 6000 tablet normal 1500 tablet mini 5500 cellphone ultra thin 5000 cellphone very thin 6000 tablet big 2500 cellphone bendale 3000 cellphone foldable 3000 tablet pro 4500 tablet pro2 6500 • RDD API • DataFrame API
  8. Abundant Classes Spark SQL ! • SQLContext • DataFrame •

    Row • HiveContext • GroupedData • Window ! ! • DataFrameNaFunctions • DataFrameStatFunctions • Functions • Types • …
  9. Spark SQL & DataFrames ! • Random data generation !

    • Descriptive statistics ! • Selected summary statistics ! • Pearson correlation ! • Cross tabulation ! • Aggregation ! • Multi-dimensional cube df.withColumn('normal', randn(seed=123)) df.describe('col1', 'col2') df.select(mean('col1'),min('col2')) df.stat.corr('col1','col2') df.stat.crosstab('col1','col2') df.groupBy('col1').sum('col2') df.cube('col1').sum('col2')
  10. Let’s Start it Spark SQL ! • Start with HiveContext

    • SELECT, GROUP BY, ORDER BY, JOIN, etc. • Available API - Scala, Python, Java, R (1.4+)
  11. Input & Output Spark SQL ! • Read, write, and

    transform data • SELECT, GROUP BY, ORDER BY, JOIN, NESTED-QUERY • Default - Parquet, JSON sqlContext.read.json(filepath) df.write.json(filepath) sqlContext.table(tablename)
  12. User Defined Functions Spark SQL ! • Build your own

    functions registerFunction() • Of course with Hive UDF, UDAF, make sure the jar sqlContext.registerFunction( \ 'strLength',\ lambda x: len(x)) UDAF in DF? https://issues.apache.org/jira/browse/SPARK-6802
  13. Performance Tuning Spark SQL ! • Materialize hot data in

    memory • How to speed up the performance of query execution? spark.sql.shuffle.partitions spark.sql.inMemoryColumnarStorage.batchSize sqlContext.cacheTable('df') sqlContext.sql('CACHE TABLE df') sqlContext.uncacheTable('df') sqlContext.setConf('spark.sql.shuffle.partitions','200')
  14. Column Category Column Product Column Rev cellphone tablet tablet cellphone

    thin normal mini ultra thin 6000 1500 5000 ultra thin Array ByteBuffer Columnar in-Memory Storage Partition 1
  15. Q: Spark SQL vs. Impala Which one is better? !

    • Or which one should be used for which use cases? Iterative data processing? Big volumes? Heavy ETL jobs?
  16. Reference Spark SQL ! • Holden Karau et al. Learning

    Spark, O'Reilly Media, 2015 • Spark SQL and DataFrame Guide • …