Apache Spark Training [Spark Core, Spark SQL, Tungsten]

All code examples are available here https://github.com/zaleslaw/Spark-Tutorial

Gitbook with notes for this training could be found here

You can find here slides of my Open Apache Spark Training and links to video tutorial (in Russian).

Alexey Zinoviev

September 11, 2017

  About With IT since 2007 With Java since 2009 With

    Hadoop since 2012 With Spark since 2014 With EPAM since 2015
  Contacts E-mail : [email protected] Twitter

    : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs
  Github Spark Tutorial: Core, Streaming,

    Machine Learning https://github.com/zaleslaw/Spark-Tutorial
  Gitbook Обработка данных на Spark

    2.2 и Kafka 0.10 www.gitbook.com/book/zaleslaw/data-processing-book
  It's hard to …

    .. store • .. handle • .. search in • .. visualize • .. send in network
  Advantages

    native Python, Scala, R interface • interactive shells • in-memory caching of data, specified by the user • > 80 highly efficient distributed operations, any combination of them • capable of reusing Hadoop ecosystem, e.g. HDFS, YARN
  Say me R..say me D..

    Say me D again • Dataset • Distributed • Resilient
  Loading val localData = (5,7,1,12,10,25)

    val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")
  9. 41 Training from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25)

    val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")
  Loading val localData = (5,7,1,12,10,25)

    val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")
  Loading // Wildcards, running on

    directories, text and archives sc.textFile("/my/directory") sc.textFile("/my/directory/*.txt") sc.textFile("/my/directory/*.gz") // Read directory and return as filename/content pairs sc.wholeTextFiles // Read sequence file sc.sequenceFile[TKey, TValue] // Takes an arbitrary JobConf and InputFormat class sc.hadoopRDD sc.newAPIHadoopRDD // SerDe rdd.saveAsObjectFile sc.objectFile
  Word Count val textFile =

    sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  13. 47 Training from Zinoviev Alexey Word Count val textFile =

    sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  14. 48 Training from Zinoviev Alexey Word Count val textFile =

    sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  Transformations

    map, flatMap, filter • groupByKey, reduceByKey, sortByKey • mapValues, distinct • join, union • sample
  Actions

    reduce • collect, first, take, foreach • count(), countByKey() • saveAsTextFile()
  Caching in Spark

    Frequently used RDD can be stored in memory • One method, one short-cut: persist(), cache() • SparkContext keeps track of cached RDD • Serialized or deserialized Java objects
  Full list of options

  19. 62 Training from Zinoviev Alexey Spark Core Storage Level •

  Spark Core Storage Level

    MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER (default for Spark Streaming) • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  Development tools

    Console REPL ($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin
  22. 68 Training from Zinoviev Alexey Development tools • Console REPL

    ($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin • IntelliJ IDEA Community + Scala Plugin
  Development tools

    Console REPL ($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin • IntelliJ IDEA Community + Scala Plugin • Don't forget about SBT or adding spark's jars
  SBT build name := "Spark-app"

    version := "1.0" scalaVersion := "2.11.11" libraryDependencies += "org.apache.spark" % "spark- core_2.11" % "2.2.0" libraryDependencies += "org.apache.spark" % "spark- sql_2.11" % "2.2.0"
  25. 71 Training from Zinoviev Alexey SBT build name := "Spark-app"

    version := "1.0" scalaVersion := "2.11.11" libraryDependencies += "org.apache.spark" % "spark- core_2.11" % "2.2.0“ % "provided" libraryDependencies += "org.apache.spark" % "spark- sql_2.11" % "2.2.0“ % "provided"
  DAG Scheduler

    Build stages of tasks • Submit them to lower level scheduler • Lower level scheduler will schedule data based on locality • Resubmit failed stages if outputs are lost
  Task in Spark

    Unit of work to execute on in an executor thread • Unlike MR, there is no "map" vs "reduce" task • Each task apply set of transformations to same partitions in the RDD • Each task either partitions its output for "shuffle", or send the output back to the driver
  Cluster Modes

    Local mode • Stand-alone mode • Yarn • Mesos
  Spark Master URL

    local, local[n], local[*], local[K,F], local[*,F] • spark://host:port or spark://host1:port, host2:port • yarn-client or yarn-cluster • mesos://host:port
  Submit ./bin/spark-submit \

    --class com.epam.SparkJob1 \ --master spark:// \ --executor-memory 2G \ --total-executor-cores 10 \ /path/to/artifact.jar \
  A common deployment strategy is

    to submit your application from a gateway machine that is physically co-located with your worker machines
  Submit ./bin/spark-submit \

    --class com.epam.SparkJob1 \ --master mesos:// \ --executor-memory 2G \ --deploy-mode cluster \ --total-executor-cores 10 \ /path/to/artifact.jar \
  Every Spark application launches a

    web UI • A list of scheduler stages and tasks • A summary of RDD sizes and memory usage • Environmental information • Information about the running executors
  RDD Lineage is … (aka

    RDD operator graph or RDD dependency graph) a graph of all the parent RDDs of a RDD.
  toDebugString prints …

    The execution DAG or physical execution plan is the DAG of stages.
  spark .logLineage $ ./bin/spark-shell --conf

    spark.logLineage=true scala> sc.textFile("README.md", 4).count ... 15/10/17 14:46:42 INFO SparkContext: Starting job: count at <console>:25 15/10/17 14:46:42 INFO SparkContext: RDD's recursive dependencies: (4) MapPartitionsRDD[1] at textFile at <console>:25 [] | README.md HadoopRDD[0] at textFile at <console>:25 []
  Case class for RDD User

    (height: Int (not null), name: String, age: Int)
  The main concept

    DataFrames are composed of Row objects, along with a schema that describes the data types of each column in the row
  RDD->DF val usersRdd = sqlContext

    .jsonFile("hdfs://localhost:9000/users.json") val df = usersRdd.toDF() val newRDD = df.rdd df.show()
  DataFrame's nature

    Like RDD with schema but it's not RDD now • Distributed collection of data grouped into named columns • Domain-specific designed for common tasks under structured data • Available in Python, Scala, Java, and R (via SparkR) • Mutate from SchemaRDD
  DataFrame as SQL

    Selecting columns and filtering • Joining different data sources • Aggregation (count, sum, average, etc) • Plotting results with Pandas (with PySpark)
  Run SQL val df =

    spark.read.json("/home/users.json") df.createOrReplaceTempView("users") val sqlDF = spark.sql("SELECT name FROM users") sqlDF.show()
  Spark SQL advantages

    Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark • Unifies Stack with Spark Core, Spark Streaming etc. • Hive compatibility • Standard connectivity (JDBC, ODBC)
  If you have a Hive

    in Spark application • Support for writing queries in HQL • Catalog info from Hive MetaStore • Tablescan operator that uses Hive SerDes • Wrappers for Hive UDFs, UDAFs, UDTFs
  Hive val hive = new

    HiveContext(spark) hive.hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") hive.hql("LOAD DATA LOCAL INPATH '…/kv1.txt' INTO TABLE src") val results = hive.hql("FROM src SELECT key, value").collect()
  46. 136 Training from Zinoviev Alexey Hive val hive = new

    HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  47. 137 Training from Zinoviev Alexey Hive val hive = new

    HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/WarAndPeace.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  48. 138 Training from Zinoviev Alexey Hive val hive = new

    HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/WarAndPeace.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  RDD rdd.filter(_.age > 21) //

    RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  50. 146 Training from Zinoviev Alexey SQL rdd.filter(_.age > 21) //

    RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  51. 147 Training from Zinoviev Alexey Expression rdd.filter(_.age > 21) //

    RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  52. 149 Training from Zinoviev Alexey DataSet rdd.filter(_.age > 21) //

    RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  DataSet = RDD's types +

    DataFrame's Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten
  54. 151 Training from Zinoviev Alexey DataSet = RDD’s types +

    DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten
  Unified API in Spark 2.0

    DataFrame = Dataset[Row] Dataframe is a schemaless (untyped) Dataset now
  Define case class case class

    User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  57. 155 Training from Zinoviev Alexey Read JSON case class User(email:

    String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  58. 156 Training from Zinoviev Alexey Filter by Field case class

    User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  DataSet.explain() == Physical Plan ==

    Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----)
  How to be effective with

    CPU • Runtime code generation (Whole Stage Code Generation) • Сache locality • Off-heap memory management
  Two choices to distribute data

    across cluster • Java serialization By default with ObjectOutputStream • Kryo serialization Should register classes (no support of Serialazible)
  The main problem: overhead of

    serializing Each serialized object contains the class structure as well as the values
  The main problem: overhead of

    serializing Each serialized object contains the class structure as well as the values Don't forget about GC
  UnsafeRowFormat

    Bit set for tracking null values • Small values are inlined • For variable-length values are stored relative offset into the variable length data section • Rows are always 8-byte word aligned • Equality comparison and hashing can be performed on raw bytes without requiring additional interpretation
  Encoder's concept

    Generate bytecode to interact with off-heap & Give access to attributes without ser/deser
  Special Tool from Databricks Benchmark

    Tool for SparkSQL https://github.com/databricks/spark-sql-perf
