Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Spark Training [Spark Core, Spark SQL, Tungsten]

Apache Spark Training [Spark Core, Spark SQL, Tungsten]

All code examples are available here https://github.com/zaleslaw/Spark-Tutorial

Gitbook with notes for this training could be found here
https://www.gitbook.com/book/zaleslaw/data-processing-book/details

You can find here slides of my Open Apache Spark Training and links to video tutorial (in Russian).

Alexey Zinoviev

September 11, 2017
Tweet

More Decks by Alexey Zinoviev

Other Decks in Programming

Transcript

  1. Spark Training
    Alexey Zinovyev, Java/BigData Trainer in EPAM

    View Slide

  2. About
    With IT since 2007
    With Java since 2009
    With Hadoop since 2012
    With Spark since 2014
    With EPAM since 2015

    View Slide

  3. 3
    Training from Zinoviev Alexey
    Contacts
    E-mail : [email protected]
    Twitter : @zaleslaw @BigDataRussia
    vk.com/big_data_russia Big Data Russia
    + Telegram @bigdatarussia
    vk.com/java_jvm Java & JVM langs
    + Telegram @javajvmlangs

    View Slide

  4. 4
    Training from Zinoviev Alexey
    Github
    Spark Tutorial: Core, Streaming, Machine Learning
    https://github.com/zaleslaw/Spark-Tutorial

    View Slide

  5. 5
    Training from Zinoviev Alexey
    Gitbook
    Обработка данных на Spark 2.2 и Kafka 0.10
    www.gitbook.com/book/zaleslaw/data-processing-book

    View Slide

  6. 6
    Training from Zinoviev Alexey
    Spark
    Family

    View Slide

  7. 7
    Training from Zinoviev Alexey
    Spark
    Family

    View Slide

  8. 8
    Training from Zinoviev Alexey
    WHAT IS BIG DATA?

    View Slide

  9. 9
    Training from Zinoviev Alexey
    Joke about Excel

    View Slide

  10. 10
    Training from Zinoviev Alexey
    Every 60 seconds…

    View Slide

  11. 11
    Training from Zinoviev Alexey
    Is BigData about PBs?

    View Slide

  12. 12
    Training from Zinoviev Alexey
    Is BigData about PBs?

    View Slide

  13. 13
    Training from Zinoviev Alexey
    It’s hard to …
    • .. store
    • .. handle
    • .. search in
    • .. visualize
    • .. send in network

    View Slide

  14. 14
    Training from Zinoviev Alexey
    How to handle
    all these stuff?

    View Slide

  15. 15
    Training from Zinoviev Alexey
    Just do it … in parallel

    View Slide

  16. 16
    Training from Zinoviev Alexey
    Parallel Computing vs Distributed Computing

    View Slide

  17. 17
    Training from Zinoviev Alexey
    Modern Java in 2016
    Big Data in 2014

    View Slide

  18. 18
    Training from Zinoviev Alexey
    Batch jobs produce reports. More and more..

    View Slide

  19. 19
    Training from Zinoviev Alexey
    But customer can wait forever (ok, 1h)

    View Slide

  20. 20
    Training from Zinoviev Alexey
    Hadoop Architecture

    View Slide

  21. 21
    Training from Zinoviev Alexey
    HDFS Architecture

    View Slide

  22. 22
    Training from Zinoviev Alexey
    Daemons in YARN

    View Slide

  23. 23
    Training from Zinoviev Alexey
    Different
    scheduling
    algorithms

    View Slide

  24. 24
    Training from Zinoviev Alexey
    Hive Data Model

    View Slide

  25. 25
    Training from Zinoviev Alexey
    Machine Learning EVERYWHERE

    View Slide

  26. 26
    Training from Zinoviev Alexey
    Data Lake in promotional brochure

    View Slide

  27. 27
    Training from Zinoviev Alexey
    Data Lake in production

    View Slide

  28. 28
    Training from Zinoviev Alexey
    Simple Flow in Reporting/BI systems

    View Slide

  29. 29
    Training from Zinoviev Alexey
    WHY SHOULD WE USE
    SPARK?

    View Slide

  30. 30
    Training from Zinoviev Alexey
    Advantages
    • native Python, Scala, R interface
    • interactive shells
    • in-memory caching of data, specified by the user
    • > 80 highly efficient distributed operations, any
    combination of them
    • capable of reusing Hadoop ecosystem, e.g. HDFS, YARN

    View Slide

  31. 31
    Training from Zinoviev Alexey
    MapReduce vs Spark

    View Slide

  32. 32
    Training from Zinoviev Alexey
    MapReduce vs Spark

    View Slide

  33. 33
    Training from Zinoviev Alexey
    MapReduce vs Spark

    View Slide

  34. 34
    Training from Zinoviev Alexey
    Let’s use Spark. It’s fast!

    View Slide

  35. 35
    Training from Zinoviev Alexey
    SPARK INTRO

    View Slide

  36. 36
    Training from Zinoviev Alexey
    Say me R..say me D.. Say me D again
    • Dataset
    • Distributed
    • Resilient

    View Slide

  37. 37
    Training from Zinoviev Alexey
    Single
    Thread
    collection

    View Slide

  38. 38
    Training from Zinoviev Alexey
    No perf
    issues,
    right?

    View Slide

  39. 39
    Training from Zinoviev Alexey
    The main concept
    more partitions = more parallelism

    View Slide

  40. 40
    Training from Zinoviev Alexey
    Loading
    val localData = (5,7,1,12,10,25)
    val ourFirstRDD = sc.parallelize(localData)
    val textFile = sc.textFile("hdfs://...")

    View Slide

  41. 41
    Training from Zinoviev Alexey
    Loading
    val localData = (5,7,1,12,10,25)
    val ourFirstRDD = sc.parallelize(localData)
    val textFile = sc.textFile("hdfs://...")

    View Slide

  42. 42
    Training from Zinoviev Alexey
    Loading
    val localData = (5,7,1,12,10,25)
    val ourFirstRDD = sc.parallelize(localData)
    // from file
    val textFile = sc.textFile("hdfs://...")

    View Slide

  43. 43
    Training from Zinoviev Alexey
    Loading
    // Wildcards, running on directories, text and archives
    sc.textFile("/my/directory")
    sc.textFile("/my/directory/*.txt")
    sc.textFile("/my/directory/*.gz")
    // Read directory and return as filename/content pairs
    sc.wholeTextFiles
    // Read sequence file
    sc.sequenceFile[TKey, TValue]
    // Takes an arbitrary JobConf and InputFormat class
    sc.hadoopRDD
    sc.newAPIHadoopRDD
    // SerDe
    rdd.saveAsObjectFile
    sc.objectFile

    View Slide

  44. 44
    Training from Zinoviev Alexey
    Spark
    Context

    View Slide

  45. 45
    Training from Zinoviev Alexey
    RDD OPERATIONS

    View Slide

  46. 46
    Training from Zinoviev Alexey
    Word
    Count
    val textFile = sc.textFile("hdfs://...")
    val counts = textFile
    .flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
    counts.saveAsTextFile("hdfs://...")

    View Slide

  47. 47
    Training from Zinoviev Alexey
    Word
    Count
    val textFile = sc.textFile("hdfs://...")
    val counts = textFile
    .flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
    counts.saveAsTextFile("hdfs://...")

    View Slide

  48. 48
    Training from Zinoviev Alexey
    Word
    Count
    val textFile = sc.textFile("hdfs://...")
    val counts = textFile
    .flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
    counts.saveAsTextFile("hdfs://...")

    View Slide

  49. 49
    Training from Zinoviev Alexey
    What’s the difference between
    actions and transformations?

    View Slide

  50. 50
    Training from Zinoviev Alexey
    RDD Chain

    View Slide

  51. 51
    Training from Zinoviev Alexey
    Transformations
    • map, flatMap, filter
    • groupByKey, reduceByKey, sortByKey
    • mapValues, distinct
    • join, union
    • sample

    View Slide

  52. 52
    Training from Zinoviev Alexey
    FlatMap explanation

    View Slide

  53. 53
    Training from Zinoviev Alexey
    Map explanation

    View Slide

  54. 54
    Training from Zinoviev Alexey
    ReduceByKey explanation

    View Slide

  55. 55
    Training from Zinoviev Alexey
    Actions
    • reduce
    • collect, first, take, foreach
    • count(), countByKey()
    • saveAsTextFile()

    View Slide

  56. 56
    Training from Zinoviev Alexey
    What’s the difference between
    PairRDD and usual RDD?

    View Slide

  57. 57
    Training from Zinoviev Alexey
    Pair RDD

    View Slide

  58. 58
    Training from Zinoviev Alexey
    RDD Demo

    View Slide

  59. 59
    Training from Zinoviev Alexey
    PERSISTENCE

    View Slide

  60. 60
    Training from Zinoviev Alexey
    Caching in Spark
    • Frequently used RDD can be stored in memory
    • One method, one short-cut: persist(), cache()
    • SparkContext keeps track of cached RDD
    • Serialized or deserialized Java objects

    View Slide

  61. 61
    Training from Zinoviev Alexey
    Full list of options
    • MEMORY_ONLY
    • MEMORY_AND_DISK
    • MEMORY_ONLY_SER
    • MEMORY_AND_DISK_SER
    • DISK_ONLY
    • MEMORY_ONLY_2, MEMORY_AND_DISK_2

    View Slide

  62. 62
    Training from Zinoviev Alexey
    Spark Core Storage Level
    • MEMORY_ONLY (default for Spark Core)
    • MEMORY_AND_DISK
    • MEMORY_ONLY_SER
    • MEMORY_AND_DISK_SER
    • DISK_ONLY
    • MEMORY_ONLY_2, MEMORY_AND_DISK_2

    View Slide

  63. 63
    Training from Zinoviev Alexey
    Spark Streaming Storage Level
    • MEMORY_ONLY (default for Spark Core)
    • MEMORY_AND_DISK
    • MEMORY_ONLY_SER (default for Spark Streaming)
    • MEMORY_AND_DISK_SER
    • DISK_ONLY
    • MEMORY_ONLY_2, MEMORY_AND_DISK_2

    View Slide

  64. 64
    Training from Zinoviev Alexey
    BUILDING

    View Slide

  65. 65
    Training from Zinoviev Alexey
    Development tools
    • Console REPL ($SPARK_HOME/sbin/spark-shell)

    View Slide

  66. 66
    Training from Zinoviev Alexey
    Development tools
    • Console REPL ($SPARK_HOME/sbin/spark-shell)
    • Apache Zeppelin

    View Slide

  67. 67
    Training from Zinoviev Alexey
    Run Zeppelin

    View Slide

  68. 68
    Training from Zinoviev Alexey
    Development tools
    • Console REPL ($SPARK_HOME/sbin/spark-shell)
    • Apache Zeppelin
    • IntelliJ IDEA Community + Scala Plugin

    View Slide

  69. 69
    Training from Zinoviev Alexey
    Development tools
    • Console REPL ($SPARK_HOME/sbin/spark-shell)
    • Apache Zeppelin
    • IntelliJ IDEA Community + Scala Plugin
    • Don’t forget about SBT or adding spark’s jars

    View Slide

  70. 70
    Training from Zinoviev Alexey
    SBT build
    name := "Spark-app"
    version := "1.0"
    scalaVersion := "2.11.11"
    libraryDependencies += "org.apache.spark" % "spark-
    core_2.11" % "2.2.0"
    libraryDependencies += "org.apache.spark" % "spark-
    sql_2.11" % "2.2.0"

    View Slide

  71. 71
    Training from Zinoviev Alexey
    SBT build
    name := "Spark-app"
    version := "1.0"
    scalaVersion := "2.11.11"
    libraryDependencies += "org.apache.spark" % "spark-
    core_2.11" % "2.2.0“ % "provided"
    libraryDependencies += "org.apache.spark" % "spark-
    sql_2.11" % "2.2.0“ % "provided"

    View Slide

  72. 72
    Training from Zinoviev Alexey
    SPARK ARCHITECTURE

    View Slide

  73. 73
    Training from Zinoviev Alexey
    YARN + Driver

    View Slide

  74. 74
    Training from Zinoviev Alexey
    Worker Nodes and Executors

    View Slide

  75. 75
    Training from Zinoviev Alexey
    Spark Application

    View Slide

  76. 76
    Training from Zinoviev Alexey
    Job Stages

    View Slide

  77. 77
    Training from Zinoviev Alexey
    DAG Scheduler
    • Build stages of tasks
    • Submit them to lower level scheduler
    • Lower level scheduler will schedule data based on locality
    • Resubmit failed stages if outputs are lost

    View Slide

  78. 78
    Training from Zinoviev Alexey
    Scheduler Optimizations

    View Slide

  79. 79
    Training from Zinoviev Alexey
    Task in Spark
    • Unit of work to execute on in an executor thread
    • Unlike MR, there is no "map" vs "reduce" task
    • Each task apply set of transformations to same partitions in
    the RDD
    • Each task either partitions its output for "shuffle", or send
    the output back to the driver

    View Slide

  80. 80
    Training from Zinoviev Alexey
    CONFIGURING

    View Slide

  81. 81
    Training from Zinoviev Alexey
    Cluster Modes
    • Local mode

    View Slide

  82. 82
    Training from Zinoviev Alexey
    Cluster Modes
    • Local mode
    • Stand-alone mode

    View Slide

  83. 83
    Training from Zinoviev Alexey
    Cluster Modes
    • Local mode
    • Stand-alone mode
    • Yarn

    View Slide

  84. 84
    Training from Zinoviev Alexey
    Cluster Modes
    • Local mode
    • Stand-alone mode
    • Yarn
    • Mesos

    View Slide

  85. 85
    Training from Zinoviev Alexey
    Spark Master URL
    • local, local[n], local[*], local[K,F], local[*,F]
    • spark://host:port or spark://host1:port, host2:port
    • yarn-client or yarn-cluster
    • mesos://host:port

    View Slide

  86. 86
    Training from Zinoviev Alexey
    SUBMIT

    View Slide

  87. 87
    Training from Zinoviev Alexey
    Submit
    ./bin/spark-submit \
    --class com.epam.SparkJob1 \
    --master spark://192.168.101.101:7077 \
    --executor-memory 2G \
    --total-executor-cores 10 \
    /path/to/artifact.jar \

    View Slide

  88. 88
    Training from Zinoviev Alexey
    A common deployment strategy
    is to submit your application from a gateway
    machine that is physically co-located with your
    worker machines

    View Slide

  89. 89
    Training from Zinoviev Alexey
    Submit
    ./bin/spark-submit \
    --class com.epam.SparkJob1 \
    --master mesos://192.168.101.101:7077 \
    --executor-memory 2G \
    --deploy-mode cluster \
    --total-executor-cores 10 \
    /path/to/artifact.jar \

    View Slide

  90. 90
    Training from Zinoviev Alexey
    STANDALONE CLUSTER

    View Slide

  91. 91
    Training from Zinoviev Alexey
    Start
    master
    ./sbin/start-master.sh
    spark://192.168.101.101:7077

    View Slide

  92. 92
    Training from Zinoviev Alexey
    Start
    slave ./sbin/start-slave.sh 192.168.101.101:7077

    View Slide

  93. 93
    Training from Zinoviev Alexey
    Standalone Cluster Architecture

    View Slide

  94. 94
    Training from Zinoviev Alexey
    Standalone Cluster Architecture with Resources

    View Slide

  95. 95
    Training from Zinoviev Alexey
    EC2 Scripts for Spark 2.2
    https://github.com/amplab/spark-ec2

    View Slide

  96. 96
    Training from Zinoviev Alexey
    MONITORING

    View Slide

  97. 97
    Training from Zinoviev Alexey
    Start
    history-
    server
    ./sbin/start-historyserver.sh
    open http://192.168.101.101:18080

    View Slide

  98. 98
    Training from Zinoviev Alexey
    Open web UI and enjoy

    View Slide

  99. 99
    Training from Zinoviev Alexey
    Every Spark application launches a web UI
    • A list of scheduler stages and tasks
    • A summary of RDD sizes and memory usage
    • Environmental information
    • Information about the running executors

    View Slide

  100. 100
    Training from Zinoviev Alexey
    Spark Training: Act 2
    Alexey Zinovyev, Java/BigData Trainer in EPAM

    View Slide

  101. 101
    Training from Zinoviev Alexey
    RDD INTERNALS

    View Slide

  102. 102
    Training from Zinoviev Alexey
    Do it
    parallel

    View Slide

  103. 103
    Training from Zinoviev Alexey
    I’d like
    NARROW

    View Slide

  104. 104
    Training from Zinoviev Alexey
    Map, filter, filter

    View Slide

  105. 105
    Training from Zinoviev Alexey
    GroupByKey, join

    View Slide

  106. 106
    Training from Zinoviev Alexey
    Is it a graph with tasks and dependencies?

    View Slide

  107. 107
    Training from Zinoviev Alexey
    RDD Lineage is …
    (aka RDD operator graph or RDD dependency graph)
    a graph of all the parent RDDs of a RDD.

    View Slide

  108. 108
    Training from Zinoviev Alexey
    I’d like
    NARROW

    View Slide

  109. 109
    Training from Zinoviev Alexey
    toDebugString prints …
    The execution DAG or physical execution plan is
    the DAG of stages.

    View Slide

  110. 110
    Training from Zinoviev Alexey
    spark
    .logLineage
    $ ./bin/spark-shell --conf spark.logLineage=true
    scala> sc.textFile("README.md", 4).count
    ...
    15/10/17 14:46:42 INFO SparkContext: Starting job: count at
    :25
    15/10/17 14:46:42 INFO SparkContext: RDD's recursive dependencies:
    (4) MapPartitionsRDD[1] at textFile at :25 []
    | README.md HadoopRDD[0] at textFile at :25 []

    View Slide

  111. 111
    Training from Zinoviev Alexey
    Partitions
    Demo

    View Slide

  112. 112
    Training from Zinoviev Alexey
    Spark
    Family

    View Slide

  113. 113
    Training from Zinoviev Alexey
    SCHEMA + RDD

    View Slide

  114. 114
    Training from Zinoviev Alexey
    Data sources and formats

    View Slide

  115. 115
    Training from Zinoviev Alexey
    New RDD
    for each
    case

    View Slide

  116. 116
    Training from Zinoviev Alexey
    Define schema for data to extract with SQL

    View Slide

  117. 117
    Training from Zinoviev Alexey
    Case class for RDD
    User (height: Int (not null), name: String, age: Int)

    View Slide

  118. 118
    Training from Zinoviev Alexey
    Let’s think about tables

    View Slide

  119. 119
    Training from Zinoviev Alexey
    The main concept
    DataFrames are composed of Row objects, along
    with a schema that describes the data types of each
    column in the row

    View Slide

  120. 120
    Training from Zinoviev Alexey
    RDD->DF
    val usersRdd = sqlContext
    .jsonFile("hdfs://localhost:9000/users.json")
    val df = usersRdd.toDF()
    val newRDD = df.rdd
    df.show()

    View Slide

  121. 121
    Training from Zinoviev Alexey
    DATAFRAMES

    View Slide

  122. 122
    Training from Zinoviev Alexey
    DataFrame’s nature
    • Like RDD with schema but it’s not RDD now
    • Distributed collection of data grouped into named columns
    • Domain-specific designed for common tasks under
    structured data
    • Available in Python, Scala, Java, and R (via SparkR)
    • Mutate from SchemaRDD

    View Slide

  123. 123
    Training from Zinoviev Alexey
    DataFrame as SQL
    • Selecting columns and filtering
    • Joining different data sources
    • Aggregation (count, sum, average, etc)
    • Plotting results with Pandas (with PySpark)

    View Slide

  124. 124
    Training from Zinoviev Alexey
    Input & Output

    View Slide

  125. 125
    Training from Zinoviev Alexey
    Input & Output

    View Slide

  126. 126
    Training from Zinoviev Alexey
    Custom Data Sources

    View Slide

  127. 127
    Training from Zinoviev Alexey
    DataFrames
    Demo

    View Slide

  128. 128
    Training from Zinoviev Alexey
    SPARK SQL

    View Slide

  129. 129
    Training from Zinoviev Alexey
    Run SQL
    val df = spark.read.json(“/home/users.json”)
    df.createOrReplaceTempView(“users”)
    val sqlDF = spark.sql("SELECT name FROM users")
    sqlDF.show()

    View Slide

  130. 130
    Training from Zinoviev Alexey
    Spark SQL advantages
    • Spark SQL allows relational queries expressed in SQL,
    HiveQL, or Scala to be executed using Spark
    • Unifies Stack with Spark Core, Spark Streaming etc.
    • Hive compatibility
    • Standard connectivity (JDBC, ODBC)

    View Slide

  131. 131
    Training from Zinoviev Alexey
    Spark SQL
    Demo

    View Slide

  132. 132
    Training from Zinoviev Alexey
    HIVE INTEGRATION

    View Slide

  133. 133
    Training from Zinoviev Alexey
    Hive Support

    View Slide

  134. 134
    Training from Zinoviev Alexey
    If you have a Hive in Spark application
    • Support for writing queries in HQL
    • Catalog info from Hive MetaStore
    • Tablescan operator that uses Hive SerDes
    • Wrappers for Hive UDFs, UDAFs, UDTFs

    View Slide

  135. 135
    Training from Zinoviev Alexey
    Hive
    val hive = new HiveContext(spark)
    hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
    value STRING)”)
    hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
    TABLE src”)
    val results = hive.hql(“FROM src SELECT key,
    value”).collect()

    View Slide

  136. 136
    Training from Zinoviev Alexey
    Hive
    val hive = new HiveContext(spark)
    hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
    value STRING)”)
    hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
    TABLE src”)
    val results = hive.hql(“FROM src SELECT key,
    value”).collect()

    View Slide

  137. 137
    Training from Zinoviev Alexey
    Hive
    val hive = new HiveContext(spark)
    hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
    value STRING)”)
    hive.hql(“LOAD DATA LOCAL INPATH ‘…/WarAndPeace.txt’
    INTO TABLE src”)
    val results = hive.hql(“FROM src SELECT key,
    value”).collect()

    View Slide

  138. 138
    Training from Zinoviev Alexey
    Hive
    val hive = new HiveContext(spark)
    hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
    value STRING)”)
    hive.hql(“LOAD DATA LOCAL INPATH ‘…/WarAndPeace.txt’
    INTO TABLE src”)
    val results = hive.hql(“FROM src SELECT key,
    value”).collect()

    View Slide

  139. 139
    Training from Zinoviev Alexey
    How to cache in memory?

    View Slide

  140. 140
    Training from Zinoviev Alexey
    Easy to cache
    sql.cacheTable(“people”)

    View Slide

  141. 141
    Training from Zinoviev Alexey
    The main problem of this approach

    View Slide

  142. 142
    Training from Zinoviev Alexey
    THORNY PATH TO
    DATASETS

    View Slide

  143. 143
    Training from Zinoviev Alexey
    History of Spark APIs

    View Slide

  144. 144
    Training from Zinoviev Alexey
    RDD
    rdd.filter(_.age > 21) // RDD
    df.filter("age > 21") // DataFrame SQL-style
    df.filter(df.col("age").gt(21)) // Expression style
    dataset.filter(_.age < 21); // Dataset API

    View Slide

  145. 145
    Training from Zinoviev Alexey
    History of Spark APIs

    View Slide

  146. 146
    Training from Zinoviev Alexey
    SQL
    rdd.filter(_.age > 21) // RDD
    df.filter("age > 21") // DataFrame SQL-style
    df.filter(df.col("age").gt(21)) // Expression style
    dataset.filter(_.age < 21); // Dataset API

    View Slide

  147. 147
    Training from Zinoviev Alexey
    Expression
    rdd.filter(_.age > 21) // RDD
    df.filter("age > 21") // DataFrame SQL-style
    df.filter(df.col("age").gt(21)) // Expression style
    dataset.filter(_.age < 21); // Dataset API

    View Slide

  148. 148
    Training from Zinoviev Alexey
    History of Spark APIs

    View Slide

  149. 149
    Training from Zinoviev Alexey
    DataSet
    rdd.filter(_.age > 21) // RDD
    df.filter("age > 21") // DataFrame SQL-style
    df.filter(df.col("age").gt(21)) // Expression style
    dataset.filter(_.age < 21); // Dataset API

    View Slide

  150. 150
    Training from Zinoviev Alexey
    DataSet = RDD’s types + DataFrame’s Catalyst
    • RDD API
    • compile-time type-safety
    • off-heap storage mechanism
    • performance benefits of the Catalyst query optimizer
    • Tungsten

    View Slide

  151. 151
    Training from Zinoviev Alexey
    DataSet = RDD’s types + DataFrame’s Catalyst
    • RDD API
    • compile-time type-safety
    • off-heap storage mechanism
    • performance benefits of the Catalyst query optimizer
    • Tungsten

    View Slide

  152. 152
    Training from Zinoviev Alexey
    Structured APIs in SPARK

    View Slide

  153. 153
    Training from Zinoviev Alexey
    Unified API in Spark 2.0
    DataFrame = Dataset[Row]
    Dataframe is a schemaless (untyped) Dataset now

    View Slide

  154. 154
    Training from Zinoviev Alexey
    Define
    case class
    case class User(email: String, footSize: Long, name: String)
    // DataFrame -> DataSet with Users
    val userDS =
    spark.read.json("/home/tmp/datasets/users.json").as[User]
    userDS.map(_.name).collect()
    userDS.filter(_.footSize > 38).collect()
    ds.rdd // IF YOU REALLY WANT

    View Slide

  155. 155
    Training from Zinoviev Alexey
    Read JSON
    case class User(email: String, footSize: Long, name: String)
    // DataFrame -> DataSet with Users
    val userDS =
    spark.read.json("/home/tmp/datasets/users.json").as[User]
    userDS.map(_.name).collect()
    userDS.filter(_.footSize > 38).collect()
    ds.rdd // IF YOU REALLY WANT

    View Slide

  156. 156
    Training from Zinoviev Alexey
    Filter by
    Field
    case class User(email: String, footSize: Long, name: String)
    // DataFrame -> DataSet with Users
    val userDS =
    spark.read.json("/home/tmp/datasets/users.json").as[User]
    userDS.map(_.name).collect()
    userDS.filter(_.footSize > 38).collect()
    ds.rdd // IF YOU REALLY WANT

    View Slide

  157. 157
    Training from Zinoviev Alexey
    DataSet API
    Demo

    View Slide

  158. 158
    Training from Zinoviev Alexey
    Spark
    Family

    View Slide

  159. 159
    Training from Zinoviev Alexey
    CATALYST OPTIMIZER

    View Slide

  160. 160
    Training from Zinoviev Alexey
    Job Stages in Spark

    View Slide

  161. 161
    Training from Zinoviev Alexey
    Scheduler Optimizations

    View Slide

  162. 162
    Training from Zinoviev Alexey
    What’s faster: SQL or DataSet API?

    View Slide

  163. 163
    Training from Zinoviev Alexey
    Unified Logical Plan

    View Slide

  164. 164
    Training from Zinoviev Alexey
    SQL String -> Execution

    View Slide

  165. 165
    Training from Zinoviev Alexey
    Catalyst Optimizer for DataFrames

    View Slide

  166. 166
    Training from Zinoviev Alexey
    Bytecode

    View Slide

  167. 167
    Training from Zinoviev Alexey
    How optimizer works

    View Slide

  168. 168
    Training from Zinoviev Alexey
    DataSet.explain()
    == Physical Plan ==
    Project [avg(price)#43,carat#45]
    +- SortMergeJoin [color#21], [color#47]
    :- Sort [color#21 ASC], false, 0
    : +- TungstenExchange hashpartitioning(color#21,200), None
    : +- Project [avg(price)#43,color#21]
    : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as
    bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43])
    : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None
    : +- TungstenAggregate(key=[cut#20,color#21],
    functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)],
    output=[cut#20,color#21,sum#58,count#59L])
    : +- Scan CsvRelation(-----)
    +- Sort [color#47 ASC], false, 0
    +- TungstenExchange hashpartitioning(color#47,200), None
    +- ConvertToUnsafe
    +- Scan CsvRelation(----)

    View Slide

  169. 169
    Training from Zinoviev Alexey
    Why does explain() show
    so many Tungsten things?

    View Slide

  170. 170
    Training from Zinoviev Alexey
    Tungsten’s goal
    Push performance closer to the limits of modern
    hardware

    View Slide

  171. 171
    Training from Zinoviev Alexey
    How to be effective with CPU
    • Runtime code generation (Whole Stage Code Generation)
    • Сache locality
    • Off-heap memory management

    View Slide

  172. 172
    Training from Zinoviev Alexey
    Cache Locality

    View Slide

  173. 173
    Training from Zinoviev Alexey
    Whole-Stage CodeGen

    View Slide

  174. 174
    Training from Zinoviev Alexey
    Tungsten
    Power

    View Slide

  175. 175
    Training from Zinoviev Alexey
    SERIALIZATION

    View Slide

  176. 176
    Training from Zinoviev Alexey
    Issue: Spark uses Java serialization A LOT

    View Slide

  177. 177
    Training from Zinoviev Alexey
    Two choices to distribute data across cluster
    • Java serialization
    By default with ObjectOutputStream
    • Kryo serialization
    Should register classes (no support of Serialazible)

    View Slide

  178. 178
    Training from Zinoviev Alexey
    The main problem: overhead of serializing
    Each serialized object contains the class structure as
    well as the values

    View Slide

  179. 179
    Training from Zinoviev Alexey
    The main problem: overhead of serializing
    Each serialized object contains the class structure as
    well as the values
    Don’t forget about GC

    View Slide

  180. 180
    Training from Zinoviev Alexey
    Tungsten Compact Encoding

    View Slide

  181. 181
    Training from Zinoviev Alexey
    Maybe something UNSAFE?

    View Slide

  182. 182
    Training from Zinoviev Alexey
    UnsafeRowFormat
    • Bit set for tracking null values
    • Small values are inlined
    • For variable-length values are stored relative offset into the
    variable length data section
    • Rows are always 8-byte word aligned
    • Equality comparison and hashing can be performed on raw
    bytes without requiring additional interpretation

    View Slide

  183. 183
    Training from Zinoviev Alexey
    Encoder’s concept
    Generate bytecode to interact with off-heap
    &
    Give access to attributes without ser/deser

    View Slide

  184. 184
    Training from Zinoviev Alexey
    Encoders

    View Slide

  185. 185
    Training from Zinoviev Alexey
    No custom encoders

    View Slide

  186. 186
    Training from Zinoviev Alexey
    PERFORMANCE

    View Slide

  187. 187
    Training from Zinoviev Alexey
    How to measure Spark performance?

    View Slide

  188. 188
    Training from Zinoviev Alexey
    You'd measure performance!

    View Slide

  189. 189
    Training from Zinoviev Alexey
    TPCDS
    99 Queries
    http://bit.ly/2dObMsH

    View Slide

  190. 190
    Training from Zinoviev Alexey

    View Slide

  191. 191
    Training from Zinoviev Alexey
    How to benchmark Spark

    View Slide

  192. 192
    Training from Zinoviev Alexey
    Special Tool from Databricks
    Benchmark Tool for SparkSQL
    https://github.com/databricks/spark-sql-perf

    View Slide

  193. 193
    Training from Zinoviev Alexey
    Spark 2 vs Spark 1.6

    View Slide

  194. 194
    Training from Zinoviev Alexey
    MEMORY MANAGEMENT

    View Slide

  195. 195
    Training from Zinoviev Alexey
    Can I influence
    on Memory Management in Spark?

    View Slide

  196. 196
    Training from Zinoviev Alexey
    Should I tune generation’s stuff?

    View Slide

  197. 197
    Training from Zinoviev Alexey
    Cached
    Data

    View Slide

  198. 198
    Training from Zinoviev Alexey
    During
    operations

    View Slide

  199. 199
    Training from Zinoviev Alexey
    For your
    needs

    View Slide

  200. 200
    Training from Zinoviev Alexey
    For Dark
    Lord

    View Slide

  201. 201
    Training from Zinoviev Alexey
    IN CONCLUSION

    View Slide

  202. 202
    Training from Zinoviev Alexey
    Contacts
    E-mail : [email protected]
    Twitter : @zaleslaw @BigDataRussia
    vk.com/big_data_russia Big Data Russia
    + Telegram @bigdatarussia
    vk.com/java_jvm Java & JVM langs
    + Telegram @javajvmlangs

    View Slide

  203. 203
    Training from Zinoviev Alexey
    Github
    Spark Tutorial: Core, Streaming, Machine Learning
    https://github.com/zaleslaw/Spark-Tutorial

    View Slide

  204. 204
    Training from Zinoviev Alexey
    Gitbook
    Обработка данных на Spark 2.2 и Kafka 0.10
    www.gitbook.com/book/zaleslaw/data-processing-book

    View Slide

  205. 205
    Training from Zinoviev Alexey
    Any questions?

    View Slide