Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Spark Training [Spark Core, Spark SQL, T...

Apache Spark Training [Spark Core, Spark SQL, Tungsten]

All code examples are available here https://github.com/zaleslaw/Spark-Tutorial

Gitbook with notes for this training could be found here
https://www.gitbook.com/book/zaleslaw/data-processing-book/details

You can find here slides of my Open Apache Spark Training and links to video tutorial (in Russian).

Alexey Zinoviev

September 11, 2017
Tweet

More Decks by Alexey Zinoviev

Other Decks in Programming

Transcript

  1. About With IT since 2007 With Java since 2009 With

    Hadoop since 2012 With Spark since 2014 With EPAM since 2015
  2. 3 Training from Zinoviev Alexey Contacts E-mail : [email protected] Twitter

    : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs
  3. 4 Training from Zinoviev Alexey Github Spark Tutorial: Core, Streaming,

    Machine Learning https://github.com/zaleslaw/Spark-Tutorial
  4. 5 Training from Zinoviev Alexey Gitbook Обработка данных на Spark

    2.2 и Kafka 0.10 www.gitbook.com/book/zaleslaw/data-processing-book
  5. 13 Training from Zinoviev Alexey It’s hard to … •

    .. store • .. handle • .. search in • .. visualize • .. send in network
  6. 30 Training from Zinoviev Alexey Advantages • native Python, Scala,

    R interface • interactive shells • in-memory caching of data, specified by the user • > 80 highly efficient distributed operations, any combination of them • capable of reusing Hadoop ecosystem, e.g. HDFS, YARN
  7. 36 Training from Zinoviev Alexey Say me R..say me D..

    Say me D again • Dataset • Distributed • Resilient
  8. 40 Training from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25)

    val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")
  9. 41 Training from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25)

    val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")
  10. 42 Training from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25)

    val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")
  11. 43 Training from Zinoviev Alexey Loading // Wildcards, running on

    directories, text and archives sc.textFile("/my/directory") sc.textFile("/my/directory/*.txt") sc.textFile("/my/directory/*.gz") // Read directory and return as filename/content pairs sc.wholeTextFiles // Read sequence file sc.sequenceFile[TKey, TValue] // Takes an arbitrary JobConf and InputFormat class sc.hadoopRDD sc.newAPIHadoopRDD // SerDe rdd.saveAsObjectFile sc.objectFile
  12. 46 Training from Zinoviev Alexey Word Count val textFile =

    sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  13. 47 Training from Zinoviev Alexey Word Count val textFile =

    sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  14. 48 Training from Zinoviev Alexey Word Count val textFile =

    sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  15. 51 Training from Zinoviev Alexey Transformations • map, flatMap, filter

    • groupByKey, reduceByKey, sortByKey • mapValues, distinct • join, union • sample
  16. 55 Training from Zinoviev Alexey Actions • reduce • collect,

    first, take, foreach • count(), countByKey() • saveAsTextFile()
  17. 60 Training from Zinoviev Alexey Caching in Spark • Frequently

    used RDD can be stored in memory • One method, one short-cut: persist(), cache() • SparkContext keeps track of cached RDD • Serialized or deserialized Java objects
  18. 61 Training from Zinoviev Alexey Full list of options •

    MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  19. 62 Training from Zinoviev Alexey Spark Core Storage Level •

    MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  20. 63 Training from Zinoviev Alexey Spark Streaming Storage Level •

    MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER (default for Spark Streaming) • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  21. 66 Training from Zinoviev Alexey Development tools • Console REPL

    ($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin
  22. 68 Training from Zinoviev Alexey Development tools • Console REPL

    ($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin • IntelliJ IDEA Community + Scala Plugin
  23. 69 Training from Zinoviev Alexey Development tools • Console REPL

    ($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin • IntelliJ IDEA Community + Scala Plugin • Don’t forget about SBT or adding spark’s jars
  24. 70 Training from Zinoviev Alexey SBT build name := "Spark-app"

    version := "1.0" scalaVersion := "2.11.11" libraryDependencies += "org.apache.spark" % "spark- core_2.11" % "2.2.0" libraryDependencies += "org.apache.spark" % "spark- sql_2.11" % "2.2.0"
  25. 71 Training from Zinoviev Alexey SBT build name := "Spark-app"

    version := "1.0" scalaVersion := "2.11.11" libraryDependencies += "org.apache.spark" % "spark- core_2.11" % "2.2.0“ % "provided" libraryDependencies += "org.apache.spark" % "spark- sql_2.11" % "2.2.0“ % "provided"
  26. 77 Training from Zinoviev Alexey DAG Scheduler • Build stages

    of tasks • Submit them to lower level scheduler • Lower level scheduler will schedule data based on locality • Resubmit failed stages if outputs are lost
  27. 79 Training from Zinoviev Alexey Task in Spark • Unit

    of work to execute on in an executor thread • Unlike MR, there is no "map" vs "reduce" task • Each task apply set of transformations to same partitions in the RDD • Each task either partitions its output for "shuffle", or send the output back to the driver
  28. 84 Training from Zinoviev Alexey Cluster Modes • Local mode

    • Stand-alone mode • Yarn • Mesos
  29. 85 Training from Zinoviev Alexey Spark Master URL • local,

    local[n], local[*], local[K,F], local[*,F] • spark://host:port or spark://host1:port, host2:port • yarn-client or yarn-cluster • mesos://host:port
  30. 87 Training from Zinoviev Alexey Submit ./bin/spark-submit \ --class com.epam.SparkJob1

    \ --master spark://192.168.101.101:7077 \ --executor-memory 2G \ --total-executor-cores 10 \ /path/to/artifact.jar \
  31. 88 Training from Zinoviev Alexey A common deployment strategy is

    to submit your application from a gateway machine that is physically co-located with your worker machines
  32. 89 Training from Zinoviev Alexey Submit ./bin/spark-submit \ --class com.epam.SparkJob1

    \ --master mesos://192.168.101.101:7077 \ --executor-memory 2G \ --deploy-mode cluster \ --total-executor-cores 10 \ /path/to/artifact.jar \
  33. 99 Training from Zinoviev Alexey Every Spark application launches a

    web UI • A list of scheduler stages and tasks • A summary of RDD sizes and memory usage • Environmental information • Information about the running executors
  34. 107 Training from Zinoviev Alexey RDD Lineage is … (aka

    RDD operator graph or RDD dependency graph) a graph of all the parent RDDs of a RDD.
  35. 109 Training from Zinoviev Alexey toDebugString prints … The execution

    DAG or physical execution plan is the DAG of stages.
  36. 110 Training from Zinoviev Alexey spark .logLineage $ ./bin/spark-shell --conf

    spark.logLineage=true scala> sc.textFile("README.md", 4).count ... 15/10/17 14:46:42 INFO SparkContext: Starting job: count at <console>:25 15/10/17 14:46:42 INFO SparkContext: RDD's recursive dependencies: (4) MapPartitionsRDD[1] at textFile at <console>:25 [] | README.md HadoopRDD[0] at textFile at <console>:25 []
  37. 117 Training from Zinoviev Alexey Case class for RDD User

    (height: Int (not null), name: String, age: Int)
  38. 119 Training from Zinoviev Alexey The main concept DataFrames are

    composed of Row objects, along with a schema that describes the data types of each column in the row
  39. 120 Training from Zinoviev Alexey RDD->DF val usersRdd = sqlContext

    .jsonFile("hdfs://localhost:9000/users.json") val df = usersRdd.toDF() val newRDD = df.rdd df.show()
  40. 122 Training from Zinoviev Alexey DataFrame’s nature • Like RDD

    with schema but it’s not RDD now • Distributed collection of data grouped into named columns • Domain-specific designed for common tasks under structured data • Available in Python, Scala, Java, and R (via SparkR) • Mutate from SchemaRDD
  41. 123 Training from Zinoviev Alexey DataFrame as SQL • Selecting

    columns and filtering • Joining different data sources • Aggregation (count, sum, average, etc) • Plotting results with Pandas (with PySpark)
  42. 129 Training from Zinoviev Alexey Run SQL val df =

    spark.read.json(“/home/users.json”) df.createOrReplaceTempView(“users”) val sqlDF = spark.sql("SELECT name FROM users") sqlDF.show()
  43. 130 Training from Zinoviev Alexey Spark SQL advantages • Spark

    SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark • Unifies Stack with Spark Core, Spark Streaming etc. • Hive compatibility • Standard connectivity (JDBC, ODBC)
  44. 134 Training from Zinoviev Alexey If you have a Hive

    in Spark application • Support for writing queries in HQL • Catalog info from Hive MetaStore • Tablescan operator that uses Hive SerDes • Wrappers for Hive UDFs, UDAFs, UDTFs
  45. 135 Training from Zinoviev Alexey Hive val hive = new

    HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  46. 136 Training from Zinoviev Alexey Hive val hive = new

    HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  47. 137 Training from Zinoviev Alexey Hive val hive = new

    HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/WarAndPeace.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  48. 138 Training from Zinoviev Alexey Hive val hive = new

    HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/WarAndPeace.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  49. 144 Training from Zinoviev Alexey RDD rdd.filter(_.age > 21) //

    RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  50. 146 Training from Zinoviev Alexey SQL rdd.filter(_.age > 21) //

    RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  51. 147 Training from Zinoviev Alexey Expression rdd.filter(_.age > 21) //

    RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  52. 149 Training from Zinoviev Alexey DataSet rdd.filter(_.age > 21) //

    RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  53. 150 Training from Zinoviev Alexey DataSet = RDD’s types +

    DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten
  54. 151 Training from Zinoviev Alexey DataSet = RDD’s types +

    DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten
  55. 153 Training from Zinoviev Alexey Unified API in Spark 2.0

    DataFrame = Dataset[Row] Dataframe is a schemaless (untyped) Dataset now
  56. 154 Training from Zinoviev Alexey Define case class case class

    User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  57. 155 Training from Zinoviev Alexey Read JSON case class User(email:

    String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  58. 156 Training from Zinoviev Alexey Filter by Field case class

    User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  59. 168 Training from Zinoviev Alexey DataSet.explain() == Physical Plan ==

    Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----)
  60. 171 Training from Zinoviev Alexey How to be effective with

    CPU • Runtime code generation (Whole Stage Code Generation) • Сache locality • Off-heap memory management
  61. 177 Training from Zinoviev Alexey Two choices to distribute data

    across cluster • Java serialization By default with ObjectOutputStream • Kryo serialization Should register classes (no support of Serialazible)
  62. 178 Training from Zinoviev Alexey The main problem: overhead of

    serializing Each serialized object contains the class structure as well as the values
  63. 179 Training from Zinoviev Alexey The main problem: overhead of

    serializing Each serialized object contains the class structure as well as the values Don’t forget about GC
  64. 182 Training from Zinoviev Alexey UnsafeRowFormat • Bit set for

    tracking null values • Small values are inlined • For variable-length values are stored relative offset into the variable length data section • Rows are always 8-byte word aligned • Equality comparison and hashing can be performed on raw bytes without requiring additional interpretation
  65. 183 Training from Zinoviev Alexey Encoder’s concept Generate bytecode to

    interact with off-heap & Give access to attributes without ser/deser
  66. 192 Training from Zinoviev Alexey Special Tool from Databricks Benchmark

    Tool for SparkSQL https://github.com/databricks/spark-sql-perf
  67. 202 Training from Zinoviev Alexey Contacts E-mail : [email protected] Twitter

    : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs
  68. 203 Training from Zinoviev Alexey Github Spark Tutorial: Core, Streaming,

    Machine Learning https://github.com/zaleslaw/Spark-Tutorial
  69. 204 Training from Zinoviev Alexey Gitbook Обработка данных на Spark

    2.2 и Kafka 0.10 www.gitbook.com/book/zaleslaw/data-processing-book