Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Spark Training [Spark Core, Spark SQL, Tungsten]

Apache Spark Training [Spark Core, Spark SQL, Tungsten]

All code examples are available here https://github.com/zaleslaw/Spark-Tutorial

Gitbook with notes for this training could be found here
https://www.gitbook.com/book/zaleslaw/data-processing-book/details

You can find here slides of my Open Apache Spark Training and links to video tutorial (in Russian).

376cd2fd5ffded946c96d5a45766350f?s=128

Alexey Zinoviev

September 11, 2017
Tweet

Transcript

  1. Spark Training Alexey Zinovyev, Java/BigData Trainer in EPAM

  2. About With IT since 2007 With Java since 2009 With

    Hadoop since 2012 With Spark since 2014 With EPAM since 2015
  3. 3 Training from Zinoviev Alexey Contacts E-mail : Alexey_Zinovyev@epam.com Twitter

    : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs
  4. 4 Training from Zinoviev Alexey Github Spark Tutorial: Core, Streaming,

    Machine Learning https://github.com/zaleslaw/Spark-Tutorial
  5. 5 Training from Zinoviev Alexey Gitbook Обработка данных на Spark

    2.2 и Kafka 0.10 www.gitbook.com/book/zaleslaw/data-processing-book
  6. 6 Training from Zinoviev Alexey Spark Family

  7. 7 Training from Zinoviev Alexey Spark Family

  8. 8 Training from Zinoviev Alexey WHAT IS BIG DATA?

  9. 9 Training from Zinoviev Alexey Joke about Excel

  10. 10 Training from Zinoviev Alexey Every 60 seconds…

  11. 11 Training from Zinoviev Alexey Is BigData about PBs?

  12. 12 Training from Zinoviev Alexey Is BigData about PBs?

  13. 13 Training from Zinoviev Alexey It’s hard to … •

    .. store • .. handle • .. search in • .. visualize • .. send in network
  14. 14 Training from Zinoviev Alexey How to handle all these

    stuff?
  15. 15 Training from Zinoviev Alexey Just do it … in

    parallel
  16. 16 Training from Zinoviev Alexey Parallel Computing vs Distributed Computing

  17. 17 Training from Zinoviev Alexey Modern Java in 2016 Big

    Data in 2014
  18. 18 Training from Zinoviev Alexey Batch jobs produce reports. More

    and more..
  19. 19 Training from Zinoviev Alexey But customer can wait forever

    (ok, 1h)
  20. 20 Training from Zinoviev Alexey Hadoop Architecture

  21. 21 Training from Zinoviev Alexey HDFS Architecture

  22. 22 Training from Zinoviev Alexey Daemons in YARN

  23. 23 Training from Zinoviev Alexey Different scheduling algorithms

  24. 24 Training from Zinoviev Alexey Hive Data Model

  25. 25 Training from Zinoviev Alexey Machine Learning EVERYWHERE

  26. 26 Training from Zinoviev Alexey Data Lake in promotional brochure

  27. 27 Training from Zinoviev Alexey Data Lake in production

  28. 28 Training from Zinoviev Alexey Simple Flow in Reporting/BI systems

  29. 29 Training from Zinoviev Alexey WHY SHOULD WE USE SPARK?

  30. 30 Training from Zinoviev Alexey Advantages • native Python, Scala,

    R interface • interactive shells • in-memory caching of data, specified by the user • > 80 highly efficient distributed operations, any combination of them • capable of reusing Hadoop ecosystem, e.g. HDFS, YARN
  31. 31 Training from Zinoviev Alexey MapReduce vs Spark

  32. 32 Training from Zinoviev Alexey MapReduce vs Spark

  33. 33 Training from Zinoviev Alexey MapReduce vs Spark

  34. 34 Training from Zinoviev Alexey Let’s use Spark. It’s fast!

  35. 35 Training from Zinoviev Alexey SPARK INTRO

  36. 36 Training from Zinoviev Alexey Say me R..say me D..

    Say me D again • Dataset • Distributed • Resilient
  37. 37 Training from Zinoviev Alexey Single Thread collection

  38. 38 Training from Zinoviev Alexey No perf issues, right?

  39. 39 Training from Zinoviev Alexey The main concept more partitions

    = more parallelism
  40. 40 Training from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25)

    val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")
  41. 41 Training from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25)

    val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")
  42. 42 Training from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25)

    val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")
  43. 43 Training from Zinoviev Alexey Loading // Wildcards, running on

    directories, text and archives sc.textFile("/my/directory") sc.textFile("/my/directory/*.txt") sc.textFile("/my/directory/*.gz") // Read directory and return as filename/content pairs sc.wholeTextFiles // Read sequence file sc.sequenceFile[TKey, TValue] // Takes an arbitrary JobConf and InputFormat class sc.hadoopRDD sc.newAPIHadoopRDD // SerDe rdd.saveAsObjectFile sc.objectFile
  44. 44 Training from Zinoviev Alexey Spark Context

  45. 45 Training from Zinoviev Alexey RDD OPERATIONS

  46. 46 Training from Zinoviev Alexey Word Count val textFile =

    sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  47. 47 Training from Zinoviev Alexey Word Count val textFile =

    sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  48. 48 Training from Zinoviev Alexey Word Count val textFile =

    sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  49. 49 Training from Zinoviev Alexey What’s the difference between actions

    and transformations?
  50. 50 Training from Zinoviev Alexey RDD Chain

  51. 51 Training from Zinoviev Alexey Transformations • map, flatMap, filter

    • groupByKey, reduceByKey, sortByKey • mapValues, distinct • join, union • sample
  52. 52 Training from Zinoviev Alexey FlatMap explanation

  53. 53 Training from Zinoviev Alexey Map explanation

  54. 54 Training from Zinoviev Alexey ReduceByKey explanation

  55. 55 Training from Zinoviev Alexey Actions • reduce • collect,

    first, take, foreach • count(), countByKey() • saveAsTextFile()
  56. 56 Training from Zinoviev Alexey What’s the difference between PairRDD

    and usual RDD?
  57. 57 Training from Zinoviev Alexey Pair RDD

  58. 58 Training from Zinoviev Alexey RDD Demo

  59. 59 Training from Zinoviev Alexey PERSISTENCE

  60. 60 Training from Zinoviev Alexey Caching in Spark • Frequently

    used RDD can be stored in memory • One method, one short-cut: persist(), cache() • SparkContext keeps track of cached RDD • Serialized or deserialized Java objects
  61. 61 Training from Zinoviev Alexey Full list of options •

    MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  62. 62 Training from Zinoviev Alexey Spark Core Storage Level •

    MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  63. 63 Training from Zinoviev Alexey Spark Streaming Storage Level •

    MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER (default for Spark Streaming) • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  64. 64 Training from Zinoviev Alexey BUILDING

  65. 65 Training from Zinoviev Alexey Development tools • Console REPL

    ($SPARK_HOME/sbin/spark-shell)
  66. 66 Training from Zinoviev Alexey Development tools • Console REPL

    ($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin
  67. 67 Training from Zinoviev Alexey Run Zeppelin

  68. 68 Training from Zinoviev Alexey Development tools • Console REPL

    ($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin • IntelliJ IDEA Community + Scala Plugin
  69. 69 Training from Zinoviev Alexey Development tools • Console REPL

    ($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin • IntelliJ IDEA Community + Scala Plugin • Don’t forget about SBT or adding spark’s jars
  70. 70 Training from Zinoviev Alexey SBT build name := "Spark-app"

    version := "1.0" scalaVersion := "2.11.11" libraryDependencies += "org.apache.spark" % "spark- core_2.11" % "2.2.0" libraryDependencies += "org.apache.spark" % "spark- sql_2.11" % "2.2.0"
  71. 71 Training from Zinoviev Alexey SBT build name := "Spark-app"

    version := "1.0" scalaVersion := "2.11.11" libraryDependencies += "org.apache.spark" % "spark- core_2.11" % "2.2.0“ % "provided" libraryDependencies += "org.apache.spark" % "spark- sql_2.11" % "2.2.0“ % "provided"
  72. 72 Training from Zinoviev Alexey SPARK ARCHITECTURE

  73. 73 Training from Zinoviev Alexey YARN + Driver

  74. 74 Training from Zinoviev Alexey Worker Nodes and Executors

  75. 75 Training from Zinoviev Alexey Spark Application

  76. 76 Training from Zinoviev Alexey Job Stages

  77. 77 Training from Zinoviev Alexey DAG Scheduler • Build stages

    of tasks • Submit them to lower level scheduler • Lower level scheduler will schedule data based on locality • Resubmit failed stages if outputs are lost
  78. 78 Training from Zinoviev Alexey Scheduler Optimizations

  79. 79 Training from Zinoviev Alexey Task in Spark • Unit

    of work to execute on in an executor thread • Unlike MR, there is no "map" vs "reduce" task • Each task apply set of transformations to same partitions in the RDD • Each task either partitions its output for "shuffle", or send the output back to the driver
  80. 80 Training from Zinoviev Alexey CONFIGURING

  81. 81 Training from Zinoviev Alexey Cluster Modes • Local mode

  82. 82 Training from Zinoviev Alexey Cluster Modes • Local mode

    • Stand-alone mode
  83. 83 Training from Zinoviev Alexey Cluster Modes • Local mode

    • Stand-alone mode • Yarn
  84. 84 Training from Zinoviev Alexey Cluster Modes • Local mode

    • Stand-alone mode • Yarn • Mesos
  85. 85 Training from Zinoviev Alexey Spark Master URL • local,

    local[n], local[*], local[K,F], local[*,F] • spark://host:port or spark://host1:port, host2:port • yarn-client or yarn-cluster • mesos://host:port
  86. 86 Training from Zinoviev Alexey SUBMIT

  87. 87 Training from Zinoviev Alexey Submit ./bin/spark-submit \ --class com.epam.SparkJob1

    \ --master spark://192.168.101.101:7077 \ --executor-memory 2G \ --total-executor-cores 10 \ /path/to/artifact.jar \
  88. 88 Training from Zinoviev Alexey A common deployment strategy is

    to submit your application from a gateway machine that is physically co-located with your worker machines
  89. 89 Training from Zinoviev Alexey Submit ./bin/spark-submit \ --class com.epam.SparkJob1

    \ --master mesos://192.168.101.101:7077 \ --executor-memory 2G \ --deploy-mode cluster \ --total-executor-cores 10 \ /path/to/artifact.jar \
  90. 90 Training from Zinoviev Alexey STANDALONE CLUSTER

  91. 91 Training from Zinoviev Alexey Start master ./sbin/start-master.sh spark://192.168.101.101:7077

  92. 92 Training from Zinoviev Alexey Start slave ./sbin/start-slave.sh 192.168.101.101:7077

  93. 93 Training from Zinoviev Alexey Standalone Cluster Architecture

  94. 94 Training from Zinoviev Alexey Standalone Cluster Architecture with Resources

  95. 95 Training from Zinoviev Alexey EC2 Scripts for Spark 2.2

    https://github.com/amplab/spark-ec2
  96. 96 Training from Zinoviev Alexey MONITORING

  97. 97 Training from Zinoviev Alexey Start history- server ./sbin/start-historyserver.sh open

    http://192.168.101.101:18080
  98. 98 Training from Zinoviev Alexey Open web UI and enjoy

  99. 99 Training from Zinoviev Alexey Every Spark application launches a

    web UI • A list of scheduler stages and tasks • A summary of RDD sizes and memory usage • Environmental information • Information about the running executors
  100. 100 Training from Zinoviev Alexey Spark Training: Act 2 Alexey

    Zinovyev, Java/BigData Trainer in EPAM
  101. 101 Training from Zinoviev Alexey RDD INTERNALS

  102. 102 Training from Zinoviev Alexey Do it parallel

  103. 103 Training from Zinoviev Alexey I’d like NARROW

  104. 104 Training from Zinoviev Alexey Map, filter, filter

  105. 105 Training from Zinoviev Alexey GroupByKey, join

  106. 106 Training from Zinoviev Alexey Is it a graph with

    tasks and dependencies?
  107. 107 Training from Zinoviev Alexey RDD Lineage is … (aka

    RDD operator graph or RDD dependency graph) a graph of all the parent RDDs of a RDD.
  108. 108 Training from Zinoviev Alexey I’d like NARROW

  109. 109 Training from Zinoviev Alexey toDebugString prints … The execution

    DAG or physical execution plan is the DAG of stages.
  110. 110 Training from Zinoviev Alexey spark .logLineage $ ./bin/spark-shell --conf

    spark.logLineage=true scala> sc.textFile("README.md", 4).count ... 15/10/17 14:46:42 INFO SparkContext: Starting job: count at <console>:25 15/10/17 14:46:42 INFO SparkContext: RDD's recursive dependencies: (4) MapPartitionsRDD[1] at textFile at <console>:25 [] | README.md HadoopRDD[0] at textFile at <console>:25 []
  111. 111 Training from Zinoviev Alexey Partitions Demo

  112. 112 Training from Zinoviev Alexey Spark Family

  113. 113 Training from Zinoviev Alexey SCHEMA + RDD

  114. 114 Training from Zinoviev Alexey Data sources and formats

  115. 115 Training from Zinoviev Alexey New RDD for each case

  116. 116 Training from Zinoviev Alexey Define schema for data to

    extract with SQL
  117. 117 Training from Zinoviev Alexey Case class for RDD User

    (height: Int (not null), name: String, age: Int)
  118. 118 Training from Zinoviev Alexey Let’s think about tables

  119. 119 Training from Zinoviev Alexey The main concept DataFrames are

    composed of Row objects, along with a schema that describes the data types of each column in the row
  120. 120 Training from Zinoviev Alexey RDD->DF val usersRdd = sqlContext

    .jsonFile("hdfs://localhost:9000/users.json") val df = usersRdd.toDF() val newRDD = df.rdd df.show()
  121. 121 Training from Zinoviev Alexey DATAFRAMES

  122. 122 Training from Zinoviev Alexey DataFrame’s nature • Like RDD

    with schema but it’s not RDD now • Distributed collection of data grouped into named columns • Domain-specific designed for common tasks under structured data • Available in Python, Scala, Java, and R (via SparkR) • Mutate from SchemaRDD
  123. 123 Training from Zinoviev Alexey DataFrame as SQL • Selecting

    columns and filtering • Joining different data sources • Aggregation (count, sum, average, etc) • Plotting results with Pandas (with PySpark)
  124. 124 Training from Zinoviev Alexey Input & Output

  125. 125 Training from Zinoviev Alexey Input & Output

  126. 126 Training from Zinoviev Alexey Custom Data Sources

  127. 127 Training from Zinoviev Alexey DataFrames Demo

  128. 128 Training from Zinoviev Alexey SPARK SQL

  129. 129 Training from Zinoviev Alexey Run SQL val df =

    spark.read.json(“/home/users.json”) df.createOrReplaceTempView(“users”) val sqlDF = spark.sql("SELECT name FROM users") sqlDF.show()
  130. 130 Training from Zinoviev Alexey Spark SQL advantages • Spark

    SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark • Unifies Stack with Spark Core, Spark Streaming etc. • Hive compatibility • Standard connectivity (JDBC, ODBC)
  131. 131 Training from Zinoviev Alexey Spark SQL Demo

  132. 132 Training from Zinoviev Alexey HIVE INTEGRATION

  133. 133 Training from Zinoviev Alexey Hive Support

  134. 134 Training from Zinoviev Alexey If you have a Hive

    in Spark application • Support for writing queries in HQL • Catalog info from Hive MetaStore • Tablescan operator that uses Hive SerDes • Wrappers for Hive UDFs, UDAFs, UDTFs
  135. 135 Training from Zinoviev Alexey Hive val hive = new

    HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  136. 136 Training from Zinoviev Alexey Hive val hive = new

    HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  137. 137 Training from Zinoviev Alexey Hive val hive = new

    HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/WarAndPeace.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  138. 138 Training from Zinoviev Alexey Hive val hive = new

    HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/WarAndPeace.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  139. 139 Training from Zinoviev Alexey How to cache in memory?

  140. 140 Training from Zinoviev Alexey Easy to cache sql.cacheTable(“people”)

  141. 141 Training from Zinoviev Alexey The main problem of this

    approach
  142. 142 Training from Zinoviev Alexey THORNY PATH TO DATASETS

  143. 143 Training from Zinoviev Alexey History of Spark APIs

  144. 144 Training from Zinoviev Alexey RDD rdd.filter(_.age > 21) //

    RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  145. 145 Training from Zinoviev Alexey History of Spark APIs

  146. 146 Training from Zinoviev Alexey SQL rdd.filter(_.age > 21) //

    RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  147. 147 Training from Zinoviev Alexey Expression rdd.filter(_.age > 21) //

    RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  148. 148 Training from Zinoviev Alexey History of Spark APIs

  149. 149 Training from Zinoviev Alexey DataSet rdd.filter(_.age > 21) //

    RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  150. 150 Training from Zinoviev Alexey DataSet = RDD’s types +

    DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten
  151. 151 Training from Zinoviev Alexey DataSet = RDD’s types +

    DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten
  152. 152 Training from Zinoviev Alexey Structured APIs in SPARK

  153. 153 Training from Zinoviev Alexey Unified API in Spark 2.0

    DataFrame = Dataset[Row] Dataframe is a schemaless (untyped) Dataset now
  154. 154 Training from Zinoviev Alexey Define case class case class

    User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  155. 155 Training from Zinoviev Alexey Read JSON case class User(email:

    String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  156. 156 Training from Zinoviev Alexey Filter by Field case class

    User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  157. 157 Training from Zinoviev Alexey DataSet API Demo

  158. 158 Training from Zinoviev Alexey Spark Family

  159. 159 Training from Zinoviev Alexey CATALYST OPTIMIZER

  160. 160 Training from Zinoviev Alexey Job Stages in Spark

  161. 161 Training from Zinoviev Alexey Scheduler Optimizations

  162. 162 Training from Zinoviev Alexey What’s faster: SQL or DataSet

    API?
  163. 163 Training from Zinoviev Alexey Unified Logical Plan

  164. 164 Training from Zinoviev Alexey SQL String -> Execution

  165. 165 Training from Zinoviev Alexey Catalyst Optimizer for DataFrames

  166. 166 Training from Zinoviev Alexey Bytecode

  167. 167 Training from Zinoviev Alexey How optimizer works

  168. 168 Training from Zinoviev Alexey DataSet.explain() == Physical Plan ==

    Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----)
  169. 169 Training from Zinoviev Alexey Why does explain() show so

    many Tungsten things?
  170. 170 Training from Zinoviev Alexey Tungsten’s goal Push performance closer

    to the limits of modern hardware
  171. 171 Training from Zinoviev Alexey How to be effective with

    CPU • Runtime code generation (Whole Stage Code Generation) • Сache locality • Off-heap memory management
  172. 172 Training from Zinoviev Alexey Cache Locality

  173. 173 Training from Zinoviev Alexey Whole-Stage CodeGen

  174. 174 Training from Zinoviev Alexey Tungsten Power

  175. 175 Training from Zinoviev Alexey SERIALIZATION

  176. 176 Training from Zinoviev Alexey Issue: Spark uses Java serialization

    A LOT
  177. 177 Training from Zinoviev Alexey Two choices to distribute data

    across cluster • Java serialization By default with ObjectOutputStream • Kryo serialization Should register classes (no support of Serialazible)
  178. 178 Training from Zinoviev Alexey The main problem: overhead of

    serializing Each serialized object contains the class structure as well as the values
  179. 179 Training from Zinoviev Alexey The main problem: overhead of

    serializing Each serialized object contains the class structure as well as the values Don’t forget about GC
  180. 180 Training from Zinoviev Alexey Tungsten Compact Encoding

  181. 181 Training from Zinoviev Alexey Maybe something UNSAFE?

  182. 182 Training from Zinoviev Alexey UnsafeRowFormat • Bit set for

    tracking null values • Small values are inlined • For variable-length values are stored relative offset into the variable length data section • Rows are always 8-byte word aligned • Equality comparison and hashing can be performed on raw bytes without requiring additional interpretation
  183. 183 Training from Zinoviev Alexey Encoder’s concept Generate bytecode to

    interact with off-heap & Give access to attributes without ser/deser
  184. 184 Training from Zinoviev Alexey Encoders

  185. 185 Training from Zinoviev Alexey No custom encoders

  186. 186 Training from Zinoviev Alexey PERFORMANCE

  187. 187 Training from Zinoviev Alexey How to measure Spark performance?

  188. 188 Training from Zinoviev Alexey You'd measure performance!

  189. 189 Training from Zinoviev Alexey TPCDS 99 Queries http://bit.ly/2dObMsH

  190. 190 Training from Zinoviev Alexey

  191. 191 Training from Zinoviev Alexey How to benchmark Spark

  192. 192 Training from Zinoviev Alexey Special Tool from Databricks Benchmark

    Tool for SparkSQL https://github.com/databricks/spark-sql-perf
  193. 193 Training from Zinoviev Alexey Spark 2 vs Spark 1.6

  194. 194 Training from Zinoviev Alexey MEMORY MANAGEMENT

  195. 195 Training from Zinoviev Alexey Can I influence on Memory

    Management in Spark?
  196. 196 Training from Zinoviev Alexey Should I tune generation’s stuff?

  197. 197 Training from Zinoviev Alexey Cached Data

  198. 198 Training from Zinoviev Alexey During operations

  199. 199 Training from Zinoviev Alexey For your needs

  200. 200 Training from Zinoviev Alexey For Dark Lord

  201. 201 Training from Zinoviev Alexey IN CONCLUSION

  202. 202 Training from Zinoviev Alexey Contacts E-mail : Alexey_Zinovyev@epam.com Twitter

    : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs
  203. 203 Training from Zinoviev Alexey Github Spark Tutorial: Core, Streaming,

    Machine Learning https://github.com/zaleslaw/Spark-Tutorial
  204. 204 Training from Zinoviev Alexey Gitbook Обработка данных на Spark

    2.2 и Kafka 0.10 www.gitbook.com/book/zaleslaw/data-processing-book
  205. 205 Training from Zinoviev Alexey Any questions?