Apache Spark Training [Spark Core, Spark SQL, Tungsten]

Spark Training Alexey Zinovyev, Java/BigData Trainer in EPAM

About With IT since 2007 With Java since 2009 With
Hadoop since 2012 With Spark since 2014 With EPAM since 2015

3 Training from Zinoviev Alexey Contacts E-mail : [email protected] Twitter
: @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs

4 Training from Zinoviev Alexey Github Spark Tutorial: Core, Streaming,
Machine Learning https://github.com/zaleslaw/Spark-Tutorial

5 Training from Zinoviev Alexey Gitbook Обработка данных на Spark
2.2 и Kafka 0.10 www.gitbook.com/book/zaleslaw/data-processing-book

6 Training from Zinoviev Alexey Spark Family

8 Training from Zinoviev Alexey WHAT IS BIG DATA?

9 Training from Zinoviev Alexey Joke about Excel

10 Training from Zinoviev Alexey Every 60 seconds…

11 Training from Zinoviev Alexey Is BigData about PBs?

12 Training from Zinoviev Alexey Is BigData about PBs?

13 Training from Zinoviev Alexey It’s hard to … •
.. store • .. handle • .. search in • .. visualize • .. send in network

14 Training from Zinoviev Alexey How to handle all these
stuff?

15 Training from Zinoviev Alexey Just do it … in
parallel

16 Training from Zinoviev Alexey Parallel Computing vs Distributed Computing

17 Training from Zinoviev Alexey Modern Java in 2016 Big
Data in 2014

18 Training from Zinoviev Alexey Batch jobs produce reports. More
and more..

19 Training from Zinoviev Alexey But customer can wait forever
(ok, 1h)

20 Training from Zinoviev Alexey Hadoop Architecture

21 Training from Zinoviev Alexey HDFS Architecture

22 Training from Zinoviev Alexey Daemons in YARN

23 Training from Zinoviev Alexey Different scheduling algorithms

24 Training from Zinoviev Alexey Hive Data Model

25 Training from Zinoviev Alexey Machine Learning EVERYWHERE

26 Training from Zinoviev Alexey Data Lake in promotional brochure

27 Training from Zinoviev Alexey Data Lake in production

28 Training from Zinoviev Alexey Simple Flow in Reporting/BI systems

29 Training from Zinoviev Alexey WHY SHOULD WE USE SPARK?

30 Training from Zinoviev Alexey Advantages • native Python, Scala,
R interface • interactive shells • in-memory caching of data, specified by the user • > 80 highly efficient distributed operations, any combination of them • capable of reusing Hadoop ecosystem, e.g. HDFS, YARN

31 Training from Zinoviev Alexey MapReduce vs Spark

34 Training from Zinoviev Alexey Let’s use Spark. It’s fast!

35 Training from Zinoviev Alexey SPARK INTRO

36 Training from Zinoviev Alexey Say me R..say me D..
Say me D again • Dataset • Distributed • Resilient

37 Training from Zinoviev Alexey Single Thread collection

38 Training from Zinoviev Alexey No perf issues, right?

39 Training from Zinoviev Alexey The main concept more partitions
= more parallelism

40 Training from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")

val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")

val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")

43 Training from Zinoviev Alexey Loading // Wildcards, running on
directories, text and archives sc.textFile("/my/directory") sc.textFile("/my/directory/*.txt") sc.textFile("/my/directory/*.gz") // Read directory and return as filename/content pairs sc.wholeTextFiles // Read sequence file sc.sequenceFile[TKey, TValue] // Takes an arbitrary JobConf and InputFormat class sc.hadoopRDD sc.newAPIHadoopRDD // SerDe rdd.saveAsObjectFile sc.objectFile

44 Training from Zinoviev Alexey Spark Context

45 Training from Zinoviev Alexey RDD OPERATIONS

46 Training from Zinoviev Alexey Word Count val textFile =
sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

49 Training from Zinoviev Alexey What’s the difference between actions
and transformations?

50 Training from Zinoviev Alexey RDD Chain

51 Training from Zinoviev Alexey Transformations • map, flatMap, filter
• groupByKey, reduceByKey, sortByKey • mapValues, distinct • join, union • sample

52 Training from Zinoviev Alexey FlatMap explanation

53 Training from Zinoviev Alexey Map explanation

54 Training from Zinoviev Alexey ReduceByKey explanation

55 Training from Zinoviev Alexey Actions • reduce • collect,
first, take, foreach • count(), countByKey() • saveAsTextFile()

56 Training from Zinoviev Alexey What’s the difference between PairRDD
and usual RDD?

57 Training from Zinoviev Alexey Pair RDD

58 Training from Zinoviev Alexey RDD Demo

59 Training from Zinoviev Alexey PERSISTENCE

60 Training from Zinoviev Alexey Caching in Spark • Frequently
used RDD can be stored in memory • One method, one short-cut: persist(), cache() • SparkContext keeps track of cached RDD • Serialized or deserialized Java objects

61 Training from Zinoviev Alexey Full list of options •
MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2

62 Training from Zinoviev Alexey Spark Core Storage Level •
MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2

63 Training from Zinoviev Alexey Spark Streaming Storage Level •
MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER (default for Spark Streaming) • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2

64 Training from Zinoviev Alexey BUILDING

65 Training from Zinoviev Alexey Development tools • Console REPL
($SPARK_HOME/sbin/spark-shell)

($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin

67 Training from Zinoviev Alexey Run Zeppelin

($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin • IntelliJ IDEA Community + Scala Plugin

($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin • IntelliJ IDEA Community + Scala Plugin • Don’t forget about SBT or adding spark’s jars

70 Training from Zinoviev Alexey SBT build name := "Spark-app"
version := "1.0" scalaVersion := "2.11.11" libraryDependencies += "org.apache.spark" % "spark- core_2.11" % "2.2.0" libraryDependencies += "org.apache.spark" % "spark- sql_2.11" % "2.2.0"

71 Training from Zinoviev Alexey SBT build name := "Spark-app"
version := "1.0" scalaVersion := "2.11.11" libraryDependencies += "org.apache.spark" % "spark- core_2.11" % "2.2.0“ % "provided" libraryDependencies += "org.apache.spark" % "spark- sql_2.11" % "2.2.0“ % "provided"

72 Training from Zinoviev Alexey SPARK ARCHITECTURE

73 Training from Zinoviev Alexey YARN + Driver

74 Training from Zinoviev Alexey Worker Nodes and Executors

75 Training from Zinoviev Alexey Spark Application

76 Training from Zinoviev Alexey Job Stages

77 Training from Zinoviev Alexey DAG Scheduler • Build stages
of tasks • Submit them to lower level scheduler • Lower level scheduler will schedule data based on locality • Resubmit failed stages if outputs are lost

78 Training from Zinoviev Alexey Scheduler Optimizations

79 Training from Zinoviev Alexey Task in Spark • Unit
of work to execute on in an executor thread • Unlike MR, there is no "map" vs "reduce" task • Each task apply set of transformations to same partitions in the RDD • Each task either partitions its output for "shuffle", or send the output back to the driver

80 Training from Zinoviev Alexey CONFIGURING

81 Training from Zinoviev Alexey Cluster Modes • Local mode

• Stand-alone mode

• Stand-alone mode • Yarn

• Stand-alone mode • Yarn • Mesos

85 Training from Zinoviev Alexey Spark Master URL • local,
local[n], local[*], local[K,F], local[*,F] • spark://host:port or spark://host1:port, host2:port • yarn-client or yarn-cluster • mesos://host:port

86 Training from Zinoviev Alexey SUBMIT

87 Training from Zinoviev Alexey Submit ./bin/spark-submit \ --class com.epam.SparkJob1
\ --master spark://192.168.101.101:7077 \ --executor-memory 2G \ --total-executor-cores 10 \ /path/to/artifact.jar \

88 Training from Zinoviev Alexey A common deployment strategy is
to submit your application from a gateway machine that is physically co-located with your worker machines

89 Training from Zinoviev Alexey Submit ./bin/spark-submit \ --class com.epam.SparkJob1
\ --master mesos://192.168.101.101:7077 \ --executor-memory 2G \ --deploy-mode cluster \ --total-executor-cores 10 \ /path/to/artifact.jar \

90 Training from Zinoviev Alexey STANDALONE CLUSTER

91 Training from Zinoviev Alexey Start master ./sbin/start-master.sh spark://192.168.101.101:7077

92 Training from Zinoviev Alexey Start slave ./sbin/start-slave.sh 192.168.101.101:7077

93 Training from Zinoviev Alexey Standalone Cluster Architecture

94 Training from Zinoviev Alexey Standalone Cluster Architecture with Resources

95 Training from Zinoviev Alexey EC2 Scripts for Spark 2.2
https://github.com/amplab/spark-ec2

96 Training from Zinoviev Alexey MONITORING

97 Training from Zinoviev Alexey Start historyserver ./sbin/start-historyserver.sh open
http://192.168.101.101:18080

98 Training from Zinoviev Alexey Open web UI and enjoy

99 Training from Zinoviev Alexey Every Spark application launches a
web UI • A list of scheduler stages and tasks • A summary of RDD sizes and memory usage • Environmental information • Information about the running executors

100 Training from Zinoviev Alexey Spark Training: Act 2 Alexey
Zinovyev, Java/BigData Trainer in EPAM

101 Training from Zinoviev Alexey RDD INTERNALS

102 Training from Zinoviev Alexey Do it parallel

103 Training from Zinoviev Alexey I’d like NARROW

104 Training from Zinoviev Alexey Map, filter, filter

105 Training from Zinoviev Alexey GroupByKey, join

106 Training from Zinoviev Alexey Is it a graph with
tasks and dependencies?

107 Training from Zinoviev Alexey RDD Lineage is … (aka
RDD operator graph or RDD dependency graph) a graph of all the parent RDDs of a RDD.

108 Training from Zinoviev Alexey I’d like NARROW

109 Training from Zinoviev Alexey toDebugString prints … The execution
DAG or physical execution plan is the DAG of stages.

110 Training from Zinoviev Alexey spark .logLineage $ ./bin/spark-shell --conf
spark.logLineage=true scala> sc.textFile("README.md", 4).count ... 15/10/17 14:46:42 INFO SparkContext: Starting job: count at <console>:25 15/10/17 14:46:42 INFO SparkContext: RDD's recursive dependencies: (4) MapPartitionsRDD[1] at textFile at <console>:25 [] | README.md HadoopRDD[0] at textFile at <console>:25 []

111 Training from Zinoviev Alexey Partitions Demo

113 Training from Zinoviev Alexey SCHEMA + RDD

114 Training from Zinoviev Alexey Data sources and formats

115 Training from Zinoviev Alexey New RDD for each case

116 Training from Zinoviev Alexey Define schema for data to
extract with SQL

117 Training from Zinoviev Alexey Case class for RDD User
(height: Int (not null), name: String, age: Int)

118 Training from Zinoviev Alexey Let’s think about tables

119 Training from Zinoviev Alexey The main concept DataFrames are
composed of Row objects, along with a schema that describes the data types of each column in the row

120 Training from Zinoviev Alexey RDD->DF val usersRdd = sqlContext
.jsonFile("hdfs://localhost:9000/users.json") val df = usersRdd.toDF() val newRDD = df.rdd df.show()

121 Training from Zinoviev Alexey DATAFRAMES

122 Training from Zinoviev Alexey DataFrame’s nature • Like RDD
with schema but it’s not RDD now • Distributed collection of data grouped into named columns • Domain-specific designed for common tasks under structured data • Available in Python, Scala, Java, and R (via SparkR) • Mutate from SchemaRDD

123 Training from Zinoviev Alexey DataFrame as SQL • Selecting
columns and filtering • Joining different data sources • Aggregation (count, sum, average, etc) • Plotting results with Pandas (with PySpark)

124 Training from Zinoviev Alexey Input & Output

125 Training from Zinoviev Alexey Input & Output

126 Training from Zinoviev Alexey Custom Data Sources

127 Training from Zinoviev Alexey DataFrames Demo

128 Training from Zinoviev Alexey SPARK SQL

129 Training from Zinoviev Alexey Run SQL val df =
spark.read.json(“/home/users.json”) df.createOrReplaceTempView(“users”) val sqlDF = spark.sql("SELECT name FROM users") sqlDF.show()

130 Training from Zinoviev Alexey Spark SQL advantages • Spark
SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark • Unifies Stack with Spark Core, Spark Streaming etc. • Hive compatibility • Standard connectivity (JDBC, ODBC)

131 Training from Zinoviev Alexey Spark SQL Demo

132 Training from Zinoviev Alexey HIVE INTEGRATION

133 Training from Zinoviev Alexey Hive Support

134 Training from Zinoviev Alexey If you have a Hive
in Spark application • Support for writing queries in HQL • Catalog info from Hive MetaStore • Tablescan operator that uses Hive SerDes • Wrappers for Hive UDFs, UDAFs, UDTFs

135 Training from Zinoviev Alexey Hive val hive = new
HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()

HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()

HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/WarAndPeace.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()

139 Training from Zinoviev Alexey How to cache in memory?

140 Training from Zinoviev Alexey Easy to cache sql.cacheTable(“people”)

141 Training from Zinoviev Alexey The main problem of this
approach

142 Training from Zinoviev Alexey THORNY PATH TO DATASETS

143 Training from Zinoviev Alexey History of Spark APIs

144 Training from Zinoviev Alexey RDD rdd.filter(_.age > 21) //
RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API

146 Training from Zinoviev Alexey SQL rdd.filter(_.age > 21) //

147 Training from Zinoviev Alexey Expression rdd.filter(_.age > 21) //

149 Training from Zinoviev Alexey DataSet rdd.filter(_.age > 21) //

150 Training from Zinoviev Alexey DataSet = RDD’s types +
DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten

151 Training from Zinoviev Alexey DataSet = RDD’s types +
DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten

152 Training from Zinoviev Alexey Structured APIs in SPARK

153 Training from Zinoviev Alexey Unified API in Spark 2.0
DataFrame = Dataset[Row] Dataframe is a schemaless (untyped) Dataset now

154 Training from Zinoviev Alexey Define case class case class
User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT

155 Training from Zinoviev Alexey Read JSON case class User(email:
String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT

156 Training from Zinoviev Alexey Filter by Field case class
User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT

157 Training from Zinoviev Alexey DataSet API Demo

159 Training from Zinoviev Alexey CATALYST OPTIMIZER

160 Training from Zinoviev Alexey Job Stages in Spark

161 Training from Zinoviev Alexey Scheduler Optimizations

162 Training from Zinoviev Alexey What’s faster: SQL or DataSet
API?

163 Training from Zinoviev Alexey Unified Logical Plan

164 Training from Zinoviev Alexey SQL String -> Execution

165 Training from Zinoviev Alexey Catalyst Optimizer for DataFrames

166 Training from Zinoviev Alexey Bytecode

167 Training from Zinoviev Alexey How optimizer works

168 Training from Zinoviev Alexey DataSet.explain() == Physical Plan ==
Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----)

169 Training from Zinoviev Alexey Why does explain() show so
many Tungsten things?

170 Training from Zinoviev Alexey Tungsten’s goal Push performance closer
to the limits of modern hardware

171 Training from Zinoviev Alexey How to be effective with
CPU • Runtime code generation (Whole Stage Code Generation) • Сache locality • Off-heap memory management

172 Training from Zinoviev Alexey Cache Locality

173 Training from Zinoviev Alexey Whole-Stage CodeGen

174 Training from Zinoviev Alexey Tungsten Power

175 Training from Zinoviev Alexey SERIALIZATION

176 Training from Zinoviev Alexey Issue: Spark uses Java serialization
A LOT

177 Training from Zinoviev Alexey Two choices to distribute data
across cluster • Java serialization By default with ObjectOutputStream • Kryo serialization Should register classes (no support of Serialazible)

178 Training from Zinoviev Alexey The main problem: overhead of
serializing Each serialized object contains the class structure as well as the values

179 Training from Zinoviev Alexey The main problem: overhead of
serializing Each serialized object contains the class structure as well as the values Don’t forget about GC

180 Training from Zinoviev Alexey Tungsten Compact Encoding

181 Training from Zinoviev Alexey Maybe something UNSAFE?

182 Training from Zinoviev Alexey UnsafeRowFormat • Bit set for
tracking null values • Small values are inlined • For variable-length values are stored relative offset into the variable length data section • Rows are always 8-byte word aligned • Equality comparison and hashing can be performed on raw bytes without requiring additional interpretation

183 Training from Zinoviev Alexey Encoder’s concept Generate bytecode to
interact with off-heap & Give access to attributes without ser/deser

184 Training from Zinoviev Alexey Encoders

185 Training from Zinoviev Alexey No custom encoders

186 Training from Zinoviev Alexey PERFORMANCE

187 Training from Zinoviev Alexey How to measure Spark performance?

188 Training from Zinoviev Alexey You'd measure performance!

189 Training from Zinoviev Alexey TPCDS 99 Queries http://bit.ly/2dObMsH

190 Training from Zinoviev Alexey

191 Training from Zinoviev Alexey How to benchmark Spark

192 Training from Zinoviev Alexey Special Tool from Databricks Benchmark
Tool for SparkSQL https://github.com/databricks/spark-sql-perf

193 Training from Zinoviev Alexey Spark 2 vs Spark 1.6

194 Training from Zinoviev Alexey MEMORY MANAGEMENT

195 Training from Zinoviev Alexey Can I influence on Memory
Management in Spark?

196 Training from Zinoviev Alexey Should I tune generation’s stuff?

197 Training from Zinoviev Alexey Cached Data

198 Training from Zinoviev Alexey During operations

199 Training from Zinoviev Alexey For your needs

200 Training from Zinoviev Alexey For Dark Lord

201 Training from Zinoviev Alexey IN CONCLUSION

202 Training from Zinoviev Alexey Contacts E-mail : [email protected] Twitter
: @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs

203 Training from Zinoviev Alexey Github Spark Tutorial: Core, Streaming,
Machine Learning https://github.com/zaleslaw/Spark-Tutorial

204 Training from Zinoviev Alexey Gitbook Обработка данных на Spark
2.2 и Kafka 0.10 www.gitbook.com/book/zaleslaw/data-processing-book

205 Training from Zinoviev Alexey Any questions?

Apache Spark Training [Spark Core, Spark SQL, T...

Apache Spark Training [Spark Core, Spark SQL, Tungsten]

More Decks by Alexey Zinoviev

Other Decks in Programming

Featured

Transcript