Apache Spark Training [Spark Core, Spark SQL, Tungsten]

Slide 1

Slide 1 text

Spark Training Alexey Zinovyev, Java/BigData Trainer in EPAM

Slide 2

Slide 2 text

About With IT since 2007 With Java since 2009 With Hadoop since 2012 With Spark since 2014 With EPAM since 2015

Slide 3

Slide 3 text

3 Training from Zinoviev Alexey Contacts E-mail : [email protected] Twitter : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs

Slide 4

Slide 4 text

4 Training from Zinoviev Alexey Github Spark Tutorial: Core, Streaming, Machine Learning https://github.com/zaleslaw/Spark-Tutorial

Slide 5

Slide 5 text

5 Training from Zinoviev Alexey Gitbook Обработка данных на Spark 2.2 и Kafka 0.10 www.gitbook.com/book/zaleslaw/data-processing-book

Slide 6

Slide 6 text

6 Training from Zinoviev Alexey Spark Family

Slide 7

Slide 7 text

7 Training from Zinoviev Alexey Spark Family

Slide 8

Slide 8 text

8 Training from Zinoviev Alexey WHAT IS BIG DATA?

Slide 9

Slide 9 text

9 Training from Zinoviev Alexey Joke about Excel

Slide 10

Slide 10 text

10 Training from Zinoviev Alexey Every 60 seconds…

Slide 11

Slide 11 text

11 Training from Zinoviev Alexey Is BigData about PBs?

Slide 12

Slide 12 text

12 Training from Zinoviev Alexey Is BigData about PBs?

Slide 13

Slide 13 text

13 Training from Zinoviev Alexey It’s hard to … • .. store • .. handle • .. search in • .. visualize • .. send in network

Slide 14

Slide 14 text

14 Training from Zinoviev Alexey How to handle all these stuff?

Slide 15

Slide 15 text

15 Training from Zinoviev Alexey Just do it … in parallel

Slide 16

Slide 16 text

16 Training from Zinoviev Alexey Parallel Computing vs Distributed Computing

Slide 17

Slide 17 text

17 Training from Zinoviev Alexey Modern Java in 2016 Big Data in 2014

Slide 18

Slide 18 text

18 Training from Zinoviev Alexey Batch jobs produce reports. More and more..

Slide 19

Slide 19 text

19 Training from Zinoviev Alexey But customer can wait forever (ok, 1h)

Slide 20

Slide 20 text

20 Training from Zinoviev Alexey Hadoop Architecture

Slide 21

Slide 21 text

21 Training from Zinoviev Alexey HDFS Architecture

Slide 22

Slide 22 text

22 Training from Zinoviev Alexey Daemons in YARN

Slide 23

Slide 23 text

23 Training from Zinoviev Alexey Different scheduling algorithms

Slide 24

Slide 24 text

24 Training from Zinoviev Alexey Hive Data Model

Slide 25

Slide 25 text

25 Training from Zinoviev Alexey Machine Learning EVERYWHERE

Slide 26

Slide 26 text

26 Training from Zinoviev Alexey Data Lake in promotional brochure

Slide 27

Slide 27 text

27 Training from Zinoviev Alexey Data Lake in production

Slide 28

Slide 28 text

28 Training from Zinoviev Alexey Simple Flow in Reporting/BI systems

Slide 29

Slide 29 text

29 Training from Zinoviev Alexey WHY SHOULD WE USE SPARK?

Slide 30

Slide 30 text

30 Training from Zinoviev Alexey Advantages • native Python, Scala, R interface • interactive shells • in-memory caching of data, specified by the user • > 80 highly efficient distributed operations, any combination of them • capable of reusing Hadoop ecosystem, e.g. HDFS, YARN

Slide 31

Slide 31 text

31 Training from Zinoviev Alexey MapReduce vs Spark

Slide 32

Slide 32 text

32 Training from Zinoviev Alexey MapReduce vs Spark

Slide 33

Slide 33 text

33 Training from Zinoviev Alexey MapReduce vs Spark

Slide 34

Slide 34 text

34 Training from Zinoviev Alexey Let’s use Spark. It’s fast!

Slide 35

Slide 35 text

35 Training from Zinoviev Alexey SPARK INTRO

Slide 36

Slide 36 text

36 Training from Zinoviev Alexey Say me R..say me D.. Say me D again • Dataset • Distributed • Resilient

Slide 37

Slide 37 text

37 Training from Zinoviev Alexey Single Thread collection

Slide 38

Slide 38 text

38 Training from Zinoviev Alexey No perf issues, right?

Slide 39

Slide 39 text

39 Training from Zinoviev Alexey The main concept more partitions = more parallelism

Slide 40

Slide 40 text

40 Training from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")

Slide 41

Slide 41 text

41 Training from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")

Slide 42

Slide 42 text

42 Training from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")

Slide 43

Slide 43 text

43 Training from Zinoviev Alexey Loading // Wildcards, running on directories, text and archives sc.textFile("/my/directory") sc.textFile("/my/directory/*.txt") sc.textFile("/my/directory/*.gz") // Read directory and return as filename/content pairs sc.wholeTextFiles // Read sequence file sc.sequenceFile[TKey, TValue] // Takes an arbitrary JobConf and InputFormat class sc.hadoopRDD sc.newAPIHadoopRDD // SerDe rdd.saveAsObjectFile sc.objectFile

Slide 44

Slide 44 text

44 Training from Zinoviev Alexey Spark Context

Slide 45

Slide 45 text

45 Training from Zinoviev Alexey RDD OPERATIONS

Slide 46

Slide 46 text

46 Training from Zinoviev Alexey Word Count val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Slide 47

Slide 47 text

47 Training from Zinoviev Alexey Word Count val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Slide 48

Slide 48 text

48 Training from Zinoviev Alexey Word Count val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Slide 49

Slide 49 text

49 Training from Zinoviev Alexey What’s the difference between actions and transformations?

Slide 50

Slide 50 text

50 Training from Zinoviev Alexey RDD Chain

Slide 51

Slide 51 text

51 Training from Zinoviev Alexey Transformations • map, flatMap, filter • groupByKey, reduceByKey, sortByKey • mapValues, distinct • join, union • sample

Slide 52

Slide 52 text

52 Training from Zinoviev Alexey FlatMap explanation

Slide 53

Slide 53 text

53 Training from Zinoviev Alexey Map explanation

Slide 54

Slide 54 text

54 Training from Zinoviev Alexey ReduceByKey explanation

Slide 55

Slide 55 text

55 Training from Zinoviev Alexey Actions • reduce • collect, first, take, foreach • count(), countByKey() • saveAsTextFile()

Slide 56

Slide 56 text

56 Training from Zinoviev Alexey What’s the difference between PairRDD and usual RDD?

Slide 57

Slide 57 text

57 Training from Zinoviev Alexey Pair RDD

Slide 58

Slide 58 text

58 Training from Zinoviev Alexey RDD Demo

Slide 59

Slide 59 text

59 Training from Zinoviev Alexey PERSISTENCE

Slide 60

Slide 60 text

60 Training from Zinoviev Alexey Caching in Spark • Frequently used RDD can be stored in memory • One method, one short-cut: persist(), cache() • SparkContext keeps track of cached RDD • Serialized or deserialized Java objects

Slide 61

Slide 61 text

61 Training from Zinoviev Alexey Full list of options • MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2

Slide 62

Slide 62 text

62 Training from Zinoviev Alexey Spark Core Storage Level • MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2

Slide 63

Slide 63 text

63 Training from Zinoviev Alexey Spark Streaming Storage Level • MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER (default for Spark Streaming) • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2

Slide 64

Slide 64 text

64 Training from Zinoviev Alexey BUILDING

Slide 65

Slide 65 text

65 Training from Zinoviev Alexey Development tools • Console REPL ($SPARK_HOME/sbin/spark-shell)

Slide 66

Slide 66 text

66 Training from Zinoviev Alexey Development tools • Console REPL ($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin

Slide 67

Slide 67 text

67 Training from Zinoviev Alexey Run Zeppelin

Slide 68

Slide 68 text

68 Training from Zinoviev Alexey Development tools • Console REPL ($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin • IntelliJ IDEA Community + Scala Plugin

Slide 69

Slide 69 text

69 Training from Zinoviev Alexey Development tools • Console REPL ($SPARK_HOME/sbin/spark-shell) • Apache Zeppelin • IntelliJ IDEA Community + Scala Plugin • Don’t forget about SBT or adding spark’s jars

Slide 70

Slide 70 text

70 Training from Zinoviev Alexey SBT build name := "Spark-app" version := "1.0" scalaVersion := "2.11.11" libraryDependencies += "org.apache.spark" % "spark- core_2.11" % "2.2.0" libraryDependencies += "org.apache.spark" % "spark- sql_2.11" % "2.2.0"

Slide 71

Slide 71 text

71 Training from Zinoviev Alexey SBT build name := "Spark-app" version := "1.0" scalaVersion := "2.11.11" libraryDependencies += "org.apache.spark" % "spark- core_2.11" % "2.2.0“ % "provided" libraryDependencies += "org.apache.spark" % "spark- sql_2.11" % "2.2.0“ % "provided"

Slide 72

Slide 72 text

72 Training from Zinoviev Alexey SPARK ARCHITECTURE

Slide 73

Slide 73 text

73 Training from Zinoviev Alexey YARN + Driver

Slide 74

Slide 74 text

74 Training from Zinoviev Alexey Worker Nodes and Executors

Slide 75

Slide 75 text

75 Training from Zinoviev Alexey Spark Application

Slide 76

Slide 76 text

76 Training from Zinoviev Alexey Job Stages

Slide 77

Slide 77 text

77 Training from Zinoviev Alexey DAG Scheduler • Build stages of tasks • Submit them to lower level scheduler • Lower level scheduler will schedule data based on locality • Resubmit failed stages if outputs are lost

Slide 78

Slide 78 text

78 Training from Zinoviev Alexey Scheduler Optimizations

Slide 79

Slide 79 text

79 Training from Zinoviev Alexey Task in Spark • Unit of work to execute on in an executor thread • Unlike MR, there is no "map" vs "reduce" task • Each task apply set of transformations to same partitions in the RDD • Each task either partitions its output for "shuffle", or send the output back to the driver

Slide 80

Slide 80 text

80 Training from Zinoviev Alexey CONFIGURING

Slide 81

Slide 81 text

81 Training from Zinoviev Alexey Cluster Modes • Local mode

Slide 82

Slide 82 text

82 Training from Zinoviev Alexey Cluster Modes • Local mode • Stand-alone mode

Slide 83

Slide 83 text

83 Training from Zinoviev Alexey Cluster Modes • Local mode • Stand-alone mode • Yarn

Slide 84

Slide 84 text

84 Training from Zinoviev Alexey Cluster Modes • Local mode • Stand-alone mode • Yarn • Mesos

Slide 85

Slide 85 text

85 Training from Zinoviev Alexey Spark Master URL • local, local[n], local[*], local[K,F], local[*,F] • spark://host:port or spark://host1:port, host2:port • yarn-client or yarn-cluster • mesos://host:port

Slide 86

Slide 86 text

86 Training from Zinoviev Alexey SUBMIT

Slide 87

Slide 87 text

87 Training from Zinoviev Alexey Submit ./bin/spark-submit \ --class com.epam.SparkJob1 \ --master spark://192.168.101.101:7077 \ --executor-memory 2G \ --total-executor-cores 10 \ /path/to/artifact.jar \

Slide 88

Slide 88 text

88 Training from Zinoviev Alexey A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines

Slide 89

Slide 89 text

89 Training from Zinoviev Alexey Submit ./bin/spark-submit \ --class com.epam.SparkJob1 \ --master mesos://192.168.101.101:7077 \ --executor-memory 2G \ --deploy-mode cluster \ --total-executor-cores 10 \ /path/to/artifact.jar \

Slide 90

Slide 90 text

90 Training from Zinoviev Alexey STANDALONE CLUSTER

Slide 91

Slide 91 text

91 Training from Zinoviev Alexey Start master ./sbin/start-master.sh spark://192.168.101.101:7077

Slide 92

Slide 92 text

92 Training from Zinoviev Alexey Start slave ./sbin/start-slave.sh 192.168.101.101:7077

Slide 93

Slide 93 text

93 Training from Zinoviev Alexey Standalone Cluster Architecture

Slide 94

Slide 94 text

94 Training from Zinoviev Alexey Standalone Cluster Architecture with Resources

Slide 95

Slide 95 text

95 Training from Zinoviev Alexey EC2 Scripts for Spark 2.2 https://github.com/amplab/spark-ec2

Slide 96

Slide 96 text

96 Training from Zinoviev Alexey MONITORING

Slide 97

Slide 97 text

97 Training from Zinoviev Alexey Start historyserver ./sbin/start-historyserver.sh open http://192.168.101.101:18080

Slide 98

Slide 98 text

98 Training from Zinoviev Alexey Open web UI and enjoy

Slide 99

Slide 99 text

99 Training from Zinoviev Alexey Every Spark application launches a web UI • A list of scheduler stages and tasks • A summary of RDD sizes and memory usage • Environmental information • Information about the running executors

Slide 100

Slide 100 text

100 Training from Zinoviev Alexey Spark Training: Act 2 Alexey Zinovyev, Java/BigData Trainer in EPAM

Slide 101

Slide 101 text

101 Training from Zinoviev Alexey RDD INTERNALS

Slide 102

Slide 102 text

102 Training from Zinoviev Alexey Do it parallel

Slide 103

Slide 103 text

103 Training from Zinoviev Alexey I’d like NARROW

Slide 104

Slide 104 text

104 Training from Zinoviev Alexey Map, filter, filter

Slide 105

Slide 105 text

105 Training from Zinoviev Alexey GroupByKey, join

Slide 106

Slide 106 text

106 Training from Zinoviev Alexey Is it a graph with tasks and dependencies?

Slide 107

Slide 107 text

107 Training from Zinoviev Alexey RDD Lineage is … (aka RDD operator graph or RDD dependency graph) a graph of all the parent RDDs of a RDD.

Slide 108

Slide 108 text

108 Training from Zinoviev Alexey I’d like NARROW

Slide 109

Slide 109 text

109 Training from Zinoviev Alexey toDebugString prints … The execution DAG or physical execution plan is the DAG of stages.

Slide 110

Slide 110 text

110 Training from Zinoviev Alexey spark .logLineage $ ./bin/spark-shell --conf spark.logLineage=true scala> sc.textFile("README.md", 4).count ... 15/10/17 14:46:42 INFO SparkContext: Starting job: count at :25 15/10/17 14:46:42 INFO SparkContext: RDD's recursive dependencies: (4) MapPartitionsRDD[1] at textFile at :25 [] | README.md HadoopRDD[0] at textFile at :25 []

Slide 111

Slide 111 text

111 Training from Zinoviev Alexey Partitions Demo

Slide 112

Slide 112 text

112 Training from Zinoviev Alexey Spark Family

Slide 113

Slide 113 text

113 Training from Zinoviev Alexey SCHEMA + RDD

Slide 114

Slide 114 text

114 Training from Zinoviev Alexey Data sources and formats

Slide 115

Slide 115 text

115 Training from Zinoviev Alexey New RDD for each case

Slide 116

Slide 116 text

116 Training from Zinoviev Alexey Define schema for data to extract with SQL

Slide 117

Slide 117 text

117 Training from Zinoviev Alexey Case class for RDD User (height: Int (not null), name: String, age: Int)

Slide 118

Slide 118 text

118 Training from Zinoviev Alexey Let’s think about tables

Slide 119

Slide 119 text

119 Training from Zinoviev Alexey The main concept DataFrames are composed of Row objects, along with a schema that describes the data types of each column in the row

Slide 120

Slide 120 text

120 Training from Zinoviev Alexey RDD->DF val usersRdd = sqlContext .jsonFile("hdfs://localhost:9000/users.json") val df = usersRdd.toDF() val newRDD = df.rdd df.show()

Slide 121

Slide 121 text

121 Training from Zinoviev Alexey DATAFRAMES

Slide 122

Slide 122 text

122 Training from Zinoviev Alexey DataFrame’s nature • Like RDD with schema but it’s not RDD now • Distributed collection of data grouped into named columns • Domain-specific designed for common tasks under structured data • Available in Python, Scala, Java, and R (via SparkR) • Mutate from SchemaRDD

Slide 123

Slide 123 text

123 Training from Zinoviev Alexey DataFrame as SQL • Selecting columns and filtering • Joining different data sources • Aggregation (count, sum, average, etc) • Plotting results with Pandas (with PySpark)

Slide 124

Slide 124 text

124 Training from Zinoviev Alexey Input & Output

Slide 125

Slide 125 text

125 Training from Zinoviev Alexey Input & Output

Slide 126

Slide 126 text

126 Training from Zinoviev Alexey Custom Data Sources

Slide 127

Slide 127 text

127 Training from Zinoviev Alexey DataFrames Demo

Slide 128

Slide 128 text

128 Training from Zinoviev Alexey SPARK SQL

Slide 129

Slide 129 text

129 Training from Zinoviev Alexey Run SQL val df = spark.read.json(“/home/users.json”) df.createOrReplaceTempView(“users”) val sqlDF = spark.sql("SELECT name FROM users") sqlDF.show()

Slide 130

Slide 130 text

130 Training from Zinoviev Alexey Spark SQL advantages • Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark • Unifies Stack with Spark Core, Spark Streaming etc. • Hive compatibility • Standard connectivity (JDBC, ODBC)

Slide 131

Slide 131 text

131 Training from Zinoviev Alexey Spark SQL Demo

Slide 132

Slide 132 text

132 Training from Zinoviev Alexey HIVE INTEGRATION

Slide 133

Slide 133 text

133 Training from Zinoviev Alexey Hive Support

Slide 134

Slide 134 text

134 Training from Zinoviev Alexey If you have a Hive in Spark application • Support for writing queries in HQL • Catalog info from Hive MetaStore • Tablescan operator that uses Hive SerDes • Wrappers for Hive UDFs, UDAFs, UDTFs

Slide 135

Slide 135 text

135 Training from Zinoviev Alexey Hive val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()

Slide 136

Slide 136 text

136 Training from Zinoviev Alexey Hive val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()

Slide 137

Slide 137 text

137 Training from Zinoviev Alexey Hive val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/WarAndPeace.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()

Slide 138

Slide 138 text

138 Training from Zinoviev Alexey Hive val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/WarAndPeace.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()

Slide 139

Slide 139 text

139 Training from Zinoviev Alexey How to cache in memory?

Slide 140

Slide 140 text

140 Training from Zinoviev Alexey Easy to cache sql.cacheTable(“people”)

Slide 141

Slide 141 text

141 Training from Zinoviev Alexey The main problem of this approach

Slide 142

Slide 142 text

142 Training from Zinoviev Alexey THORNY PATH TO DATASETS

Slide 143

Slide 143 text

143 Training from Zinoviev Alexey History of Spark APIs

Slide 144

Slide 144 text

144 Training from Zinoviev Alexey RDD rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API

Slide 145

Slide 145 text

145 Training from Zinoviev Alexey History of Spark APIs

Slide 146

Slide 146 text

146 Training from Zinoviev Alexey SQL rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API

Slide 147

Slide 147 text

147 Training from Zinoviev Alexey Expression rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API

Slide 148

Slide 148 text

148 Training from Zinoviev Alexey History of Spark APIs

Slide 149

Slide 149 text

149 Training from Zinoviev Alexey DataSet rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API

Slide 150

Slide 150 text

150 Training from Zinoviev Alexey DataSet = RDD’s types + DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten

Slide 151

Slide 151 text

151 Training from Zinoviev Alexey DataSet = RDD’s types + DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten

Slide 152

Slide 152 text

152 Training from Zinoviev Alexey Structured APIs in SPARK

Slide 153

Slide 153 text

153 Training from Zinoviev Alexey Unified API in Spark 2.0 DataFrame = Dataset[Row] Dataframe is a schemaless (untyped) Dataset now

Slide 154

Slide 154 text

154 Training from Zinoviev Alexey Define case class case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT

Slide 155

Slide 155 text

155 Training from Zinoviev Alexey Read JSON case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT

Slide 156

Slide 156 text

156 Training from Zinoviev Alexey Filter by Field case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT

Slide 157

Slide 157 text

157 Training from Zinoviev Alexey DataSet API Demo

Slide 158

Slide 158 text

158 Training from Zinoviev Alexey Spark Family

Slide 159

Slide 159 text

159 Training from Zinoviev Alexey CATALYST OPTIMIZER

Slide 160

Slide 160 text

160 Training from Zinoviev Alexey Job Stages in Spark

Slide 161

Slide 161 text

161 Training from Zinoviev Alexey Scheduler Optimizations

Slide 162

Slide 162 text

162 Training from Zinoviev Alexey What’s faster: SQL or DataSet API?

Slide 163

Slide 163 text

163 Training from Zinoviev Alexey Unified Logical Plan

Slide 164

Slide 164 text

164 Training from Zinoviev Alexey SQL String -> Execution

Slide 165

Slide 165 text

165 Training from Zinoviev Alexey Catalyst Optimizer for DataFrames

Slide 166

Slide 166 text

166 Training from Zinoviev Alexey Bytecode

Slide 167

Slide 167 text

167 Training from Zinoviev Alexey How optimizer works

Slide 168

Slide 168 text

168 Training from Zinoviev Alexey DataSet.explain() == Physical Plan == Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----)

Slide 169

Slide 169 text

169 Training from Zinoviev Alexey Why does explain() show so many Tungsten things?

Slide 170

Slide 170 text

170 Training from Zinoviev Alexey Tungsten’s goal Push performance closer to the limits of modern hardware

Slide 171

Slide 171 text

171 Training from Zinoviev Alexey How to be effective with CPU • Runtime code generation (Whole Stage Code Generation) • Сache locality • Off-heap memory management

Slide 172

Slide 172 text

172 Training from Zinoviev Alexey Cache Locality

Slide 173

Slide 173 text

173 Training from Zinoviev Alexey Whole-Stage CodeGen

Slide 174

Slide 174 text

174 Training from Zinoviev Alexey Tungsten Power

Slide 175

Slide 175 text

175 Training from Zinoviev Alexey SERIALIZATION

Slide 176

Slide 176 text

176 Training from Zinoviev Alexey Issue: Spark uses Java serialization A LOT

Slide 177

Slide 177 text

177 Training from Zinoviev Alexey Two choices to distribute data across cluster • Java serialization By default with ObjectOutputStream • Kryo serialization Should register classes (no support of Serialazible)

Slide 178

Slide 178 text

178 Training from Zinoviev Alexey The main problem: overhead of serializing Each serialized object contains the class structure as well as the values

Slide 179

Slide 179 text

179 Training from Zinoviev Alexey The main problem: overhead of serializing Each serialized object contains the class structure as well as the values Don’t forget about GC

Slide 180

Slide 180 text

180 Training from Zinoviev Alexey Tungsten Compact Encoding

Slide 181

Slide 181 text

181 Training from Zinoviev Alexey Maybe something UNSAFE?

Slide 182

Slide 182 text

182 Training from Zinoviev Alexey UnsafeRowFormat • Bit set for tracking null values • Small values are inlined • For variable-length values are stored relative offset into the variable length data section • Rows are always 8-byte word aligned • Equality comparison and hashing can be performed on raw bytes without requiring additional interpretation

Slide 183

Slide 183 text

183 Training from Zinoviev Alexey Encoder’s concept Generate bytecode to interact with off-heap & Give access to attributes without ser/deser

Slide 184

Slide 184 text

184 Training from Zinoviev Alexey Encoders

Slide 185

Slide 185 text

185 Training from Zinoviev Alexey No custom encoders

Slide 186

Slide 186 text

186 Training from Zinoviev Alexey PERFORMANCE

Slide 187

Slide 187 text

187 Training from Zinoviev Alexey How to measure Spark performance?

Slide 188

Slide 188 text

188 Training from Zinoviev Alexey You'd measure performance!

Slide 189

Slide 189 text

189 Training from Zinoviev Alexey TPCDS 99 Queries http://bit.ly/2dObMsH

Slide 190

Slide 190 text

190 Training from Zinoviev Alexey

Slide 191

Slide 191 text

191 Training from Zinoviev Alexey How to benchmark Spark

Slide 192

Slide 192 text

192 Training from Zinoviev Alexey Special Tool from Databricks Benchmark Tool for SparkSQL https://github.com/databricks/spark-sql-perf

Slide 193

Slide 193 text

193 Training from Zinoviev Alexey Spark 2 vs Spark 1.6

Slide 194

Slide 194 text

194 Training from Zinoviev Alexey MEMORY MANAGEMENT

Slide 195

Slide 195 text

195 Training from Zinoviev Alexey Can I influence on Memory Management in Spark?

Slide 196

Slide 196 text

196 Training from Zinoviev Alexey Should I tune generation’s stuff?

Slide 197

Slide 197 text

197 Training from Zinoviev Alexey Cached Data

Slide 198

Slide 198 text

198 Training from Zinoviev Alexey During operations

Slide 199

Slide 199 text

199 Training from Zinoviev Alexey For your needs

Slide 200

Slide 200 text

200 Training from Zinoviev Alexey For Dark Lord

Slide 201

Slide 201 text

201 Training from Zinoviev Alexey IN CONCLUSION

Slide 202

Slide 202 text

202 Training from Zinoviev Alexey Contacts E-mail : [email protected] Twitter : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs

Slide 203

Slide 203 text

203 Training from Zinoviev Alexey Github Spark Tutorial: Core, Streaming, Machine Learning https://github.com/zaleslaw/Spark-Tutorial

Slide 204

Slide 204 text

204 Training from Zinoviev Alexey Gitbook Обработка данных на Spark 2.2 и Kafka 0.10 www.gitbook.com/book/zaleslaw/data-processing-book

Slide 205

Slide 205 text

205 Training from Zinoviev Alexey Any questions?