Spark Streaming Tips for Devs & Ops

WHO ARE WE? Fede Fernández Scala Software Engineer at 47
Degrees Spark Certified Developer @fede_fdz Fran Pérez Scala Software Engineer at 47 Degrees Spark Certified Developer @FPerezP

Overview Spark Streaming Spark + Kafka groupByKey vs reduceByKey Table
Joins Serializer Tunning

Spark Streaming Real-time data processing Continuous Data Flow RDD RDD
RDD DStream Output Data

Spark + Kafka • Receiver-based Approach ◦ At least once
(with Write Ahead Logs) • Direct API ◦ Exactly once

Spark + Kafka • Receiver-based Approach

Spark + Kafka • Direct API

groupByKey VS reduceByKey • groupByKey ◦ Groups pairs of data
with the same key. • reduceByKey ◦ Groups and combines pairs of data based on a reduce operation.

groupByKey VS reduceByKey sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).groupByKey.map(t => (t._1,
t._2.sum)) sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).reduceByKey(_ + _)

groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c,
1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)

1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle

1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (c, 1) (c, 1) shuffle shuffle

1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 5) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 6) (c, 1) (c, 1) (c, 1) (c, 1) (c, 4) shuffle shuffle

reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s,
1) (j, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (j, 1) (c, 1) (s, 1) (j, 1)

reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s,
1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)

reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (s,
2) (s, 2) (s, 1) (s, 1) (c, 1) (c, 2) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle

reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (j,
5) (s, 2) (s, 2) (s, 1) (s, 1) (s, 6) (c, 1) (c, 2) (c, 1) (c, 4) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle

reduce VS group • Improve performance • Can’t always be
used • Out of Memory Exceptions • aggregateByKey, foldByKey, combineByKey

Table Joins • Typical operations that can be improved •
Need a previous analysis • There are no silver bullets

Table Joins: Medium - Large

Table Joins: Medium - Large FILTER No Shuffle

Table Joins: Small - Large ... Shuffled Hash Join sqlContext.sql("explain
<select>").collect.mkString(“\n”) [== Physical Plan ==] [Project] [+- SortMergeJoin] [ :- Sort] [ : +- TungstenExchange hashpartitioning] [ : +- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Sort] [ +- TungstenExchange hashpartitioning] [ +- ConvertToUnsafe] [ +- Scan ExistingRDD]

Table Joins: Small - Large Broadcast Hash Join sqlContext.sql("explain <select>").collect.mkString(“\n”)
[== Physical Plan ==] [Project] [+- BroadcastHashJoin] [ :- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Scan ParquetRelation] No shuffle! By default from Spark 1.4 when using DataFrame API Prior Spark 1.4 ANALYZE TABLE small_table COMPUTE STATISTICS noscan Broadcast

Table Joins: Small - Large

Serializers • Java’s ObjectOutputStream framework. (Default) • Custom serializers: extends
Serializable & Externalizable. • KryoSerializer: register your custom classes. • Where is our code being run? • Special care to JodaTime.

Tuning Garbage Collector blockInterval Partitioning Storage

Tuning: Garbage Collector • Applications which rely heavily on memory
consumption. • GC Strategies • Concurrent Mark Sweep (CMS) GC • ParallelOld GC • Garbage-First GC • Tuning steps: • Review your logic and object management • Try Garbage-First • Activate and inspect the logs Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

Tuning: blockInterval blockInterval = (bi * consumers) / (pf *
sc) • CAT: Total cores per partition. • bi: Batch Interval time in milliseconds. • consumers: number of streaming consumers. • pf (partitionFactor): number of partitions per core. • sc (sparkCores): CAT - consumers.

blockInterval: example • batchIntervalMillis = 600,000 • consumers = 20
• CAT = 120 • sparkCores = 120 - 20 = 100 • partitionFactor = 3 blockInterval = (bi * consumers) / (pf * sc) = (600,000 * 20) / (3 * 100) = 40,000

Tuning: Partitioning partitions = consumers * bi / blockInterval •
consumers: number of streaming consumers. • bi: Batch Interval time in milliseconds. • blockInterval: time size to split data before storing into Spark.

Partitioning: example • batchIntervalMillis = 600,000 • consumers = 20
• blockInterval = 40,000 partitions = consumers * bi / blockInterval = 20 * 600,000/ 40,000= 30

Tuning: Storage • Default (MEMORY_ONLY) • MEMORY_ONLY_SER with Serialization Library
• MEMORY_AND_DISK & DISK_ONLY • Replicated _2 • OFF_HEAP (Tachyon/Alluxio)

Where to find more information? Spark Official Documentation Databricks Blog
Databricks Spark Knowledge Base Spark Notebook - By Andy Petrella Databricks YouTube Channel

QUESTIONS Fede Fernández @fede_fdz [email protected] Fran Pérez @FPerezP [email protected] Thanks!

Spark Streaming Tips for Devs & Ops

Spark Streaming Tips for Devs & Ops

More Decks by Fede

Other Decks in Programming

Featured

Transcript