Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark Streaming Tips for Devs & Ops

Spark Streaming Tips for Devs & Ops

J On The Beach
21st May 2016 - La Térmica, Málaga, Spain

Fede

May 21, 2016
Tweet

More Decks by Fede

Other Decks in Programming

Transcript

  1. WHO ARE WE? Fede Fernández Scala Software Engineer at 47

    Degrees Spark Certified Developer @fede_fdz Fran Pérez Scala Software Engineer at 47 Degrees Spark Certified Developer @FPerezP
  2. Spark + Kafka • Receiver-based Approach ◦ At least once

    (with Write Ahead Logs) • Direct API ◦ Exactly once
  3. groupByKey VS reduceByKey • groupByKey ◦ Groups pairs of data

    with the same key. • reduceByKey ◦ Groups and combines pairs of data based on a reduce operation.
  4. groupByKey VS reduceByKey sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).groupByKey.map(t => (t._1,

    t._2.sum)) sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).reduceByKey(_ + _)
  5. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c,

    1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  6. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c,

    1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  7. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c,

    1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  8. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c,

    1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (c, 1) (c, 1) shuffle shuffle
  9. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c,

    1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 5) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 6) (c, 1) (c, 1) (c, 1) (c, 1) (c, 4) shuffle shuffle
  10. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s,

    1) (j, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (j, 1) (c, 1) (s, 1) (j, 1)
  11. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s,

    1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  12. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s,

    1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  13. reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (s,

    2) (s, 2) (s, 1) (s, 1) (c, 1) (c, 2) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  14. reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (j,

    5) (s, 2) (s, 2) (s, 1) (s, 1) (s, 6) (c, 1) (c, 2) (c, 1) (c, 4) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  15. reduce VS group • Improve performance • Can’t always be

    used • Out of Memory Exceptions • aggregateByKey, foldByKey, combineByKey
  16. Table Joins • Typical operations that can be improved •

    Need a previous analysis • There are no silver bullets
  17. Table Joins: Small - Large ... Shuffled Hash Join sqlContext.sql("explain

    <select>").collect.mkString(“\n”) [== Physical Plan ==] [Project] [+- SortMergeJoin] [ :- Sort] [ : +- TungstenExchange hashpartitioning] [ : +- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Sort] [ +- TungstenExchange hashpartitioning] [ +- ConvertToUnsafe] [ +- Scan ExistingRDD]
  18. Table Joins: Small - Large Broadcast Hash Join sqlContext.sql("explain <select>").collect.mkString(“\n”)

    [== Physical Plan ==] [Project] [+- BroadcastHashJoin] [ :- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Scan ParquetRelation] No shuffle! By default from Spark 1.4 when using DataFrame API Prior Spark 1.4 ANALYZE TABLE small_table COMPUTE STATISTICS noscan Broadcast
  19. Serializers • Java’s ObjectOutputStream framework. (Default) • Custom serializers: extends

    Serializable & Externalizable. • KryoSerializer: register your custom classes. • Where is our code being run? • Special care to JodaTime.
  20. Tuning: Garbage Collector • Applications which rely heavily on memory

    consumption. • GC Strategies • Concurrent Mark Sweep (CMS) GC • ParallelOld GC • Garbage-First GC • Tuning steps: • Review your logic and object management • Try Garbage-First • Activate and inspect the logs Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
  21. Tuning: blockInterval blockInterval = (bi * consumers) / (pf *

    sc) • CAT: Total cores per partition. • bi: Batch Interval time in milliseconds. • consumers: number of streaming consumers. • pf (partitionFactor): number of partitions per core. • sc (sparkCores): CAT - consumers.
  22. blockInterval: example • batchIntervalMillis = 600,000 • consumers = 20

    • CAT = 120 • sparkCores = 120 - 20 = 100 • partitionFactor = 3 blockInterval = (bi * consumers) / (pf * sc) = (600,000 * 20) / (3 * 100) = 40,000
  23. Tuning: Partitioning partitions = consumers * bi / blockInterval •

    consumers: number of streaming consumers. • bi: Batch Interval time in milliseconds. • blockInterval: time size to split data before storing into Spark.
  24. Partitioning: example • batchIntervalMillis = 600,000 • consumers = 20

    • blockInterval = 40,000 partitions = consumers * bi / blockInterval = 20 * 600,000/ 40,000= 30
  25. Tuning: Storage • Default (MEMORY_ONLY) • MEMORY_ONLY_SER with Serialization Library

    • MEMORY_AND_DISK & DISK_ONLY • Replicated _2 • OFF_HEAP (Tachyon/Alluxio)
  26. Where to find more information? Spark Official Documentation Databricks Blog

    Databricks Spark Knowledge Base Spark Notebook - By Andy Petrella Databricks YouTube Channel