Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark Streaming Tips for Devs & Ops

Spark Streaming Tips for Devs & Ops

J On The Beach
21st May 2016 - La Térmica, Málaga, Spain

85deedeb587bda99f1dd7f4b19ce612b?s=128

Fede

May 21, 2016
Tweet

Transcript

  1. Spark Streaming Tips for Devs & Ops

  2. None
  3. WHO ARE WE? Fede Fernández Scala Software Engineer at 47

    Degrees Spark Certified Developer @fede_fdz Fran Pérez Scala Software Engineer at 47 Degrees Spark Certified Developer @FPerezP
  4. Overview Spark Streaming Spark + Kafka groupByKey vs reduceByKey Table

    Joins Serializer Tunning
  5. Spark Streaming Real-time data processing Continuous Data Flow RDD RDD

    RDD DStream Output Data
  6. Spark + Kafka • Receiver-based Approach ◦ At least once

    (with Write Ahead Logs) • Direct API ◦ Exactly once
  7. Spark + Kafka • Receiver-based Approach

  8. Spark + Kafka • Direct API

  9. groupByKey VS reduceByKey • groupByKey ◦ Groups pairs of data

    with the same key. • reduceByKey ◦ Groups and combines pairs of data based on a reduce operation.
  10. groupByKey VS reduceByKey sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).groupByKey.map(t => (t._1,

    t._2.sum)) sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).reduceByKey(_ + _)
  11. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c,

    1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  12. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c,

    1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  13. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c,

    1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  14. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c,

    1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (c, 1) (c, 1) shuffle shuffle
  15. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c,

    1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 5) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 6) (c, 1) (c, 1) (c, 1) (c, 1) (c, 4) shuffle shuffle
  16. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s,

    1) (j, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (j, 1) (c, 1) (s, 1) (j, 1)
  17. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s,

    1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  18. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s,

    1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  19. reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (s,

    2) (s, 2) (s, 1) (s, 1) (c, 1) (c, 2) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  20. reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (j,

    5) (s, 2) (s, 2) (s, 1) (s, 1) (s, 6) (c, 1) (c, 2) (c, 1) (c, 4) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  21. reduce VS group • Improve performance • Can’t always be

    used • Out of Memory Exceptions • aggregateByKey, foldByKey, combineByKey
  22. Table Joins • Typical operations that can be improved •

    Need a previous analysis • There are no silver bullets
  23. Table Joins: Medium - Large

  24. Table Joins: Medium - Large FILTER No Shuffle

  25. Table Joins: Small - Large ... Shuffled Hash Join sqlContext.sql("explain

    <select>").collect.mkString(“\n”) [== Physical Plan ==] [Project] [+- SortMergeJoin] [ :- Sort] [ : +- TungstenExchange hashpartitioning] [ : +- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Sort] [ +- TungstenExchange hashpartitioning] [ +- ConvertToUnsafe] [ +- Scan ExistingRDD]
  26. Table Joins: Small - Large Broadcast Hash Join sqlContext.sql("explain <select>").collect.mkString(“\n”)

    [== Physical Plan ==] [Project] [+- BroadcastHashJoin] [ :- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Scan ParquetRelation] No shuffle! By default from Spark 1.4 when using DataFrame API Prior Spark 1.4 ANALYZE TABLE small_table COMPUTE STATISTICS noscan Broadcast
  27. Table Joins: Small - Large

  28. Serializers • Java’s ObjectOutputStream framework. (Default) • Custom serializers: extends

    Serializable & Externalizable. • KryoSerializer: register your custom classes. • Where is our code being run? • Special care to JodaTime.
  29. Tuning Garbage Collector blockInterval Partitioning Storage

  30. Tuning: Garbage Collector • Applications which rely heavily on memory

    consumption. • GC Strategies • Concurrent Mark Sweep (CMS) GC • ParallelOld GC • Garbage-First GC • Tuning steps: • Review your logic and object management • Try Garbage-First • Activate and inspect the logs Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
  31. Tuning: blockInterval blockInterval = (bi * consumers) / (pf *

    sc) • CAT: Total cores per partition. • bi: Batch Interval time in milliseconds. • consumers: number of streaming consumers. • pf (partitionFactor): number of partitions per core. • sc (sparkCores): CAT - consumers.
  32. blockInterval: example • batchIntervalMillis = 600,000 • consumers = 20

    • CAT = 120 • sparkCores = 120 - 20 = 100 • partitionFactor = 3 blockInterval = (bi * consumers) / (pf * sc) = (600,000 * 20) / (3 * 100) = 40,000
  33. Tuning: Partitioning partitions = consumers * bi / blockInterval •

    consumers: number of streaming consumers. • bi: Batch Interval time in milliseconds. • blockInterval: time size to split data before storing into Spark.
  34. Partitioning: example • batchIntervalMillis = 600,000 • consumers = 20

    • blockInterval = 40,000 partitions = consumers * bi / blockInterval = 20 * 600,000/ 40,000= 30
  35. Tuning: Storage • Default (MEMORY_ONLY) • MEMORY_ONLY_SER with Serialization Library

    • MEMORY_AND_DISK & DISK_ONLY • Replicated _2 • OFF_HEAP (Tachyon/Alluxio)
  36. Where to find more information? Spark Official Documentation Databricks Blog

    Databricks Spark Knowledge Base Spark Notebook - By Andy Petrella Databricks YouTube Channel
  37. QUESTIONS Fede Fernández @fede_fdz fede.f@47deg.com Fran Pérez @FPerezP fran.p@47deg.com Thanks!