Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Michael Spector

February 10, 2016

Michael Spector

Slides from the meetup:
Spark Streaming and "Exactly Once" Semantics Revisited

Link: http://www.meetup.com/Big-Data-Israel/events/228080600/


February 10, 2016

More Decks by AppsFlyer

Other Decks in Technology


  1. Appsflyer as Marketing Platform • Attribution • Statistics: clicks, installs,

    in-app events, launches, uninstalls, etc. • Life time value • Retargeting • Fraud detection • Prediction • A/B testing • etc...
  2. Appsflyer Technology • ~7B events / day • Hundreds of

    machines in Amazon • Tens of micro-services Apache Kafka service service service service service service DB Amazon S3 MongoDB Redshift Druid
  3. Stream Processing Minimize latency between data ingestion and insights Usages

    • Real-time dashboard • Fraud prevention • Ad bidding • etc.
  4. Stream Processing Frameworks Key Differences • Latency • Windowing support

    • Delivery semantics • State management • API easiness • Programming languages support • Community support • etc..
  5. Apache Spark Spark Driver val textFile = sc.textFile("hdfs://...") val counts

    = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Cluster Manager Worker Node Executor Task Task Worker Node Executor Task Task
  6. Apache Spark External World RDD RDD RDD RDD RDD External

    World read read transform transform transform transform action
  7. Streaming in Spark Advantages • Reuse existing infra • Rich

    API • Straightforward windowing • It’s easier to implement “exactly once” Disadvantages • Latency Input Stream Micro batches Spark Engine Processed data
  8. Windowing in Spark Streaming • Window length and sliding interval

    must be multiples of batch interval • Possible usages ◦ Finding Top N elements during last M period of time ◦ Pre-aggregation of data prior to inserting to DB ◦ etc. DStream Window length Sliding interval
  9. Achieving “Exactly once” • Producer ◦ Doesn’t duplicate messages •

    Stream processor ◦ Tracking state (checkpointing) ◦ Resilient components • Consumer ◦ Reads only new messages “Easy” way: • Message deduplication based on some ID • Idempotent output destination
  10. Stream Checkpointing https://en.wikipedia.org/wiki/Snapshot_algorithm • Barriers are injected into the data

    stream. • Once intermediate step sees barriers from all of its input streams it outputs barrier to all of its outgoing streams. • Once all sink operators see barrier for a snapshot, they acknowledge the snapshot, and it’s considered committed. • Multiple barriers can be seen in stream flow. • Operators store their state to an external storage. • On failure, all the operators' state will fall back to the latest complete snapshot, and data source will also fall back to the position recorded with this snapshot. storage checkpoint
  11. Micro-batch Checkpointing receive process state receive process state receive process

    state while (true) { // 1. receive next batch of data // 2. compute next stream and state } Unit of fault tolerance
  12. Resilience in Spark Streaming All Spark components must be resilient!

    • Driver application process • Master process • Worker process • Executor process • Receiver thread • Worker node Driver Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k
  13. Driver Driver Resilience • Client mode ◦ Driver application is

    running inside the “spark-submit” process. ◦ If this process dies the entire application is killed. • Cluster mode ◦ Driver application runs on one of worker nodes. ◦ “--supervise” option makes driver restart on a different worker node. • Running through Marathon ◦ Marathon can re-start failed applications automatically. Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k
  14. Master Resilience • Single master ◦ The entire application is

    killed. • Multi-master mode ◦ A standby master is elected active. ◦ Worker nodes automatically register with new master. ◦ Leader election via ZooKeeper. Driver Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k
  15. Worker Resilience • Worker process ◦ When failed, all child

    processes (driver or executor) are killed. ◦ New worker process is launched automatically. • Executor process ◦ Restarted on failure by the parent worker process. • Receiver thread ◦ Running inside the Executor process - same as Executor. • Worker node ◦ Failure of worker node behaves the same as killing all its components individually. Driver Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k
  16. Checkpointing • Checkpointing helps recover from driver failure. • Stores

    computation graph to some fault tolerant place (like HDFS or S3). • What is saved as metadata ◦ Metadata of queued but not processed batches ◦ Stream operations (code) ◦ Configuration • Disadvantages ◦ Frequent checkpointing reduces throughput. ◦ As the code itself is saved, upgrade is not possible without removing checkpoints.
  17. Write Ahead Log • Synchronously saves received data to fault

    tolerant storage. • Helps recover received, but not yet committed blocks. • Disadvantages ◦ Additional storage is required. ◦ Reduced throughput. Executor input stream Receiver
  18. Problems with Checkpointing and WAL • Data can be lost

    even when using checkpointing (batches hold in memory will be lost on driver failure). • Checkpointing and WAL prevent data loss, but do not provide “exactly once” semantics. • If receiver fails before updating offsets in ZooKeeper - we are in trouble. • In this case data will be re-read from Kafka and from WAL. • Still not exactly once!
  19. The Solution • Don’t use receivers - read directly from

    input stream instead. • Driver instructs executors what range to read from a stream (stream must be rewindable). • Read range is attached to the batch itself. • Example (Kafka direct stream): Application Driver Streaming Context 1. Periodically query latest offsets for topics & partitions 2. Calculates offset ranges for the next batch Executor 3. Schedule the next micro- batch job 4. Consume data for the calculated offsets
  20. The Problem • Events counting. • Group by different set

    of dimensions. • Have pre-aggregation layer that reduces load on DB on spikes. DB app_id event_name country count com.app.bla FIRST_LAUNCH US 152 com.app.bla purchase IL 10 com.app.jo custom_inapp_20 US 45
  21. Transactional Events Aggregator • Based on SQL database • Store

    Kafka partition offsets into the DB • Increment event counters in transaction based on current and stored offsets. SQL DB Driver Executor Executor 1. Read last Kafka partitions and their offsets from the DB 2. Create direct Kafka stream based on read partitions and offsets 3. Consume events from Kafka 4. Aggregate events 5. Upsert event counter along with current offsets in transaction
  22. Snapshotting Events Aggregator Driver Executor Executor 1. Read last Kafka

    partitions and their offsets from S3 2. Create direct Kafka stream based on read partitions and offsets 3. Consume events from Kafka 4. Aggregate events 5. Store processed data and Kafka offsets under /data/ts=<timestamp> and /offsets/ts=<timestamp> respectively S3 Aggregator Application
  23. Snapshotting Events Aggregator Executor Executor Executor 1. Find last committed

    timestamp 2. Read data for the last timestamp from /data/ts=<timestamp> 4. Aggregate events by different dimensions, and split to cubes 6. Delete offsets and data for the timestamp /offsets/ts=<timestamp> /data/ts=<timestamp> S3 Loader Application Cassandra 5. Increment counters in different cubes Driver
  24. Deployment • We use Mesos ◦ Master HA for free.

    ◦ Marathon keeps Spark streaming application alive. • Read carefully ◦ http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning • Inspect, re-configure, retry • Turn off Spark dynamicity • Preserve data locality • Find balance between cores/batch interval/block interval • Processing time must be less than batch interval Tips
  25. Right Now • Processes ~50M events a day • Reduces

    the stream in two sliding windows: 1. Last 5 seconds (“now”) 2. Last 10 minutes (“recent”) • At most once semantics
  26. Right Now Why Spark? • Experienced with Spark • Convenient

    Clojure wrappers (Sparkling, Flambo) • Documentation and community
  27. Right Now In Production • 3 m3.xlarge machines for the

    workers (4 cores each) spark.default.parallelism=10 • Lesson learned: foreachRDD and foreachPartition