Michael Spector

Spark Streaming Spark streaming recipes and “exactly once” semantics revised

Appsflyer: Basic Flow Advertiser Publisher Click Install

Appsflyer as Marketing Platform • Attribution • Statistics: clicks, installs,
in-app events, launches, uninstalls, etc. • Life time value • Retargeting • Fraud detection • Prediction • A/B testing • etc...

Appsflyer Technology • ~7B events / day • Hundreds of
machines in Amazon • Tens of micro-services Apache Kafka service service service service service service DB Amazon S3 MongoDB Redshift Druid

What is stream processing?

Stream Processing Minimize latency between data ingestion and insights Usages
• Real-time dashboard • Fraud prevention • Ad bidding • etc.

Stream Processing Frameworks Key Differences • Latency • Windowing support
• Delivery semantics • State management • API easiness • Programming languages support • Community support • etc..

Apache Spark Spark Driver val textFile = sc.textFile("hdfs://...") val counts
= textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Cluster Manager Worker Node Executor Task Task Worker Node Executor Task Task

Apache Spark External World RDD RDD RDD RDD RDD External
World read read transform transform transform transform action

Streaming in Spark Advantages • Reuse existing infra • Rich
API • Straightforward windowing • It’s easier to implement “exactly once” Disadvantages • Latency Input Stream Micro batches Spark Engine Processed data

Windowing in Spark Streaming • Window length and sliding interval
must be multiples of batch interval • Possible usages ◦ Finding Top N elements during last M period of time ◦ Pre-aggregation of data prior to inserting to DB ◦ etc. DStream Window length Sliding interval

Do we need “exactly once” semantics?

Data Processing Paradigms Batch layer Real-time layer Real-time layer http://www.kappa-architecture.com/
https://en.wikipedia.org/wiki/Lambda_architecture

How do we achieve “exactly once”?

Achieving “Exactly once” • Producer ◦ Doesn’t duplicate messages •
Stream processor ◦ Tracking state (checkpointing) ◦ Resilient components • Consumer ◦ Reads only new messages “Easy” way: • Message deduplication based on some ID • Idempotent output destination

Stream Checkpointing https://en.wikipedia.org/wiki/Snapshot_algorithm • Barriers are injected into the data
stream. • Once intermediate step sees barriers from all of its input streams it outputs barrier to all of its outgoing streams. • Once all sink operators see barrier for a snapshot, they acknowledge the snapshot, and it’s considered committed. • Multiple barriers can be seen in stream flow. • Operators store their state to an external storage. • On failure, all the operators' state will fall back to the latest complete snapshot, and data source will also fall back to the position recorded with this snapshot. storage checkpoint

Micro-batch Checkpointing receive process state receive process state receive process
state while (true) { // 1. receive next batch of data // 2. compute next stream and state } Unit of fault tolerance

Resilience in Spark Streaming All Spark components must be resilient!
• Driver application process • Master process • Worker process • Executor process • Receiver thread • Worker node Driver Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k

Driver Driver Resilience • Client mode ◦ Driver application is
running inside the “spark-submit” process. ◦ If this process dies the entire application is killed. • Cluster mode ◦ Driver application runs on one of worker nodes. ◦ “--supervise” option makes driver restart on a different worker node. • Running through Marathon ◦ Marathon can re-start failed applications automatically. Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k

Master Resilience • Single master ◦ The entire application is
killed. • Multi-master mode ◦ A standby master is elected active. ◦ Worker nodes automatically register with new master. ◦ Leader election via ZooKeeper. Driver Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k

Worker Resilience • Worker process ◦ When failed, all child
processes (driver or executor) are killed. ◦ New worker process is launched automatically. • Executor process ◦ Restarted on failure by the parent worker process. • Receiver thread ◦ Running inside the Executor process - same as Executor. • Worker node ◦ Failure of worker node behaves the same as killing all its components individually. Driver Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k

Resilience doesn’t ensure “exactly once”

Checkpointing • Checkpointing helps recover from driver failure. • Stores
computation graph to some fault tolerant place (like HDFS or S3). • What is saved as metadata ◦ Metadata of queued but not processed batches ◦ Stream operations (code) ◦ Configuration • Disadvantages ◦ Frequent checkpointing reduces throughput. ◦ As the code itself is saved, upgrade is not possible without removing checkpoints.

Write Ahead Log • Synchronously saves received data to fault
tolerant storage. • Helps recover received, but not yet committed blocks. • Disadvantages ◦ Additional storage is required. ◦ Reduced throughput. Executor input stream Receiver

Problems with Checkpointing and WAL • Data can be lost
even when using checkpointing (batches hold in memory will be lost on driver failure). • Checkpointing and WAL prevent data loss, but do not provide “exactly once” semantics. • If receiver fails before updating offsets in ZooKeeper - we are in trouble. • In this case data will be re-read from Kafka and from WAL. • Still not exactly once!

The Solution • Don’t use receivers - read directly from
input stream instead. • Driver instructs executors what range to read from a stream (stream must be rewindable). • Read range is attached to the batch itself. • Example (Kafka direct stream): Application Driver Streaming Context 1. Periodically query latest offsets for topics & partitions 2. Calculates offset ranges for the next batch Executor 3. Schedule the next micro- batch job 4. Consume data for the calculated offsets

Example #1

The Problem • Events counting. • Group by different set
of dimensions. • Have pre-aggregation layer that reduces load on DB on spikes. DB app_id event_name country count com.app.bla FIRST_LAUNCH US 152 com.app.bla purchase IL 10 com.app.jo custom_inapp_20 US 45

Transactional Events Aggregator • Based on SQL database • Store
Kafka partition offsets into the DB • Increment event counters in transaction based on current and stored offsets. SQL DB Driver Executor Executor 1. Read last Kafka partitions and their offsets from the DB 2. Create direct Kafka stream based on read partitions and offsets 3. Consume events from Kafka 4. Aggregate events 5. Upsert event counter along with current offsets in transaction

Creating Kafka Stream

Aggregation & Writing to DB

Example #2

Snapshotting Events Aggregator Driver Executor Executor 1. Read last Kafka
partitions and their offsets from S3 2. Create direct Kafka stream based on read partitions and offsets 3. Consume events from Kafka 4. Aggregate events 5. Store processed data and Kafka offsets under /data/ts=<timestamp> and /offsets/ts=<timestamp> respectively S3 Aggregator Application

Snapshotting Events Aggregator Executor Executor Executor 1. Find last committed
timestamp 2. Read data for the last timestamp from /data/ts=<timestamp> 4. Aggregate events by different dimensions, and split to cubes 6. Delete offsets and data for the timestamp /offsets/ts=<timestamp> /data/ts=<timestamp> S3 Loader Application Cassandra 5. Increment counters in different cubes Driver

Aggregator

Loader

Deployment • We use Mesos ◦ Master HA for free.
◦ Marathon keeps Spark streaming application alive. • Read carefully ◦ http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning • Inspect, re-configure, retry • Turn off Spark dynamicity • Preserve data locality • Find balance between cores/batch interval/block interval • Processing time must be less than batch interval Tips

Thank you! (and we’re hiring)

Right Now Real time analytics dashboard

Right Now • Processes ~50M events a day • Reduces
the stream in two sliding windows: 1. Last 5 seconds (“now”) 2. Last 10 minutes (“recent”) • At most once semantics

Right Now

Right Now Why Spark? • Experienced with Spark • Convenient
Clojure wrappers (Sparkling, Flambo) • Documentation and community

Right Now In Production • 3 m3.xlarge machines for the
workers (4 cores each) spark.default.parallelism=10 • Lesson learned: foreachRDD and foreachPartition

Thank you!

Michael Spector

Michael Spector

More Decks by AppsFlyer

Other Decks in Technology

Featured

Transcript