Slide 1

Slide 1 text

Stream Processing Dharmesh Kakadia

Slide 2

Slide 2 text

whoami • Applied Scientist/Software Engineer, MobileDataLabs (acquired team inside Microsoft) team building AI/Analytics platform • Spent couple of years with Microsoft Research • Spent couple of years with Azure HDInsight • Among other things, author of “Apache Mesos Essentials” book • Opinions are mine and biased • You can find me as @dharmeshkakadia everywhere

Slide 3

Slide 3 text

Concepts • Events – changelog of the realworld • EventTime – time when the event happened • ProcessingTime – time at which the events are seen by streaming system • Kind of processing • Transformations • Aggregations • Window - Spread of the events across time Image source : https://www.slideshare.net/ConfluentInc/fundamentals- of-stream-processing-with-apache-beam-tyler-akidau-frances-perry

Slide 4

Slide 4 text

Operational Concepts • Checkpointing - track progress of • Data consumption aka offset tracking • Intermediate state • Watermarking - defines how long to wait for late events. After which they will be deemed too late to be processed/important and will be discarded. • Watermark delay = trailing gap event time max event time watermark • Windows that are older than watermark are automatically, which allows the state to be bounded.

Slide 5

Slide 5 text

Event-time watermarks: how late is too late? Image source : https://www.slideshare.net/databricks/easy-scalable-fault-tolerant-stream-processing-with-structured-streaming-with-tathagata-das

Slide 6

Slide 6 text

Tradeoffs • Latency – how much delay is acceptable? • Semantics • Atmost once – some events might be lost aka not processed. • Exactly once • Atleast once – duplicates are possible. • Scale* • Data size/speed • Cost • Resource utilization • *Scale is not really an option in practice • Tradeoffs in normal operations vs worst case • If you have to choose between higher throughput and lower latency which would you choose? • How about between lower latency and atleast once?

Slide 7

Slide 7 text

Spark Streaming • Micro batching • Proven at scale • Exactly once processing • Support for multiple languages • Support for multiple systems – Mesos, Kubernetes, YARN, Standalone • All data sources • Unified API • Schema support • … Image source : https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html

Slide 8

Slide 8 text

Spark Streaming Continuous Processing mode • Introduced in 2.3. Enabled with .trigger(continuous = "5 seconds") • At least once, end-to-end low latency execution mode. Image source : https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html

Slide 9

Slide 9 text

Event markers • Chandy-Lamport algorithm for optimizing checkpointing Special “epoch markers” records are injected into the input data stream of every task. When a marker is encountered by a task, the task asynchronously reports the last offset processed to the driver. Once the driver receives the offsets from all the tasks writing to the sink, it writes them to the write-ahead-log. Since the checkpointing is completely asynchronous, the tasks can continue uninterrupted and provide consistent millisecond-level latencies. • Also used in Kafka.

Slide 10

Slide 10 text

Spark Streaming operational aspects • Minimal code change between batch and streaming code • No code change between micro batch and continuous mode • Checkpointing in JSON. Good for manual debugging and/or recovery • Streaming statistics. Available on Spark UI. Timeline and histograms (Input rate, Scheduling delay, Processing time) • Continous mode is Experimental : • Aggregations are not supported yet. • No retry on task failures. Manual restart from checkpoint. • Use DataSourceV2. Operates at a record level. New get() and next() API. • Mainly designed around Kafka currently. Image source : https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html

Slide 11

Slide 11 text

Larger picture • Streaming system is not run in isolation. Be careful not to generalize any guarantees in isolation with guarantees for end to end systems. • Consider: • Community • Hiring & familiarity • Next year problems • Ops aspects • Isolation • Password rotation • Secret management • Monitoring • Alerting • CI/CD • Upgrades • Audit

Slide 12

Slide 12 text

+ + Our Data Platform

Slide 13

Slide 13 text

Thanks! Questions/Discussion