Stream Processing

Stream Processing

Invited talk at @twitter on stream processing.

Other talks at http://dharmeshkakadia.github.io/talks

0aa2ebd008cdd198af5e9765062bb265?s=128

dharmeshkakadia

June 29, 2018
Tweet

Transcript

  1. 2.

    whoami • Applied Scientist/Software Engineer, MobileDataLabs (acquired team inside Microsoft)

    team building AI/Analytics platform • Spent couple of years with Microsoft Research • Spent couple of years with Azure HDInsight • Among other things, author of “Apache Mesos Essentials” book • Opinions are mine and biased • You can find me as @dharmeshkakadia everywhere
  2. 3.

    Concepts • Events – changelog of the realworld • EventTime

    – time when the event happened • ProcessingTime – time at which the events are seen by streaming system • Kind of processing • Transformations • Aggregations • Window - Spread of the events across time Image source : https://www.slideshare.net/ConfluentInc/fundamentals- of-stream-processing-with-apache-beam-tyler-akidau-frances-perry
  3. 4.

    Operational Concepts • Checkpointing - track progress of • Data

    consumption aka offset tracking • Intermediate state • Watermarking - defines how long to wait for late events. After which they will be deemed too late to be processed/important and will be discarded. • Watermark delay = trailing gap event time max event time watermark • Windows that are older than watermark are automatically, which allows the state to be bounded.
  4. 5.

    Event-time watermarks: how late is too late? Image source :

    https://www.slideshare.net/databricks/easy-scalable-fault-tolerant-stream-processing-with-structured-streaming-with-tathagata-das
  5. 6.

    Tradeoffs • Latency – how much delay is acceptable? •

    Semantics • Atmost once – some events might be lost aka not processed. • Exactly once • Atleast once – duplicates are possible. • Scale* • Data size/speed • Cost • Resource utilization • *Scale is not really an option in practice • Tradeoffs in normal operations vs worst case • If you have to choose between higher throughput and lower latency which would you choose? • How about between lower latency and atleast once?
  6. 7.

    Spark Streaming • Micro batching • Proven at scale •

    Exactly once processing • Support for multiple languages • Support for multiple systems – Mesos, Kubernetes, YARN, Standalone • All data sources • Unified API • Schema support • … Image source : https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html
  7. 8.

    Spark Streaming Continuous Processing mode • Introduced in 2.3. Enabled

    with .trigger(continuous = "5 seconds") • At least once, end-to-end low latency execution mode. Image source : https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html
  8. 9.

    Event markers • Chandy-Lamport algorithm for optimizing checkpointing Special “epoch

    markers” records are injected into the input data stream of every task. When a marker is encountered by a task, the task asynchronously reports the last offset processed to the driver. Once the driver receives the offsets from all the tasks writing to the sink, it writes them to the write-ahead-log. Since the checkpointing is completely asynchronous, the tasks can continue uninterrupted and provide consistent millisecond-level latencies. • Also used in Kafka.
  9. 10.

    Spark Streaming operational aspects • Minimal code change between batch

    and streaming code • No code change between micro batch and continuous mode • Checkpointing in JSON. Good for manual debugging and/or recovery • Streaming statistics. Available on Spark UI. Timeline and histograms (Input rate, Scheduling delay, Processing time) • Continous mode is Experimental : • Aggregations are not supported yet. • No retry on task failures. Manual restart from checkpoint. • Use DataSourceV2. Operates at a record level. New get() and next() API. • Mainly designed around Kafka currently. Image source : https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html
  10. 11.

    Larger picture • Streaming system is not run in isolation.

    Be careful not to generalize any guarantees in isolation with guarantees for end to end systems. • Consider: • Community • Hiring & familiarity • Next year problems • Ops aspects • Isolation • Password rotation • Secret management • Monitoring • Alerting • CI/CD • Upgrades • Audit