$30 off During Our Annual Pro Sale. View Details »

Stream Processing

Stream Processing

Invited talk at @twitter on stream processing.

Other talks at http://dharmeshkakadia.github.io/talks

dharmeshkakadia

June 29, 2018
Tweet

More Decks by dharmeshkakadia

Other Decks in Technology

Transcript

  1. Stream Processing
    Dharmesh Kakadia

    View Slide

  2. whoami
    • Applied Scientist/Software Engineer, MobileDataLabs (acquired team
    inside Microsoft) team building AI/Analytics platform
    • Spent couple of years with Microsoft Research
    • Spent couple of years with Azure HDInsight
    • Among other things, author of “Apache Mesos Essentials” book
    • Opinions are mine and biased
    • You can find me as @dharmeshkakadia everywhere

    View Slide

  3. Concepts
    • Events – changelog of the realworld
    • EventTime – time when the event
    happened
    • ProcessingTime – time at which the events
    are seen by streaming system
    • Kind of processing
    • Transformations
    • Aggregations
    • Window - Spread of the events across time
    Image source : https://www.slideshare.net/ConfluentInc/fundamentals-
    of-stream-processing-with-apache-beam-tyler-akidau-frances-perry

    View Slide

  4. Operational Concepts
    • Checkpointing - track progress of
    • Data consumption aka offset tracking
    • Intermediate state
    • Watermarking - defines how long to wait for late events. After which
    they will be deemed too late to be processed/important and will be
    discarded.
    • Watermark delay = trailing gap event time max event time watermark
    • Windows that are older than watermark are automatically, which
    allows the state to be bounded.

    View Slide

  5. Event-time watermarks: how late is too late?
    Image source : https://www.slideshare.net/databricks/easy-scalable-fault-tolerant-stream-processing-with-structured-streaming-with-tathagata-das

    View Slide

  6. Tradeoffs
    • Latency – how much delay is acceptable?
    • Semantics
    • Atmost once – some events might be lost aka not processed.
    • Exactly once
    • Atleast once – duplicates are possible.
    • Scale*
    • Data size/speed
    • Cost
    • Resource utilization
    • *Scale is not really an option in practice
    • Tradeoffs in normal operations vs worst case
    • If you have to choose between higher throughput and lower latency which would you choose?
    • How about between lower latency and atleast once?

    View Slide

  7. Spark Streaming

    Micro batching

    Proven at scale

    Exactly once processing

    Support for multiple
    languages

    Support for multiple systems
    – Mesos, Kubernetes, YARN,
    Standalone

    All data sources

    Unified API

    Schema support


    Image source : https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html

    View Slide

  8. Spark Streaming Continuous Processing mode
    • Introduced in 2.3. Enabled with .trigger(continuous = "5 seconds")
    • At least once, end-to-end low latency execution mode.
    Image source : https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html

    View Slide

  9. Event markers
    • Chandy-Lamport algorithm for optimizing checkpointing
    Special “epoch markers” records are injected into the input data
    stream of every task. When a marker is encountered by a task, the
    task asynchronously reports the last offset processed to the driver.
    Once the driver receives the offsets from all the tasks writing to the
    sink, it writes them to the write-ahead-log. Since the checkpointing is
    completely asynchronous, the tasks can continue uninterrupted and
    provide consistent millisecond-level latencies.
    • Also used in Kafka.

    View Slide

  10. Spark Streaming operational aspects
    • Minimal code change between batch and
    streaming code
    • No code change between micro batch and
    continuous mode
    • Checkpointing in JSON. Good for manual
    debugging and/or recovery
    • Streaming statistics. Available on Spark UI.
    Timeline and histograms (Input rate,
    Scheduling delay, Processing time)
    • Continous mode is Experimental :
    • Aggregations are not supported yet.
    • No retry on task failures. Manual restart
    from checkpoint.
    • Use DataSourceV2. Operates at a record
    level. New get() and next() API.
    • Mainly designed around Kafka currently.
    Image source : https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html

    View Slide

  11. Larger picture
    • Streaming system is not run in isolation. Be careful not to generalize any
    guarantees in isolation with guarantees for end to end systems.
    • Consider:
    • Community
    • Hiring & familiarity
    • Next year problems
    • Ops aspects
    • Isolation
    • Password rotation
    • Secret management
    • Monitoring
    • Alerting
    • CI/CD
    • Upgrades
    • Audit

    View Slide

  12. + +
    Our Data Platform

    View Slide

  13. Thanks!
    Questions/Discussion

    View Slide