team building AI/Analytics platform • Spent couple of years with Microsoft Research • Spent couple of years with Azure HDInsight • Among other things, author of “Apache Mesos Essentials” book • Opinions are mine and biased • You can find me as @dharmeshkakadia everywhere
– time when the event happened • ProcessingTime – time at which the events are seen by streaming system • Kind of processing • Transformations • Aggregations • Window - Spread of the events across time Image source : https://www.slideshare.net/ConfluentInc/fundamentals- of-stream-processing-with-apache-beam-tyler-akidau-frances-perry
consumption aka offset tracking • Intermediate state • Watermarking - defines how long to wait for late events. After which they will be deemed too late to be processed/important and will be discarded. • Watermark delay = trailing gap event time max event time watermark • Windows that are older than watermark are automatically, which allows the state to be bounded.
Semantics • Atmost once – some events might be lost aka not processed. • Exactly once • Atleast once – duplicates are possible. • Scale* • Data size/speed • Cost • Resource utilization • *Scale is not really an option in practice • Tradeoffs in normal operations vs worst case • If you have to choose between higher throughput and lower latency which would you choose? • How about between lower latency and atleast once?
Exactly once processing • Support for multiple languages • Support for multiple systems – Mesos, Kubernetes, YARN, Standalone • All data sources • Unified API • Schema support • … Image source : https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html
markers” records are injected into the input data stream of every task. When a marker is encountered by a task, the task asynchronously reports the last offset processed to the driver. Once the driver receives the offsets from all the tasks writing to the sink, it writes them to the write-ahead-log. Since the checkpointing is completely asynchronous, the tasks can continue uninterrupted and provide consistent millisecond-level latencies. • Also used in Kafka.
and streaming code • No code change between micro batch and continuous mode • Checkpointing in JSON. Good for manual debugging and/or recovery • Streaming statistics. Available on Spark UI. Timeline and histograms (Input rate, Scheduling delay, Processing time) • Continous mode is Experimental : • Aggregations are not supported yet. • No retry on task failures. Manual restart from checkpoint. • Use DataSourceV2. Operates at a record level. New get() and next() API. • Mainly designed around Kafka currently. Image source : https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html
Be careful not to generalize any guarantees in isolation with guarantees for end to end systems. • Consider: • Community • Hiring & familiarity • Next year problems • Ops aspects • Isolation • Password rotation • Secret management • Monitoring • Alerting • CI/CD • Upgrades • Audit