Streaming Unbounded flow of data, can be processed one at time, or just some more at a time. Basically : while (isRunning) { Process data .. Store and/or send it somewhere else .. }
Lambda Architecture - Good for its time, but comes with pain points - Pros : Keeps the source unchanged, emphasize the issue of reprocessing the data, force the operations on materialized views - Cons : Two separated code in two distributed systems, each with its own complexity, and painful to manage
Properties of Effective Streaming - Delivery Guarantees At least once : ensure all operators see every data, and replay the stream in case of failure. Exactly once : ensure that operators do not process duplicate updates .
- Stateful dataflow operators - State access patterns : - Local state : current state of a specific operator - Partitioned state : maintains state across partitions - Direct Stream API : mapWithState(), flatMapWithState(); etc - Checkpointing and savepoints - Exactly once semantics (at least they claim to be)
Kafka Streams No just for Streaming Analytics Beyond Big Data, bridging the Analytics and Transactional, operational and services world. Performant, uses the best of to remain lightweight and simple
Kafka streams - Streams are dual of Table - A stream is a changelog of a table - A table is a materialized view of a stream - Same as Change Data Capture in databases
Apache Beam Attempt to provide a unified batch+streaming programming model for protables data processing pipelines. Provides a Java SDK and other DSLs in other languages. And a handful of streaming engines as runners : Spark, Flink, Dataflow, etc.
More about Beam/Google DataFlow - The World Beyond Batch and Streaming 1 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 - The World Beyond Batch and Streaming 2 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 - Dataflow Beam and Spark Conparison https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-compari son#logistics