Streaming Platforms

Streaming Platforms in the Big Data Zoo Sam BESSALAH -
@samklr

Three assumed paradigms - Request Response

Three assumed paradigms - Request Response - Batch Processing

Three assumed paradigms - Request Response - Batch Processing -
Streaming

Streaming Unbounded flow of data, can be processed one at
time, or some more at a time.

Streaming Unbounded flow of data, can be processed one at
time, or just some more at a time. Basically : while (isRunning) { Process data .. Store and/or send it somewhere else .. }

Examples of Streaming Apps

Stream Processing Legends - Not Precise, need to accept approximate
results - Lossy - Unstable - Does not match batch processing - Transient

Solution : Lambda Architecture, Big Data, Circa 2013

Lambda Architecture

Lambda Architecture - Good for its time, but comes with
pain points - Pros : Keeps the source unchanged, emphasize the issue of reprocessing the data, force the operations on materialized views - Cons : Two separated code in two distributed systems, each with its own complexity, and painful to manage

Lambda Architecture Extend streaming platforms to handle the whole operation,
after all batch and stream can be interchangeable

Welcome to the Zoo

Properties of Effective Streaming - Stream Replay

Properties of Effective Streaming - Lineage Tracking

Properties of Effective Streaming - State Checkpointing

Properties of Effective Streaming - State Management Non trivial, real
world apps need state : f(input, state) => (output, state)

Properties of Effective Streaming - Delivery Guarantees At least once
: ensure all operators see every data, and replay the stream in case of failure. Exactly once : ensure that operators do not process duplicate updates .

Properties of Effective Streaming - Delivery Guarantees - Fault Tolerance
- Latency - Throughput - Scalability

- Stateful dataflow operators - State access patterns : -
Local state : current state of a specific operator - Partitioned state : maintains state across partitions - Direct Stream API : mapWithState(), flatMapWithState(); etc - Checkpointing and savepoints - Exactly once semantics (at least they claim to be)

Checkpointing recovery

Remember You can only claim Exactly once, if your source
enables you to rewind the stream. Hence Kafka … Again

Streams

Spark, Flink and Storm - Distributed - Cluster Managers -
Huge overhead - Comes on top of another platform

Spark, Flink and Storm - Distributed - Cluster Managers -
Huge overhead - Comes on top of another platform Kafka Streams is just a library

Kafka Streams No just for Streaming Analytics Beyond Big Data,
bridging the Analytics and Transactional, operational and services world. Performant, uses the best of to remain lightweight and simple

Kafka streams

Kafka streams - Streams are dual of Table - A
stream is a changelog of a table - A table is a materialized view of a stream - Same as Change Data Capture in databases

Kafka streams

Kafka streams Fault tolerance through Kafka

Apache Beam Attempt to provide a unified batch+streaming programming model
for protables data processing pipelines. Provides a Java SDK and other DSLs in other languages. And a handful of streaming engines as runners : Spark, Flink, Dataflow, etc.

Principles of the Beam Model

More about Beam/Google DataFlow - The World Beyond Batch and
Streaming 1 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 - The World Beyond Batch and Streaming 2 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 - Dataflow Beam and Spark Conparison https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-compari son#logistics

When to use them ?

Bibliography and links http://www.slideshare.net/stephanewen1/continuous-processing-with-apache-flink-strata-london-2016 http://www.slideshare.net/stephanewen1/apache-flink-overview-and-use-cases-at-prehadoop-summit-meetu ps` http://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming?qid=fb518816-18bd-4771-8e7 6-2e6ee58661de&v=&b=&from_search=1 http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-ap ache-flink/
http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/ www.confluent.io/blog/elastic-scaling-in-kafka-streams/

Streaming Platforms

Streaming Platforms

More Decks by Sam Bessalah

Other Decks in Programming

Featured

Transcript