Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Streaming Platforms

Sam Bessalah
September 28, 2016

Streaming Platforms

Criteo Labs. Paris / Paris Data Meetup
Paris 09-28-2016

Sam Bessalah

September 28, 2016
Tweet

More Decks by Sam Bessalah

Other Decks in Programming

Transcript

  1. Streaming Unbounded flow of data, can be processed one at

    time, or just some more at a time. Basically : while (isRunning) { Process data .. Store and/or send it somewhere else .. }
  2. Stream Processing Legends - Not Precise, need to accept approximate

    results - Lossy - Unstable - Does not match batch processing - Transient
  3. Lambda Architecture - Good for its time, but comes with

    pain points - Pros : Keeps the source unchanged, emphasize the issue of reprocessing the data, force the operations on materialized views - Cons : Two separated code in two distributed systems, each with its own complexity, and painful to manage
  4. Lambda Architecture Extend streaming platforms to handle the whole operation,

    after all batch and stream can be interchangeable
  5. Properties of Effective Streaming - State Management Non trivial, real

    world apps need state : f(input, state) => (output, state)
  6. Properties of Effective Streaming - Delivery Guarantees At least once

    : ensure all operators see every data, and replay the stream in case of failure. Exactly once : ensure that operators do not process duplicate updates .
  7. - Stateful dataflow operators - State access patterns : -

    Local state : current state of a specific operator - Partitioned state : maintains state across partitions - Direct Stream API : mapWithState(), flatMapWithState(); etc - Checkpointing and savepoints - Exactly once semantics (at least they claim to be)
  8. Remember You can only claim Exactly once, if your source

    enables you to rewind the stream. Hence Kafka … Again
  9. Spark, Flink and Storm - Distributed - Cluster Managers -

    Huge overhead - Comes on top of another platform
  10. Spark, Flink and Storm - Distributed - Cluster Managers -

    Huge overhead - Comes on top of another platform Kafka Streams is just a library
  11. Kafka Streams No just for Streaming Analytics Beyond Big Data,

    bridging the Analytics and Transactional, operational and services world. Performant, uses the best of to remain lightweight and simple
  12. Kafka streams - Streams are dual of Table - A

    stream is a changelog of a table - A table is a materialized view of a stream - Same as Change Data Capture in databases
  13. Apache Beam Attempt to provide a unified batch+streaming programming model

    for protables data processing pipelines. Provides a Java SDK and other DSLs in other languages. And a handful of streaming engines as runners : Spark, Flink, Dataflow, etc.
  14. More about Beam/Google DataFlow - The World Beyond Batch and

    Streaming 1 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 - The World Beyond Batch and Streaming 2 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 - Dataflow Beam and Spark Conparison https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-compari son#logistics