Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming Platforms

Sam Bessalah
September 28, 2016

Streaming Platforms

Criteo Labs. Paris / Paris Data Meetup
Paris 09-28-2016

Sam Bessalah

September 28, 2016
Tweet

More Decks by Sam Bessalah

Other Decks in Programming

Transcript

  1. Streaming
    Platforms
    in the Big Data Zoo
    Sam BESSALAH - @samklr

    View Slide

  2. Three assumed paradigms
    - Request Response

    View Slide

  3. Three assumed paradigms
    - Request Response
    - Batch Processing

    View Slide

  4. Three assumed paradigms
    - Request Response
    - Batch Processing
    - Streaming

    View Slide

  5. Streaming
    Unbounded flow of data, can be processed one at time, or
    some more at a time.

    View Slide

  6. Streaming
    Unbounded flow of data, can be processed one at time, or just
    some more at a time.
    Basically :
    while (isRunning) {
    Process data ..
    Store and/or send it somewhere else ..
    }

    View Slide

  7. Examples of Streaming Apps

    View Slide

  8. Stream Processing Legends
    - Not Precise, need to accept approximate results
    - Lossy
    - Unstable
    - Does not match batch processing
    - Transient

    View Slide

  9. Solution : Lambda Architecture,
    Big Data, Circa 2013

    View Slide

  10. View Slide

  11. Lambda Architecture

    View Slide

  12. Lambda
    Architecture
    - Good for its time, but comes with pain points
    - Pros : Keeps the source unchanged, emphasize
    the issue of reprocessing the data, force the
    operations on materialized views
    - Cons : Two separated code in two distributed
    systems, each with its own complexity, and
    painful to manage

    View Slide

  13. Lambda Architecture
    Extend streaming platforms to handle the whole operation,
    after all batch and stream can be interchangeable

    View Slide

  14. Welcome to the Zoo

    View Slide

  15. Properties of
    Effective Streaming
    - Stream Replay

    View Slide

  16. Properties of
    Effective Streaming
    - Lineage Tracking

    View Slide

  17. Properties of Effective
    Streaming
    - State Checkpointing

    View Slide

  18. Properties of Effective
    Streaming
    - State Management
    Non trivial, real world apps need state :
    f(input, state) => (output, state)

    View Slide

  19. Properties of Effective
    Streaming
    - Delivery Guarantees
    At least once : ensure all operators see every data,
    and replay the stream in case of failure.
    Exactly once : ensure that operators do not process
    duplicate updates .

    View Slide

  20. Properties of Effective
    Streaming
    - Delivery Guarantees
    - Fault Tolerance
    - Latency
    - Throughput
    - Scalability

    View Slide

  21. View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. View Slide

  30. View Slide

  31. View Slide

  32. View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. View Slide

  39. View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. View Slide

  44. View Slide

  45. View Slide

  46. - Stateful dataflow operators
    - State access patterns :
    - Local state : current state of a specific operator
    - Partitioned state : maintains state across partitions
    - Direct Stream API : mapWithState(), flatMapWithState(); etc
    - Checkpointing and savepoints
    - Exactly once semantics (at least they claim to be)

    View Slide

  47. View Slide

  48. View Slide

  49. Checkpointing recovery

    View Slide

  50. Checkpointing recovery

    View Slide

  51. Remember
    You can only claim Exactly
    once, if your source enables
    you to rewind the stream.
    Hence Kafka …
    Again

    View Slide

  52. Streams

    View Slide

  53. View Slide

  54. Spark, Flink and Storm
    - Distributed
    - Cluster Managers
    - Huge overhead
    - Comes on top of another platform

    View Slide

  55. Spark, Flink and Storm
    - Distributed
    - Cluster Managers
    - Huge overhead
    - Comes on top of another platform
    Kafka Streams is
    just a library

    View Slide

  56. Kafka Streams
    No just for Streaming Analytics
    Beyond Big Data, bridging the Analytics and Transactional,
    operational and services world.
    Performant, uses the best of to remain lightweight and simple

    View Slide

  57. View Slide

  58. Kafka streams

    View Slide

  59. Kafka streams

    View Slide

  60. Kafka streams

    View Slide

  61. Kafka streams

    View Slide

  62. Kafka streams

    View Slide

  63. Kafka streams

    View Slide

  64. Kafka streams
    - Streams are dual of Table
    - A stream is a changelog of a table
    - A table is a materialized view of a stream
    - Same as Change Data Capture in databases

    View Slide

  65. Kafka streams

    View Slide

  66. Kafka streams

    View Slide

  67. Kafka streams
    Fault tolerance through Kafka

    View Slide

  68. Apache Beam
    Attempt to provide a unified batch+streaming
    programming model for protables data processing
    pipelines.
    Provides a Java SDK and other DSLs in other
    languages.
    And a handful of streaming engines as runners :
    Spark, Flink, Dataflow, etc.

    View Slide

  69. View Slide

  70. Principles of the Beam Model

    View Slide

  71. View Slide

  72. View Slide

  73. More about Beam/Google DataFlow
    - The World Beyond Batch and Streaming 1
    https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
    - The World Beyond Batch and Streaming 2
    https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
    - Dataflow Beam and Spark Conparison
    https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-compari
    son#logistics

    View Slide

  74. When to use them ?

    View Slide

  75. When to use them ?

    View Slide

  76. View Slide

  77. View Slide

  78. Bibliography and links
    http://www.slideshare.net/stephanewen1/continuous-processing-with-apache-flink-strata-london-2016
    http://www.slideshare.net/stephanewen1/apache-flink-overview-and-use-cases-at-prehadoop-summit-meetu
    ps`
    http://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming?qid=fb518816-18bd-4771-8e7
    6-2e6ee58661de&v=&b=&from_search=1
    http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-ap
    ache-flink/
    http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
    www.confluent.io/blog/elastic-scaling-in-kafka-streams/

    View Slide