Pro Yearly is on sale from $80 to $50! »

Streaming Platforms

50c1b0fe4cdb0e8e7992d6872cf6cfd7?s=47 Sam Bessalah
September 28, 2016

Streaming Platforms

Criteo Labs. Paris / Paris Data Meetup
Paris 09-28-2016

50c1b0fe4cdb0e8e7992d6872cf6cfd7?s=128

Sam Bessalah

September 28, 2016
Tweet

Transcript

  1. Streaming Platforms in the Big Data Zoo Sam BESSALAH -

    @samklr
  2. Three assumed paradigms - Request Response

  3. Three assumed paradigms - Request Response - Batch Processing

  4. Three assumed paradigms - Request Response - Batch Processing -

    Streaming
  5. Streaming Unbounded flow of data, can be processed one at

    time, or some more at a time.
  6. Streaming Unbounded flow of data, can be processed one at

    time, or just some more at a time. Basically : while (isRunning) { Process data .. Store and/or send it somewhere else .. }
  7. Examples of Streaming Apps

  8. Stream Processing Legends - Not Precise, need to accept approximate

    results - Lossy - Unstable - Does not match batch processing - Transient
  9. Solution : Lambda Architecture, Big Data, Circa 2013

  10. None
  11. Lambda Architecture

  12. Lambda Architecture - Good for its time, but comes with

    pain points - Pros : Keeps the source unchanged, emphasize the issue of reprocessing the data, force the operations on materialized views - Cons : Two separated code in two distributed systems, each with its own complexity, and painful to manage
  13. Lambda Architecture Extend streaming platforms to handle the whole operation,

    after all batch and stream can be interchangeable
  14. Welcome to the Zoo

  15. Properties of Effective Streaming - Stream Replay

  16. Properties of Effective Streaming - Lineage Tracking

  17. Properties of Effective Streaming - State Checkpointing

  18. Properties of Effective Streaming - State Management Non trivial, real

    world apps need state : f(input, state) => (output, state)
  19. Properties of Effective Streaming - Delivery Guarantees At least once

    : ensure all operators see every data, and replay the stream in case of failure. Exactly once : ensure that operators do not process duplicate updates .
  20. Properties of Effective Streaming - Delivery Guarantees - Fault Tolerance

    - Latency - Throughput - Scalability
  21. None
  22. None
  23. None
  24. None
  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. None
  33. None
  34. None
  35. None
  36. None
  37. None
  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. - Stateful dataflow operators - State access patterns : -

    Local state : current state of a specific operator - Partitioned state : maintains state across partitions - Direct Stream API : mapWithState(), flatMapWithState(); etc - Checkpointing and savepoints - Exactly once semantics (at least they claim to be)
  47. None
  48. None
  49. Checkpointing recovery

  50. Checkpointing recovery

  51. Remember You can only claim Exactly once, if your source

    enables you to rewind the stream. Hence Kafka … Again
  52. Streams

  53. None
  54. Spark, Flink and Storm - Distributed - Cluster Managers -

    Huge overhead - Comes on top of another platform
  55. Spark, Flink and Storm - Distributed - Cluster Managers -

    Huge overhead - Comes on top of another platform Kafka Streams is just a library
  56. Kafka Streams No just for Streaming Analytics Beyond Big Data,

    bridging the Analytics and Transactional, operational and services world. Performant, uses the best of to remain lightweight and simple
  57. None
  58. Kafka streams

  59. Kafka streams

  60. Kafka streams

  61. Kafka streams

  62. Kafka streams

  63. Kafka streams

  64. Kafka streams - Streams are dual of Table - A

    stream is a changelog of a table - A table is a materialized view of a stream - Same as Change Data Capture in databases
  65. Kafka streams

  66. Kafka streams

  67. Kafka streams Fault tolerance through Kafka

  68. Apache Beam Attempt to provide a unified batch+streaming programming model

    for protables data processing pipelines. Provides a Java SDK and other DSLs in other languages. And a handful of streaming engines as runners : Spark, Flink, Dataflow, etc.
  69. None
  70. Principles of the Beam Model

  71. None
  72. None
  73. More about Beam/Google DataFlow - The World Beyond Batch and

    Streaming 1 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 - The World Beyond Batch and Streaming 2 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 - Dataflow Beam and Spark Conparison https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-compari son#logistics
  74. When to use them ?

  75. When to use them ?

  76. None
  77. None
  78. Bibliography and links http://www.slideshare.net/stephanewen1/continuous-processing-with-apache-flink-strata-london-2016 http://www.slideshare.net/stephanewen1/apache-flink-overview-and-use-cases-at-prehadoop-summit-meetu ps` http://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming?qid=fb518816-18bd-4771-8e7 6-2e6ee58661de&v=&b=&from_search=1 http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-ap ache-flink/

    http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/ www.confluent.io/blog/elastic-scaling-in-kafka-streams/