Staging reactive data pipelines using Kafka as the backbone

Staging reactive data pipelines using Kafka as the backbone

At Cake Solutions, we build highly distributed and scalable systems using Kafka as our core data pipeline.

Kafka has become the de facto platform for reliable and scalable distribution of high-volumes of data. However, as a developer, it can be challenging to figure out the best architecture and consumption patterns for interacting with Kafka while delivering quality of service such as high availability and delivery guarantees. It can also be difficult to understand the various streaming patterns and messaging topologies available in Kafka.

In this talk, we present the patterns we've successfully employed in production and provide the tools and guidelines for other developers to choose the most appropriate fit for given data processing problem. The key points for the presentation are: patterns for building reactive data pipelines, high availability and message delivery guarantees, clustering of application consumers, topic partition topology, offset commit patterns, performance benchmarks, and custom reactive, asynchronous, non-blocking Kafka driver.

https://github.com/cakesolutions/scala-kafka-client

Ab4a11cf19e2341bfb0837b2ed2b2dd0?s=128

Jaakko Pallari

October 04, 2016
Tweet

Transcript

  1. None
  2. Jaakko Pallari (@lepovirta) Simon Souter (@simonsouter) Staging Reactive data pipelines

    using Kafka as the backbone /cakesolutions /scala-kafka-client
  3. MANCHESTER LONDON NEW YORK Reactive Solutions at Cake

  4. Contents 1. Reactive Data Pipelines 2. Kafka as a Reactive

    Message Queue 3. Architecture & Consumer Patterns 4. Streaming Application Development
  5. Stream Processing • Big Data • Processing in Real-time •

    Event Throughput vs Number of Queries • IoT Source Service Sink
  6. Distributed Streaming Engines • Server Applications • Stream topologies deployed

    to cluster • Framework design
  7. Streaming from ground-up • Custom Streaming Applications • Leverage existing

    tool stack Source Application Sink
  8. Staged data pipelines • Staged Event Driven Architecture • Processes

    separated by a queue • Processing in stages Process Queue Process Queue Queue
  9. Reactive data pipelines • Responsive • Resilient • Elastic •

    Message Driven Process Queue Process Source Sink
  10. Streaming from ground-up • Microservices as processing components Source Microservice

    1 Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink Queue
  11. • Deployment via cluster orchestration services Streaming from ground-up Source

    Microservice 1 Queue Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink Orchestration Service Scale up
  12. Streaming from ground-up • Messaging middleware for resilient data distribution

    between microservices Source Microservice 1 Queue Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink
  13. What is Kafka? • Distributed Message Broker • Supports Parallel

    Streaming • Kafka as a Reactive MQ Source Microservice 1 Kafka Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink
  14. Kafka Topic: “Electric_Readings” Kafka: topic and message anatomy Key: “meter1”

    Value: 1.34 Electric Bill Calculation Auditing Message Driven
  15. Kafka: at-least-once delivery Kafka Topic: “Electric_Readings” Electric meter Consumption Aggregator

    Deliver ACK Deliver ACK Resilient
  16. Kafka node 2 Kafka node 1 Kafka: clustering - arrangement

    Kafka Topic Partition 1 Partition 2 Elastic
  17. Kafka: clustering - replication Resilient Kafka node 2 Kafka node

    1 Kafka Topic Partition 1 Partition 2 Partition 2 Replica Partition 1 Replica
  18. Kafka: clustering - consumer Partition #1 Partition #2 Partition #3

    Consumer #1 Consumer #2 Consumer #3 Kafka Topic Responsive Same consumer group
  19. Kafka: clustering - consumer Partition #1 Partition #2 Partition #3

    Consumer #1 Consumer #2 Consumer #3 Kafka Topic Responsive
  20. Kafka: clustering - consumer Partition #1 Partition #2 Partition #3

    Consumer #1 Consumer #2 Consumer #3 Kafka Topic Responsive
  21. Kafka: clustering - consumer Partition #1 Partition #2 Partition #3

    Consumer #1 Consumer #2 Consumer #3 Kafka Topic Responsive
  22. Kafka: clustering - consumer Partition #1 Partition #2 Partition #3

    Consumer #1 Consumer #2 Consumer #3 Kafka Topic Responsive
  23. Kafka: clustering - consumer Partition #1 Partition #2 Partition #3

    Consumer #1 Consumer #2 Kafka Topic Responsive Consumer #3 Consumer #4 No Data
  24. Kafka: high throughput • Single partition consumer: 20-90 Mb/sec Responsive

  25. Kafka the Reactive MQ Message Driven • Key-value messages Responsive

    • Consumer clustering • High throughput Resilient • At-least-once delivery • Replication Elastic • Linear scalability
  26. Kafka consumer patterns Source Microservice 1 Kafka Microservice 2 Microservice

    1 Microservice 1 Microservice 2 Microservice 2 Sink
  27. Simple message queue Partition Electric Meter Auditing Electric Readings Partition

    replica Partition replica Kafka Terminology: - Partition Count: 1
  28. Simple message queue - fanout Partition Electric Meter Auditing Electric

    Readings Partition replica Partition replica Billing Kafka Terminology: - Partition Count: 1 - Multiple Consumer Groups
  29. DB Simple message queue - consumer Auditing Service Consumer Client

    App logic Kafka Partition 1. Consume a batch of messages from Kafka 2. Process messages and send results to wherever necessary (e.g. another Kafka topic) 3. Confirm delivery to Kafka Kafka Terminology: - Commit Mode: Manual
  30. Partition Kafka: message confirmation • Messages confirmed by offset (not

    individually) Commit point Consumer Consumed: Kafka Terminology: - Commit Mode: Manual
  31. Partition Kafka: message confirmation • Messages confirmed by offset (not

    individually) Commit point Consumer Commit Consumed: Kafka Terminology: - Commit Mode: Manual
  32. Parallel workers Partition #1 Partition #2 Partition #N Electric Meter

    Auditing node #1 Auditing node #2 Auditing node #N Electric Readings Electric Meter Electric Meter Kafka Terminology: - Partition Count: >1 - Single Consumer Group
  33. Kafka Partition Kafka Partition Consumer for parallel processing DB Auditing

    Service Consumer Client App logic Kafka Partition • Same arrangement from consumer perspective Kafka Terminology: - Partition Count: >1 - Commit Mode: Manual
  34. Orchestration • Provide Scaling Capability • Restart or replace failed

    nodes Partition #1 Partition #2 Partition #N Electric Meter Auditing node #1 Auditing node #2 Auditing node #N Electric Readings Electric Meter Electric Meter Mesos/ Marathon New node
  35. Stateful Processing • Example: Average electricity consumption per meter for

    the last hour Electric Meter Aggregation Electric Readings Partition Partition Partition Electric Meter Electric Meter
  36. Aggregator for Stream and state Partition #1 Partition #2 Aggregator

    for Electric Readings • Data locality
  37. Aggregator for Stream and state Partition #1 Partition #2 Aggregator

    for Key: "meter 1" Value: 9.2 Key: "meter 2" Value: 2.7 Electric Readings • Data locality
  38. Aggregator for Fault tolerance Partition #1 Partition #2 Aggregator for

    Electric Readings • State persistence and recovery
  39. Aggregator for Fault tolerance Partition #1 Partition #2 Aggregator for

    Electric Readings Persistence • State persistence and recovery
  40. Persistence Stateful Processing app Persistence Kafka Partition Kafka Partition Kafka/DB/?

    Aggregation Service Consumer Client Aggregation logic Kafka Partition
  41. Aggregation Service Consumer Client Aggregation logic Aggregation Service Consumer Client

    Aggregation logic Stateful Processing app Persistence Kafka Partition Kafka Partition Kafka/DB/? Kafka Partition Duplicated message processing after recovery.
  42. Stateful Processing app Persistence Persist state with partition offsets Don't

    commit! Just fetch more data Kafka Partition Kafka Partition Kafka/DB/? Aggregation Service Consumer Client Aggregation logic Kafka Partition Kafka Terminology: - Commit Mode: Self Managed Offsets
  43. Partition #1 Partition #1 Stateful Processing architecture • Dynamic partition

    assignment • Shared Persistence for State Aggregator 2 Aggregator 1 Persistence Kafka/DB/? Partition #4 Partition #6 Partition #1 Partition #2 Orchestration Service Aggregator 3
  44. Partition #1 Partition #1 Stateful Processing architecture • Dynamic partition

    assignment • Shared Persistence for State Aggregator 2 Aggregator 1 Persistence Kafka/DB/? Partition #4 Partition #6 Partition #1 Partition #2 Orchestration Service Aggregator 3
  45. Streaming Patterns Stateful Processing • Self-managed processing state Single Partition

    Topic • Strong ordering guarantees • Limited failure recovery • Scalability is limited Multi Partition Topic • Parallel processing • Limited ordering guarantees • Kafka managed processing state Fanout • Independent consumer groups
  46. Kafka libraries • Kafka client support in many languages •

    Scala, Java, C • C bindings -> Haskell, OCaml, Python etc. Source Microservice 1 Kafka Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink
  47. Reactive Streaming APIs • Similar paradigm as in real-time streaming

    platforms • Reactive Kafka ◦ Based on Akka Reactive Streams API ◦ Scala + Java ◦ Developed by Akka team • Kafka Streams ◦ Official streaming API for Kafka ◦ Java ◦ Developed by Confluent
  48. scala-kafka-client • Kafka client developed for Scala • Async and

    non-blocking • Built on top off the official Java driver • Easy API with high performance /cakesolutions /scala-kafka-client
  49. scala-kafka-client • Leverage extensive Akka feature set • Processing logic

    implemented using Actor Model Kafka Consumer Actor Kafka Producer Actor Receiver Actor Kafka Kafka /cakesolutions /scala-kafka-client
  50. Summary • Leverage Microservice based techniques. • Streaming topologies can

    be varied and complex ◦ Many use-cases fall under a small set of consumer patterns. • Challenges around scalable and reactive data pipelines • Kafka provides first-class support for reactive streaming to your applications. • Stateful processing remains a challenging area.
  51. We didn’t discuss... • Data serialisation • Application rolling updates

    • Complex streaming topologies
  52. Questions? MANCHESTER LONDON NEW YORK /cakesolutions /scala-kafka-client @cakesolutions +44 845

    617 1200 enquiries@cakesolutions.net