Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Staging reactive data pipelines using Kafka as the backbone

Staging reactive data pipelines using Kafka as the backbone

At Cake Solutions, we build highly distributed and scalable systems using Kafka as our core data pipeline.

Kafka has become the de facto platform for reliable and scalable distribution of high-volumes of data. However, as a developer, it can be challenging to figure out the best architecture and consumption patterns for interacting with Kafka while delivering quality of service such as high availability and delivery guarantees. It can also be difficult to understand the various streaming patterns and messaging topologies available in Kafka.

In this talk, we present the patterns we've successfully employed in production and provide the tools and guidelines for other developers to choose the most appropriate fit for given data processing problem. The key points for the presentation are: patterns for building reactive data pipelines, high availability and message delivery guarantees, clustering of application consumers, topic partition topology, offset commit patterns, performance benchmarks, and custom reactive, asynchronous, non-blocking Kafka driver.

https://github.com/cakesolutions/scala-kafka-client

Jaakko Pallari

October 04, 2016
Tweet

More Decks by Jaakko Pallari

Other Decks in Programming

Transcript

  1. Jaakko Pallari (@lepovirta) Simon Souter (@simonsouter) Staging Reactive data pipelines

    using Kafka as the backbone /cakesolutions /scala-kafka-client
  2. Contents 1. Reactive Data Pipelines 2. Kafka as a Reactive

    Message Queue 3. Architecture & Consumer Patterns 4. Streaming Application Development
  3. Stream Processing • Big Data • Processing in Real-time •

    Event Throughput vs Number of Queries • IoT Source Service Sink
  4. Staged data pipelines • Staged Event Driven Architecture • Processes

    separated by a queue • Processing in stages Process Queue Process Queue Queue
  5. Reactive data pipelines • Responsive • Resilient • Elastic •

    Message Driven Process Queue Process Source Sink
  6. Streaming from ground-up • Microservices as processing components Source Microservice

    1 Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink Queue
  7. • Deployment via cluster orchestration services Streaming from ground-up Source

    Microservice 1 Queue Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink Orchestration Service Scale up
  8. Streaming from ground-up • Messaging middleware for resilient data distribution

    between microservices Source Microservice 1 Queue Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink
  9. What is Kafka? • Distributed Message Broker • Supports Parallel

    Streaming • Kafka as a Reactive MQ Source Microservice 1 Kafka Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink
  10. Kafka Topic: “Electric_Readings” Kafka: topic and message anatomy Key: “meter1”

    Value: 1.34 Electric Bill Calculation Auditing Message Driven
  11. Kafka node 2 Kafka node 1 Kafka: clustering - arrangement

    Kafka Topic Partition 1 Partition 2 Elastic
  12. Kafka: clustering - replication Resilient Kafka node 2 Kafka node

    1 Kafka Topic Partition 1 Partition 2 Partition 2 Replica Partition 1 Replica
  13. Kafka: clustering - consumer Partition #1 Partition #2 Partition #3

    Consumer #1 Consumer #2 Consumer #3 Kafka Topic Responsive Same consumer group
  14. Kafka: clustering - consumer Partition #1 Partition #2 Partition #3

    Consumer #1 Consumer #2 Consumer #3 Kafka Topic Responsive
  15. Kafka: clustering - consumer Partition #1 Partition #2 Partition #3

    Consumer #1 Consumer #2 Consumer #3 Kafka Topic Responsive
  16. Kafka: clustering - consumer Partition #1 Partition #2 Partition #3

    Consumer #1 Consumer #2 Consumer #3 Kafka Topic Responsive
  17. Kafka: clustering - consumer Partition #1 Partition #2 Partition #3

    Consumer #1 Consumer #2 Consumer #3 Kafka Topic Responsive
  18. Kafka: clustering - consumer Partition #1 Partition #2 Partition #3

    Consumer #1 Consumer #2 Kafka Topic Responsive Consumer #3 Consumer #4 No Data
  19. Kafka the Reactive MQ Message Driven • Key-value messages Responsive

    • Consumer clustering • High throughput Resilient • At-least-once delivery • Replication Elastic • Linear scalability
  20. Kafka consumer patterns Source Microservice 1 Kafka Microservice 2 Microservice

    1 Microservice 1 Microservice 2 Microservice 2 Sink
  21. Simple message queue Partition Electric Meter Auditing Electric Readings Partition

    replica Partition replica Kafka Terminology: - Partition Count: 1
  22. Simple message queue - fanout Partition Electric Meter Auditing Electric

    Readings Partition replica Partition replica Billing Kafka Terminology: - Partition Count: 1 - Multiple Consumer Groups
  23. DB Simple message queue - consumer Auditing Service Consumer Client

    App logic Kafka Partition 1. Consume a batch of messages from Kafka 2. Process messages and send results to wherever necessary (e.g. another Kafka topic) 3. Confirm delivery to Kafka Kafka Terminology: - Commit Mode: Manual
  24. Partition Kafka: message confirmation • Messages confirmed by offset (not

    individually) Commit point Consumer Consumed: Kafka Terminology: - Commit Mode: Manual
  25. Partition Kafka: message confirmation • Messages confirmed by offset (not

    individually) Commit point Consumer Commit Consumed: Kafka Terminology: - Commit Mode: Manual
  26. Parallel workers Partition #1 Partition #2 Partition #N Electric Meter

    Auditing node #1 Auditing node #2 Auditing node #N Electric Readings Electric Meter Electric Meter Kafka Terminology: - Partition Count: >1 - Single Consumer Group
  27. Kafka Partition Kafka Partition Consumer for parallel processing DB Auditing

    Service Consumer Client App logic Kafka Partition • Same arrangement from consumer perspective Kafka Terminology: - Partition Count: >1 - Commit Mode: Manual
  28. Orchestration • Provide Scaling Capability • Restart or replace failed

    nodes Partition #1 Partition #2 Partition #N Electric Meter Auditing node #1 Auditing node #2 Auditing node #N Electric Readings Electric Meter Electric Meter Mesos/ Marathon New node
  29. Stateful Processing • Example: Average electricity consumption per meter for

    the last hour Electric Meter Aggregation Electric Readings Partition Partition Partition Electric Meter Electric Meter
  30. Aggregator for Stream and state Partition #1 Partition #2 Aggregator

    for Key: "meter 1" Value: 9.2 Key: "meter 2" Value: 2.7 Electric Readings • Data locality
  31. Aggregator for Fault tolerance Partition #1 Partition #2 Aggregator for

    Electric Readings • State persistence and recovery
  32. Aggregator for Fault tolerance Partition #1 Partition #2 Aggregator for

    Electric Readings Persistence • State persistence and recovery
  33. Persistence Stateful Processing app Persistence Kafka Partition Kafka Partition Kafka/DB/?

    Aggregation Service Consumer Client Aggregation logic Kafka Partition
  34. Aggregation Service Consumer Client Aggregation logic Aggregation Service Consumer Client

    Aggregation logic Stateful Processing app Persistence Kafka Partition Kafka Partition Kafka/DB/? Kafka Partition Duplicated message processing after recovery.
  35. Stateful Processing app Persistence Persist state with partition offsets Don't

    commit! Just fetch more data Kafka Partition Kafka Partition Kafka/DB/? Aggregation Service Consumer Client Aggregation logic Kafka Partition Kafka Terminology: - Commit Mode: Self Managed Offsets
  36. Partition #1 Partition #1 Stateful Processing architecture • Dynamic partition

    assignment • Shared Persistence for State Aggregator 2 Aggregator 1 Persistence Kafka/DB/? Partition #4 Partition #6 Partition #1 Partition #2 Orchestration Service Aggregator 3
  37. Partition #1 Partition #1 Stateful Processing architecture • Dynamic partition

    assignment • Shared Persistence for State Aggregator 2 Aggregator 1 Persistence Kafka/DB/? Partition #4 Partition #6 Partition #1 Partition #2 Orchestration Service Aggregator 3
  38. Streaming Patterns Stateful Processing • Self-managed processing state Single Partition

    Topic • Strong ordering guarantees • Limited failure recovery • Scalability is limited Multi Partition Topic • Parallel processing • Limited ordering guarantees • Kafka managed processing state Fanout • Independent consumer groups
  39. Kafka libraries • Kafka client support in many languages •

    Scala, Java, C • C bindings -> Haskell, OCaml, Python etc. Source Microservice 1 Kafka Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink
  40. Reactive Streaming APIs • Similar paradigm as in real-time streaming

    platforms • Reactive Kafka ◦ Based on Akka Reactive Streams API ◦ Scala + Java ◦ Developed by Akka team • Kafka Streams ◦ Official streaming API for Kafka ◦ Java ◦ Developed by Confluent
  41. scala-kafka-client • Kafka client developed for Scala • Async and

    non-blocking • Built on top off the official Java driver • Easy API with high performance /cakesolutions /scala-kafka-client
  42. scala-kafka-client • Leverage extensive Akka feature set • Processing logic

    implemented using Actor Model Kafka Consumer Actor Kafka Producer Actor Receiver Actor Kafka Kafka /cakesolutions /scala-kafka-client
  43. Summary • Leverage Microservice based techniques. • Streaming topologies can

    be varied and complex ◦ Many use-cases fall under a small set of consumer patterns. • Challenges around scalable and reactive data pipelines • Kafka provides first-class support for reactive streaming to your applications. • Stateful processing remains a challenging area.