Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Staging reactive data pipelines using Kafka as the backbone

Staging reactive data pipelines using Kafka as the backbone

At Cake Solutions, we build highly distributed and scalable systems using Kafka as our core data pipeline.

Kafka has become the de facto platform for reliable and scalable distribution of high-volumes of data. However, as a developer, it can be challenging to figure out the best architecture and consumption patterns for interacting with Kafka while delivering quality of service such as high availability and delivery guarantees. It can also be difficult to understand the various streaming patterns and messaging topologies available in Kafka.

In this talk, we present the patterns we've successfully employed in production and provide the tools and guidelines for other developers to choose the most appropriate fit for given data processing problem. The key points for the presentation are: patterns for building reactive data pipelines, high availability and message delivery guarantees, clustering of application consumers, topic partition topology, offset commit patterns, performance benchmarks, and custom reactive, asynchronous, non-blocking Kafka driver.

https://github.com/cakesolutions/scala-kafka-client

Jaakko Pallari

October 04, 2016
Tweet

More Decks by Jaakko Pallari

Other Decks in Programming

Transcript

  1. Jaakko Pallari (@lepovirta)
    Simon Souter (@simonsouter)
    Staging Reactive data pipelines using
    Kafka as the backbone
    /cakesolutions /scala-kafka-client

    View full-size slide

  2. MANCHESTER LONDON NEW YORK
    Reactive Solutions at Cake

    View full-size slide

  3. Contents
    1. Reactive Data Pipelines
    2. Kafka as a Reactive Message Queue
    3. Architecture & Consumer Patterns
    4. Streaming Application Development

    View full-size slide

  4. Stream Processing
    ● Big Data
    ● Processing in Real-time
    ● Event Throughput vs Number of Queries
    ● IoT
    Source Service Sink

    View full-size slide

  5. Distributed Streaming Engines
    ● Server Applications
    ● Stream topologies deployed to cluster
    ● Framework design

    View full-size slide

  6. Streaming from ground-up
    ● Custom Streaming Applications
    ● Leverage existing tool stack
    Source Application Sink

    View full-size slide

  7. Staged data pipelines
    ● Staged Event Driven Architecture
    ● Processes separated by a queue
    ● Processing in stages
    Process Queue Process Queue
    Queue

    View full-size slide

  8. Reactive data pipelines
    ● Responsive
    ● Resilient
    ● Elastic
    ● Message Driven
    Process Queue Process
    Source Sink

    View full-size slide

  9. Streaming from ground-up
    ● Microservices as processing components
    Source Microservice 1 Microservice 2
    Microservice 1
    Microservice 1
    Microservice 2
    Microservice 2
    Sink
    Queue

    View full-size slide

  10. ● Deployment via cluster orchestration services
    Streaming from ground-up
    Source Microservice 1 Queue Microservice 2
    Microservice 1
    Microservice 1
    Microservice 2
    Microservice 2
    Sink
    Orchestration
    Service
    Scale
    up

    View full-size slide

  11. Streaming from ground-up
    ● Messaging middleware for resilient data distribution
    between microservices
    Source Microservice 1 Queue Microservice 2
    Microservice 1
    Microservice 1
    Microservice 2
    Microservice 2
    Sink

    View full-size slide

  12. What is Kafka?
    ● Distributed Message Broker
    ● Supports Parallel Streaming
    ● Kafka as a Reactive MQ
    Source Microservice 1 Kafka Microservice 2
    Microservice 1
    Microservice 1
    Microservice 2
    Microservice 2
    Sink

    View full-size slide

  13. Kafka Topic:
    “Electric_Readings”
    Kafka: topic and message anatomy
    Key: “meter1”
    Value: 1.34
    Electric Bill
    Calculation
    Auditing
    Message Driven

    View full-size slide

  14. Kafka: at-least-once delivery
    Kafka Topic:
    “Electric_Readings”
    Electric meter
    Consumption
    Aggregator
    Deliver
    ACK
    Deliver
    ACK
    Resilient

    View full-size slide

  15. Kafka node 2
    Kafka node 1
    Kafka: clustering - arrangement
    Kafka
    Topic
    Partition 1
    Partition 2
    Elastic

    View full-size slide

  16. Kafka: clustering - replication
    Resilient
    Kafka node 2
    Kafka node 1
    Kafka
    Topic
    Partition 1
    Partition 2
    Partition 2
    Replica
    Partition 1
    Replica

    View full-size slide

  17. Kafka: clustering - consumer
    Partition #1
    Partition #2
    Partition #3
    Consumer #1
    Consumer #2
    Consumer #3
    Kafka
    Topic
    Responsive
    Same consumer group

    View full-size slide

  18. Kafka: clustering - consumer
    Partition #1
    Partition #2
    Partition #3
    Consumer #1
    Consumer #2
    Consumer #3
    Kafka
    Topic
    Responsive

    View full-size slide

  19. Kafka: clustering - consumer
    Partition #1
    Partition #2
    Partition #3
    Consumer #1
    Consumer #2
    Consumer #3
    Kafka
    Topic
    Responsive

    View full-size slide

  20. Kafka: clustering - consumer
    Partition #1
    Partition #2
    Partition #3
    Consumer #1
    Consumer #2
    Consumer #3
    Kafka
    Topic
    Responsive

    View full-size slide

  21. Kafka: clustering - consumer
    Partition #1
    Partition #2
    Partition #3
    Consumer #1
    Consumer #2
    Consumer #3
    Kafka
    Topic
    Responsive

    View full-size slide

  22. Kafka: clustering - consumer
    Partition #1
    Partition #2
    Partition #3
    Consumer #1
    Consumer #2
    Kafka
    Topic
    Responsive
    Consumer #3
    Consumer #4 No Data

    View full-size slide

  23. Kafka: high throughput
    ● Single partition consumer: 20-90 Mb/sec
    Responsive

    View full-size slide

  24. Kafka the Reactive MQ
    Message Driven
    ● Key-value messages
    Responsive
    ● Consumer clustering
    ● High throughput
    Resilient
    ● At-least-once delivery
    ● Replication
    Elastic
    ● Linear scalability

    View full-size slide

  25. Kafka consumer patterns
    Source Microservice 1 Kafka Microservice 2
    Microservice 1
    Microservice 1
    Microservice 2
    Microservice 2
    Sink

    View full-size slide

  26. Simple message queue
    Partition
    Electric
    Meter
    Auditing
    Electric
    Readings
    Partition replica
    Partition replica
    Kafka Terminology:
    - Partition Count: 1

    View full-size slide

  27. Simple message queue - fanout
    Partition
    Electric
    Meter
    Auditing
    Electric
    Readings
    Partition replica
    Partition replica
    Billing
    Kafka Terminology:
    - Partition Count: 1
    - Multiple Consumer Groups

    View full-size slide

  28. DB
    Simple message queue - consumer
    Auditing
    Service
    Consumer
    Client
    App logic
    Kafka Partition
    1. Consume a batch of messages from Kafka
    2. Process messages and send results to wherever necessary (e.g. another Kafka topic)
    3. Confirm delivery to Kafka
    Kafka Terminology:
    - Commit Mode: Manual

    View full-size slide

  29. Partition
    Kafka: message confirmation
    ● Messages confirmed by offset (not individually)
    Commit point
    Consumer
    Consumed:
    Kafka Terminology:
    - Commit Mode: Manual

    View full-size slide

  30. Partition
    Kafka: message confirmation
    ● Messages confirmed by offset (not individually)
    Commit point
    Consumer
    Commit
    Consumed:
    Kafka Terminology:
    - Commit Mode: Manual

    View full-size slide

  31. Parallel workers
    Partition #1
    Partition #2
    Partition #N
    Electric
    Meter
    Auditing node #1
    Auditing node #2
    Auditing node #N
    Electric
    Readings
    Electric
    Meter
    Electric
    Meter
    Kafka Terminology:
    - Partition Count: >1
    - Single Consumer Group

    View full-size slide

  32. Kafka Partition
    Kafka Partition
    Consumer for parallel processing
    DB
    Auditing
    Service
    Consumer
    Client
    App logic
    Kafka Partition
    ● Same arrangement from consumer perspective
    Kafka Terminology:
    - Partition Count: >1
    - Commit Mode: Manual

    View full-size slide

  33. Orchestration
    ● Provide Scaling Capability
    ● Restart or replace failed nodes
    Partition #1
    Partition #2
    Partition #N
    Electric
    Meter
    Auditing node #1
    Auditing node #2
    Auditing node #N
    Electric
    Readings
    Electric
    Meter
    Electric
    Meter
    Mesos/
    Marathon New node

    View full-size slide

  34. Stateful Processing
    ● Example:
    Average electricity consumption per meter for the last hour
    Electric
    Meter
    Aggregation
    Electric
    Readings
    Partition
    Partition
    Partition
    Electric
    Meter
    Electric
    Meter

    View full-size slide

  35. Aggregator for
    Stream and state
    Partition #1
    Partition #2
    Aggregator for
    Electric
    Readings
    ● Data locality

    View full-size slide

  36. Aggregator for
    Stream and state
    Partition #1
    Partition #2
    Aggregator for
    Key: "meter 1"
    Value: 9.2
    Key: "meter 2"
    Value: 2.7
    Electric
    Readings
    ● Data locality

    View full-size slide

  37. Aggregator for
    Fault tolerance
    Partition #1
    Partition #2
    Aggregator for
    Electric
    Readings
    ● State persistence and recovery

    View full-size slide

  38. Aggregator for
    Fault tolerance
    Partition #1
    Partition #2
    Aggregator for
    Electric
    Readings
    Persistence
    ● State persistence and recovery

    View full-size slide

  39. Persistence
    Stateful Processing app
    Persistence
    Kafka Partition
    Kafka Partition
    Kafka/DB/?
    Aggregation
    Service
    Consumer
    Client
    Aggregation
    logic
    Kafka Partition

    View full-size slide

  40. Aggregation
    Service
    Consumer
    Client
    Aggregation
    logic
    Aggregation
    Service
    Consumer
    Client
    Aggregation
    logic
    Stateful Processing app
    Persistence
    Kafka Partition
    Kafka Partition
    Kafka/DB/?
    Kafka Partition
    Duplicated message processing
    after recovery.

    View full-size slide

  41. Stateful Processing app
    Persistence Persist state with
    partition offsets
    Don't commit!
    Just fetch more data
    Kafka Partition
    Kafka Partition
    Kafka/DB/?
    Aggregation
    Service
    Consumer
    Client
    Aggregation
    logic
    Kafka Partition
    Kafka Terminology:
    - Commit Mode: Self Managed
    Offsets

    View full-size slide

  42. Partition #1
    Partition #1
    Stateful Processing architecture
    ● Dynamic partition assignment
    ● Shared Persistence for State
    Aggregator 2
    Aggregator 1
    Persistence
    Kafka/DB/?
    Partition #4
    Partition #6
    Partition #1
    Partition #2
    Orchestration
    Service
    Aggregator 3

    View full-size slide

  43. Partition #1
    Partition #1
    Stateful Processing architecture
    ● Dynamic partition assignment
    ● Shared Persistence for State
    Aggregator 2
    Aggregator 1
    Persistence
    Kafka/DB/?
    Partition #4
    Partition #6
    Partition #1
    Partition #2
    Orchestration
    Service
    Aggregator 3

    View full-size slide

  44. Streaming Patterns
    Stateful Processing
    ● Self-managed processing
    state
    Single Partition Topic
    ● Strong ordering guarantees
    ● Limited failure recovery
    ● Scalability is limited
    Multi Partition Topic
    ● Parallel processing
    ● Limited ordering guarantees
    ● Kafka managed processing
    state
    Fanout
    ● Independent consumer
    groups

    View full-size slide

  45. Kafka libraries
    ● Kafka client support in many languages
    ● Scala, Java, C
    ● C bindings -> Haskell, OCaml, Python etc.
    Source Microservice 1 Kafka Microservice 2
    Microservice 1
    Microservice 1
    Microservice 2
    Microservice 2
    Sink

    View full-size slide

  46. Reactive Streaming APIs
    ● Similar paradigm as in real-time streaming platforms
    ● Reactive Kafka
    ○ Based on Akka Reactive Streams API
    ○ Scala + Java
    ○ Developed by Akka team
    ● Kafka Streams
    ○ Official streaming API for Kafka
    ○ Java
    ○ Developed by Confluent

    View full-size slide

  47. scala-kafka-client
    ● Kafka client developed for Scala
    ● Async and non-blocking
    ● Built on top off the official Java driver
    ● Easy API with high performance
    /cakesolutions /scala-kafka-client

    View full-size slide

  48. scala-kafka-client
    ● Leverage extensive Akka feature set
    ● Processing logic implemented using
    Actor Model
    Kafka
    Consumer
    Actor
    Kafka
    Producer
    Actor
    Receiver
    Actor
    Kafka Kafka
    /cakesolutions /scala-kafka-client

    View full-size slide

  49. Summary
    ● Leverage Microservice based techniques.
    ● Streaming topologies can be varied and complex
    ○ Many use-cases fall under a small set of consumer
    patterns.
    ● Challenges around scalable and reactive data pipelines
    ● Kafka provides first-class support for reactive streaming to
    your applications.
    ● Stateful processing remains a challenging area.

    View full-size slide

  50. We didn’t discuss...
    ● Data serialisation
    ● Application rolling updates
    ● Complex streaming topologies

    View full-size slide

  51. Questions?
    MANCHESTER LONDON NEW YORK
    /cakesolutions /scala-kafka-client
    @cakesolutions
    +44 845 617 1200
    [email protected]

    View full-size slide