Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Staging reactive data pipelines using Kafka as the backbone

Staging reactive data pipelines using Kafka as the backbone

At Cake Solutions, we build highly distributed and scalable systems using Kafka as our core data pipeline.

Kafka has become the de facto platform for reliable and scalable distribution of high-volumes of data. However, as a developer, it can be challenging to figure out the best architecture and consumption patterns for interacting with Kafka while delivering quality of service such as high availability and delivery guarantees. It can also be difficult to understand the various streaming patterns and messaging topologies available in Kafka.

In this talk, we present the patterns we've successfully employed in production and provide the tools and guidelines for other developers to choose the most appropriate fit for given data processing problem. The key points for the presentation are: patterns for building reactive data pipelines, high availability and message delivery guarantees, clustering of application consumers, topic partition topology, offset commit patterns, performance benchmarks, and custom reactive, asynchronous, non-blocking Kafka driver.

https://github.com/cakesolutions/scala-kafka-client

Jaakko Pallari

October 04, 2016
Tweet

More Decks by Jaakko Pallari

Other Decks in Programming

Transcript

  1. View Slide

  2. Jaakko Pallari (@lepovirta)
    Simon Souter (@simonsouter)
    Staging Reactive data pipelines using
    Kafka as the backbone
    /cakesolutions /scala-kafka-client

    View Slide

  3. MANCHESTER LONDON NEW YORK
    Reactive Solutions at Cake

    View Slide

  4. Contents
    1. Reactive Data Pipelines
    2. Kafka as a Reactive Message Queue
    3. Architecture & Consumer Patterns
    4. Streaming Application Development

    View Slide

  5. Stream Processing
    ● Big Data
    ● Processing in Real-time
    ● Event Throughput vs Number of Queries
    ● IoT
    Source Service Sink

    View Slide

  6. Distributed Streaming Engines
    ● Server Applications
    ● Stream topologies deployed to cluster
    ● Framework design

    View Slide

  7. Streaming from ground-up
    ● Custom Streaming Applications
    ● Leverage existing tool stack
    Source Application Sink

    View Slide

  8. Staged data pipelines
    ● Staged Event Driven Architecture
    ● Processes separated by a queue
    ● Processing in stages
    Process Queue Process Queue
    Queue

    View Slide

  9. Reactive data pipelines
    ● Responsive
    ● Resilient
    ● Elastic
    ● Message Driven
    Process Queue Process
    Source Sink

    View Slide

  10. Streaming from ground-up
    ● Microservices as processing components
    Source Microservice 1 Microservice 2
    Microservice 1
    Microservice 1
    Microservice 2
    Microservice 2
    Sink
    Queue

    View Slide

  11. ● Deployment via cluster orchestration services
    Streaming from ground-up
    Source Microservice 1 Queue Microservice 2
    Microservice 1
    Microservice 1
    Microservice 2
    Microservice 2
    Sink
    Orchestration
    Service
    Scale
    up

    View Slide

  12. Streaming from ground-up
    ● Messaging middleware for resilient data distribution
    between microservices
    Source Microservice 1 Queue Microservice 2
    Microservice 1
    Microservice 1
    Microservice 2
    Microservice 2
    Sink

    View Slide

  13. What is Kafka?
    ● Distributed Message Broker
    ● Supports Parallel Streaming
    ● Kafka as a Reactive MQ
    Source Microservice 1 Kafka Microservice 2
    Microservice 1
    Microservice 1
    Microservice 2
    Microservice 2
    Sink

    View Slide

  14. Kafka Topic:
    “Electric_Readings”
    Kafka: topic and message anatomy
    Key: “meter1”
    Value: 1.34
    Electric Bill
    Calculation
    Auditing
    Message Driven

    View Slide

  15. Kafka: at-least-once delivery
    Kafka Topic:
    “Electric_Readings”
    Electric meter
    Consumption
    Aggregator
    Deliver
    ACK
    Deliver
    ACK
    Resilient

    View Slide

  16. Kafka node 2
    Kafka node 1
    Kafka: clustering - arrangement
    Kafka
    Topic
    Partition 1
    Partition 2
    Elastic

    View Slide

  17. Kafka: clustering - replication
    Resilient
    Kafka node 2
    Kafka node 1
    Kafka
    Topic
    Partition 1
    Partition 2
    Partition 2
    Replica
    Partition 1
    Replica

    View Slide

  18. Kafka: clustering - consumer
    Partition #1
    Partition #2
    Partition #3
    Consumer #1
    Consumer #2
    Consumer #3
    Kafka
    Topic
    Responsive
    Same consumer group

    View Slide

  19. Kafka: clustering - consumer
    Partition #1
    Partition #2
    Partition #3
    Consumer #1
    Consumer #2
    Consumer #3
    Kafka
    Topic
    Responsive

    View Slide

  20. Kafka: clustering - consumer
    Partition #1
    Partition #2
    Partition #3
    Consumer #1
    Consumer #2
    Consumer #3
    Kafka
    Topic
    Responsive

    View Slide

  21. Kafka: clustering - consumer
    Partition #1
    Partition #2
    Partition #3
    Consumer #1
    Consumer #2
    Consumer #3
    Kafka
    Topic
    Responsive

    View Slide

  22. Kafka: clustering - consumer
    Partition #1
    Partition #2
    Partition #3
    Consumer #1
    Consumer #2
    Consumer #3
    Kafka
    Topic
    Responsive

    View Slide

  23. Kafka: clustering - consumer
    Partition #1
    Partition #2
    Partition #3
    Consumer #1
    Consumer #2
    Kafka
    Topic
    Responsive
    Consumer #3
    Consumer #4 No Data

    View Slide

  24. Kafka: high throughput
    ● Single partition consumer: 20-90 Mb/sec
    Responsive

    View Slide

  25. Kafka the Reactive MQ
    Message Driven
    ● Key-value messages
    Responsive
    ● Consumer clustering
    ● High throughput
    Resilient
    ● At-least-once delivery
    ● Replication
    Elastic
    ● Linear scalability

    View Slide

  26. Kafka consumer patterns
    Source Microservice 1 Kafka Microservice 2
    Microservice 1
    Microservice 1
    Microservice 2
    Microservice 2
    Sink

    View Slide

  27. Simple message queue
    Partition
    Electric
    Meter
    Auditing
    Electric
    Readings
    Partition replica
    Partition replica
    Kafka Terminology:
    - Partition Count: 1

    View Slide

  28. Simple message queue - fanout
    Partition
    Electric
    Meter
    Auditing
    Electric
    Readings
    Partition replica
    Partition replica
    Billing
    Kafka Terminology:
    - Partition Count: 1
    - Multiple Consumer Groups

    View Slide

  29. DB
    Simple message queue - consumer
    Auditing
    Service
    Consumer
    Client
    App logic
    Kafka Partition
    1. Consume a batch of messages from Kafka
    2. Process messages and send results to wherever necessary (e.g. another Kafka topic)
    3. Confirm delivery to Kafka
    Kafka Terminology:
    - Commit Mode: Manual

    View Slide

  30. Partition
    Kafka: message confirmation
    ● Messages confirmed by offset (not individually)
    Commit point
    Consumer
    Consumed:
    Kafka Terminology:
    - Commit Mode: Manual

    View Slide

  31. Partition
    Kafka: message confirmation
    ● Messages confirmed by offset (not individually)
    Commit point
    Consumer
    Commit
    Consumed:
    Kafka Terminology:
    - Commit Mode: Manual

    View Slide

  32. Parallel workers
    Partition #1
    Partition #2
    Partition #N
    Electric
    Meter
    Auditing node #1
    Auditing node #2
    Auditing node #N
    Electric
    Readings
    Electric
    Meter
    Electric
    Meter
    Kafka Terminology:
    - Partition Count: >1
    - Single Consumer Group

    View Slide

  33. Kafka Partition
    Kafka Partition
    Consumer for parallel processing
    DB
    Auditing
    Service
    Consumer
    Client
    App logic
    Kafka Partition
    ● Same arrangement from consumer perspective
    Kafka Terminology:
    - Partition Count: >1
    - Commit Mode: Manual

    View Slide

  34. Orchestration
    ● Provide Scaling Capability
    ● Restart or replace failed nodes
    Partition #1
    Partition #2
    Partition #N
    Electric
    Meter
    Auditing node #1
    Auditing node #2
    Auditing node #N
    Electric
    Readings
    Electric
    Meter
    Electric
    Meter
    Mesos/
    Marathon New node

    View Slide

  35. Stateful Processing
    ● Example:
    Average electricity consumption per meter for the last hour
    Electric
    Meter
    Aggregation
    Electric
    Readings
    Partition
    Partition
    Partition
    Electric
    Meter
    Electric
    Meter

    View Slide

  36. Aggregator for
    Stream and state
    Partition #1
    Partition #2
    Aggregator for
    Electric
    Readings
    ● Data locality

    View Slide

  37. Aggregator for
    Stream and state
    Partition #1
    Partition #2
    Aggregator for
    Key: "meter 1"
    Value: 9.2
    Key: "meter 2"
    Value: 2.7
    Electric
    Readings
    ● Data locality

    View Slide

  38. Aggregator for
    Fault tolerance
    Partition #1
    Partition #2
    Aggregator for
    Electric
    Readings
    ● State persistence and recovery

    View Slide

  39. Aggregator for
    Fault tolerance
    Partition #1
    Partition #2
    Aggregator for
    Electric
    Readings
    Persistence
    ● State persistence and recovery

    View Slide

  40. Persistence
    Stateful Processing app
    Persistence
    Kafka Partition
    Kafka Partition
    Kafka/DB/?
    Aggregation
    Service
    Consumer
    Client
    Aggregation
    logic
    Kafka Partition

    View Slide

  41. Aggregation
    Service
    Consumer
    Client
    Aggregation
    logic
    Aggregation
    Service
    Consumer
    Client
    Aggregation
    logic
    Stateful Processing app
    Persistence
    Kafka Partition
    Kafka Partition
    Kafka/DB/?
    Kafka Partition
    Duplicated message processing
    after recovery.

    View Slide

  42. Stateful Processing app
    Persistence Persist state with
    partition offsets
    Don't commit!
    Just fetch more data
    Kafka Partition
    Kafka Partition
    Kafka/DB/?
    Aggregation
    Service
    Consumer
    Client
    Aggregation
    logic
    Kafka Partition
    Kafka Terminology:
    - Commit Mode: Self Managed
    Offsets

    View Slide

  43. Partition #1
    Partition #1
    Stateful Processing architecture
    ● Dynamic partition assignment
    ● Shared Persistence for State
    Aggregator 2
    Aggregator 1
    Persistence
    Kafka/DB/?
    Partition #4
    Partition #6
    Partition #1
    Partition #2
    Orchestration
    Service
    Aggregator 3

    View Slide

  44. Partition #1
    Partition #1
    Stateful Processing architecture
    ● Dynamic partition assignment
    ● Shared Persistence for State
    Aggregator 2
    Aggregator 1
    Persistence
    Kafka/DB/?
    Partition #4
    Partition #6
    Partition #1
    Partition #2
    Orchestration
    Service
    Aggregator 3

    View Slide

  45. Streaming Patterns
    Stateful Processing
    ● Self-managed processing
    state
    Single Partition Topic
    ● Strong ordering guarantees
    ● Limited failure recovery
    ● Scalability is limited
    Multi Partition Topic
    ● Parallel processing
    ● Limited ordering guarantees
    ● Kafka managed processing
    state
    Fanout
    ● Independent consumer
    groups

    View Slide

  46. Kafka libraries
    ● Kafka client support in many languages
    ● Scala, Java, C
    ● C bindings -> Haskell, OCaml, Python etc.
    Source Microservice 1 Kafka Microservice 2
    Microservice 1
    Microservice 1
    Microservice 2
    Microservice 2
    Sink

    View Slide

  47. Reactive Streaming APIs
    ● Similar paradigm as in real-time streaming platforms
    ● Reactive Kafka
    ○ Based on Akka Reactive Streams API
    ○ Scala + Java
    ○ Developed by Akka team
    ● Kafka Streams
    ○ Official streaming API for Kafka
    ○ Java
    ○ Developed by Confluent

    View Slide

  48. scala-kafka-client
    ● Kafka client developed for Scala
    ● Async and non-blocking
    ● Built on top off the official Java driver
    ● Easy API with high performance
    /cakesolutions /scala-kafka-client

    View Slide

  49. scala-kafka-client
    ● Leverage extensive Akka feature set
    ● Processing logic implemented using
    Actor Model
    Kafka
    Consumer
    Actor
    Kafka
    Producer
    Actor
    Receiver
    Actor
    Kafka Kafka
    /cakesolutions /scala-kafka-client

    View Slide

  50. Summary
    ● Leverage Microservice based techniques.
    ● Streaming topologies can be varied and complex
    ○ Many use-cases fall under a small set of consumer
    patterns.
    ● Challenges around scalable and reactive data pipelines
    ● Kafka provides first-class support for reactive streaming to
    your applications.
    ● Stateful processing remains a challenging area.

    View Slide

  51. We didn’t discuss...
    ● Data serialisation
    ● Application rolling updates
    ● Complex streaming topologies

    View Slide

  52. Questions?
    MANCHESTER LONDON NEW YORK
    /cakesolutions /scala-kafka-client
    @cakesolutions
    +44 845 617 1200
    [email protected]

    View Slide