Staging reactive data pipelines using Kafka as the backbone

Jaakko Pallari (@lepovirta) Simon Souter (@simonsouter) Staging Reactive data pipelines
using Kafka as the backbone /cakesolutions /scala-kafka-client

MANCHESTER LONDON NEW YORK Reactive Solutions at Cake

Contents 1. Reactive Data Pipelines 2. Kafka as a Reactive
Message Queue 3. Architecture & Consumer Patterns 4. Streaming Application Development

Stream Processing • Big Data • Processing in Real-time •
Event Throughput vs Number of Queries • IoT Source Service Sink

Distributed Streaming Engines • Server Applications • Stream topologies deployed
to cluster • Framework design

Streaming from ground-up • Custom Streaming Applications • Leverage existing
tool stack Source Application Sink

Staged data pipelines • Staged Event Driven Architecture • Processes
separated by a queue • Processing in stages Process Queue Process Queue Queue

Reactive data pipelines • Responsive • Resilient • Elastic •
Message Driven Process Queue Process Source Sink

Streaming from ground-up • Microservices as processing components Source Microservice
1 Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink Queue

• Deployment via cluster orchestration services Streaming from ground-up Source
Microservice 1 Queue Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink Orchestration Service Scale up

Streaming from ground-up • Messaging middleware for resilient data distribution
between microservices Source Microservice 1 Queue Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink

What is Kafka? • Distributed Message Broker • Supports Parallel
Streaming • Kafka as a Reactive MQ Source Microservice 1 Kafka Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink

Kafka Topic: “Electric_Readings” Kafka: topic and message anatomy Key: “meter1”
Value: 1.34 Electric Bill Calculation Auditing Message Driven

Kafka: at-least-once delivery Kafka Topic: “Electric_Readings” Electric meter Consumption Aggregator
Deliver ACK Deliver ACK Resilient

Kafka node 2 Kafka node 1 Kafka: clustering - arrangement
Kafka Topic Partition 1 Partition 2 Elastic

Kafka: clustering - replication Resilient Kafka node 2 Kafka node
1 Kafka Topic Partition 1 Partition 2 Partition 2 Replica Partition 1 Replica

Kafka: clustering - consumer Partition #1 Partition #2 Partition #3
Consumer #1 Consumer #2 Consumer #3 Kafka Topic Responsive Same consumer group

Consumer #1 Consumer #2 Consumer #3 Kafka Topic Responsive

Consumer #1 Consumer #2 Kafka Topic Responsive Consumer #3 Consumer #4 No Data

Kafka: high throughput • Single partition consumer: 20-90 Mb/sec Responsive

Kafka the Reactive MQ Message Driven • Key-value messages Responsive
• Consumer clustering • High throughput Resilient • At-least-once delivery • Replication Elastic • Linear scalability

Kafka consumer patterns Source Microservice 1 Kafka Microservice 2 Microservice
1 Microservice 1 Microservice 2 Microservice 2 Sink

Simple message queue Partition Electric Meter Auditing Electric Readings Partition
replica Partition replica Kafka Terminology: - Partition Count: 1

Simple message queue - fanout Partition Electric Meter Auditing Electric
Readings Partition replica Partition replica Billing Kafka Terminology: - Partition Count: 1 - Multiple Consumer Groups

DB Simple message queue - consumer Auditing Service Consumer Client
App logic Kafka Partition 1. Consume a batch of messages from Kafka 2. Process messages and send results to wherever necessary (e.g. another Kafka topic) 3. Confirm delivery to Kafka Kafka Terminology: - Commit Mode: Manual

Partition Kafka: message confirmation • Messages confirmed by offset (not
individually) Commit point Consumer Consumed: Kafka Terminology: - Commit Mode: Manual

Partition Kafka: message confirmation • Messages confirmed by offset (not
individually) Commit point Consumer Commit Consumed: Kafka Terminology: - Commit Mode: Manual

Parallel workers Partition #1 Partition #2 Partition #N Electric Meter
Auditing node #1 Auditing node #2 Auditing node #N Electric Readings Electric Meter Electric Meter Kafka Terminology: - Partition Count: >1 - Single Consumer Group

Kafka Partition Kafka Partition Consumer for parallel processing DB Auditing
Service Consumer Client App logic Kafka Partition • Same arrangement from consumer perspective Kafka Terminology: - Partition Count: >1 - Commit Mode: Manual

Orchestration • Provide Scaling Capability • Restart or replace failed
nodes Partition #1 Partition #2 Partition #N Electric Meter Auditing node #1 Auditing node #2 Auditing node #N Electric Readings Electric Meter Electric Meter Mesos/ Marathon New node

Stateful Processing • Example: Average electricity consumption per meter for
the last hour Electric Meter Aggregation Electric Readings Partition Partition Partition Electric Meter Electric Meter

Aggregator for Stream and state Partition #1 Partition #2 Aggregator
for Electric Readings • Data locality

Aggregator for Stream and state Partition #1 Partition #2 Aggregator
for Key: "meter 1" Value: 9.2 Key: "meter 2" Value: 2.7 Electric Readings • Data locality

Aggregator for Fault tolerance Partition #1 Partition #2 Aggregator for
Electric Readings • State persistence and recovery

Aggregator for Fault tolerance Partition #1 Partition #2 Aggregator for
Electric Readings Persistence • State persistence and recovery

Persistence Stateful Processing app Persistence Kafka Partition Kafka Partition Kafka/DB/?
Aggregation Service Consumer Client Aggregation logic Kafka Partition

Aggregation Service Consumer Client Aggregation logic Aggregation Service Consumer Client
Aggregation logic Stateful Processing app Persistence Kafka Partition Kafka Partition Kafka/DB/? Kafka Partition Duplicated message processing after recovery.

Stateful Processing app Persistence Persist state with partition offsets Don't
commit! Just fetch more data Kafka Partition Kafka Partition Kafka/DB/? Aggregation Service Consumer Client Aggregation logic Kafka Partition Kafka Terminology: - Commit Mode: Self Managed Offsets

Partition #1 Partition #1 Stateful Processing architecture • Dynamic partition
assignment • Shared Persistence for State Aggregator 2 Aggregator 1 Persistence Kafka/DB/? Partition #4 Partition #6 Partition #1 Partition #2 Orchestration Service Aggregator 3

Streaming Patterns Stateful Processing • Self-managed processing state Single Partition
Topic • Strong ordering guarantees • Limited failure recovery • Scalability is limited Multi Partition Topic • Parallel processing • Limited ordering guarantees • Kafka managed processing state Fanout • Independent consumer groups

Kafka libraries • Kafka client support in many languages •
Scala, Java, C • C bindings -> Haskell, OCaml, Python etc. Source Microservice 1 Kafka Microservice 2 Microservice 1 Microservice 1 Microservice 2 Microservice 2 Sink

Reactive Streaming APIs • Similar paradigm as in real-time streaming
platforms • Reactive Kafka ◦ Based on Akka Reactive Streams API ◦ Scala + Java ◦ Developed by Akka team • Kafka Streams ◦ Official streaming API for Kafka ◦ Java ◦ Developed by Confluent

scala-kafka-client • Kafka client developed for Scala • Async and
non-blocking • Built on top off the official Java driver • Easy API with high performance /cakesolutions /scala-kafka-client

scala-kafka-client • Leverage extensive Akka feature set • Processing logic
implemented using Actor Model Kafka Consumer Actor Kafka Producer Actor Receiver Actor Kafka Kafka /cakesolutions /scala-kafka-client

Summary • Leverage Microservice based techniques. • Streaming topologies can
be varied and complex ◦ Many use-cases fall under a small set of consumer patterns. • Challenges around scalable and reactive data pipelines • Kafka provides first-class support for reactive streaming to your applications. • Stateful processing remains a challenging area.

We didn’t discuss... • Data serialisation • Application rolling updates
• Complex streaming topologies

Questions? MANCHESTER LONDON NEW YORK /cakesolutions /scala-kafka-client @cakesolutions +44 845
617 1200 [email protected]

Staging reactive data pipelines using Kafka as ...

Staging reactive data pipelines using Kafka as the backbone

More Decks by Jaakko Pallari

Other Decks in Programming

Featured

Transcript