The journey from queues to data pipeline streams

linkedin/shlomish github.com/shlomsh [email protected] Shlomi Shemesh Head of R&D, WixCode The
journey from queues to data streams

90% of the data in the world today has been
created in the last two years alone, at 2.5 Billion Billions bytes of data a day! -- https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=WRL12345USEN

Hi. I’m Shlomi Currently Head of R&D at Providing a
serverless environment to build and run web apps. Previously R&D Manager at Real-time data flows Processing billions of daily events in realtime for mobile app analytics

AGENDA Lots of data Using queues to process data Using
data pipelines to Process data at scale Data pipeline patterns Log compaction

Lots of data 01

Phase 1: Synchronous Microservices Architecture ELBs Web Servers . .
. Service 2 Service 1 Service C Service A Service 3 Service 4 Service B

Constant growth of traffic

Synchronous Microservices are coupled ELBs Web Servers . . .
Service 2 Service 1 Service C Service A Service 3 Service 4 Service B

Should we process in Batches? data is bounded with a
start and an end within a job Pros: ▪ Large amount of data ▪ Scalability Cons: Intervals = Latency

Low latency actionable decision making ▪ Mobile ads ▪ Cyber
security ▪ Fraud detection ▪ Ride sharing ▪ Healthcare Monitoring ▪ IOT

Processing Batches of Data Processing Streams of Data data is
bounded with a start and an end in a job unbounded data coming in real-time continuously

Winter is coming - Back to the design board

Queue 01 -- https://www.huffingtonpost.com/entry/china-traffic-jam-golden-week_us_5616c82ae4b0dbb8000da85d Using Queues to process data 02

Decoupling by Observer Pattern Subject Observer Observer Observer

Employing explicit message-passing enables load management, elasticity, and flow control
by ... the message queues -- https://www.reactivemanifesto.org/

Phase 2: Pub-Sub Publisher Subscriber Subscriber Subscriber Queue We gain:
▪ Decoupling ▪ Parallelism Still missing: ▪ Fault tolerance ▪ Data is pushed not pulled ▪ Concurrency

Publisher Queue Performance ▪ Latency ▪ Throughput ▪ Scalability Queue
properties Fault Tolerance ▪ Recover in case of failures ▪ Consumer to keep going from where it stopped FIFO messages order

Queue properties Publisher Queue Delivery Guarantees incoming data in a
streaming engine will be processed: ▪ At-least-once ▪ At-most-once ▪ Exactly-once Concurrency vs. Parallelism Lets see an example

Publisher Queue Subscriber Subscriber Subscriber Phase 3: messages removed when
consumed We gain: ▪ Pull rather than Push ▪ Better delivery guarantee ▪ Better Fault tolerance ▪ Concurrency Still missing: ▪ Message Order ▪ How do we scale rabbit broker?

Queue 3 Subscriber C2 Subscriber C1 Publisher Subscriber B2 Subscriber
B1 Subscriber B3 Subscriber A2 Subscriber A1 Queue 2 Queue 1 Exchange Scaling Publishers Parallelism vs. concurrency

Queue Subscriber Side effect At least once 1 2 Queue
Subscriber Side effect At most once 2 1 Rabbit Delivery Guarantees Producer Producer

Queue Subscriber Side effect Exactly once - “All or Nothing”
Idempotent Producer Idempotent Not supported by RabbitMQ: ▪ Queue should support retry idempotency ▪ Producer should be idempotent ▪ Consumer should be idempotent

RabbitMQ - what’s next? ▪ Complex to scale out, requires
DevOps overhead ▪ Does not support exactly once ▪ No Order guarantees with concurrency

Data pipelines to process data at scale 03 https://www.salsify.com/hubfs/long_queue.jpg

Apache Kafka A publish-subscribe messaging rethought as a distributed commit
log. Writes are append only Reads are a single seek & scan offset

Apache Kafka Cluster ▪ Distributed by design ▪ Scalable Elastically
and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines. ▪ Durable & Fault tolerant Data persisted to disk and replicated across following brokers. Producers can wait on acknowledgement from replicas - eventual/strong consistency

Anatomy of a Topic • Writes are append only per
partition • Messages partitioned by Key • Ordered within a partition Topic foo Service A cluster publishers Partition 0 Partition 1 Partition 2 Service B cluster Consumers group I Service C cluster Consumers group II Partition offset 0 1254 1 1235 2 1398 Partition offset 0 698 1 680 2 701

Apache Kafka vs. Traditional Queues ▪ Offset is consumer responsibility
▪ Messages are kept even if consumed (retention policy) ▪ Stronger Ordering guarantees by partitions ▪ Concurrency using partitions ▪ Parallelism by different Consumer groups ▪ Guarantees At-least-once and exactly-once

It’s a paradigm shift in the way we think about
event-driven microservices architecture.

Event Driven Microservices Architecture Web Servers . . . ELBs

Service C Service A Service B

Service 2 Service 1 Service C Service A Service 3 . . . Service 4 Service B

Service 2 Service 1 Service C Service A Service 3 . . . Service 4 Service B New Product Requirement

Service 2 Service 1 Service C Service A Service 3 . . . Service 4 Service 5 Service B

Fault tolerant - + ++ Concurrency - + ++ Delivery
Guarantee no At least once Supports Exactly once Scale + + ++ Latency + ++ + Order - - + Comparing the alternatives

Data Pipeline Patterns 04

Back Pressure https://en.wikipedia.org/wiki/Dam#/media/File:Takato_Dam_discharge.jpg

Back-pressure allows systems to gracefully respond to load rather than
collapse under it… will ensure that the system is resilient under load. -- http://www.reactivemanifesto.org/glossary#Back-Pressure

Reporting Using BigQuery Synchronously CPU-bound Service A IO-bound Service B
Network Latency blocks B Also blocks A

queue lag = log size - offset Kafka has Built
in back-pressure CPU-bound Service A IO-bound Service B

No data loss You’ll feel the back-pressure, but you won’t
crash

Monitoring Bottlenecks

Monitoring Event Driven Microservices Web Servers . . . ELBs
Service 2 Service 1 Service C Service A Service 3 . . . 3rd party apis Service 4 Service 5 Service B

Quickly Spot Where the Bottleneck is

Get the exact resources you need

Rush hours vs. off peak traffic time

Lag-based auto scaling

https://twitter.com/ouarzy/status/743793840700661760 Tests

Utilize consumer groups for testing Dev phase/ Staging area /
QA / Testbed Production data

Production Ops

▪ Easily add more microservices to consume from the same
topic ▪ Data migration from U.S to E.U located servers ▪ Replace legacy microservice with new code - and run in parallel Just pop up another consumer group Service A Service A’ Service A Service A’ US EU

https://commons.wikimedia.org/w/index.php?curid=44599363 Time Traveling

Time Traveling Start consuming from a past date Service A
Service A’ Service A New Service B Starts consuming from the past 8501 Recalculate data for a customer Introduce a new analytics service Side effect parallel to production

Log compaction at 05

Data used by event driven microservice My Service Wix Store
Wix Payments Wix CRM Real-time High-throughput Data pipeline Wix User serverless functions ? online Real-time stream Online rare changes

Apache Kafka Log Compaction ▪ Single partition ▪ We use
it with infinite retention as DB ▪ The head of the compacted log is identical to a traditional Kafka log

Offset 13 17 19 30 31 32 33 34 35
36 37 38 Keys k1 k5 k2 k7 K8 K4 K1 k1 k1 k9 k8 k2 Values v5 v2 v7 v1 v4 v6 v1 v2 v9 v6 v22 v25 Offset 17 30 32 35 36 37 38 Keys k5 k7 K4 k1 k9 k8 k2 Values v2 v1 v6 v9 v6 v22 v25 Apache Kafka Log Compaction Before Compaction After Compaction Only keeps latest version of the key

Offset 17 30 32 35 36 37 38 Keys k5
k7 K4 k1 k9 k8 k2 Values v2 v1 v6 v9 v6 v22 v25 Apache Kafka Log Compaction Only keeps latest version of the key ▪ Used as an event sourcing storage ▪ Built-in compaction to key/value storage ▪ CQRS ▪ Subscribers are notified using kafka consumers

My Service Wix Store Wix Payments Wix CRM Real-time High-throughput
Data pipeline Wix User serverless functions online Real-time stream Online rare changes Data Pushed to real-time microservice Log compaction

Takeaways ▪ Event driven architecture - better decoupling, resilience, scale,
production monitoring ▪ Engineering impact ▪ Culture impact ▪ Innovation impact ▪ Open discussion - log compaction, cqrs, kafka streams

▪ https://www.oreilly.com/ideas/questioning-the-lambda-architecture ▪ https://docs.confluent.io/current/streams/index.html ▪ https://www.confluent.io/blog/apache-kafka-for-service-architectures/ ▪ https://www.confluent.io/kafka-summit-london18/dont-repeat-yourself-introducing-exactly-o nce-semantics-in-apache-kafka ▪
https://www.linkedin.com/pulse/spark-streaming-vs-flink-storm-kafka-streams-samza-choos e-prakash/ ▪ https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 Reading resources

[email protected] linkedin/shlomish github.com/shlomsh Than-Queue

Q&A [email protected] linkedin/shlomish github.com/shlomsh

Apache Kafka Cluster ▪ Durable & Fault tolerant Data persisted
to disk and replicated across following brokers. producers can wait on acknowledgement ▪ Scalable elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines ▪ Distributed by design Broker 1 Broker 2 Broker 3 Broker 4 Topic A Partition 1 Partition 1 Partition 1 Topic A Partition 2 Partition 2 Partition 2 Topic A Partition 3 Partition 3 Partition 3 Topic A Partition 4 Partition 4 Partition 4 Leader Follower

The journey from queues to data pipeline streams

The journey from queues to data pipeline streams

More Decks by Shlomi Shemesh

Other Decks in Technology

Featured

Transcript