Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The journey from queues to data pipeline streams

The journey from queues to data pipeline streams

Modern data pipelines have come a long way since the traditional publisher-subscriber and asynch execution.
Now days tools like Kafka are used as the organization's data backbone, processing Terabytes of daily data across real-time microsservices and batch processing using different data stores and tools.
In this talk we will discuss the key differences between kafka and traditional queues, how data pipelines transformed the backend architecture for many big data companies providing better resiliency using concepts like back pressure, distributed logs and stream processing

Avatar for Shlomi Shemesh

Shlomi Shemesh

May 19, 2018
Tweet

More Decks by Shlomi Shemesh

Other Decks in Technology

Transcript

  1. 90% of the data in the world today has been

    created in the last two years alone, at 2.5 Billion Billions bytes of data a day! -- https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=WRL12345USEN
  2. Hi. I’m Shlomi Currently Head of R&D at Providing a

    serverless environment to build and run web apps. Previously R&D Manager at Real-time data flows Processing billions of daily events in realtime for mobile app analytics
  3. AGENDA Lots of data Using queues to process data Using

    data pipelines to Process data at scale Data pipeline patterns Log compaction
  4. Phase 1: Synchronous Microservices Architecture ELBs Web Servers . .

    . Service 2 Service 1 Service C Service A Service 3 Service 4 Service B
  5. Synchronous Microservices are coupled ELBs Web Servers . . .

    Service 2 Service 1 Service C Service A Service 3 Service 4 Service B
  6. Should we process in Batches? data is bounded with a

    start and an end within a job Pros: ▪ Large amount of data ▪ Scalability Cons: Intervals = Latency
  7. Low latency actionable decision making ▪ Mobile ads ▪ Cyber

    security ▪ Fraud detection ▪ Ride sharing ▪ Healthcare Monitoring ▪ IOT
  8. Processing Batches of Data Processing Streams of Data data is

    bounded with a start and an end in a job unbounded data coming in real-time continuously
  9. Employing explicit message-passing enables load management, elasticity, and flow control

    by ... the message queues -- https://www.reactivemanifesto.org/
  10. Phase 2: Pub-Sub Publisher Subscriber Subscriber Subscriber Queue We gain:

    ▪ Decoupling ▪ Parallelism Still missing: ▪ Fault tolerance ▪ Data is pushed not pulled ▪ Concurrency
  11. Publisher Queue Performance ▪ Latency ▪ Throughput ▪ Scalability Queue

    properties Fault Tolerance ▪ Recover in case of failures ▪ Consumer to keep going from where it stopped FIFO messages order
  12. Queue properties Publisher Queue Delivery Guarantees incoming data in a

    streaming engine will be processed: ▪ At-least-once ▪ At-most-once ▪ Exactly-once Concurrency vs. Parallelism Lets see an example
  13. Publisher Queue Subscriber Subscriber Subscriber Phase 3: messages removed when

    consumed We gain: ▪ Pull rather than Push ▪ Better delivery guarantee ▪ Better Fault tolerance ▪ Concurrency Still missing: ▪ Message Order ▪ How do we scale rabbit broker?
  14. Queue 3 Subscriber C2 Subscriber C1 Publisher Subscriber B2 Subscriber

    B1 Subscriber B3 Subscriber A2 Subscriber A1 Queue 2 Queue 1 Exchange Scaling Publishers Parallelism vs. concurrency
  15. Queue Subscriber Side effect At least once 1 2 Queue

    Subscriber Side effect At most once 2 1 Rabbit Delivery Guarantees Producer Producer
  16. Queue Subscriber Side effect Exactly once - “All or Nothing”

    Idempotent Producer Idempotent Not supported by RabbitMQ: ▪ Queue should support retry idempotency ▪ Producer should be idempotent ▪ Consumer should be idempotent
  17. RabbitMQ - what’s next? ▪ Complex to scale out, requires

    DevOps overhead ▪ Does not support exactly once ▪ No Order guarantees with concurrency
  18. Apache Kafka A publish-subscribe messaging rethought as a distributed commit

    log. Writes are append only Reads are a single seek & scan offset
  19. Apache Kafka Cluster ▪ Distributed by design ▪ Scalable Elastically

    and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines. ▪ Durable & Fault tolerant Data persisted to disk and replicated across following brokers. Producers can wait on acknowledgement from replicas - eventual/strong consistency
  20. Anatomy of a Topic • Writes are append only per

    partition • Messages partitioned by Key • Ordered within a partition Topic foo Service A cluster publishers Partition 0 Partition 1 Partition 2 Service B cluster Consumers group I Service C cluster Consumers group II Partition offset 0 1254 1 1235 2 1398 Partition offset 0 698 1 680 2 701
  21. Apache Kafka vs. Traditional Queues ▪ Offset is consumer responsibility

    ▪ Messages are kept even if consumed (retention policy) ▪ Stronger Ordering guarantees by partitions ▪ Concurrency using partitions ▪ Parallelism by different Consumer groups ▪ Guarantees At-least-once and exactly-once
  22. It’s a paradigm shift in the way we think about

    event-driven microservices architecture.
  23. Event Driven Microservices Architecture Web Servers . . . ELBs

    Service 2 Service 1 Service C Service A Service 3 . . . Service 4 Service B
  24. Event Driven Microservices Architecture Web Servers . . . ELBs

    Service 2 Service 1 Service C Service A Service 3 . . . Service 4 Service B
  25. Event Driven Microservices Architecture Web Servers . . . ELBs

    Service 2 Service 1 Service C Service A Service 3 . . . Service 4 Service B New Product Requirement
  26. Event Driven Microservices Architecture Web Servers . . . ELBs

    Service 2 Service 1 Service C Service A Service 3 . . . Service 4 Service 5 Service B
  27. Fault tolerant - + ++ Concurrency - + ++ Delivery

    Guarantee no At least once Supports Exactly once Scale + + ++ Latency + ++ + Order - - + Comparing the alternatives
  28. Back-pressure allows systems to gracefully respond to load rather than

    collapse under it… will ensure that the system is resilient under load. -- http://www.reactivemanifesto.org/glossary#Back-Pressure
  29. queue lag = log size - offset Kafka has Built

    in back-pressure CPU-bound Service A IO-bound Service B
  30. Monitoring Event Driven Microservices Web Servers . . . ELBs

    Service 2 Service 1 Service C Service A Service 3 . . . 3rd party apis Service 4 Service 5 Service B
  31. ▪ Easily add more microservices to consume from the same

    topic ▪ Data migration from U.S to E.U located servers ▪ Replace legacy microservice with new code - and run in parallel Just pop up another consumer group Service A Service A’ Service A Service A’ US EU
  32. Time Traveling Start consuming from a past date Service A

    Service A’ Service A New Service B Starts consuming from the past 8501 Recalculate data for a customer Introduce a new analytics service Side effect parallel to production
  33. Data used by event driven microservice My Service Wix Store

    Wix Payments Wix CRM Real-time High-throughput Data pipeline Wix User serverless functions ? online Real-time stream Online rare changes
  34. Apache Kafka Log Compaction ▪ Single partition ▪ We use

    it with infinite retention as DB ▪ The head of the compacted log is identical to a traditional Kafka log
  35. Offset 13 17 19 30 31 32 33 34 35

    36 37 38 Keys k1 k5 k2 k7 K8 K4 K1 k1 k1 k9 k8 k2 Values v5 v2 v7 v1 v4 v6 v1 v2 v9 v6 v22 v25 Offset 17 30 32 35 36 37 38 Keys k5 k7 K4 k1 k9 k8 k2 Values v2 v1 v6 v9 v6 v22 v25 Apache Kafka Log Compaction Before Compaction After Compaction Only keeps latest version of the key
  36. Offset 17 30 32 35 36 37 38 Keys k5

    k7 K4 k1 k9 k8 k2 Values v2 v1 v6 v9 v6 v22 v25 Apache Kafka Log Compaction Only keeps latest version of the key ▪ Used as an event sourcing storage ▪ Built-in compaction to key/value storage ▪ CQRS ▪ Subscribers are notified using kafka consumers
  37. My Service Wix Store Wix Payments Wix CRM Real-time High-throughput

    Data pipeline Wix User serverless functions online Real-time stream Online rare changes Data Pushed to real-time microservice Log compaction
  38. Takeaways ▪ Event driven architecture - better decoupling, resilience, scale,

    production monitoring ▪ Engineering impact ▪ Culture impact ▪ Innovation impact ▪ Open discussion - log compaction, cqrs, kafka streams
  39. Apache Kafka Cluster ▪ Durable & Fault tolerant Data persisted

    to disk and replicated across following brokers. producers can wait on acknowledgement ▪ Scalable elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines ▪ Distributed by design Broker 1 Broker 2 Broker 3 Broker 4 Topic A Partition 1 Partition 1 Partition 1 Topic A Partition 2 Partition 2 Partition 2 Topic A Partition 3 Partition 3 Partition 3 Topic A Partition 4 Partition 4 Partition 4 Leader Follower