Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ElixirConf 2019 - Build Efficient Data Processing Pipelines in Elixir Using Broadway

ElixirConf 2019 - Build Efficient Data Processing Pipelines in Elixir Using Broadway

Announced in February 2019, Broadway is a new open-source tool developed by Plataformatec that aims to streamline data processing pipelines with Elixir. It allows developers to build complex GenStage topologies that can consume and process data efficiently from different sources, including Amazon SQS and RabbitMQ. In this talk, we'll discuss some of the main concepts behind Broadway, how it leverages OTP to achieve fault-tolerance, sharing implementation details and architectural aspects. You'll learn how to build efficient data processing pipelines and how to optimize them using real-time metrics based on Telemetry events.

Marlus Saraiva

August 30, 2019
Tweet

Other Decks in Programming

Transcript

  1. Agenda • Why Broadway? • How does it work? •

    Implementation details • Fault tolerance • Graceful shutdown • Telemetry integration
  2. Broadway An open source tool developed by Plataformatec that aims

    to streamline data processing pipelines with Elixir
  3. Data pipelines “A set of data processing elements connected in

    series, where the output of one element is the input of the next one” - Wikipedia
  4. Why Broadway? • Reimplementing the same features • Assembling complex

    GenStage topologies • Running into the same common pitfalls Many companies are using Elixir to building data processing pipelines
  5. Desired Features • Concurrency • Back-pressure • Batching • Graceful

    shutdowns • Consume data from different sources like SQS, RabbitMQ, … • More…
  6. Desired Features • Concurrency • Back-pressure • Batching • Graceful

    shutdowns • Consume data from different sources like SQS, RabbitMQ, … • More…
  7. Desired Features • Concurrency • Back-pressure • Batching • Graceful

    shutdowns • Consume data from different sources like SQS, RabbitMQ, … • More…
  8. Desired Features • Concurrency • Back-pressure • Batching • Graceful

    shutdowns • Consume data from different sources like SQS, RabbitMQ, … • More…
  9. Desired Features • Concurrency • Back-pressure • Batching • Graceful

    shutdowns • Consume data from different sources like SQS, RabbitMQ, … • More…
  10. Complex GenStage pipelines • How to define the right topology

    for the pipeline? • How to structure the supervision tree correctly? • How to handle graceful shutdown without data loss? ?
  11. Complex GenStage pipelines • How to define the right topology

    for the pipeline? • How to structure the supervision tree correctly? • How to handle graceful shutdown without data loss? ?
  12. Broadway defmodule MyBroadway do use Broadway def start_link(_opts) do Broadway.start_link(MyBroadway,

    producers: [...], processors: [...], batchers: [...] ) end def handle_message(...) do ... end def handle_batch(...) do ... end end Behaviour
  13. Broadway defmodule MyBroadway do use Broadway def start_link(_opts) do Broadway.start_link(MyBroadway,

    producers: [...], processors: [...], batchers: [...] ) end def handle_message(...) do ... end def handle_batch(...) do ... end end Configuration
  14. Broadway defmodule MyBroadway do use Broadway def start_link(_opts) do Broadway.start_link(MyBroadway,

    producers: [...], processors: [...], batchers: [...] ) end def handle_message(...) do ... end def handle_batch(...) do ... end end Callbacks
  15. Configuration (defines the topology) producers: [ default: [ module: {Counter,

    []}, stages: 2 ] ], processors: [ default: [stages: 3] ] Producers Processors
  16. producers: [ default: [ module: {Counter, []}, stages: 2 ]

    ], processors: [ default: [stages: 3] ] Producers Processors Configuration (defines the topology)
  17. producers: [ default: [ module: {Counter, []}, stages: 1 ]

    ], processors: [ default: [stages: 2] ], batchers: [ sqs: [stages: 2], s3: [stages: 1] ] S3 SQS Producers Processors Batchers Batch Processors Configuration (defines the topology)
  18. producers: [ default: [ module: {Counter, []}, stages: 1 ]

    ], processors: [ default: [stages: 2] ], batchers: [ sqs: [stages: 2], s3: [stages: 1] ] S3 SQS Producers Processors Batchers Batch Processors Configuration (defines the topology)
  19. producers: [ default: [ module: {Counter, []}, stages: 1 ]

    ], processors: [ default: [stages: 2] ], batchers: [ sqs: [stages: 2], s3: [stages: 1] ] S3 SQS Producers Processors Batchers Batch Processors Configuration (defines the topology)
  20. producers: [ default: [ module: {Counter, []}, stages: 1 ]

    ], processors: [ default: [stages: 2] ], batchers: [ sqs: [stages: 2], s3: [stages: 1] ] S3 SQS Producers Processors Batchers Batch Processors Configuration (defines the topology)
  21. Callbacks (define the custom code that runs inside the pipeline)

    def handle_message(...) do ... if sqs?(message) do message |> Message.put_batcher(:sqs) else message |> Message.put_batcher(:s3) end end def handle_batch(:sqs, ...) do ... end def handle_batch(:s3, ...) do ... end S3 SQS Producers Processors Batchers Batch Processors
  22. Callbacks (define the custom code that runs inside the pipeline)

    def handle_message(...) do ... if sqs?(message) do message |> Message.put_batcher(:sqs) else message |> Message.put_batcher(:s3) end end def handle_batch(:sqs, ...) do ... end def handle_batch(:s3, ...) do ... end S3 SQS Producers Processors Batchers Batch Processors
  23. def handle_message(...) do ... if sqs?(message) do message |> Message.put_batcher(:sqs)

    else message |> Message.put_batcher(:s3) end end def handle_batch(:sqs, ...) do ... end def handle_batch(:s3, ...) do ... end S3 SQS Callbacks (define the custom code that runs inside the pipeline) Producers Processors Batchers Batch Processors
  24. def handle_message(...) do ... if sqs?(message) do message |> Message.put_batcher(:sqs)

    else message |> Message.put_batcher(:s3) end end def handle_batch(:sqs, ...) do ... end def handle_batch(:s3, ...) do ... end S3 SQS Callbacks (define the custom code that runs inside the pipeline) Producers Processors Batchers Batch Processors
  25. def handle_message(...) do ... if sqs?(message) do message |> Message.put_batcher(:sqs)

    else message |> Message.put_batcher(:s3) end end def handle_batch(:sqs, ...) do ... end def handle_batch(:s3, ...) do ... end S3 SQS Callbacks (define the custom code that runs inside the pipeline) Producers Processors Batchers Batch Processors
  26. def handle_message(...) do ... if sqs?(message) do message |> Message.put_batcher(:sqs)

    else message |> Message.put_batcher(:s3) end end def handle_batch(:sqs, ...) do ... end def handle_batch(:s3, ...) do ... end S3 SQS Callbacks (define the custom code that runs inside the pipeline) Producers Processors Batchers Batch Processors
  27. Complex GenStage pipelines • How to define the right topology

    for the pipeline? • How to structure the supervision tree correctly? • How to handle graceful shutdown without data loss? ?
  28. Complex GenStage pipelines • How to define the right topology

    for the pipeline? • How to structure the supervision tree correctly? • How to handle graceful shutdown without data loss? ?
  29. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  30. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  31. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  32. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  33. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  34. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  35. ProducerSupervisor :one_for_one Terminator Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher Consumer_1

    Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  36. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  37. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  38. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision
  39. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  40. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2 ( max_restarts = 0 )
  41. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2 ( max_restarts = 0 )
  42. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2 ( max_restarts = 0 )
  43. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2 ( max_restarts = 0 )
  44. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  45. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  46. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  47. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2
  48. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2 ( max_restarts = 0 )
  49. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2 ( max_restarts = 0 )
  50. ProducerSupervisor :one_for_one Terminator Producer_1 Processor_2 Processor_1 Consumer_1 Batcher Consumer_2 Batcher

    Consumer_1 Broadway Pipeline Supervisor :rest_for_one Broadway GenServer ProcessorSupervisor :one_for_all BatcherPartitionSupervisor :one_for_one ConsumerSupervisor :one_for_all ConsumerSupervisor :one_for_all BatcherConsumerSuperv_1 :rest_for_one BatcherConsumerSuperv_2 :rest_for_one Supervision Producer_2 ( max_restarts = 0 )
  51. Complex GenStage pipelines ? • How to define the right

    topology for the pipeline? • How to structure supervisions trees correctly? • How to handle graceful shutdown without data loss?
  52. Complex GenStage pipelines ? • How to define the right

    topology for the pipeline? • How to structure supervisions trees correctly? • How to handle graceful shutdown without data loss?
  53. Producers Processors Batchers Batch Processors Graceful Shutdown Step 1: Tell

    processors to stop subscribing to producers Messages Terminator
  54. Producers Processors Batchers Batch Processors Graceful Shutdown Step 1: Tell

    processors to stop subscribing to producers Messages Terminator
  55. Producers Processors Batchers Batch Processors Graceful Shutdown Step 1: Tell

    processors to stop subscribing to producers Messages Terminator
  56. Producers Processors Batchers Batch Processors Graceful Shutdown Messages Terminator Step

    2: Cancel and shutdown producers • Tell producers to: • Stop receiving messages • Stop accepting demand • Flush all events in the buffer • Cancel consumers
  57. Processors Batchers Batch Processors Graceful Shutdown Step 3: Monitors consumers

    and wait until they’re dead Terminator monitor and wait Producers Messages
  58. Processors Batchers Batch Processors Graceful Shutdown Step 3: Monitors consumers

    and wait until they’re dead Terminator monitor and wait Producers Messages
  59. Batchers Batch Processors Graceful Shutdown Step 3: Monitors consumers and

    wait until they’re dead Terminator monitor and wait Producers Messages
  60. Batch Processors Graceful Shutdown Step 3: Monitors consumers and wait

    until they’re dead Terminator monitor and wait Producers Messages
  61. Complex GenStage pipelines ? • How to define the right

    topology for the pipeline? • How to structure supervisions trees correctly? • How to handle graceful shutdown without data loss?