GenStage and Flow by @josevalim at ElixirConf

Slide 1

Slide 1 text

@elixirlang / elixir-lang.org

Slide 2

Slide 2 text

GenStage & Flow github.com/elixir-lang/gen_stage

Slide 3

Slide 3 text

Prelude: From eager,  to lazy,  to concurrent,  to distributed

Slide 4

Slide 4 text

Example problem: word counting

Slide 5

Slide 5 text

“roses are red\n violets are blue\n" %{“are” => 2, “blue” => 1, “red” => 1, “roses" => 1, “violets" => 1}

Slide 6

Slide 6 text

Eager

Slide 7

Slide 7 text

File.read!("source") "roses are red\n violets are blue\n"

Slide 8

Slide 8 text

File.read!("source") |> String.split("\n") ["roses are red", "violets are blue"]

Slide 9

Slide 9 text

File.read!("source") |> String.split("\n") |> Enum.flat_map(&String.split/1) ["roses", “are", “red", "violets", "are", "blue"]

Slide 10

Slide 10 text

File.read!("source") |> String.split("\n") |> Enum.flat_map(&String.split/1) |> Enum.reduce(%{}, fn word, map -> Map.update(map, word, 1, & &1 + 1) end) %{“are” => 2, “blue” => 1, “red” => 1, “roses" => 1, “violets" => 1}

Slide 11

Slide 11 text

Eager • Simple • Efficient for small collections • Inefficient for large collections with multiple passes

Slide 12

Slide 12 text

File.read!(“really large file") |> String.split("\n") |> Enum.flat_map(&String.split/1)

Slide 13

Slide 13 text

Lazy

Slide 14

Slide 14 text

File.stream!("source", :line) #Stream<...>

Slide 15

Slide 15 text

File.stream!("source", :line) |> Stream.flat_map(&String.split/1) #Stream<...>

Slide 16

Slide 16 text

File.stream!("source", :line) |> Stream.flat_map(&String.split/1) |> Enum.reduce(%{}, fn word, map -> Map.update(map, word, 1, & &1 + 1) end) %{“are” => 2, “blue” => 1, “red” => 1, “roses" => 1, “violets" => 1}

Slide 17

Slide 17 text

Lazy • Folds computations, goes item by item • Less memory usage at the cost of computation • Allows us to work with large or infinite collections

Slide 18

Slide 18 text

Concurrent

Slide 19

Slide 19 text

#Stream<...> File.stream!("source", :line)

Slide 20

Slide 20 text

Slide 21

Slide 21 text

%{“are” => 2, “blue” => 1, “red” => 1, “roses" => 1, “violets" => 1} File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.partition() |> Flow.reduce(fn -> %{} end, fn word, map -> Map.update(map, word, 1, & &1 + 1) end) |> Enum.into(%{})

Slide 22

Slide 22 text

Flow • We give up ordering and process locality for concurrency • Tools for working with bounded and unbounded data

Slide 23

Slide 23 text

Flow • It is not magic! There is an overhead when data flows through processes • Requires volume and/or cpu/io-bound work to see benefits

Slide 24

Slide 24 text

Flow stats • ~1200 lines of code (LOC) • ~1300 lines of documentation (LOD)

Slide 25

Slide 25 text

Topics • 1200LOC: How is flow implemented? • 1300LOD: How to reason about flows?

Slide 26

Slide 26 text

GenStage

Slide 27

Slide 27 text

GenStage • It is a new behaviour • Exchanges data between stages transparently with back-pressure • Breaks into producers, consumers and producer_consumers

Slide 28

Slide 28 text

GenStage producer consumer producer producer consumer producer consumer consumer

Slide 29

Slide 29 text

GenStage: Demand-driven B A Producer Consumer 1. consumer subscribes to producer 2. consumer sends demand 3. producer sends events Asks 10 Sends max 10 Subscribes

Slide 30

Slide 30 text

B C A Asks 10 Asks 10 Sends max 10 Sends max 10 GenStage: Demand-driven

Slide 31

Slide 31 text

• It is a message contract • It pushes back-pressure to the boundary • GenStage is one impl of this contract GenStage: Demand-driven

Slide 32

Slide 32 text

GenStage example

Slide 33

Slide 33 text

B A Printer (Consumer) Counter (Producer)

Slide 34

Slide 34 text

defmodule Producer do use GenStage def init(counter) do {:producer, counter} end def handle_demand(demand, counter) when demand > 0 do events = Enum.to_list(counter..counter+demand-1) {:noreply, events, counter + demand} end end

Slide 35

Slide 35 text

state demand handle_demand 0 10 {:noreply, [0, 1, …, 9], 10} 10 5 {:noreply, [10, 11, 12, 13, 14], 15} 15 5 {:noreply, [15, 16, 17, 18, 19], 20}

Slide 36

Slide 36 text

defmodule Consumer do use GenStage def init(:ok) do {:consumer, :the_state_does_not_matter} end def handle_events(events, _from, state) do Process.sleep(1000) IO.inspect(events) {:noreply, [], state} end end

Slide 37

Slide 37 text

{:ok, counter} = GenStage.start_link(Producer, 0) {:ok, printer} = GenStage.start_link(Consumer, :ok) GenStage.sync_subscribe(printer, to: counter) (wait 1 second) [0, 1, 2, ..., 499] (500 events) (wait 1 second) [500, 501, 502, ..., 999] (500 events)

Slide 38

Slide 38 text

Subscribe options • max_demand: the maximum amount of events to ask (default 1000) • min_demand: when reached, ask for more events (default 500)

Slide 39

Slide 39 text

max_demand: 10, min_demand: 5 1. the consumer asks for 10 items 2. the consumer receives 10 items 3. the consumer processes 5 of 10 items 4. the consumer asks for more 5 5. the consumer processes the remaining 5

Slide 40

Slide 40 text

max_demand: 10, min_demand: 0 1. the consumer asks for 10 items 2. the consumer receives 10 items 3. the consumer processes 10 items 4. the consumer asks for more 10 5. the consumer waits

Slide 41

Slide 41 text

GenStage dispatchers

Slide 42

Slide 42 text

Dispatchers • Per producer • Effectively receive the demand and send data • Allow a producer to dispatch to multiple consumers at once

Slide 43

Slide 43 text

prod 1 2,5 3 4 1,2,3,4,5 DemandDispatcher

Slide 44

Slide 44 text

prod 1,2,3 1,2,3 1,2,3 1,2,3 1,2,3 BroadcastDispatcher

Slide 45

Slide 45 text

prod 1,5 2,6 3 4 1,2,3,4,5,6 PartitionDispatcher rem(event, 4)

Slide 46

Slide 46 text

Validating GenStage

Slide 47

Slide 47 text

GenStage goals • Support generic producers • Replace GenEvent • Introduce DynamicSupervisor

Slide 48

Slide 48 text

http://bit.ly/genstage

Slide 49

Slide 49 text

Topics • 1200LOC: How is flow implemented? • Its core is a 80LOC stage • 1300LOD: How to reason about flows?

Slide 50

Slide 50 text

Flow

Slide 51

Slide 51 text

“roses are red\n violets are blue\n" %{“are” => 2, “blue” => 1, “red” => 1, “roses" => 1, “violets" => 1}

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Slide 54

Slide 54 text

File.stream!("source", :line) |> Flow.from_enumerable() Producer

Slide 55

Slide 55 text

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) Producer Stage 1 Stage 2 Stage 3 Stage 4 DemandDispatcher

Slide 56

Slide 56 text

"blue" Producer Stage 1 Stage 2 Stage 3 Stage 4 "roses are red" "roses" "are" "red" "violets are blue" "violets" "are"

Slide 57

Slide 57 text

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) Producer Stage 1 Stage 2 Stage 3 Stage 4

Slide 58

Slide 58 text

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.reduce(fn -> %{} end, fn word, map -> Map.update(map, word, 1, & &1 + 1) end) Producer Stage 1 Stage 2 Stage 3 Stage 4

Slide 59

Slide 59 text

Producer Stage 1 Stage 2 Stage 3 Stage 4 "roses are red" %{“are” => 1, “red” => 1, “roses" => 1} "violets are blue" %{“are” => 1, “blue” => 1, “violets" => 1}

Slide 60

Slide 60 text

Producer Stage 1 Stage 2 Stage 3 Stage 4 %{“are” => 1, “red” => 1, “roses" => 1} %{“are” => 1, “blue” => 1, “violets" => 1}

Slide 61

Slide 61 text

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.partition() |> Flow.reduce(fn -> %{} end, fn word, map -> Map.update(map, word, 1, & &1 + 1) end) Producer Stage 1 Stage 2 Stage 3 Stage 4 Stage A Stage B Stage C Stage D

Slide 62

Slide 62 text

%{“are” => 1, “red” => 1} Stage C Stage A Stage B Stage D Stage 1 Stage 4 Producer "roses are red" %{“roses” => 1} Stage 2 Stage 3

Slide 63

Slide 63 text

%{“are” => 2, “red” => 1} Stage C Stage A Stage B Stage D Stage 1 Stage 4 Producer "roses are red" %{“roses” => 1} "violets are blue" Stage 2 Stage 3

Slide 64

Slide 64 text

Mapper 4 Reducer C Producer Reducer A Reducer B Reducer D Mapper 1 Mapper 2 Mapper 3 DemandDispatcher PartitionDispatcher

Slide 65

Slide 65 text

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.partition() |> Flow.reduce(fn -> %{} end, fn word, map -> Map.update(map, word, 1, & &1 + 1) end) |> Enum.into(%{}) • reduce/3 collects all data into maps • when it is done, the maps are streamed • into/2 collects the state into a map

Slide 66

Slide 66 text

Windows and triggers • If reduce/3 runs until all data is processed… what happens on unbounded data? • See Flow.Window documentation.

Slide 67

Slide 67 text

Postlude: From eager,  to lazy,  to concurrent,  to distributed

Slide 68

Slide 68 text

Enum (eager) File.read!("source") |> String.split("\n") |> Enum.flat_map(&String.split/1) |> Enum.reduce(%{}, fn word, map -> Map.update(map, word, 1, & &1 + 1) end)

Slide 69

Slide 69 text

Stream (lazy) File.stream!("source", :line) |> Stream.flat_map(&String.split/1) |> Enum.reduce(%{}, fn word, map -> Map.update(map, word, 1, & &1 + 1) end)

Slide 70

Slide 70 text

Flow (concurrent) File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.partition() |> Flow.reduce(%{}, fn word, map -> Map.update(map, word, 1, & &1 + 1) end) |> Enum.into(%{})

Slide 71

Slide 71 text

Flow features • Provides map and reduce operations, partitions, flow merge, flow join • Configurable batch size (max & min demand) • Data windowing with triggers and watermarks

Slide 72

Slide 72 text

Distributed? • Flow API has feature parity with frameworks like Apache Spark • However, there is no distribution nor execution guarantees

Slide 73

Slide 73 text

CHEN, Y., ALSPAUGH, S., AND KATZ, R. Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads “small inputs are common in practice: 40–80% of Cloudera customers’ MapReduce jobs and 70% of jobs in a Facebook trace have ≤ 1GB of input”

Slide 74

Slide 74 text

GOG, I., SCHWARZKOPF, M., CROOKS, N., GROSVENOR, M. P., CLEMENT, A., AND HAND, S. Musketeer: all for one, one for all in data processing systems. “For between 40-80% of the jobs submitted to MapReduce systems, you’d be better off just running them on a single machine”

Slide 75

Slide 75 text

Distributed? • Single machine matters - try it! • The gap between concurrent and distributed in Elixir is small • Durability concerns will be tackled next

Slide 76

Slide 76 text

Thank-yous

Slide 77

Slide 77 text

Library authors • Give GenStage a try • Build your own producers • TIP: use @behaviour GenStage if you want to make it an optional dep

Slide 78

Slide 78 text

Inspirations • Akka Streams - back pressure contract • Apache Spark - map reduce API • Apache Beam - windowing model • Microsoft Naiad - stage notifications

Slide 79

Slide 79 text

The Team • The Elixir Core Team • Specially James Fish (fishcakez) • And Eric and Chris for therapy sessions

Slide 80

Slide 80 text

consulting and software engineering Built and designed at

Slide 81

Slide 81 text

consulting and software engineering Elixir coaching Elixir design review Custom development

Slide 82

Slide 82 text

@elixirlang / elixir-lang.org