Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GenStage and Flow by @josevalim at ElixirConf

Plataformatec
September 02, 2016

GenStage and Flow by @josevalim at ElixirConf

Plataformatec

September 02, 2016
Tweet

More Decks by Plataformatec

Other Decks in Programming

Transcript

  1. @elixirlang / elixir-lang.org

    View full-size slide

  2. GenStage & Flow
    github.com/elixir-lang/gen_stage

    View full-size slide

  3. Prelude:
    From eager,

    to lazy,

    to concurrent,

    to distributed

    View full-size slide

  4. Example problem:
    word counting

    View full-size slide

  5. “roses are red\n
    violets are blue\n"
    %{“are” => 2,
    “blue” => 1,
    “red” => 1,
    “roses" => 1,
    “violets" => 1}

    View full-size slide

  6. File.read!("source")
    "roses are red\n
    violets are blue\n"

    View full-size slide

  7. File.read!("source")
    |> String.split("\n")
    ["roses are red",
    "violets are blue"]

    View full-size slide

  8. File.read!("source")
    |> String.split("\n")
    |> Enum.flat_map(&String.split/1)
    ["roses", “are", “red",
    "violets", "are", "blue"]

    View full-size slide

  9. File.read!("source")
    |> String.split("\n")
    |> Enum.flat_map(&String.split/1)
    |> Enum.reduce(%{}, fn word, map ->
    Map.update(map, word, 1, & &1 + 1)
    end)
    %{“are” => 2,
    “blue” => 1,
    “red” => 1,
    “roses" => 1,
    “violets" => 1}

    View full-size slide

  10. Eager
    • Simple
    • Efficient for small collections
    • Inefficient for large collections with
    multiple passes

    View full-size slide

  11. File.read!(“really large file")
    |> String.split("\n")
    |> Enum.flat_map(&String.split/1)

    View full-size slide

  12. File.stream!("source", :line)
    #Stream<...>

    View full-size slide

  13. File.stream!("source", :line)
    |> Stream.flat_map(&String.split/1)
    #Stream<...>

    View full-size slide

  14. File.stream!("source", :line)
    |> Stream.flat_map(&String.split/1)
    |> Enum.reduce(%{}, fn word, map ->
    Map.update(map, word, 1, & &1 + 1)
    end)
    %{“are” => 2,
    “blue” => 1,
    “red” => 1,
    “roses" => 1,
    “violets" => 1}

    View full-size slide

  15. Lazy
    • Folds computations, goes item by item
    • Less memory usage at the cost of
    computation
    • Allows us to work with large or
    infinite collections

    View full-size slide

  16. #Stream<...>
    File.stream!("source", :line)

    View full-size slide

  17. File.stream!("source", :line)
    |> Flow.from_enumerable()
    |> Flow.flat_map(&String.split/1)
    |> Flow.partition()
    |> Flow.reduce(fn -> %{} end, fn word, map ->
    Map.update(map, word, 1, & &1 + 1)
    end)
    #Flow<...>

    View full-size slide

  18. %{“are” => 2,
    “blue” => 1,
    “red” => 1,
    “roses" => 1,
    “violets" => 1}
    File.stream!("source", :line)
    |> Flow.from_enumerable()
    |> Flow.flat_map(&String.split/1)
    |> Flow.partition()
    |> Flow.reduce(fn -> %{} end, fn word, map ->
    Map.update(map, word, 1, & &1 + 1)
    end)
    |> Enum.into(%{})

    View full-size slide

  19. Flow
    • We give up ordering and process
    locality for concurrency
    • Tools for working with bounded and
    unbounded data

    View full-size slide

  20. Flow
    • It is not magic! There is an overhead
    when data flows through processes
    • Requires volume and/or cpu/io-bound
    work to see benefits

    View full-size slide

  21. Flow stats
    • ~1200 lines of code (LOC)
    • ~1300 lines of documentation (LOD)

    View full-size slide

  22. Topics
    • 1200LOC: How is flow implemented?
    • 1300LOD: How to reason about flows?

    View full-size slide

  23. GenStage
    • It is a new behaviour
    • Exchanges data between stages
    transparently with back-pressure
    • Breaks into producers, consumers and
    producer_consumers

    View full-size slide

  24. GenStage
    producer
    consumer
    producer producer
    consumer
    producer
    consumer
    consumer

    View full-size slide

  25. GenStage: Demand-driven
    B
    A
    Producer Consumer
    1. consumer subscribes to producer
    2. consumer sends demand
    3. producer sends events
    Asks 10
    Sends max 10
    Subscribes

    View full-size slide

  26. B C
    A
    Asks 10
    Asks 10
    Sends max 10 Sends max 10
    GenStage: Demand-driven

    View full-size slide

  27. • It is a message contract
    • It pushes back-pressure to the boundary
    • GenStage is one impl of this contract
    GenStage: Demand-driven

    View full-size slide

  28. GenStage
    example

    View full-size slide

  29. B
    A
    Printer
    (Consumer)
    Counter
    (Producer)

    View full-size slide

  30. defmodule Producer do
    use GenStage
    def init(counter) do
    {:producer, counter}
    end
    def handle_demand(demand, counter) when demand > 0 do
    events = Enum.to_list(counter..counter+demand-1)
    {:noreply, events, counter + demand}
    end
    end

    View full-size slide

  31. state demand handle_demand
    0 10 {:noreply, [0, 1, …, 9], 10}
    10 5 {:noreply, [10, 11, 12, 13, 14], 15}
    15 5 {:noreply, [15, 16, 17, 18, 19], 20}

    View full-size slide

  32. defmodule Consumer do
    use GenStage
    def init(:ok) do
    {:consumer, :the_state_does_not_matter}
    end
    def handle_events(events, _from, state) do
    Process.sleep(1000)
    IO.inspect(events)
    {:noreply, [], state}
    end
    end

    View full-size slide

  33. {:ok, counter} =
    GenStage.start_link(Producer, 0)
    {:ok, printer} =
    GenStage.start_link(Consumer, :ok)
    GenStage.sync_subscribe(printer, to: counter)
    (wait 1 second)
    [0, 1, 2, ..., 499] (500 events)
    (wait 1 second)
    [500, 501, 502, ..., 999] (500 events)

    View full-size slide

  34. Subscribe options
    • max_demand: the maximum amount
    of events to ask (default 1000)
    • min_demand: when reached, ask for
    more events (default 500)

    View full-size slide

  35. max_demand: 10, min_demand: 5
    1. the consumer asks for 10 items
    2. the consumer receives 10 items
    3. the consumer processes 5 of 10 items
    4. the consumer asks for more 5
    5. the consumer processes the remaining 5

    View full-size slide

  36. max_demand: 10, min_demand: 0
    1. the consumer asks for 10 items
    2. the consumer receives 10 items
    3. the consumer processes 10 items
    4. the consumer asks for more 10
    5. the consumer waits

    View full-size slide

  37. GenStage
    dispatchers

    View full-size slide

  38. Dispatchers
    • Per producer
    • Effectively receive the demand and
    send data
    • Allow a producer to dispatch to
    multiple consumers at once

    View full-size slide

  39. prod
    1
    2,5
    3
    4
    1,2,3,4,5
    DemandDispatcher

    View full-size slide

  40. prod
    1,2,3
    1,2,3
    1,2,3
    1,2,3
    1,2,3
    BroadcastDispatcher

    View full-size slide

  41. prod
    1,5
    2,6
    3
    4
    1,2,3,4,5,6
    PartitionDispatcher
    rem(event, 4)

    View full-size slide

  42. Validating
    GenStage

    View full-size slide

  43. GenStage goals
    • Support generic producers
    • Replace GenEvent
    • Introduce DynamicSupervisor

    View full-size slide

  44. http://bit.ly/genstage

    View full-size slide

  45. Topics
    • 1200LOC: How is flow implemented?
    • Its core is a 80LOC stage
    • 1300LOD: How to reason about flows?

    View full-size slide

  46. “roses are red\n
    violets are blue\n"
    %{“are” => 2,
    “blue” => 1,
    “red” => 1,
    “roses" => 1,
    “violets" => 1}

    View full-size slide

  47. File.stream!("source", :line)
    |> Flow.from_enumerable()
    |> Flow.flat_map(&String.split/1)
    |> Flow.partition()
    |> Flow.reduce(fn -> %{} end, fn word, map ->
    Map.update(map, word, 1, & &1 + 1)
    end)
    #Flow<...>

    View full-size slide

  48. %{“are” => 2,
    “blue” => 1,
    “red” => 1,
    “roses" => 1,
    “violets" => 1}
    File.stream!("source", :line)
    |> Flow.from_enumerable()
    |> Flow.flat_map(&String.split/1)
    |> Flow.partition()
    |> Flow.reduce(fn -> %{} end, fn word, map ->
    Map.update(map, word, 1, & &1 + 1)
    end)
    |> Enum.into(%{})

    View full-size slide

  49. File.stream!("source", :line)
    |> Flow.from_enumerable()
    Producer

    View full-size slide

  50. File.stream!("source", :line)
    |> Flow.from_enumerable()
    |> Flow.flat_map(&String.split/1)
    Producer
    Stage 1 Stage 2 Stage 3 Stage 4
    DemandDispatcher

    View full-size slide

  51. "blue"
    Producer
    Stage 1 Stage 2 Stage 3 Stage 4
    "roses are red"
    "roses"
    "are"
    "red"
    "violets are blue"
    "violets"
    "are"

    View full-size slide

  52. File.stream!("source", :line)
    |> Flow.from_enumerable()
    |> Flow.flat_map(&String.split/1)
    Producer
    Stage 1 Stage 2 Stage 3 Stage 4

    View full-size slide

  53. File.stream!("source", :line)
    |> Flow.from_enumerable()
    |> Flow.flat_map(&String.split/1)
    |> Flow.reduce(fn -> %{} end, fn word, map ->
    Map.update(map, word, 1, & &1 + 1)
    end)
    Producer
    Stage 1 Stage 2 Stage 3 Stage 4

    View full-size slide

  54. Producer
    Stage 1 Stage 2 Stage 3 Stage 4
    "roses are red"
    %{“are” => 1,
    “red” => 1,
    “roses" => 1}
    "violets are blue"
    %{“are” => 1,
    “blue” => 1,
    “violets" => 1}

    View full-size slide

  55. Producer
    Stage 1 Stage 2 Stage 3 Stage 4
    %{“are” => 1,
    “red” => 1,
    “roses" => 1}
    %{“are” => 1,
    “blue” => 1,
    “violets" => 1}

    View full-size slide

  56. File.stream!("source", :line)
    |> Flow.from_enumerable()
    |> Flow.flat_map(&String.split/1)
    |> Flow.partition()
    |> Flow.reduce(fn -> %{} end, fn word, map ->
    Map.update(map, word, 1, & &1 + 1)
    end)
    Producer
    Stage 1 Stage 2 Stage 3 Stage 4
    Stage A Stage B Stage C Stage D

    View full-size slide

  57. %{“are” => 1,
    “red” => 1}
    Stage C
    Stage A Stage B Stage D
    Stage 1 Stage 4
    Producer
    "roses are red"
    %{“roses” => 1}
    Stage 2 Stage 3

    View full-size slide

  58. %{“are” => 2,
    “red” => 1}
    Stage C
    Stage A Stage B Stage D
    Stage 1 Stage 4
    Producer
    "roses are red"
    %{“roses” => 1}
    "violets are blue"
    Stage 2 Stage 3

    View full-size slide

  59. Mapper 4
    Reducer C
    Producer
    Reducer A Reducer B Reducer D
    Mapper 1 Mapper 2 Mapper 3
    DemandDispatcher
    PartitionDispatcher

    View full-size slide

  60. File.stream!("source", :line)
    |> Flow.from_enumerable()
    |> Flow.flat_map(&String.split/1)
    |> Flow.partition()
    |> Flow.reduce(fn -> %{} end, fn word, map ->
    Map.update(map, word, 1, & &1 + 1)
    end)
    |> Enum.into(%{})
    • reduce/3 collects all data into maps
    • when it is done, the maps are streamed
    • into/2 collects the state into a map

    View full-size slide

  61. Windows and triggers
    • If reduce/3 runs until all data is processed…
    what happens on unbounded data?
    • See Flow.Window documentation.

    View full-size slide

  62. Postlude:
    From eager,

    to lazy,

    to concurrent,

    to distributed

    View full-size slide

  63. Enum (eager)
    File.read!("source")
    |> String.split("\n")
    |> Enum.flat_map(&String.split/1)
    |> Enum.reduce(%{}, fn word, map ->
    Map.update(map, word, 1, & &1 + 1)
    end)

    View full-size slide

  64. Stream (lazy)
    File.stream!("source", :line)
    |> Stream.flat_map(&String.split/1)
    |> Enum.reduce(%{}, fn word, map ->
    Map.update(map, word, 1, & &1 + 1)
    end)

    View full-size slide

  65. Flow (concurrent)
    File.stream!("source", :line)
    |> Flow.from_enumerable()
    |> Flow.flat_map(&String.split/1)
    |> Flow.partition()
    |> Flow.reduce(%{}, fn word, map ->
    Map.update(map, word, 1, & &1 + 1)
    end)
    |> Enum.into(%{})

    View full-size slide

  66. Flow features
    • Provides map and reduce operations,
    partitions, flow merge, flow join
    • Configurable batch size (max & min demand)
    • Data windowing with triggers and
    watermarks

    View full-size slide

  67. Distributed?
    • Flow API has feature parity with
    frameworks like Apache Spark
    • However, there is no distribution nor
    execution guarantees

    View full-size slide

  68. CHEN, Y., ALSPAUGH, S., AND KATZ, R.
    Interactive analytical processing in big data systems:
    a cross-industry study of MapReduce workloads
    “small inputs are common in practice: 40–80% of
    Cloudera customers’ MapReduce jobs and 70% of
    jobs in a Facebook trace have ≤ 1GB of input”

    View full-size slide

  69. GOG, I., SCHWARZKOPF, M., CROOKS, N.,
    GROSVENOR, M. P., CLEMENT, A., AND HAND, S.
    Musketeer: all for one, one for all in data
    processing systems.
    “For between 40-80% of the jobs submitted
    to MapReduce systems, you’d be better off
    just running them on a single machine”

    View full-size slide

  70. Distributed?
    • Single machine matters - try it!
    • The gap between concurrent and
    distributed in Elixir is small
    • Durability concerns will be tackled next

    View full-size slide

  71. Library authors
    • Give GenStage a try
    • Build your own producers
    • TIP: use @behaviour GenStage if you
    want to make it an optional dep

    View full-size slide

  72. Inspirations
    • Akka Streams - back pressure contract
    • Apache Spark - map reduce API
    • Apache Beam - windowing model
    • Microsoft Naiad - stage notifications

    View full-size slide

  73. The Team
    • The Elixir Core Team
    • Specially James Fish (fishcakez)
    • And Eric and Chris for therapy sessions

    View full-size slide

  74. consulting and software engineering
    Built and designed at

    View full-size slide

  75. consulting and software engineering
    Elixir coaching
    Elixir design review
    Custom development

    View full-size slide

  76. @elixirlang / elixir-lang.org

    View full-size slide