Flow
• We give up ordering and process
locality for concurrency
• Tools for working with bounded and
unbounded data
Slide 23
Slide 23 text
Flow
• It is not magic! There is an overhead
when data flows through processes
• Requires volume and/or cpu/io-bound
work to see benefits
Slide 24
Slide 24 text
Flow stats
• ~1200 lines of code (LOC)
• ~1300 lines of documentation (LOD)
Slide 25
Slide 25 text
Topics
• 1200LOC: How is flow implemented?
• 1300LOD: How to reason about flows?
Slide 26
Slide 26 text
GenStage
Slide 27
Slide 27 text
GenStage
• It is a new behaviour
• Exchanges data between stages
transparently with back-pressure
• Breaks into producers, consumers and
producer_consumers
GenStage: Demand-driven
B
A
Producer Consumer
1. consumer subscribes to producer
2. consumer sends demand
3. producer sends events
Asks 10
Sends max 10
Subscribes
Slide 30
Slide 30 text
B C
A
Asks 10
Asks 10
Sends max 10 Sends max 10
GenStage: Demand-driven
Slide 31
Slide 31 text
• It is a message contract
• It pushes back-pressure to the boundary
• GenStage is one impl of this contract
GenStage: Demand-driven
Slide 32
Slide 32 text
GenStage
example
Slide 33
Slide 33 text
B
A
Printer
(Consumer)
Counter
(Producer)
Slide 34
Slide 34 text
defmodule Producer do
use GenStage
def init(counter) do
{:producer, counter}
end
def handle_demand(demand, counter) when demand > 0 do
events = Enum.to_list(counter..counter+demand-1)
{:noreply, events, counter + demand}
end
end
defmodule Consumer do
use GenStage
def init(:ok) do
{:consumer, :the_state_does_not_matter}
end
def handle_events(events, _from, state) do
Process.sleep(1000)
IO.inspect(events)
{:noreply, [], state}
end
end
Subscribe options
• max_demand: the maximum amount
of events to ask (default 1000)
• min_demand: when reached, ask for
more events (default 500)
Slide 39
Slide 39 text
max_demand: 10, min_demand: 5
1. the consumer asks for 10 items
2. the consumer receives 10 items
3. the consumer processes 5 of 10 items
4. the consumer asks for more 5
5. the consumer processes the remaining 5
Slide 40
Slide 40 text
max_demand: 10, min_demand: 0
1. the consumer asks for 10 items
2. the consumer receives 10 items
3. the consumer processes 10 items
4. the consumer asks for more 10
5. the consumer waits
Slide 41
Slide 41 text
GenStage
dispatchers
Slide 42
Slide 42 text
Dispatchers
• Per producer
• Effectively receive the demand and
send data
• Allow a producer to dispatch to
multiple consumers at once
File.stream!("source", :line)
|> Flow.from_enumerable()
|> Flow.flat_map(&String.split/1)
|> Flow.partition()
|> Flow.reduce(fn -> %{} end, fn word, map ->
Map.update(map, word, 1, & &1 + 1)
end)
Producer
Stage 1 Stage 2 Stage 3 Stage 4
Stage A Stage B Stage C Stage D
Slide 62
Slide 62 text
%{“are” => 1,
“red” => 1}
Stage C
Stage A Stage B Stage D
Stage 1 Stage 4
Producer
"roses are red"
%{“roses” => 1}
Stage 2 Stage 3
Slide 63
Slide 63 text
%{“are” => 2,
“red” => 1}
Stage C
Stage A Stage B Stage D
Stage 1 Stage 4
Producer
"roses are red"
%{“roses” => 1}
"violets are blue"
Stage 2 Stage 3
Slide 64
Slide 64 text
Mapper 4
Reducer C
Producer
Reducer A Reducer B Reducer D
Mapper 1 Mapper 2 Mapper 3
DemandDispatcher
PartitionDispatcher
Slide 65
Slide 65 text
File.stream!("source", :line)
|> Flow.from_enumerable()
|> Flow.flat_map(&String.split/1)
|> Flow.partition()
|> Flow.reduce(fn -> %{} end, fn word, map ->
Map.update(map, word, 1, & &1 + 1)
end)
|> Enum.into(%{})
• reduce/3 collects all data into maps
• when it is done, the maps are streamed
• into/2 collects the state into a map
Slide 66
Slide 66 text
Windows and triggers
• If reduce/3 runs until all data is processed…
what happens on unbounded data?
• See Flow.Window documentation.
Slide 67
Slide 67 text
Postlude:
From eager,
to lazy,
to concurrent,
to distributed
Flow features
• Provides map and reduce operations,
partitions, flow merge, flow join
• Configurable batch size (max & min demand)
• Data windowing with triggers and
watermarks
Slide 72
Slide 72 text
Distributed?
• Flow API has feature parity with
frameworks like Apache Spark
• However, there is no distribution nor
execution guarantees
Slide 73
Slide 73 text
CHEN, Y., ALSPAUGH, S., AND KATZ, R.
Interactive analytical processing in big data systems:
a cross-industry study of MapReduce workloads
“small inputs are common in practice: 40–80% of
Cloudera customers’ MapReduce jobs and 70% of
jobs in a Facebook trace have ≤ 1GB of input”
Slide 74
Slide 74 text
GOG, I., SCHWARZKOPF, M., CROOKS, N.,
GROSVENOR, M. P., CLEMENT, A., AND HAND, S.
Musketeer: all for one, one for all in data
processing systems.
“For between 40-80% of the jobs submitted
to MapReduce systems, you’d be better off
just running them on a single machine”
Slide 75
Slide 75 text
Distributed?
• Single machine matters - try it!
• The gap between concurrent and
distributed in Elixir is small
• Durability concerns will be tackled next
Slide 76
Slide 76 text
Thank-yous
Slide 77
Slide 77 text
Library authors
• Give GenStage a try
• Build your own producers
• TIP: use @behaviour GenStage if you
want to make it an optional dep
Slide 78
Slide 78 text
Inspirations
• Akka Streams - back pressure contract
• Apache Spark - map reduce API
• Apache Beam - windowing model
• Microsoft Naiad - stage notifications
Slide 79
Slide 79 text
The Team
• The Elixir Core Team
• Specially James Fish (fishcakez)
• And Eric and Chris for therapy sessions
Slide 80
Slide 80 text
consulting and software engineering
Built and designed at
Slide 81
Slide 81 text
consulting and software engineering
Elixir coaching
Elixir design review
Custom development