MapReduce is dead! Long live Cloud Dataflow

MapReduce is dead! Long live Cloud Dataflow Francesc Campoy Developer
Advocate at Google

• Developer Advocate • Gopher - but will write Java
for food • I’m on: @francesc google.com/+FrancescCampoyFlores [email protected] About me

Shakespeare’s autocomplete Map Reduce BigQuery Cloud Dataflow Questions and answers*
*maybe 1 2 3 4 5 Agenda

Shakespeare’s autocomplete Autocomplete thou

Shakespeare’s autocomplete

Shakespeare’s autocomplete a → and ab → about ... ?

History of Big Data at Google Dremel MapReduce Bigtable 2012
2013 2002 2004 2006 2008 2010 GFS MillWheel Flume

MapReduce

• processing and generating large data sets • parallel, distributed
algorithm • on clusters of machines • three phases MapReduce map input output reduce shuffle

map: extract words from every line of text shuffle (you
don’t need to write this) group by word reduce: sum all the counts for every word Simple example: word count map lines of text {word: n} reduce shuffle [word: 1] {word: [1, 1, 1]}

Simple example: word count hello world cruel world hello: 1
world: 1 cruel: 1 world: 1 hello: 1 world: 1 world: 1 cruel: 1 hello: 1 world: 2 cruel: 1 map shuffle reduce

Shakespeare’s autocomplete with MapReduce a → and ab → about
... ?

Shakespeare’s autocomplete with MapReduce hello world what world hello: 1
world: 1 what: 1 world: 1 hello: 1 world: 1 world: 1 what: 1 hello: 1 world: 2 what: 1 map shuffle reduce

w: 1 what w: 2 world Shakespeare’s autocomplete with MapReduce
hello: 1 what: 1 map shuffle reduce h: 1 hello he: 1 hello hel: 1 hello hell: 1 hello hello: 1 hello world: 2 w: 1 what wh: 1 what wha: 1 what what: 1 what w: 2 world wo: 2 world wor: 2 world world: 2 world h: 1 hello he: 1 hello hel: 1 hello hell: 1 hello hello: 1 hello wo: 2 world wor: 2 world world: 2 world wh: 1 what wha: 1 what what: 1 what h: 1 hello he: 1 hello hel: 1 hello hell: 1 hello hello: 1 hello w: 2 world wh: 1 what wha: 1 what what: 1 what wo: 2 world wor: 2 world world: 2 world

Problems of MapReduce Very simple concept too simple? Tweeking necessary:
- how many machines for the map? - how many machines for the reduce? I don’t care about the infrastructure I care about the results!

BigQuery like SQL … just exponentially faster

• Analytical database as a service • Understands SQL •
Analyzes terabytes of data in seconds • Imports JSON, CSV, data streams • $0.02 GB month storage $5 TB queried data BigQuery

SELECT SPLIT( REGEXP_REPLACE(line, '[â-zÂ-Z]*([a-zA-Z]+)[â-zÂ-Z]*', '\\1|'), '|') word FROM [shakespeare.lines] Word
count in BigQuery

SELECT word FROM ( SELECT SPLIT( REGEXP_REPLACE(line, '[â-zÂ-Z]*([a-zA-Z]+)[â-zÂ-Z]*', '\\1|'), '|')
word FROM [shakespeare.lines] ) WHERE LENGTH(line) > 0 Word count in BigQuery

SELECT word, count(*) as n FROM ( SELECT SPLIT( REGEXP_REPLACE(line,
'[â-zÂ-Z]*([a-zA-Z]+)[â-zÂ-Z]*', '\\1|'), '|') word FROM [shakespeare.lines] ) WHERE LENGTH(line) > 0 GROUP BY word Word count in BigQuery

Word count in BigQuery SELECT word, count(*) as n FROM
( SELECT SPLIT( REGEXP_REPLACE(line, '[â-zÂ-Z]*([a-zA-Z]+)[â-zÂ-Z]*', '\\1|'), '|') word FROM [shakespeare.lines] ) WHERE LENGTH(line) > 0 GROUP BY word ORDER BY n DESC

Shakespeare’s autocomplete with BigQuery a → and ab → about
... ?

Google Cloud Dataflow

MapReduce primitives map input output <key, value> <key, value[]> shuffle
reduce

Cloud Dataflow primitives parallelDo input output <key, value> <key, value[]>
groupByKey parallelDo

A reducer is associative if: (x ∗ y) ∗ z
= x ∗ (y ∗ z) for all x, y, z in S. Or: you can reduce all the intermediary results in any order Word counting is associative Associative reducers

Why is associativity important? M M M 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 R R 5 4

Why is associativity important? M+R M+R 2 1 2 1
2 1 2 1 R R 5 4 1 2 2 1 M+R

Cloud Dataflow primitives for associative reducers parallelDo input output <key,
value> <key, value[]> groupByKey combineValues

Cloud Dataflow primitives

parallelDo (DoFn) in: PCollection<T> out: PCollection<S> example: in: “hello world”
out: [“hello”, 1], [“world”, 1] parallelDo

groupByKey in: PCollection<Pair<K, V>> // multimap out: PCollection<Pair<K, Collection<V>>> //
unimap example: in: [“hello”, 1], [“hello”, 1] out: [“hello”, [1, 1]] groupByKey

combineValues (CombFn) in: PCollection<Pair<K, Collection<V>>> // unimap out: PCollection<Pair<K,V>> //
unimap example: in: [“hello”, [1, 1, 1]] out: [“hello”, 3] combineValues

flatten in: Collection<PCollection<T>> out: PCollection<T> example: in: {[“hello”, 1], [“world”,
1]}, {[“what”, 1], [“world”], 1]} out: [“hello”, 1], [“world”, 1], [“what”, 1], [“world”], 1] flatten

Cloud Dataflow derived operations

count in: PCollection<T> out: PCollection<Pair<T, Integer>> parallelDo: T → Pair<T,
1> groupByKey: PTable<T, 1> → PTable<T, Collection<1>> combineValues: PTable<T, Collection<Integer>> → PTable<T, Integer> PTable<K, V> == PCollection<Pair<K, V>> count

join in: PTable<K, V>, PTable<K, W> out: PTable<K, Pair<Collection<V>, Collection<W>>
parallelDo: Pair<K, V> → Pair<K, Union<V, W>> Pair<K, W> → Pair<K, Union<V, W>> flatten: → PTable<K, Union<V, W>> groupByKey: → PTable<K, Collection<Union<V, W>>> combineValues: → PTable<K, Pair<Collection<V>, Collection<W>> join

But …. why?

Your code describes a graph of execution That graph is
compiled to MapReduce(s) The graph is optimized (e.g. associative reducers) Forget about the infrastructure, concentrate on your data Both batch and stream analysis supported Why Cloud Dataflow?

Optimization of graphs count words expand most popular

Optimization of graphs E GBK GBK + + count top

Optimization of graphs E GBK GBK + + MapReduce 2
MapReduce 1

MapReduce 2 MapReduce 1 Optimization of graphs E GBK GBK
+ +

Advantages of Cloud Dataflow performance benchmarks average runtime across 4
different benchmarks - lower is better relative runtime 1 2 0 FlumeJava (Dataflow) MapReduce (hand-opt.) MapReduce (modular) Sawzall (internal log crunching language that runs on MapReduce) code size average number of lines across 4 different benchmarks - lower is better relative code size 1 0 2

Demo Time

One more thing

p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*.txt")) .apply(ParDo.of(new SplitWords())) .apply(Count.<String> perElement()) .apply(ParDo.of(new ExplodePrefixes())) .apply(Combine.perKey(new ChooseBestCompletion())) .apply(ParDo.of(new
PairToString())) .apply(TextIO.Write.to("gs://campoy-dataflow-demo/counts.txt")); Streaming computation

p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*.txt")) .apply(ParDo.of(new SplitWords())) .apply(Count.<String> perElement()) .apply(ParDo.of(new ExplodePrefixes())) .apply(Combine.perKey(new ChooseBestCompletion())) .apply(ParDo.of(new
PairToString())) .apply(TextIO.Write.to("gs://campoy-dataflow-demo/counts.txt")); Streaming computation: our batch code

p.apply(PubsubIO.Read.from(input_stream)) .apply(ParDo.of(new SplitWords())) .apply(Count.<String> perElement()) .apply(ParDo.of(new ExplodePrefixes())) .apply(Combine.perKey(new ChooseBestCompletion())) .apply(ParDo.of(new
PairToString())) .apply(PubSubIO.Write.to(output_stream)); Streaming computation: streaming code

p.apply(PubsubIO.Read.from(input_stream)) .apply(Bucket.by(SlidingWindow.of(60, MINUTES))) .apply(ParDo.of(new SplitWords())) .apply(Count.<String> perElement()) .apply(ParDo.of(new ExplodePrefixes())) .apply(Combine.perKey(new
ChooseBestCompletion())) .apply(ParDo.of(new PairToString())) .apply(PubSubIO.Write.to(output_stream)); Streaming computation: streaming code

Questions?

@francesc google.com/+FrancescCampoyFlores [email protected] Thanks!

MapReduce is dead! Long live Cloud Dataflow

MapReduce is dead! Long live Cloud Dataflow

More Decks by Francesc Campoy Flores

Other Decks in Technology

Featured

Transcript