Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MapReduce is dead! Long live Cloud Dataflow

MapReduce is dead! Long live Cloud Dataflow

The talk explains what Google Cloud Dataflow, one of the latest released products of the Google Cloud Platform. Dataflow is the result of many years of research on data processing models done internally at Google,and it has the power to revolutionize how you process your own data now.

We will cover the basics of MapReduce, then point out some of the limitations of the model, and show how Dataflow solves most of them.
The goal of the session is to give you an understanding of the possibilities the Google Cloud Platform offers in terms of Big Data processing. Google has been doing this for more than a decade, and we want everyone to share the experience.

Francesc Campoy Flores

May 06, 2015
Tweet

More Decks by Francesc Campoy Flores

Other Decks in Technology

Transcript

  1. • Developer Advocate • Gopher - but will write Java

    for food • I’m on: @francesc google.com/+FrancescCampoyFlores [email protected] About me
  2. History of Big Data at Google Dremel MapReduce Bigtable 2012

    2013 2002 2004 2006 2008 2010 GFS MillWheel Flume
  3. • processing and generating large data sets • parallel, distributed

    algorithm • on clusters of machines • three phases MapReduce map input output reduce shuffle
  4. map: extract words from every line of text shuffle (you

    don’t need to write this) group by word reduce: sum all the counts for every word Simple example: word count map lines of text {word: n} reduce shuffle [word: 1] {word: [1, 1, 1]}
  5. Simple example: word count hello world cruel world hello: 1

    world: 1 cruel: 1 world: 1 hello: 1 world: 1 world: 1 cruel: 1 hello: 1 world: 2 cruel: 1 map shuffle reduce
  6. Shakespeare’s autocomplete with MapReduce hello world what world hello: 1

    world: 1 what: 1 world: 1 hello: 1 world: 1 world: 1 what: 1 hello: 1 world: 2 what: 1 map shuffle reduce
  7. w: 1 what w: 2 world Shakespeare’s autocomplete with MapReduce

    hello: 1 what: 1 map shuffle reduce h: 1 hello he: 1 hello hel: 1 hello hell: 1 hello hello: 1 hello world: 2 w: 1 what wh: 1 what wha: 1 what what: 1 what w: 2 world wo: 2 world wor: 2 world world: 2 world h: 1 hello he: 1 hello hel: 1 hello hell: 1 hello hello: 1 hello wo: 2 world wor: 2 world world: 2 world wh: 1 what wha: 1 what what: 1 what h: 1 hello he: 1 hello hel: 1 hello hell: 1 hello hello: 1 hello w: 2 world wh: 1 what wha: 1 what what: 1 what wo: 2 world wor: 2 world world: 2 world
  8. Problems of MapReduce Very simple concept too simple? Tweeking necessary:

    - how many machines for the map? - how many machines for the reduce? I don’t care about the infrastructure I care about the results!
  9. History of Big Data at Google Dremel MapReduce Bigtable 2012

    2013 2002 2004 2006 2008 2010 GFS MillWheel Flume
  10. • Analytical database as a service • Understands SQL •

    Analyzes terabytes of data in seconds • Imports JSON, CSV, data streams • $0.02 GB month storage $5 TB queried data BigQuery
  11. SELECT word FROM ( SELECT SPLIT( REGEXP_REPLACE(line, '[^a-z^A-Z]*([a-zA-Z]+)[^a-z^A-Z]*', '\\1|'), '|')

    word FROM [shakespeare.lines] ) WHERE LENGTH(line) > 0 Word count in BigQuery
  12. SELECT word, count(*) as n FROM ( SELECT SPLIT( REGEXP_REPLACE(line,

    '[^a-z^A-Z]*([a-zA-Z]+)[^a-z^A-Z]*', '\\1|'), '|') word FROM [shakespeare.lines] ) WHERE LENGTH(line) > 0 GROUP BY word Word count in BigQuery
  13. Word count in BigQuery SELECT word, count(*) as n FROM

    ( SELECT SPLIT( REGEXP_REPLACE(line, '[^a-z^A-Z]*([a-zA-Z]+)[^a-z^A-Z]*', '\\1|'), '|') word FROM [shakespeare.lines] ) WHERE LENGTH(line) > 0 GROUP BY word ORDER BY n DESC
  14. History of Big Data at Google Dremel MapReduce Bigtable 2012

    2013 2002 2004 2006 2008 2010 GFS MillWheel Flume
  15. A reducer is associative if: (x ∗ y) ∗ z

    = x ∗ (y ∗ z) for all x, y, z in S. Or: you can reduce all the intermediary results in any order Word counting is associative Associative reducers
  16. Why is associativity important? M M M 1 1 1

    1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 R R 5 4
  17. groupByKey in: PCollection<Pair<K, V>> // multimap out: PCollection<Pair<K, Collection<V>>> //

    unimap example: in: [“hello”, 1], [“hello”, 1] out: [“hello”, [1, 1]] groupByKey
  18. combineValues (CombFn) in: PCollection<Pair<K, Collection<V>>> // unimap out: PCollection<Pair<K,V>> //

    unimap example: in: [“hello”, [1, 1, 1]] out: [“hello”, 3] combineValues
  19. flatten in: Collection<PCollection<T>> out: PCollection<T> example: in: {[“hello”, 1], [“world”,

    1]}, {[“what”, 1], [“world”], 1]} out: [“hello”, 1], [“world”, 1], [“what”, 1], [“world”], 1] flatten
  20. count in: PCollection<T> out: PCollection<Pair<T, Integer>> parallelDo: T → Pair<T,

    1> groupByKey: PTable<T, 1> → PTable<T, Collection<1>> combineValues: PTable<T, Collection<Integer>> → PTable<T, Integer> PTable<K, V> == PCollection<Pair<K, V>> count
  21. join in: PTable<K, V>, PTable<K, W> out: PTable<K, Pair<Collection<V>, Collection<W>>

    parallelDo: Pair<K, V> → Pair<K, Union<V, W>> Pair<K, W> → Pair<K, Union<V, W>> flatten: → PTable<K, Union<V, W>> groupByKey: → PTable<K, Collection<Union<V, W>>> combineValues: → PTable<K, Pair<Collection<V>, Collection<W>> join
  22. Your code describes a graph of execution That graph is

    compiled to MapReduce(s) The graph is optimized (e.g. associative reducers) Forget about the infrastructure, concentrate on your data Both batch and stream analysis supported Why Cloud Dataflow?
  23. Advantages of Cloud Dataflow performance benchmarks average runtime across 4

    different benchmarks - lower is better relative runtime 1 2 0 FlumeJava (Dataflow) MapReduce (hand-opt.) MapReduce (modular) Sawzall (internal log crunching language that runs on MapReduce) code size average number of lines across 4 different benchmarks - lower is better relative code size 1 0 2