$30 off During Our Annual Pro Sale. View Details »

MapReduce

 MapReduce

This deck introduces the MapReduce framework, a very important paradigm for high performance computing and data analysis.
Motivating examples, design principles and computational paradigm will be explained along with a detailed overview of the execution workflow.

Valerio Maggio

April 24, 2020
Tweet

More Decks by Valerio Maggio

Other Decks in Education

Transcript

  1. Map Reduce @leriomaggio valerio.maggio@bristol.ac.uk Valerio Maggio, PhD

  2. A programming model (and corresponding implementation) for processing and generating

    large data sets. Map Reduce J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Operating Systems Design and Implementation, pages 137–149, 2004.
  3. Background Time 2004: The beginning of the multi-core era Context

    Issues Industry-wide shift towards multi-core machines Google Infrastructure (Cluster-based Computing Environment) —> process large amounts of raw data (e.g. crawled docs, web requests logs) —> Compute various kinds of derived data (e.g. inverted indices) Ad-hoc (Complex) Solutions Input data usually large Parallelise the code Distribute the data Handle failures Key challenges of High throughput computation
  4. The idea Move the complexity in the Backend —> (automatic)

    parallelisation of computation and distribution of data; —> I/O scheduling and Monitoring Allow for simple solutions with “small” input data (user defined) —> Decompose the problem into multiple “smaller” tasks ( divide and conquer ) Each solution must comply with the computational paradigm required by the framework —> Convention over Configuration principle
  5. Map Reduce Computational Paradigm map (f, list) -> list reduce

    (f, list) -> v map (k1 , v1 ) -> list(k2 , v2 ) reduce (k2 , list(v2 )) -> list(v2 ) The Core (functional) primitives adapted synopsis intermediate key/values
  6. Programming Model The computation expressed as input/output key/value pairs;. The

    user defines two functions: map (k1 , v1 ) -> list(k2 , v2 ) reduce (k2 , list(v2 )) -> list(v2 ) Processes input key/value pair Produces set of intermediate pairs Combines all intermediate values for a particular key Produces set of merged output values (usually one) !! Convention: Intermediate <key,value> data type are strings by default !!
  7. Example: Word Count map (key, doc) -> (word, “1”) reduce

    (word, list(counts)) -> count
  8. Example: Reverse Web-Link Graph map (source, doc) -> (target, source)

    reduce (target, list(source)) ->(target, list(source)) Google Page Rank
  9. Execution Overview Source: http://bit.ly/map-reduce-paper Input data —> M partitions; M

    map tasks; M >> R > W R reduce tasks 200K; 5K; 2K W Workers ≠ Tasks Partitioning function: hash(key) mod R Intermediate keys —> R splits Reader Combiner Remote Procedure Call R output files —> input another MR job Worker status map
  10. Fault Tolerance: Re-Execution On a worker failure: —> Detect failures

    by periodic heartbeats —> Re-execute completed and in-progress map tasks; —> Re-execute in-progress reduce tasks; —> Task completion committed through master Master failure (very unlikely) —> Writing checkpoints periodically & re-execution
  11. (additional) Refinements Locality (mapper) —> (thousands of machines) read input

    at local disk(s) speed —> (otherwise) rack-switches limit read rate Skipping Bad Records —> on seg-fault send UDP packet to master (w/ seq. no. record) —> if master sees two failures from the same sector => skip Reducer: —> Custom Combiner: to save net bandwidth in communication —> Compression of intermediate data
  12. The “Map Reduce effect”

  13. In the next lecture…