Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MapReduce

 MapReduce

This deck introduces the MapReduce framework, a very important paradigm for high performance computing and data analysis.
Motivating examples, design principles and computational paradigm will be explained along with a detailed overview of the execution workflow.

Valerio Maggio

April 24, 2020
Tweet

More Decks by Valerio Maggio

Other Decks in Education

Transcript

  1. A programming model (and corresponding implementation) for processing and generating

    large data sets. Map Reduce J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Operating Systems Design and Implementation, pages 137–149, 2004.
  2. Background Time 2004: The beginning of the multi-core era Context

    Issues Industry-wide shift towards multi-core machines Google Infrastructure (Cluster-based Computing Environment) —> process large amounts of raw data (e.g. crawled docs, web requests logs) —> Compute various kinds of derived data (e.g. inverted indices) Ad-hoc (Complex) Solutions Input data usually large Parallelise the code Distribute the data Handle failures Key challenges of High throughput computation
  3. The idea Move the complexity in the Backend —> (automatic)

    parallelisation of computation and distribution of data; —> I/O scheduling and Monitoring Allow for simple solutions with “small” input data (user defined) —> Decompose the problem into multiple “smaller” tasks ( divide and conquer ) Each solution must comply with the computational paradigm required by the framework —> Convention over Configuration principle
  4. Map Reduce Computational Paradigm map (f, list) -> list reduce

    (f, list) -> v map (k1 , v1 ) -> list(k2 , v2 ) reduce (k2 , list(v2 )) -> list(v2 ) The Core (functional) primitives adapted synopsis intermediate key/values
  5. Programming Model The computation expressed as input/output key/value pairs;. The

    user defines two functions: map (k1 , v1 ) -> list(k2 , v2 ) reduce (k2 , list(v2 )) -> list(v2 ) Processes input key/value pair Produces set of intermediate pairs Combines all intermediate values for a particular key Produces set of merged output values (usually one) !! Convention: Intermediate <key,value> data type are strings by default !!
  6. Example: Reverse Web-Link Graph map (source, doc) -> (target, source)

    reduce (target, list(source)) ->(target, list(source)) Google Page Rank
  7. Execution Overview Source: http://bit.ly/map-reduce-paper Input data —> M partitions; M

    map tasks; M >> R > W R reduce tasks 200K; 5K; 2K W Workers ≠ Tasks Partitioning function: hash(key) mod R Intermediate keys —> R splits Reader Combiner Remote Procedure Call R output files —> input another MR job Worker status map
  8. Fault Tolerance: Re-Execution On a worker failure: —> Detect failures

    by periodic heartbeats —> Re-execute completed and in-progress map tasks; —> Re-execute in-progress reduce tasks; —> Task completion committed through master Master failure (very unlikely) —> Writing checkpoints periodically & re-execution
  9. (additional) Refinements Locality (mapper) —> (thousands of machines) read input

    at local disk(s) speed —> (otherwise) rack-switches limit read rate Skipping Bad Records —> on seg-fault send UDP packet to master (w/ seq. no. record) —> if master sees two failures from the same sector => skip Reducer: —> Custom Combiner: to save net bandwidth in communication —> Compression of intermediate data