MapReduce

Map Reduce @leriomaggio [email protected] Valerio Maggio, PhD

A programming model (and corresponding implementation) for processing and generating
large data sets. Map Reduce J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Operating Systems Design and Implementation, pages 137–149, 2004.

Background Time 2004: The beginning of the multi-core era Context
Issues Industry-wide shift towards multi-core machines Google Infrastructure (Cluster-based Computing Environment) —> process large amounts of raw data (e.g. crawled docs, web requests logs) —> Compute various kinds of derived data (e.g. inverted indices) Ad-hoc (Complex) Solutions Input data usually large Parallelise the code Distribute the data Handle failures Key challenges of High throughput computation

The idea Move the complexity in the Backend —> (automatic)
parallelisation of computation and distribution of data; —> I/O scheduling and Monitoring Allow for simple solutions with “small” input data (user defined) —> Decompose the problem into multiple “smaller” tasks ( divide and conquer ) Each solution must comply with the computational paradigm required by the framework —> Convention over Configuration principle

Map Reduce Computational Paradigm map (f, list) -> list reduce
(f, list) -> v map (k1 , v1 ) -> list(k2 , v2 ) reduce (k2 , list(v2 )) -> list(v2 ) The Core (functional) primitives adapted synopsis intermediate key/values

Programming Model The computation expressed as input/output key/value pairs;. The
user defines two functions: map (k1 , v1 ) -> list(k2 , v2 ) reduce (k2 , list(v2 )) -> list(v2 ) Processes input key/value pair Produces set of intermediate pairs Combines all intermediate values for a particular key Produces set of merged output values (usually one) !! Convention: Intermediate <key,value> data type are strings by default !!

Example: Word Count map (key, doc) -> (word, “1”) reduce
(word, list(counts)) -> count

Example: Reverse Web-Link Graph map (source, doc) -> (target, source)
reduce (target, list(source)) ->(target, list(source)) Google Page Rank

Execution Overview Source: http://bit.ly/map-reduce-paper Input data —> M partitions; M
map tasks; M >> R > W R reduce tasks 200K; 5K; 2K W Workers ≠ Tasks Partitioning function: hash(key) mod R Intermediate keys —> R splits Reader Combiner Remote Procedure Call R output files —> input another MR job Worker status map

Fault Tolerance: Re-Execution On a worker failure: —> Detect failures
by periodic heartbeats —> Re-execute completed and in-progress map tasks; —> Re-execute in-progress reduce tasks; —> Task completion committed through master Master failure (very unlikely) —> Writing checkpoints periodically & re-execution

(additional) Refinements Locality (mapper) —> (thousands of machines) read input
at local disk(s) speed —> (otherwise) rack-switches limit read rate Skipping Bad Records —> on seg-fault send UDP packet to master (w/ seq. no. record) —> if master sees two failures from the same sector => skip Reducer: —> Custom Combiner: to save net bandwidth in communication —> Compression of intermediate data

The “Map Reduce effect”

In the next lecture…

MapReduce

MapReduce

Valerio Maggio

More Decks by Valerio Maggio

Other Decks in Education

Featured

Transcript

Map Reduce @leriomaggio [email protected] Valerio Maggio, PhD

A programming model (and corresponding implementation) for processing and generating

Background Time 2004: The beginning of the multi-core era Context

The idea Move the complexity in the Backend —> (automatic)

Map Reduce Computational Paradigm map (f, list) -> list reduce

Programming Model The computation expressed as input/output key/value pairs;. The

Example: Word Count map (key, doc) -> (word, “1”) reduce

Example: Reverse Web-Link Graph map (source, doc) -> (target, source)

Execution Overview Source: http://bit.ly/map-reduce-paper Input data —> M partitions; M

Fault Tolerance: Re-Execution On a worker failure: —> Detect failures

(additional) Refinements Locality (mapper) —> (thousands of machines) read input

The “Map Reduce effect”

In the next lecture…