A programming model
(and corresponding implementation)
for processing and generating large data sets.
Map Reduce
J. Dean and S. Ghemawat.
MapReduce: Simplified data processing on large clusters.
Operating Systems Design and Implementation,
pages 137–149, 2004.
Slide 3
Slide 3 text
Background
Time
2004:
The beginning of
the multi-core era
Context
Issues
Industry-wide shift towards multi-core machines
Google Infrastructure (Cluster-based Computing Environment)
—> process large amounts of raw data (e.g. crawled docs, web requests logs)
—> Compute various kinds of derived data (e.g. inverted indices)
Ad-hoc (Complex) Solutions
Input data usually large
Parallelise the code
Distribute the data
Handle failures
Key challenges of
High throughput
computation
Slide 4
Slide 4 text
The idea
Move the complexity in the Backend
—> (automatic) parallelisation of computation and distribution of data;
—> I/O scheduling and Monitoring
Allow for simple solutions with “small” input data (user defined)
—> Decompose the problem into multiple “smaller” tasks ( divide and conquer )
Each solution must comply with the computational paradigm required by the framework
—> Convention over Configuration principle
Programming Model
The computation expressed as input/output key/value pairs;.
The user defines two functions:
map (k1
, v1
) -> list(k2
, v2
)
reduce (k2
, list(v2
)) -> list(v2
)
Processes input key/value pair
Produces set of intermediate pairs
Combines all intermediate values for a particular key
Produces set of merged output values (usually one)
!! Convention: Intermediate data type are strings by default !!
Execution
Overview
Source:
http://bit.ly/map-reduce-paper
Input data —> M partitions;
M map tasks; M >> R > W
R reduce tasks 200K; 5K; 2K
W Workers ≠ Tasks
Partitioning function:
hash(key) mod R
Intermediate keys —> R splits
Reader
Combiner
Remote Procedure Call
R output files —> input another MR job
Worker
status map
Slide 10
Slide 10 text
Fault Tolerance: Re-Execution
On a worker failure:
—> Detect failures by periodic heartbeats
—> Re-execute completed and in-progress map tasks;
—> Re-execute in-progress reduce tasks;
—> Task completion committed through master
Master failure (very unlikely)
—> Writing checkpoints periodically & re-execution
Slide 11
Slide 11 text
(additional) Refinements
Locality (mapper)
—> (thousands of machines) read input at local disk(s) speed
—> (otherwise) rack-switches limit read rate
Skipping Bad Records
—> on seg-fault send UDP packet to master (w/ seq. no. record)
—> if master sees two failures from the same sector => skip
Reducer:
—> Custom Combiner: to save net bandwidth in communication
—> Compression of intermediate data