Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MapReduce

 MapReduce

This deck introduces the MapReduce framework, a very important paradigm for high performance computing and data analysis.
Motivating examples, design principles and computational paradigm will be explained along with a detailed overview of the execution workflow.

Valerio Maggio

April 24, 2020
Tweet

More Decks by Valerio Maggio

Other Decks in Education

Transcript

  1. Map Reduce
    @leriomaggio
    [email protected]
    Valerio Maggio, PhD

    View full-size slide

  2. A programming model
    (and corresponding implementation)
    for processing and generating large data sets.
    Map Reduce
    J. Dean and S. Ghemawat.
    MapReduce: Simplified data processing on large clusters.
    Operating Systems Design and Implementation,
    pages 137–149, 2004.

    View full-size slide

  3. Background
    Time
    2004:
    The beginning of
    the multi-core era
    Context
    Issues
    Industry-wide shift towards multi-core machines
    Google Infrastructure (Cluster-based Computing Environment)
    —> process large amounts of raw data (e.g. crawled docs, web requests logs)
    —> Compute various kinds of derived data (e.g. inverted indices)
    Ad-hoc (Complex) Solutions
    Input data usually large
    Parallelise the code
    Distribute the data
    Handle failures
    Key challenges of
    High throughput
    computation

    View full-size slide

  4. The idea
    Move the complexity in the Backend
    —> (automatic) parallelisation of computation and distribution of data;
    —> I/O scheduling and Monitoring
    Allow for simple solutions with “small” input data (user defined)
    —> Decompose the problem into multiple “smaller” tasks ( divide and conquer )
    Each solution must comply with the computational paradigm required by the framework
    —> Convention over Configuration principle

    View full-size slide

  5. Map Reduce Computational Paradigm
    map (f, list) -> list
    reduce (f, list) -> v
    map (k1
    , v1
    ) -> list(k2
    , v2
    )
    reduce (k2
    , list(v2
    )) -> list(v2
    )
    The Core (functional) primitives
    adapted synopsis
    intermediate key/values

    View full-size slide

  6. Programming Model
    The computation expressed as input/output key/value pairs;.
    The user defines two functions:
    map (k1
    , v1
    ) -> list(k2
    , v2
    )
    reduce (k2
    , list(v2
    )) -> list(v2
    )
    Processes input key/value pair
    Produces set of intermediate pairs
    Combines all intermediate values for a particular key
    Produces set of merged output values (usually one)
    !! Convention: Intermediate data type are strings by default !!

    View full-size slide

  7. Example: Word Count
    map (key, doc) -> (word, “1”)
    reduce (word, list(counts)) -> count

    View full-size slide

  8. Example: Reverse Web-Link Graph
    map (source, doc) -> (target, source)
    reduce (target, list(source))
    ->(target, list(source))
    Google Page Rank

    View full-size slide

  9. Execution
    Overview
    Source:
    http://bit.ly/map-reduce-paper
    Input data —> M partitions;
    M map tasks; M >> R > W
    R reduce tasks 200K; 5K; 2K
    W Workers ≠ Tasks
    Partitioning function:
    hash(key) mod R
    Intermediate keys —> R splits
    Reader
    Combiner
    Remote Procedure Call
    R output files —> input another MR job
    Worker
    status map

    View full-size slide

  10. Fault Tolerance: Re-Execution
    On a worker failure:
    —> Detect failures by periodic heartbeats
    —> Re-execute completed and in-progress map tasks;
    —> Re-execute in-progress reduce tasks;
    —> Task completion committed through master
    Master failure (very unlikely)
    —> Writing checkpoints periodically & re-execution

    View full-size slide

  11. (additional) Refinements
    Locality (mapper)
    —> (thousands of machines) read input at local disk(s) speed
    —> (otherwise) rack-switches limit read rate
    Skipping Bad Records
    —> on seg-fault send UDP packet to master (w/ seq. no. record)
    —> if master sees two failures from the same sector => skip
    Reducer:
    —> Custom Combiner: to save net bandwidth in communication
    —> Compression of intermediate data

    View full-size slide

  12. The
    “Map Reduce effect”

    View full-size slide

  13. In the next lecture…

    View full-size slide