Slide 13
Slide 13 text
There is a
Map phase
Hadoop uses
MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,
Shuffle
(a, [1,1]),
(hadoop, [1]),
(is, [1,1])
(map, [1]),
(mapreduce, [1]),
(phase, [1,1])
Reducers
There is a
Reduce phase
(doc4, "…")
(reduce, [1]),
(there, [1,1]),
(uses, 1)
(hadoop, 1)
(uses, 1)
(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1),
(reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Friday, March 16, 12
The mappers emit key-value pairs, where each key is one of the words, and the value is the
count. In the most naive (but also most memory efficient) implementation, each mapper
simply emits (word, 1) each time “word” is seen.
The mappers themselves don’t decide to which reducer each pair should be sent. Rather, the
job setup configures what to do and the Hadoop runtime enforces it during the Sort/Shuffle
phase, where the key-value pairs in each mapper are sorted by key (that is locally, not
globally or “totally”) and then the pairs are routed to the correct reducer, on the current
machine or other machines.
Note how we partitioned the reducers (by first letter of the keys). Also, note that the mapper
for the empty doc. emits no pairs, as you would expect.