Slide 27
Slide 27 text
Step 2: Create the execution plan
u Pipeline as much as possible
u Split into “stages” based on the need to “shuffle” data
HadoopRDD
MappedRDD
ShuffledRDD
MappedValuesRDD
Array[(Char, Int)]
Alice
Bob
Andy
(A, Alice)
(B, Bob)
(A, Andy)
(A, (Alice, Andy))
(B, Bob)
(A, 2)
Res0 = [(A, 2),….]
(B, 1)
Stage
1
Res0 = [(A, 2), (B, 1)]
Stage
2