Slide 1

Slide 1 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP Using Local Aggregation in MapReduce Jobs Andrew Johnson March 25, 2013

Slide 2

Slide 2 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP WHO AM I? Software engineer at Explorys CWRU grad in Computer Science, May 2012

Slide 3

Slide 3 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP OVERVIEW You’ve got your cluster set up and configured well. You’ve written your MapReduce jobs. But they are just not as fast as you’d like How do you take your MapReduce jobs to the next level?

Slide 4

Slide 4 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP OVERVIEW You’ve got your cluster set up and configured well. You’ve written your MapReduce jobs. But they are just not as fast as you’d like How do you take your MapReduce jobs to the next level?

Slide 5

Slide 5 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP OVERVIEW You’ve got your cluster set up and configured well. You’ve written your MapReduce jobs. But they are just not as fast as you’d like How do you take your MapReduce jobs to the next level?

Slide 6

Slide 6 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP OVERVIEW You’ve got your cluster set up and configured well. You’ve written your MapReduce jobs. But they are just not as fast as you’d like How do you take your MapReduce jobs to the next level?

Slide 7

Slide 7 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP KNOW THYSELF Understand your data! Understand the current performance characteristics of your jobs! Understand how much harder you can push your cluster!

Slide 8

Slide 8 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP KNOW THYSELF Understand your data! Understand the current performance characteristics of your jobs! Understand how much harder you can push your cluster!

Slide 9

Slide 9 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP KNOW THYSELF Understand your data! Understand the current performance characteristics of your jobs! Understand how much harder you can push your cluster!

Slide 10

Slide 10 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP UNDERSTANDING YOUR MAPREDUCE JOBS You need to have some kind of monitoring in place (e.g. Ganglia) Your first task is to find out what is limiting the performance of your job. Watch resource utilization during the time the job is running; is it CPU, IO, or memory bound?

Slide 11

Slide 11 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP UNDERSTANDING YOUR MAPREDUCE JOBS You need to have some kind of monitoring in place (e.g. Ganglia) Your first task is to find out what is limiting the performance of your job. Watch resource utilization during the time the job is running; is it CPU, IO, or memory bound?

Slide 12

Slide 12 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP UNDERSTANDING YOUR MAPREDUCE JOBS You need to have some kind of monitoring in place (e.g. Ganglia) Your first task is to find out what is limiting the performance of your job. Watch resource utilization during the time the job is running; is it CPU, IO, or memory bound?

Slide 13

Slide 13 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP DISCLAIMER The algorithmic techniques in this presentation reduce network utilization at the cost of increased CPU and memory utilization in the map tasks. While trying them out, continue to monitor resource utilization. Bad things can happen if you don’t! I cannot guarantee that what I’m about to show you will actually improve the performance of your jobs. Treat implementing these techniques as a science experiment. That said, I’ve had good results with these techniques.

Slide 14

Slide 14 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP DISCLAIMER The algorithmic techniques in this presentation reduce network utilization at the cost of increased CPU and memory utilization in the map tasks. While trying them out, continue to monitor resource utilization. Bad things can happen if you don’t! I cannot guarantee that what I’m about to show you will actually improve the performance of your jobs. Treat implementing these techniques as a science experiment. That said, I’ve had good results with these techniques.

Slide 15

Slide 15 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP DISCLAIMER The algorithmic techniques in this presentation reduce network utilization at the cost of increased CPU and memory utilization in the map tasks. While trying them out, continue to monitor resource utilization. Bad things can happen if you don’t! I cannot guarantee that what I’m about to show you will actually improve the performance of your jobs. Treat implementing these techniques as a science experiment. That said, I’ve had good results with these techniques.

Slide 16

Slide 16 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP DISCLAIMER The algorithmic techniques in this presentation reduce network utilization at the cost of increased CPU and memory utilization in the map tasks. While trying them out, continue to monitor resource utilization. Bad things can happen if you don’t! I cannot guarantee that what I’m about to show you will actually improve the performance of your jobs. Treat implementing these techniques as a science experiment. That said, I’ve had good results with these techniques.

Slide 17

Slide 17 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP DISCLAIMER The algorithmic techniques in this presentation reduce network utilization at the cost of increased CPU and memory utilization in the map tasks. While trying them out, continue to monitor resource utilization. Bad things can happen if you don’t! I cannot guarantee that what I’m about to show you will actually improve the performance of your jobs. Treat implementing these techniques as a science experiment. That said, I’ve had good results with these techniques.

Slide 18

Slide 18 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP WORD COUNT Class Mapper Method Map(docid a, doc d) foreach term t ∈ doc d do Emit (term t, count 1) end end end Class Reducer Method Reduce(term t, counts [c1, c2, . . .]) sum ← 0 foreach count c ∈ counts [c1, c2, . . .] do sum ← sum + c end Emit (term t, count sum) end end

Slide 19

Slide 19 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP COMBINERS After monitoring this job, we want to improve its performance Why not try a combiner? No guarantee how often a combiner is called, if at all Using combiner still requires materializing all key-value pairs

Slide 20

Slide 20 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP COMBINERS After monitoring this job, we want to improve its performance Why not try a combiner? No guarantee how often a combiner is called, if at all Using combiner still requires materializing all key-value pairs

Slide 21

Slide 21 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP COMBINERS After monitoring this job, we want to improve its performance Why not try a combiner? No guarantee how often a combiner is called, if at all Using combiner still requires materializing all key-value pairs

Slide 22

Slide 22 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP COMBINERS After monitoring this job, we want to improve its performance Why not try a combiner? No guarantee how often a combiner is called, if at all Using combiner still requires materializing all key-value pairs

Slide 23

Slide 23 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP IN-MAPPER COMBINING Move the combiner functionality into the mapper Provides total control over how and when combining takes place Materialize the minimum number of key-value pairs

Slide 24

Slide 24 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP IN-MAPPER COMBINING Move the combiner functionality into the mapper Provides total control over how and when combining takes place Materialize the minimum number of key-value pairs

Slide 25

Slide 25 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP IN-MAPPER COMBINING Move the combiner functionality into the mapper Provides total control over how and when combining takes place Materialize the minimum number of key-value pairs

Slide 26

Slide 26 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP WORD COUNT The reducer is the same as before Class Mapper Method Setup H ← new Map end Method Map(docid a, doc d) foreach term t ∈ doc d do H{t} ← H{t} + 1 end end Method Cleanup foreach term t ∈ H do Emit (term t, count H{t}) end end end

Slide 27

Slide 27 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP POTENTIAL ISSUES Preserving state across calls to Map may cause bugs Order-dependent bugs are especially pernicious Memory usage is fundamental scalability issue Don’t forget your monitoring!

Slide 28

Slide 28 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP POTENTIAL ISSUES Preserving state across calls to Map may cause bugs Order-dependent bugs are especially pernicious Memory usage is fundamental scalability issue Don’t forget your monitoring!

Slide 29

Slide 29 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP POTENTIAL ISSUES Preserving state across calls to Map may cause bugs Order-dependent bugs are especially pernicious Memory usage is fundamental scalability issue Don’t forget your monitoring!

Slide 30

Slide 30 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP POTENTIAL ISSUES Preserving state across calls to Map may cause bugs Order-dependent bugs are especially pernicious Memory usage is fundamental scalability issue Don’t forget your monitoring!

Slide 31

Slide 31 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP DEALING WITH MEMORY USAGE Know your data and your grid! Memory usage may not be a problem Periodically flush intermediate key-value pairs Empirically discover balance between memory usage and network utilization

Slide 32

Slide 32 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP DEALING WITH MEMORY USAGE Know your data and your grid! Memory usage may not be a problem Periodically flush intermediate key-value pairs Empirically discover balance between memory usage and network utilization

Slide 33

Slide 33 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP DEALING WITH MEMORY USAGE Know your data and your grid! Memory usage may not be a problem Periodically flush intermediate key-value pairs Empirically discover balance between memory usage and network utilization

Slide 34

Slide 34 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP DEALING WITH MEMORY USAGE Know your data and your grid! Memory usage may not be a problem Periodically flush intermediate key-value pairs Empirically discover balance between memory usage and network utilization

Slide 35

Slide 35 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP WORD COUNT Class Mapper Method Setup H ← new Map BufferSize ← n end Method Flush foreach term t ∈ H do Emit (term t, count H{t}) end Clear (H) end Method Map(docid a, doc d) foreach term t ∈ doc d do H{t} ← H{t} + 1 end if Size (H) ≥ BufferSize then Flush end end Method Cleanup Flush end end

Slide 36

Slide 36 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP OVERVIEW In-mapper combining is not the only local aggregation technique For more complex problems, consider changing the intermediate key-value types

Slide 37

Slide 37 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP OVERVIEW In-mapper combining is not the only local aggregation technique For more complex problems, consider changing the intermediate key-value types

Slide 38

Slide 38 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP PAIRS APPROACH TO WORD CO-OCCURRENCE Class Mapper Method Map(docid a, doc d) foreach term w ∈ doc d do foreach term u ∈ Neighbors (w) do Emit (pair (w,u), count 1) end end end end Class Reducer Method Reduce(pair p, counts [c1 , c2 , . . .]) sum ← 0 foreach count c ∈ counts [c1 , c2 , . . .] do sum ← sum + c end Emit (pair p, count sum) end end

Slide 39

Slide 39 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP IMPROVEMENTS Can apply in-mapper combining There will still be a very large number of intermediate key-value pairs What do you do if the shuffle phase is still the bottleneck? Change the structure of your key-value pairs!

Slide 40

Slide 40 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP IMPROVEMENTS Can apply in-mapper combining There will still be a very large number of intermediate key-value pairs What do you do if the shuffle phase is still the bottleneck? Change the structure of your key-value pairs!

Slide 41

Slide 41 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP IMPROVEMENTS Can apply in-mapper combining There will still be a very large number of intermediate key-value pairs What do you do if the shuffle phase is still the bottleneck? Change the structure of your key-value pairs!

Slide 42

Slide 42 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP IMPROVEMENTS Can apply in-mapper combining There will still be a very large number of intermediate key-value pairs What do you do if the shuffle phase is still the bottleneck? Change the structure of your key-value pairs!

Slide 43

Slide 43 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP STRIPES APPROACH TO WORD CO-OCCURRENCE Class Mapper Method Map(docid a, doc d) foreach term w ∈ doc d do H ← new Map foreach term u ∈ Neighbors (w) do H{u} ← H{u} + 1 end Emit (term w, stripe H) end end end Class Reducer Method Reduce(term w, stripes [H1 , H2 , . . .]) Hf ← new Map foreach stripe H ∈ stripes [H1 , H2 , . . .] do ElementSum (Hf , H) end Emit (term w, stripe Hf ) end end

Slide 44

Slide 44 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP POTENTIAL ISSUES Memory and CPU utilization will increase Emit less intermediate data at the cost of more expensive serialization Don’t forget your monitoring! In-mapper combining and the flush technique may be used with the stripes approach

Slide 45

Slide 45 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP POTENTIAL ISSUES Memory and CPU utilization will increase Emit less intermediate data at the cost of more expensive serialization Don’t forget your monitoring! In-mapper combining and the flush technique may be used with the stripes approach

Slide 46

Slide 46 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP POTENTIAL ISSUES Memory and CPU utilization will increase Emit less intermediate data at the cost of more expensive serialization Don’t forget your monitoring! In-mapper combining and the flush technique may be used with the stripes approach

Slide 47

Slide 47 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP POTENTIAL ISSUES Memory and CPU utilization will increase Emit less intermediate data at the cost of more expensive serialization Don’t forget your monitoring! In-mapper combining and the flush technique may be used with the stripes approach

Slide 48

Slide 48 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP OTHER RESOURCES Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer (http://lintool.github.com/MapReduceAlgorithms/ index.html)

Slide 49

Slide 49 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP CONTACT INFORMATION E-mail: [email protected] Slides are available at my website http://andrewjamesjohnson.com/talks/ and on GitHub (https://github.com/ajsquared/hadoop-slides)

Slide 50

Slide 50 text

INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP Questions?