Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Local Aggregation In MapReduce Jobs

Using Local Aggregation In MapReduce Jobs

An overview of local aggregation techniques to improve the performance of MapReduce jobs on Hadoop.

Andrew Johnson

March 25, 2013
Tweet

More Decks by Andrew Johnson

Other Decks in Technology

Transcript

  1. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    Using Local Aggregation in MapReduce Jobs Andrew Johnson March 25, 2013
  2. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    WHO AM I? Software engineer at Explorys CWRU grad in Computer Science, May 2012
  3. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    OVERVIEW You’ve got your cluster set up and configured well. You’ve written your MapReduce jobs. But they are just not as fast as you’d like How do you take your MapReduce jobs to the next level?
  4. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    OVERVIEW You’ve got your cluster set up and configured well. You’ve written your MapReduce jobs. But they are just not as fast as you’d like How do you take your MapReduce jobs to the next level?
  5. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    OVERVIEW You’ve got your cluster set up and configured well. You’ve written your MapReduce jobs. But they are just not as fast as you’d like How do you take your MapReduce jobs to the next level?
  6. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    OVERVIEW You’ve got your cluster set up and configured well. You’ve written your MapReduce jobs. But they are just not as fast as you’d like How do you take your MapReduce jobs to the next level?
  7. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    KNOW THYSELF Understand your data! Understand the current performance characteristics of your jobs! Understand how much harder you can push your cluster!
  8. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    KNOW THYSELF Understand your data! Understand the current performance characteristics of your jobs! Understand how much harder you can push your cluster!
  9. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    KNOW THYSELF Understand your data! Understand the current performance characteristics of your jobs! Understand how much harder you can push your cluster!
  10. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    UNDERSTANDING YOUR MAPREDUCE JOBS You need to have some kind of monitoring in place (e.g. Ganglia) Your first task is to find out what is limiting the performance of your job. Watch resource utilization during the time the job is running; is it CPU, IO, or memory bound?
  11. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    UNDERSTANDING YOUR MAPREDUCE JOBS You need to have some kind of monitoring in place (e.g. Ganglia) Your first task is to find out what is limiting the performance of your job. Watch resource utilization during the time the job is running; is it CPU, IO, or memory bound?
  12. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    UNDERSTANDING YOUR MAPREDUCE JOBS You need to have some kind of monitoring in place (e.g. Ganglia) Your first task is to find out what is limiting the performance of your job. Watch resource utilization during the time the job is running; is it CPU, IO, or memory bound?
  13. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    DISCLAIMER The algorithmic techniques in this presentation reduce network utilization at the cost of increased CPU and memory utilization in the map tasks. While trying them out, continue to monitor resource utilization. Bad things can happen if you don’t! I cannot guarantee that what I’m about to show you will actually improve the performance of your jobs. Treat implementing these techniques as a science experiment. That said, I’ve had good results with these techniques.
  14. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    DISCLAIMER The algorithmic techniques in this presentation reduce network utilization at the cost of increased CPU and memory utilization in the map tasks. While trying them out, continue to monitor resource utilization. Bad things can happen if you don’t! I cannot guarantee that what I’m about to show you will actually improve the performance of your jobs. Treat implementing these techniques as a science experiment. That said, I’ve had good results with these techniques.
  15. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    DISCLAIMER The algorithmic techniques in this presentation reduce network utilization at the cost of increased CPU and memory utilization in the map tasks. While trying them out, continue to monitor resource utilization. Bad things can happen if you don’t! I cannot guarantee that what I’m about to show you will actually improve the performance of your jobs. Treat implementing these techniques as a science experiment. That said, I’ve had good results with these techniques.
  16. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    DISCLAIMER The algorithmic techniques in this presentation reduce network utilization at the cost of increased CPU and memory utilization in the map tasks. While trying them out, continue to monitor resource utilization. Bad things can happen if you don’t! I cannot guarantee that what I’m about to show you will actually improve the performance of your jobs. Treat implementing these techniques as a science experiment. That said, I’ve had good results with these techniques.
  17. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    DISCLAIMER The algorithmic techniques in this presentation reduce network utilization at the cost of increased CPU and memory utilization in the map tasks. While trying them out, continue to monitor resource utilization. Bad things can happen if you don’t! I cannot guarantee that what I’m about to show you will actually improve the performance of your jobs. Treat implementing these techniques as a science experiment. That said, I’ve had good results with these techniques.
  18. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    WORD COUNT Class Mapper Method Map(docid a, doc d) foreach term t ∈ doc d do Emit (term t, count 1) end end end Class Reducer Method Reduce(term t, counts [c1, c2, . . .]) sum ← 0 foreach count c ∈ counts [c1, c2, . . .] do sum ← sum + c end Emit (term t, count sum) end end
  19. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    COMBINERS After monitoring this job, we want to improve its performance Why not try a combiner? No guarantee how often a combiner is called, if at all Using combiner still requires materializing all key-value pairs
  20. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    COMBINERS After monitoring this job, we want to improve its performance Why not try a combiner? No guarantee how often a combiner is called, if at all Using combiner still requires materializing all key-value pairs
  21. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    COMBINERS After monitoring this job, we want to improve its performance Why not try a combiner? No guarantee how often a combiner is called, if at all Using combiner still requires materializing all key-value pairs
  22. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    COMBINERS After monitoring this job, we want to improve its performance Why not try a combiner? No guarantee how often a combiner is called, if at all Using combiner still requires materializing all key-value pairs
  23. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    IN-MAPPER COMBINING Move the combiner functionality into the mapper Provides total control over how and when combining takes place Materialize the minimum number of key-value pairs
  24. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    IN-MAPPER COMBINING Move the combiner functionality into the mapper Provides total control over how and when combining takes place Materialize the minimum number of key-value pairs
  25. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    IN-MAPPER COMBINING Move the combiner functionality into the mapper Provides total control over how and when combining takes place Materialize the minimum number of key-value pairs
  26. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    WORD COUNT The reducer is the same as before Class Mapper Method Setup H ← new Map<term → count> end Method Map(docid a, doc d) foreach term t ∈ doc d do H{t} ← H{t} + 1 end end Method Cleanup foreach term t ∈ H do Emit (term t, count H{t}) end end end
  27. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    POTENTIAL ISSUES Preserving state across calls to Map may cause bugs Order-dependent bugs are especially pernicious Memory usage is fundamental scalability issue Don’t forget your monitoring!
  28. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    POTENTIAL ISSUES Preserving state across calls to Map may cause bugs Order-dependent bugs are especially pernicious Memory usage is fundamental scalability issue Don’t forget your monitoring!
  29. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    POTENTIAL ISSUES Preserving state across calls to Map may cause bugs Order-dependent bugs are especially pernicious Memory usage is fundamental scalability issue Don’t forget your monitoring!
  30. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    POTENTIAL ISSUES Preserving state across calls to Map may cause bugs Order-dependent bugs are especially pernicious Memory usage is fundamental scalability issue Don’t forget your monitoring!
  31. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    DEALING WITH MEMORY USAGE Know your data and your grid! Memory usage may not be a problem Periodically flush intermediate key-value pairs Empirically discover balance between memory usage and network utilization
  32. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    DEALING WITH MEMORY USAGE Know your data and your grid! Memory usage may not be a problem Periodically flush intermediate key-value pairs Empirically discover balance between memory usage and network utilization
  33. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    DEALING WITH MEMORY USAGE Know your data and your grid! Memory usage may not be a problem Periodically flush intermediate key-value pairs Empirically discover balance between memory usage and network utilization
  34. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    DEALING WITH MEMORY USAGE Know your data and your grid! Memory usage may not be a problem Periodically flush intermediate key-value pairs Empirically discover balance between memory usage and network utilization
  35. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    WORD COUNT Class Mapper Method Setup H ← new Map<term → count> BufferSize ← n end Method Flush foreach term t ∈ H do Emit (term t, count H{t}) end Clear (H) end Method Map(docid a, doc d) foreach term t ∈ doc d do H{t} ← H{t} + 1 end if Size (H) ≥ BufferSize then Flush end end Method Cleanup Flush end end
  36. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    OVERVIEW In-mapper combining is not the only local aggregation technique For more complex problems, consider changing the intermediate key-value types
  37. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    OVERVIEW In-mapper combining is not the only local aggregation technique For more complex problems, consider changing the intermediate key-value types
  38. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    PAIRS APPROACH TO WORD CO-OCCURRENCE Class Mapper Method Map(docid a, doc d) foreach term w ∈ doc d do foreach term u ∈ Neighbors (w) do Emit (pair (w,u), count 1) end end end end Class Reducer Method Reduce(pair p, counts [c1 , c2 , . . .]) sum ← 0 foreach count c ∈ counts [c1 , c2 , . . .] do sum ← sum + c end Emit (pair p, count sum) end end
  39. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    IMPROVEMENTS Can apply in-mapper combining There will still be a very large number of intermediate key-value pairs What do you do if the shuffle phase is still the bottleneck? Change the structure of your key-value pairs!
  40. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    IMPROVEMENTS Can apply in-mapper combining There will still be a very large number of intermediate key-value pairs What do you do if the shuffle phase is still the bottleneck? Change the structure of your key-value pairs!
  41. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    IMPROVEMENTS Can apply in-mapper combining There will still be a very large number of intermediate key-value pairs What do you do if the shuffle phase is still the bottleneck? Change the structure of your key-value pairs!
  42. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    IMPROVEMENTS Can apply in-mapper combining There will still be a very large number of intermediate key-value pairs What do you do if the shuffle phase is still the bottleneck? Change the structure of your key-value pairs!
  43. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    STRIPES APPROACH TO WORD CO-OCCURRENCE Class Mapper Method Map(docid a, doc d) foreach term w ∈ doc d do H ← new Map<term → count> foreach term u ∈ Neighbors (w) do H{u} ← H{u} + 1 end Emit (term w, stripe H) end end end Class Reducer Method Reduce(term w, stripes [H1 , H2 , . . .]) Hf ← new Map<term → count> foreach stripe H ∈ stripes [H1 , H2 , . . .] do ElementSum (Hf , H) end Emit (term w, stripe Hf ) end end
  44. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    POTENTIAL ISSUES Memory and CPU utilization will increase Emit less intermediate data at the cost of more expensive serialization Don’t forget your monitoring! In-mapper combining and the flush technique may be used with the stripes approach
  45. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    POTENTIAL ISSUES Memory and CPU utilization will increase Emit less intermediate data at the cost of more expensive serialization Don’t forget your monitoring! In-mapper combining and the flush technique may be used with the stripes approach
  46. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    POTENTIAL ISSUES Memory and CPU utilization will increase Emit less intermediate data at the cost of more expensive serialization Don’t forget your monitoring! In-mapper combining and the flush technique may be used with the stripes approach
  47. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    POTENTIAL ISSUES Memory and CPU utilization will increase Emit less intermediate data at the cost of more expensive serialization Don’t forget your monitoring! In-mapper combining and the flush technique may be used with the stripes approach
  48. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    OTHER RESOURCES Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer (http://lintool.github.com/MapReduceAlgorithms/ index.html)
  49. INTRODUCTION KNOW THYSELF IN-MAPPER COMBINING PAIRS AND STRIPES WRAPPING UP

    CONTACT INFORMATION E-mail: [email protected] Slides are available at my website http://andrewjamesjohnson.com/talks/ and on GitHub (https://github.com/ajsquared/hadoop-slides)