Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MapReduce Lasagna

MapReduce Lasagna

Explaining MapReduce concepts through the scenario of preparing a Lasagna. Introduction to MRJob library as well.

Shreyas

April 15, 2014
Tweet

More Decks by Shreyas

Other Decks in Programming

Transcript

  1. MAP-REDUCE & MRJOB Shreyas | Graduate Student | School of

    Information, UC Berkeley @seekshreyas
  2. MAP-REDUCE & MRJOB “thinking through it” “working through it” Shreyas

    | Graduate Student | School of Information, UC Berkeley @seekshreyas
  3. Assumptions “Make a Lasagna” •We are making it for 5

    people (ourselves) •It is for dinner at 6pm •We are starting at 9am •We’ve reviewed the recipe and gone out for groceries the day before. We have everything we need (Kitchen, Utensils, raw food materials) Project .. problem statement Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  4. Assumptions “Make a Lasagna” •We are making it for 5

    people (ourselves) •It is for dinner at 6pm •We are starting at 9am •We’ve reviewed the recipe and gone out for groceries the day before. We have everything we need (Kitchen, Utensils, raw food materials) Project .. problem statement 5 machines
 Time Constraint
 We have the data “Find Similar Users” Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  5. Initial Approach We started with the Gantt Chart approach Shreyas

    | Graduate Student | School of Information, UC Berkeley @seekshreyas
  6. Problems we faced •We were all equally skilled and equally

    unskilled •so, harder to assign roles (everybody wanted QA) •There were limits on our physical resources •Kitchen, Oven, ... •Because of a lack of our experience, It was harder to foresee what should be done first •How long/How much Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  7. Problems we faced •Also, if we were to assign roles,

    some roles could be more ‘boring’ than others •We thought it might be better to pair up people in boring stuff. •Tasks seemed sequential Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  8. So we looked more granularly and applied what we know

    better ... Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  9. MapReduce distribute (key,value) aggregate (key,[value]) shuffle map reduce Shreyas |

    Graduate Student | School of Information, UC Berkeley @seekshreyas
  10. MapReduce worker nodes1 worker nodes 2 worker nodes 3 worker

    nodes 4 Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  11. MapReduce master node worker nodes1 worker nodes 2 worker nodes

    3 worker nodes 4 Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  12. MapReduce master node worker nodes1 worker nodes 2 worker nodes

    3 worker nodes 4 Preprocessing Processing Planning Output kitchen living room dining room in his head Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  13. MAP-REDUCE .. is a paradigm .. is a way of

    thinking .. is tunnel vision on your records (stateless) Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas .. can be done in your head (almost)
  14. MAP-REDUCE Shreyas | Graduate Student | School of Information, UC

    Berkeley @seekshreyas .. can be done in your head (almost)
  15. MAP-REDUCE Shreyas | Graduate Student | School of Information, UC

    Berkeley @seekshreyas .. can be done in your head (almost) Great! But How do I practice?
  16. MAP-REDUCE Shreyas | Graduate Student | School of Information, UC

    Berkeley @seekshreyas .. can be done in your head (almost) Great! But How do I practice? “set up a virtual cluster”
  17. MAP-REDUCE Shreyas | Graduate Student | School of Information, UC

    Berkeley @seekshreyas .. can be done in your head (almost) Great! But How do I practice? “set up a virtual cluster” makes my machine slow
  18. MAP-REDUCE Shreyas | Graduate Student | School of Information, UC

    Berkeley @seekshreyas .. can be done in your head (almost) Great! But How do I practice? “set up a virtual cluster” makes my machine slow “get AWS instances”
  19. MAP-REDUCE Shreyas | Graduate Student | School of Information, UC

    Berkeley @seekshreyas .. can be done in your head (almost) Great! But How do I practice? “set up a virtual cluster” makes my machine slow “get AWS instances” costly
  20. MRJOB Shreyas | Graduate Student | School of Information, UC

    Berkeley @seekshreyas python library for map reduce
  21. mrJob a python library $ pip install mrjob import library

    review_wordcount.py # count total words in each review
  22. mrJob a python library $ pip install mrjob import library

    instantiate mrJob object review_wordcount.py # count total words in each review
  23. mrJob a python library $ pip install mrjob import library

    instantiate mrJob object mapper() review_wordcount.py # count total words in each review
  24. mrJob a python library $ pip install mrjob import library

    instantiate mrJob object mapper() reducer() review_wordcount.py # count total words in each review
  25. mrJob a python library $ pip install mrjob import library

    instantiate mrJob object mapper() reducer() steps() review_wordcount.py # count total words in each review
  26. mrJob, mapreduce gotchas don’t print ... yield print <key,value> Shreyas

    | Graduate Student | School of Information, UC Berkeley @seekshreyas
  27. mrJob, mapreduce gotchas don’t print ... yield print <key,value> return

    iterables return [key,value] depends on the input protocols Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  28. mrJob, mapreduce gotchas don’t print ... yield print <key,value> return

    iterables return [key,value] depends on the input protocols yes “map > reduce > reduce” possible? “map > reduce > (map) > reduce” Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  29. mrJob, mapreduce gotchas don’t print ... yield print <key,value> return

    iterables return [key,value] depends on the input protocols yes “map > reduce > reduce” possible? “map > reduce > (map) > reduce” tradeoff O(N**2) vs map-reduce mapreduce comes with its own overhead. trade off carefully ! Eg: Jaccard Coefficient for Similarity Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  30. “Find Similar Users” .. an example Yelp Academic Dataset Shreyas

    | Graduate Student | School of Information, UC Berkeley @seekshreyas
  31. “Find Similar Users” .. an example Shreyas | Graduate Student

    | School of Information, UC Berkeley @seekshreyas
  32. “Find Similar Users” .. an example mapper1() yield [user_id, business_id]

    Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  33. “Find Similar Users” .. an example reducer1() [user_id, busines_id] yield

    [user_id, list(set(business_id))] Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  34. “Find Similar Users” .. an example reducer1() [user_id, busines_id] yield

    [user_id, list(set(business_id))] [user_id, [business_id]] mapper2() Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  35. “Find Similar Users” .. an example reducer1() [user_id, busines_id] yield

    [user_id, list(set(business_id))] reducer2() yield “all”, [user_id, business_ids] [user_id, [business_id]] mapper2() Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  36. “Find Similar Users” .. an example reducer1() [user_id, busines_id] yield

    [user_id, list(set(business_id))] reducer2() yield “all”, [user_id, business_ids] [user_id, [business_id]] mapper2() for (user_a, biz_a), (usr_b, biz_b) in itertools.combinations(user_biz_pairs, r=2): yield jaccard(biz_a, biz_b) reducer3() Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  37. “Find Similar Users” .. an example O(N**2) vs map-reduce tradeoff

    reducer1() [user_id, busines_id] yield [user_id, list(set(business_id))] reducer2() yield “all”, [user_id, business_ids] [user_id, [business_id]] mapper2() for (user_a, biz_a), (usr_b, biz_b) in itertools.combinations(user_biz_pairs, r=2): yield jaccard(biz_a, biz_b) reducer3() Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
  38. A lot of the ideas discussed in the slides have

    been inspired by: my professors of DataMining Class: for which I have been a TA for 2 years @jblomo “Jim Blomo” @jretz “Jimmy Retzlaff” discussions with students and this awesome Jeff Dean talk Acknowledgement Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas