people (ourselves) •It is for dinner at 6pm •We are starting at 9am •We’ve reviewed the recipe and gone out for groceries the day before. We have everything we need (Kitchen, Utensils, raw food materials) Project .. problem statement Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
people (ourselves) •It is for dinner at 6pm •We are starting at 9am •We’ve reviewed the recipe and gone out for groceries the day before. We have everything we need (Kitchen, Utensils, raw food materials) Project .. problem statement 5 machines Time Constraint We have the data “Find Similar Users” Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
unskilled •so, harder to assign roles (everybody wanted QA) •There were limits on our physical resources •Kitchen, Oven, ... •Because of a lack of our experience, It was harder to foresee what should be done first •How long/How much Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
some roles could be more ‘boring’ than others •We thought it might be better to pair up people in boring stuff. •Tasks seemed sequential Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
3 worker nodes 4 Preprocessing Processing Planning Output kitchen living room dining room in his head Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
thinking .. is tunnel vision on your records (stateless) Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas .. can be done in your head (almost)
Berkeley @seekshreyas .. can be done in your head (almost) Great! But How do I practice? “set up a virtual cluster” makes my machine slow “get AWS instances”
Berkeley @seekshreyas .. can be done in your head (almost) Great! But How do I practice? “set up a virtual cluster” makes my machine slow “get AWS instances” costly
iterables return [key,value] depends on the input protocols yes “map > reduce > reduce” possible? “map > reduce > (map) > reduce” tradeoff O(N**2) vs map-reduce mapreduce comes with its own overhead. trade off carefully ! Eg: Jaccard Coefficient for Similarity Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas
been inspired by: my professors of DataMining Class: for which I have been a TA for 2 years @jblomo “Jim Blomo” @jretz “Jimmy Retzlaff” discussions with students and this awesome Jeff Dean talk Acknowledgement Shreyas | Graduate Student | School of Information, UC Berkeley @seekshreyas