Database System Implementation Final Presentation

Large-Scale data analysis with Hadoop MapReduce Hitesh Chhabra, Arindam Das,
Sarthak Jaiswal, Neelima Sailaja Team number 6 Thursday 5 December 2013

Goals • To get familiarised with Hadoop • Understand its
working, architecture and deployment strategies • To exploit the power of Hadoop to bring out trends, patterns and relations in an available dataset • To be able to apply this technology for other applications in the future Thursday 5 December 2013

Hadoop • A framework that allows for the distributed processing
of large data sets across clusters of computers • It is designed to scale up from single servers to thousands of machines • Designed to detect and handle failures at the application layer Thursday 5 December 2013

Thursday 5 December 2013

Census • age • workclass • education • occupation •
marital-status • race • sex • native-country Thursday 5 December 2013

Movies Remakes • Movie ID • Movie name • Year
of release • Director name • Film ID • Title • Year • Fraction • Prior film ID Thursday 5 December 2013

Algorithms •Cross-correlation •Stripes •Pairs •Natural Join •Aggregation Thursday 5 December
2013

Pairs and Stripes Thursday 5 December 2013

Pairs • Emit all pairs and dummy counters from Mappers
• Sum these counters on Reducer • The benefit from combiners is limited, as it is likely that all pair are distinct • There is no in-memory accumulations Thursday 5 December 2013

Stripes • Group data by the first item in pair
and maintain an associative array (“stripe”) where counters for all adjacent items are accumulated • Reducer receives all stripes for leading item i, merges them, and emits the same result as in the Pairs approach • Generates fewer intermediate keys. Hence the framework has less sorting to do. • Greately benefits from combiners. • Performs in-memory accumulation. This can lead to problems, if not properly implemented. • More complex implementation. Thursday 5 December 2013

Bubble Chart showing Pairs/ Stripes results Thursday 5 December 2013

Pairs vs Stripes Slower Faster 1 emit for 1 record
fewer emits easier to implement difﬁcult to implement more reduce input keys ( 841 groups for 841 records) less reduce input keys ( 72 groups for 841 records ) Heap Usage : 2X : 563 MB 10X : 570 MB Heap Usage : 2X : 567 MB 10X : 567 MB Thursday 5 December 2013

Join Thursday 5 December 2013

JOIN •Left-outer join : to ensure that all remakes are
included •Two distinct mapper codes for reading from two files •Also implemented a partitioner for sorting and grouping Thursday 5 December 2013

id1 n1 y1 id2 n2 y2 id3 n3 y3 id4
n4 y4 rid1 rn1 id4 rid2 rn2 id1 rid2 rn2 id2 rid3 rn3 id4 id1 n1 y1 id2 n2 y2 id3 n3 y3 id4 n4 y4 id4 rn1 id1 rn2 id2 rn2 id4 rn3 id1 n1 y1 id1 rn2 id2 n2 y2 id2 rn2 id3 n3 y3 id4 n4 y4 id4 rn1 id4 rn3 rn2 n1 y1 rn2 n2 y2 rn1 n4 y4 rn3 n4 y4 MAP REDUCE OUTPUT MOVIES REMAKES Thursday 5 December 2013

Radial Arc showing Join results Thursday 5 December 2013

Aggregation Thursday 5 December 2013

Aggregation • Mapper extracts from each tuple values to group
by and aggregate and emits them. • Reducer receives values to be aggregated already grouped and calculates an aggregation function. • Typical aggregation functions like sum or max can be calculated in a streaming fashion, hence don’t require to handle all values simultaneously • OUR TWEEK : afford the ability to run parameterized dynamic queries by providing command-line arguments during job run Thursday 5 December 2013

Bar Chart Showing aggregation results Thursday 5 December 2013

Challenges •Setting up and steep learning curve •Format combinations between
mapper and reducer •Different input and output formats for mapper and reducer •Extending writable to create composite pojo Thursday 5 December 2013

Conclusion •From setup to deployment - the whole nine yards
•The power of MapReduce at a huge scale •Mastering the internals of Hadoop MapReduce framework Thursday 5 December 2013

Future Work •Hadoop’s power is harnessed in BIG DATA of
the order of peta-bytes •Optimization of number of mappers and reducers as per the corpus size •Optimization of join Thursday 5 December 2013

References • http://blog.cloudera.com/blog/2009/05/10-mapreduce-tips/ • http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/ • http://wiki.apache.org/hadoop/HowManyMapsAndReduces • http://hackki.com/2013/09/how-to-determine-the-number-of-mappers-and- reducers-in-a-hadoop-job/
• http://hadoop.apache.org/docs/r1.2.0/api/org/apache/hadoop/mapred/ JobConﬁgurable.html Thursday 5 December 2013

Thank You! Thursday 5 December 2013

Database System Implementation Final Presentation

Database System Implementation Final Presentation

nelly_lima

More Decks by nelly_lima

Featured

Transcript

Large-Scale data analysis with Hadoop MapReduce Hitesh Chhabra, Arindam Das,

Goals • To get familiarised with Hadoop • Understand its

Hadoop • A framework that allows for the distributed processing

Thursday 5 December 2013

Thursday 5 December 2013

Census • age • workclass • education • occupation •

Movies Remakes • Movie ID • Movie name • Year

Movies Remakes • Movie ID • Movie name • Year

Algorithms •Cross-correlation •Stripes •Pairs •Natural Join •Aggregation Thursday 5 December

Pairs and Stripes Thursday 5 December 2013

Pairs • Emit all pairs and dummy counters from Mappers

Stripes • Group data by the first item in pair

Bubble Chart showing Pairs/ Stripes results Thursday 5 December 2013

Pairs vs Stripes Slower Faster 1 emit for 1 record

Join Thursday 5 December 2013

JOIN •Left-outer join : to ensure that all remakes are

id1 n1 y1 id2 n2 y2 id3 n3 y3 id4

Radial Arc showing Join results Thursday 5 December 2013

Aggregation Thursday 5 December 2013

Aggregation • Mapper extracts from each tuple values to group

Bar Chart Showing aggregation results Thursday 5 December 2013

Challenges •Setting up and steep learning curve •Format combinations between

Conclusion •From setup to deployment - the whole nine yards

Future Work •Hadoop’s power is harnessed in BIG DATA of

References • http://blog.cloudera.com/blog/2009/05/10-mapreduce-tips/ • http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/ • http://wiki.apache.org/hadoop/HowManyMapsAndReduces • http://hackki.com/2013/09/how-to-determine-the-number-of-mappers-and- reducers-in-a-hadoop-job/

Thank You! Thursday 5 December 2013