$30 off During Our Annual Pro Sale. View Details »

Database System Implementation Final Presentation

Avatar for nelly_lima nelly_lima
December 05, 2013
26

Database System Implementation Final Presentation

Avatar for nelly_lima

nelly_lima

December 05, 2013
Tweet

Transcript

  1. Large-Scale data analysis with Hadoop MapReduce Hitesh Chhabra, Arindam Das,

    Sarthak Jaiswal, Neelima Sailaja Team number 6 Thursday 5 December 2013
  2. Goals • To get familiarised with Hadoop • Understand its

    working, architecture and deployment strategies • To exploit the power of Hadoop to bring out trends, patterns and relations in an available dataset • To be able to apply this technology for other applications in the future Thursday 5 December 2013
  3. Hadoop • A framework that allows for the distributed processing

    of large data sets across clusters of computers • It is designed to scale up from single servers to thousands of machines • Designed to detect and handle failures at the application layer Thursday 5 December 2013
  4. Census • age • workclass • education • occupation •

    marital-status • race • sex • native-country Thursday 5 December 2013
  5. Movies Remakes • Movie ID • Movie name • Year

    of release • Director name • Film ID • Title • Year • Fraction • Prior film ID Thursday 5 December 2013
  6. Movies Remakes • Movie ID • Movie name • Year

    of release • Director name • Film ID • Title • Year • Fraction • Prior film ID Thursday 5 December 2013
  7. Pairs • Emit all pairs and dummy counters from Mappers

    • Sum these counters on Reducer • The benefit from combiners is limited, as it is likely that all pair are distinct • There is no in-memory accumulations Thursday 5 December 2013
  8. Stripes • Group data by the first item in pair

    and maintain an associative array (“stripe”) where counters for all adjacent items are accumulated • Reducer receives all stripes for leading item i, merges them, and emits the same result as in the Pairs approach • Generates fewer intermediate keys. Hence the framework has less sorting to do. • Greately benefits from combiners. • Performs in-memory accumulation. This can lead to problems, if not properly implemented. • More complex implementation. Thursday 5 December 2013
  9. Pairs vs Stripes Slower Faster 1 emit for 1 record

    fewer emits easier to implement difficult to implement more reduce input keys ( 841 groups for 841 records) less reduce input keys ( 72 groups for 841 records ) Heap Usage : 2X : 563 MB 10X : 570 MB Heap Usage : 2X : 567 MB 10X : 567 MB Thursday 5 December 2013
  10. JOIN •Left-outer join : to ensure that all remakes are

    included •Two distinct mapper codes for reading from two files •Also implemented a partitioner for sorting and grouping Thursday 5 December 2013
  11. id1 n1 y1 id2 n2 y2 id3 n3 y3 id4

    n4 y4 rid1 rn1 id4 rid2 rn2 id1 rid2 rn2 id2 rid3 rn3 id4 id1 n1 y1 id2 n2 y2 id3 n3 y3 id4 n4 y4 id4 rn1 id1 rn2 id2 rn2 id4 rn3 id1 n1 y1 id1 rn2 id2 n2 y2 id2 rn2 id3 n3 y3 id4 n4 y4 id4 rn1 id4 rn3 rn2 n1 y1 rn2 n2 y2 rn1 n4 y4 rn3 n4 y4 MAP REDUCE OUTPUT MOVIES REMAKES Thursday 5 December 2013
  12. Aggregation • Mapper extracts from each tuple values to group

    by and aggregate and emits them. • Reducer receives values to be aggregated already grouped and calculates an aggregation function. • Typical aggregation functions like sum or max can be calculated in a streaming fashion, hence don’t require to handle all values simultaneously • OUR TWEEK : afford the ability to run parameterized dynamic queries by providing command-line arguments during job run Thursday 5 December 2013
  13. Challenges •Setting up and steep learning curve •Format combinations between

    mapper and reducer •Different input and output formats for mapper and reducer •Extending writable to create composite pojo Thursday 5 December 2013
  14. Conclusion •From setup to deployment - the whole nine yards

    •The power of MapReduce at a huge scale •Mastering the internals of Hadoop MapReduce framework Thursday 5 December 2013
  15. Future Work •Hadoop’s power is harnessed in BIG DATA of

    the order of peta-bytes •Optimization of number of mappers and reducers as per the corpus size •Optimization of join Thursday 5 December 2013