Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop & Map Reduce

Punjab
March 25, 2010

Hadoop & Map Reduce

An introduction to Hadoop and Map Reduce

Punjab

March 25, 2010
Tweet

More Decks by Punjab

Other Decks in Programming

Transcript

  1. Hadoop & MapReduce Honey, I reduced the Map! By Arvinder

    Singh Kang Thursday, March 25, 2010
  2. Overview • What is ? • What is MapReduce? •

    Demo • What Hadoop is NOT! • Five things that could not make it! Powered by MapReduce Thursday, March 25, 2010
  3. Motivation • How do you scale up applications? • 100’s

    of terabytes of data • Takes tens of days to read on single computer • Need lots of cheap computers • This fixes speed problem (15 minutes on 1000 computers), but.. • Reliability problems • In large clusters, computers fail every day • Cluster size is not fixed • Need common infrastructure • Must be efficient and reliable Thursday, March 25, 2010
  4. Hadoop • The Apache Hadoop project • Set of open-source

    software for reliable, scalable, distributed computing. • Apache Core • Distributed File system - Distributes files • Map/Reduce - Distributes computational work • Written in Java • Runs on • commodity hardware • OS X, Windows, Linux, Solaris Thursday, March 25, 2010
  5. Hadoop DFS • Hadoop Distributed File System • Designed for

    large file storage with a default block size of 64MB (compared to 4 or 8 KB in ext3) • Each block on multiple machines • Each block replicated across servers(Usually 3 copies by default) • Inspired by GFS • Integrates well with MapReduce • Uses Linux but separate namespace Thursday, March 25, 2010
  6. MapReduce • Map/Reduce is programming model for efficient distributed computing

    • Inspired by Lisp, ML like functional languages • Gains efficiency from • Streaming through data, reducing seeks • Pipelining and lower communication overhead • Simpler Model • Not a Silver bullet, but good for high data intensive applications • Log processing • Web index building Thursday, March 25, 2010
  7. MapReduce Simplified Basics • MapReduce programs compute large volumes of

    data in parallel • Done by dividing workload across large number of machines • All data elements in MapReduce are immutable • Conceptually, MapReduce programs transform lists of input data elements into lists of output data elements. How? Lets see... Thursday, March 25, 2010
  8. Phase 1 - Mapping • The first phase of a

    MapReduce program is called mapping. • List of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element. Thursday, March 25, 2010
  9. Phase 2 - Reducing • Reducing lets you aggregate values

    together. • Reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value. • Reducing is often used to produce "summary" data, turning a large volume of data into a smaller summary of itself. Thursday, March 25, 2010
  10. Together we are MapReduce • The Hadoop MapReduce framework takes

    these concepts and uses them to process large volumes of information. • Every value has a key associated with it. Keys identify related values. e.g. a log of time-coded speedometer readings from multiple cars Thursday, March 25, 2010
  11. Example- Word Count problem • A simple MapReduce program can

    be written to determine how many times different words appear in a set of files. • For example, if we had the files: • foo.txt: Sweet, this is the foo file • bar.txt: This is the bar file • We would expect the output to be: Thursday, March 25, 2010
  12. Example - Word Count • Pseudocode mapper (filename, file-contents): for

    each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) Thursday, March 25, 2010
  13. Map/Reduce Benefits • Fine grained Map and Reduce tasks •

    Improved load balancing & faster recovery from failed tasks • Automatic re-execution on failure • In a large cluster, some nodes are always slow • Framework re-executes failed tasks • Locality optimizations • With large data, bandwidth to data is a problem • Map-Reduce + HDFS is a very effective solution • Map-Reduce queries HDFS for locations of input data • Map tasks are scheduled close to the inputs when possible Thursday, March 25, 2010
  14. Apache Hadoop is not a substitute for a database •

    SQL calls against an indexed/tuned database response back in milliseconds. Hadoop does not do this. • Hadoop stores data in files, and does not index them. • If you want to find something, you have to run a MapReduce job going through all the data. • However, There is a project adding a column-table database on top of Hadoop - HBase. Thursday, March 25, 2010
  15. MapReduce is not always the best algorithm • For that

    parallelism, you need to have each MR operation independent from all the others. Thursday, March 25, 2010
  16. Hadoop Filesystem is not a substitute for a High Availability

    SAN-hosted FS • Hadoop HDFS cheats, delivering high local data access rates by running code near the data, instead of being fast at shipping the data remotely. • Instead of using RAID controllers, it uses non-RAIDed storage across multiple machines. • It is not currently Highly Available. The Namenode is a Single Point of Failure. • It does not currently offer real security. Thursday, March 25, 2010
  17. HDFS is not a Posix filesystem • The Posix filesystem

    model has files that can appended too, seek calls made, files locked. Thursday, March 25, 2010
  18. Who uses Hadoop? • Amazon/A9 • Facebook • Google •

    IBM • Last.fm • New York Times • Yahoo! Thursday, March 25, 2010
  19. 5. Hive • Data warehouse infrastructure • Built on top

    of Hadoop • Provides tools for • Easy data summarization • Adhoc querying • Analysis of large datasets data stored in Hadoop files. You have no idea how heavy your data is. Ask me! Thursday, March 25, 2010
  20. 4. Pig • Pig is a platform for analyzing large

    data sets that consists of a high-level language for expressing data analysis programs • A high-level data-flow language and execution framework for parallel computation. • Pig's language layer is called Pig Latin • Subproject of Hadoop I’m Pig. And I speak Pig Latin. No! It is not as tough as Latin and you can’t learn it using Rosetta Stone! Thursday, March 25, 2010
  21. 3. HBase • HBase is the Hadoop database. • For

    random, realtime read/write access to Big Data. • For hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Thursday, March 25, 2010
  22. 2. Chukwa • A data collection system for managing large

    distributed systems. • Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework. • Includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data. Thursday, March 25, 2010
  23. 0:) Avro • A data serialization system that provides dynamic

    integration with scripting languages. Thursday, March 25, 2010
  24. References • Hadoop Wiki - http://wiki.apache.org/hadoop/HadoopIsNot • Apache Hadoop -

    http://hadoop.apache.org/ • Yahoo Developer Network - http://developer.yahoo.com/hadoop/tutorial/ • Programming with Hadoop’s Map/Reduce - ApacheCon Eu 2008 http:// docs.huihoo.com/apache/apachecon/eu2008/HadoopProgramming.pdf • Map / Reduce – A visual explanation - http://ayende.com suggested by Dr Dawn Wilkins Thursday, March 25, 2010
  25. Picture Credits (All pictures found in Public domain) • http://icanhascheezburger.com/

    • http://hadoop.apache.org • www.paulnoll.com/China/Zodiac/zodiac-pig-pic.html • http://www.handsnpaws.com/product/MIDICOST03/Elephant-Ears-Pajama-Pals-Costume.html • http://www.myspace.com/BigKyleCO • http://developer.yahoo.com/hadoop/ • http://summerstyle.net/openclipart.org/?ccm=/files/michi/40 • http://docs.huihoo.com/apache/apachecon/eu2008/HadoopProgramming.pdf • http://blog.films.ie/images/skynet.jpg • http://en.wikipedia.org/wiki/File:Vinland_Map_HiRes.jpg • http://noahrobinson.wordpress.com/2009/05/ Thursday, March 25, 2010