Hadoop & Map Reduce

Hadoop & MapReduce Honey, I reduced the Map! By Arvinder
Singh Kang Thursday, March 25, 2010

Overview • What is ? • What is MapReduce? •
Demo • What Hadoop is NOT! • Five things that could not make it! Powered by MapReduce Thursday, March 25, 2010

Motivation • How do you scale up applications? • 100’s
of terabytes of data • Takes tens of days to read on single computer • Need lots of cheap computers • This fixes speed problem (15 minutes on 1000 computers), but.. • Reliability problems • In large clusters, computers fail every day • Cluster size is not fixed • Need common infrastructure • Must be efficient and reliable Thursday, March 25, 2010

Hadoop • The Apache Hadoop project • Set of open-source
software for reliable, scalable, distributed computing. • Apache Core • Distributed File system - Distributes ﬁles • Map/Reduce - Distributes computational work • Written in Java • Runs on • commodity hardware • OS X, Windows, Linux, Solaris Thursday, March 25, 2010

Hadoop DFS • Hadoop Distributed File System • Designed for
large ﬁle storage with a default block size of 64MB (compared to 4 or 8 KB in ext3) • Each block on multiple machines • Each block replicated across servers(Usually 3 copies by default) • Inspired by GFS • Integrates well with MapReduce • Uses Linux but separate namespace Thursday, March 25, 2010

MapReduce • Map/Reduce is programming model for efﬁcient distributed computing
• Inspired by Lisp, ML like functional languages • Gains efﬁciency from • Streaming through data, reducing seeks • Pipelining and lower communication overhead • Simpler Model • Not a Silver bullet, but good for high data intensive applications • Log processing • Web index building Thursday, March 25, 2010

MapReduce Simpliﬁed Basics • MapReduce programs compute large volumes of
data in parallel • Done by dividing workload across large number of machines • All data elements in MapReduce are immutable • Conceptually, MapReduce programs transform lists of input data elements into lists of output data elements. How? Lets see... Thursday, March 25, 2010

MapReduce Dataﬂow Thursday, March 25, 2010

Phase 1 - Mapping • The ﬁrst phase of a
MapReduce program is called mapping. • List of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element. Thursday, March 25, 2010

Phase 2 - Reducing • Reducing lets you aggregate values
together. • Reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value. • Reducing is often used to produce "summary" data, turning a large volume of data into a smaller summary of itself. Thursday, March 25, 2010

Together we are MapReduce • The Hadoop MapReduce framework takes
these concepts and uses them to process large volumes of information. • Every value has a key associated with it. Keys identify related values. e.g. a log of time-coded speedometer readings from multiple cars Thursday, March 25, 2010

Example- Word Count problem • A simple MapReduce program can
be written to determine how many times different words appear in a set of files. • For example, if we had the files: • foo.txt: Sweet, this is the foo file • bar.txt: This is the bar file • We would expect the output to be: Thursday, March 25, 2010

Example - Word Count • Pseudocode mapper (filename, file-contents): for
each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) Thursday, March 25, 2010

Demo Thursday, March 25, 2010

Map/Reduce Beneﬁts • Fine grained Map and Reduce tasks •
Improved load balancing & faster recovery from failed tasks • Automatic re-execution on failure • In a large cluster, some nodes are always slow • Framework re-executes failed tasks • Locality optimizations • With large data, bandwidth to data is a problem • Map-Reduce + HDFS is a very effective solution • Map-Reduce queries HDFS for locations of input data • Map tasks are scheduled close to the inputs when possible Thursday, March 25, 2010

What Hadoop is not? Thursday, March 25, 2010

Apache Hadoop is not a substitute for a database •
SQL calls against an indexed/tuned database response back in milliseconds. Hadoop does not do this. • Hadoop stores data in ﬁles, and does not index them. • If you want to ﬁnd something, you have to run a MapReduce job going through all the data. • However, There is a project adding a column-table database on top of Hadoop - HBase. Thursday, March 25, 2010

MapReduce is not always the best algorithm • For that
parallelism, you need to have each MR operation independent from all the others. Thursday, March 25, 2010

Hadoop Filesystem is not a substitute for a High Availability
SAN-hosted FS • Hadoop HDFS cheats, delivering high local data access rates by running code near the data, instead of being fast at shipping the data remotely. • Instead of using RAID controllers, it uses non-RAIDed storage across multiple machines. • It is not currently Highly Available. The Namenode is a Single Point of Failure. • It does not currently offer real security. Thursday, March 25, 2010

HDFS is not a Posix filesystem • The Posix filesystem
model has files that can appended too, seek calls made, files locked. Thursday, March 25, 2010

Who uses Hadoop? • Amazon/A9 • Facebook • Google •
IBM • Last.fm • New York Times • Yahoo! Thursday, March 25, 2010

Five things that could not make it today! Thursday, March
25, 2010

5. Hive • Data warehouse infrastructure • Built on top
of Hadoop • Provides tools for • Easy data summarization • Adhoc querying • Analysis of large datasets data stored in Hadoop ﬁles. You have no idea how heavy your data is. Ask me! Thursday, March 25, 2010

4. Pig • Pig is a platform for analyzing large
data sets that consists of a high-level language for expressing data analysis programs • A high-level data-ﬂow language and execution framework for parallel computation. • Pig's language layer is called Pig Latin • Subproject of Hadoop I’m Pig. And I speak Pig Latin. No! It is not as tough as Latin and you can’t learn it using Rosetta Stone! Thursday, March 25, 2010

3. HBase • HBase is the Hadoop database. • For
random, realtime read/write access to Big Data. • For hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Thursday, March 25, 2010

2. Chukwa • A data collection system for managing large
distributed systems. • Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework. • Includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data. Thursday, March 25, 2010

1. Zookeeper • A high-performance coordination service for distributed applications.
Thursday, March 25, 2010

0:) Avro • A data serialization system that provides dynamic
integration with scripting languages. Thursday, March 25, 2010

References • Hadoop Wiki - http://wiki.apache.org/hadoop/HadoopIsNot • Apache Hadoop -
http://hadoop.apache.org/ • Yahoo Developer Network - http://developer.yahoo.com/hadoop/tutorial/ • Programming with Hadoop’s Map/Reduce - ApacheCon Eu 2008 http:// docs.huihoo.com/apache/apachecon/eu2008/HadoopProgramming.pdf • Map / Reduce – A visual explanation - http://ayende.com suggested by Dr Dawn Wilkins Thursday, March 25, 2010

Picture Credits (All pictures found in Public domain) • http://icanhascheezburger.com/
• http://hadoop.apache.org • www.paulnoll.com/China/Zodiac/zodiac-pig-pic.html • http://www.handsnpaws.com/product/MIDICOST03/Elephant-Ears-Pajama-Pals-Costume.html • http://www.myspace.com/BigKyleCO • http://developer.yahoo.com/hadoop/ • http://summerstyle.net/openclipart.org/?ccm=/ﬁles/michi/40 • http://docs.huihoo.com/apache/apachecon/eu2008/HadoopProgramming.pdf • http://blog.ﬁlms.ie/images/skynet.jpg • http://en.wikipedia.org/wiki/File:Vinland_Map_HiRes.jpg • http://noahrobinson.wordpress.com/2009/05/ Thursday, March 25, 2010

Hadoop & Map Reduce

Hadoop & Map Reduce

Punjab

More Decks by Punjab

Other Decks in Programming

Featured

Transcript