Introduction to Hadoop

Introduction to Hadoop Saravanan Vijayakumaran IIT Bombay

What is Hadoop? • A framework for large-scale distributed data
processing • Advantages ◦ Can handle petabytes of data ◦ Can scale to a cluster containing thousands of computers ◦ Allows a user to focus on the data processing logic ◦ Takes care of task distribution and fault tolerance • Two main components ◦ Hadoop Distributed File System ◦ MapReduce

A Brief History of Hadoop • In 2003-2004, Google researchers
published papers on the Google File System and MapReduce • Doug Cutting did an initial implementation of Hadoop based on the papers • Yahoo hired Cutting in 2006 and provided a team for development • In 2008, Yahoo announced that their search engine was using Hadoop running on a 10,000+ core Linux cluster • The word Hadoop is the name Cutting’s son gave to a toy elephant

Use Cases of Hadoop • Filtering ◦ Web search ◦
Product search • Classification ◦ Spam detection ◦ Fraud/anomaly detection • Recommendation Engine ◦ Online retailers suggesting products ◦ Social networks suggesting people you may know

Hadoop vs RDBMS RDBMS • Structured data • Optimized for
queries and updates at arbitrary locations • Suitable for small data • Cannot scale to web-scale data Hadoop • Structured/unstructured data • Optimized for sequential reading and batch processing of data • Suitable for web-scale data • High latency for small data

Hadoop vs MPI MPI • Suitable for long-running computations which
involve small amounts of data • No fault tolerance provided by the framework • More flexible program structure Hadoop • Typically used for short computations on large amounts of data • Fault tolerance provided by default • Programs restricted to have “MapReduce” structure

Hadoop Distributed File System • A file is split into
blocks of size 64 MB • Each block is replicated 3 times • Blocks are distributed randomly on the machines in the cluster • NameNode: machine which holds the files to block mapping • DataNodes: machines which store the blocks

MapReduce • Arbitrary programs cannot be run on Hadoop •
Programs should conform to the MapReduce programming model • MapReduce programs transform lists of input data elements into lists of output data • The input and output lists are constrained to be lists of key-value pairs ◦ A key-value pair is an ordered pair (k,v) where k is the key and v is the value ◦ Example: [ (“Alice”, 28), (“Bob”, 35), (“Eve”, 28) ] • Two list processing idioms are used ◦ Map ◦ Reduce

Map A mapper function transforms each input list element to
an output list element

Examples of Map • Square [3, 6, 5, 9, 10]
→ [9, 36, 25, 81, 100] • isPrime [3, 6, 5, 9, 10] → [True, False, True, False, False] • toUpper [“This”, “is”, “a”, “test”] → [“THIS”, “IS”, “A”, “TEST”]

Reduce A reducer function combines the values in an input
list to a single output value

Examples of Reduce • Summation [3, 6, 5, 9, 10]
→ 33 • Median [3, 6, 5, 9, 10] → 6 • Histogram [4, 6, 1, 6, 4, 4, 1, 1] → [(1,3), (4,3), (6,2)]

An Example Application: Word Count • Count how many times
different words appear in a set of files ◦ Use case: Spam filtering • Suppose we have two files ◦ file1.txt: Hello, this is the first file ◦ file2.txt: This is the second file • Expected output hello 1 first 1 this 2 second 1 is 2 file 2 the 2

Word Count as MapReduce: Mapper Mapper pseudocode mapper (file-contents): for
each word in file-contents: emit (word, 1) • file1.txt: Hello, this is the first file ◦ Mapper output (hello, 1) (this, 1) (is, 1) (the, 1) (first, 1) (file, 1) • file2.txt: This is the second file ◦ Mapper output (this, 1) (is, 1) (the, 1) (second, 1) (file, 1)

Word Count as MapReduce: Reducer • Hadoop groups values with
same key • Reducer pseudocode reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) • Output of mapper stage (hello, 1) (this, 1) (is, 1) (the, 1) (first, 1) (file, 1) (this, 1) (is, 1) (the, 1) (second, 1) (file, 1) • Input to reducer stage (hello, [1]) (this, [1, 1]) (is, [1, 1]) (the, [1,1]) (first, [1]) (file, [1, 1]) (second, [1]) • Output of reducer stage (hello, 1) (this, 2) (is, 2) (the, 2) (first, 1) (file, 2) (second, 1)

MapReduce Data Flow • Several mapper processes are created each
processing file blocks on a node • Intermediate (key,value) pairs are exchanged to send all values with same key to a single reducer • Each reducer generates an output file • The reducer outputs can be fed to a second MapReduce job for further processing

Combiner Function • Suppose a file has 1000 occurrences of
the word “is” • 1000 key-value pairs equal to (is, 1) will be created and sent to the reducer for “is” • A more efficient method is to just send (is, 1000) • This node-local processing is done by the Combiner • In this case, the Combiner has the same implementation as the Reducer

Partitioner Function • The default implementation distributes the keys randomly
among the reducers ReducerIndex = Hash(key) % NumReducers • In WordCount, suppose we want all keys starting with the same letter to go to the same reducer • A custom Partitioner can achieve this ReducerIndex = Hash(FirstLetterOfKey) % NumReducers

WordCount Mapper in Java public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }

implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } • MapClass is the user-defined class implementing the Mapper interface • User needs to implement the map function

implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } • By default, Hadoop assumes that the input is a text file where each line needs to be processed independently • Input key to Mapper is LongWritable - the byte offset of a line in a file • Input value is Text - the contents of the line

implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } • Output key of Mapper is of type Text • Output value is IntWritable

implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } • Create a variable to hold the constant one • Split a line into words • For each word, emit a key-value pair consisting of (word, 1)

WordCount Reducer in Java public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } • Reduce is the user-defined class implementing the Reducer interface • User needs to implement the reduce function

implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } • Input key to Reducer is Text - the word emitted by the mapper • Input value is a list of IntWritable values - the list of 1s

implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } • Output key of Reducer is of type Text • Output value is IntWritable

implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } • Initialize sum to zero • Add all the 1s in the values list • Emit the key-value pair (word, sum)

WordCount Driver public void run(String inputPath, String outputPath) throws Exception
{ JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); }

{ JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); } • Initialize the JobConf object • Give it a name

{ JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); } • Set the Reducer output key type to be Text • Set the Reducer output value type to be IntWritable • The Mapper input key-value types are assumed to be the default - (LongWritable, Text)

{ JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); } • The classes which implement the map and reduce functions are specified

{ JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); } • The locations of the input files and output files are specified • The job is then executed

Fault Tolerance • In large clusters, individual nodes or network
components may fail • Hadoop achieves fault tolerance by restarting tasks • A MapReduce job is monitored by a JobTracker • Each map task and reduce task is assigned to a TaskTracker • If a TaskTracker fails to ping the JobTracker for one minute, it is assumed to have crashed • Other TaskTrackers will re-execute tasks assigned to the failed TaskTracker

MapReduce Design Patterns • Summarization ◦ Numerical summarization ◦ Inverted
Index • Filtering ◦ Top 10 lists • Data organization ◦ Sorting

Numerical Summarizations • Minimum, Maximum, Average, Median, Standard Deviation •
Suppose we have bank transaction data in the following format • We want to find the maximum, minimum and average transaction amount for each of the PAN card numbers Date PAN card number Amount 21/11/2015 ABCDE1234F 80,000 30/11/2015 PQRST4567U 1,20,000 01/12/2015 ABCDE1234F 25,000 23/01/2016 GHAJK2345L 1,00,000

MapReduce Structure • Input to the mapper will be a
list of (LongWritable, Text) pairs • For each input pair, mapper outputs a (PAN card, Amount) pair • All mapper outputs with same PAN card will arrive at a single reducer • Input to reducer will be (PAN card, [Amount1, Amount2, ..., AmountN]) • Maximum and minimum are the largest and smallest amounts in the list • For average, divide sum of amounts by number of amounts

Inverted Index • Faster web search ◦ Each web page
contains a list of words ◦ An inverted index is a list of webpages which contain a particular word • Citations ◦ Every patent may cite some other past patents ◦ An inverted index is a list of patents which cite a particular patent

Vehicle Tracking • Suppose streets in Delhi are equipped with
CCTVs which can perform number plate recognition • Each camera sends a list of vehicles it has seen • Suppose we want to know which areas a vehicle visited on a particular day Date CCTV ID Vehicle Number 21/11/2015 123 DL9C 1234 30/11/2015 101 DL5A 7890 01/12/2015 123 DL8B 5555 23/01/2016 155 DL9C 1234

list of (LongWritable, Text) pairs • For each input pair, mapper outputs a (Vehicle Number, CCTV ID) pair if the vehicle was seen on the day of interest • All mapper outputs with same Vehicle Number will arrive at a single reducer • Input to reducer will be (Vehicle Number, [CCTV ID1, CCTV ID2, ..., CCTV IDn]) • The reducer removes any duplicates in the CCTV ID list

Top 10 List • Given bank transaction data, suppose we
want to find the 10 largest transactions Date PAN card number Amount 21/11/2015 ABCDE1234F 80,000 30/11/2015 PQRST4567U 1,20,000 01/12/2015 ABCDE1234F 25,000 23/01/2016 GHIJK2345L 1,00,000

Top 10 List Input Split Input Split Input Split Top
Ten Mapper Top Ten Mapper Top Ten Mapper Top Ten Reducer Local top 10 Local top 10 Local top 10 Top Ten Output Final top 10

MapReduce Structure • Set the number of reducers to one
• Input to the mapper will be a list of (LongWritable, Text) pairs • Each mapper outputs ten (NULL, (Amount, PAN Card, Date)) pairs corresponding to the ten largest amounts • All mapper outputs will arrive at the single reducer • The input to the reducer will be (NULL, [(A 1 , PC 1 , D 1 ), (A 2 , PC 2 , D 2 ), …, (A 10M , PC 10M , D 10M ) ]) • The reducer computes the top ten transactions

Sorting • Given bank transaction data, suppose we want to
sort all transactions in ascending order of amounts Date PAN card number Amount 21/11/2015 ABCDE1234F 80,000 30/11/2015 PQRST4567U 1,20,000 01/12/2015 ABCDE1234F 25,000 23/01/2016 GHIJK2345L 1,00,000

list of (LongWritable, Text) pairs • Suppose we set the mapper output to a (Amount, (PAN Card, Date)) pair • All mapper outputs corresponding to same amount will arrive at the single reducer • But the default partitioner in Hadoop does not guarantee that amounts which are near will arrive at the same reducer • We need a custom partitioner

Custom Partitioner for Sorting Input Split Input Split Input Split
Mapper Custom Partitioner Mapper Mapper Custom Partitioner Custom Partitioner Reducer Reducer Reducer 0 to 1L 5L and above 1L to 5L Sorted Output Sorted Output Sorted Output

list of (LongWritable, Text) pairs • Mapper outputs are (Amount, (PAN Card, Date)) pairs • Mapper outputs with amounts in same range will arrive at the same reducer • The input to the reducer will be sorted by amount (Amount1, [(PAN Card1, Date1), (PAN Card2, Date2), …]) (Amount2, [(PAN Card3, Date3), (PAN Card4, Date4), …]) (AmountN, [(PAN Card5, Date5), (PAN Card6, Date6), …]) • Each reducer will output all the transactions it receives • The outputs of all reducers can be concatenated to get the sorted data … … …

Reasons for Hadoop Popularity • Ease of use ◦ Hadoop
takes care of the challenges of distributed computing ◦ User can focus on the data processing logic ◦ Same program can be executed on a 10 machine cluster or 1000 machine cluster • MapReduce is flexible ◦ A large class of problems can be expressed as MapReduce computations • Scale out ◦ Can scale to clusters having thousands of machines ◦ Can handle web-scale data

Learning resources • Introduction to Hadoop and MapReduce, MOOC from
Udacity, https://www.udacity.com/courses/ud617 • Hadoop Tutorial from Yahoo!, https://developer.yahoo.com/hadoop/tutorial/ • Hadoop: The Definitive Guide, Tom White, O'Reilly Media, 2012 • Hadoop in Action, Chuck Lam, Manning Publications, 2010 • MapReduce Design Patterns, Donald Miner and Adam Shook, O'Reilly Media, 2012

Attribution Some figures were taken from the "Hadoop Tutorial from
Yahoo!" by Yahoo! Inc. which is licensed under a Creative Commons Attribution 3.0 Unported License. No changes were made to the figures. https://creativecommons.org/licenses/by/3.0/

Thanks for your attention!

Introduction to Hadoop

Introduction to Hadoop

More Decks by sarva

Other Decks in Programming

Featured

Transcript