Introduction to Hadoop - Speaker Deck

Slide 1

Slide 1 text

Introduction to Hadoop Saravanan Vijayakumaran IIT Bombay

Slide 2

Slide 2 text

What is Hadoop? ● A framework for large-scale distributed data processing ● Advantages ○ Can handle petabytes of data ○ Can scale to a cluster containing thousands of computers ○ Allows a user to focus on the data processing logic ○ Takes care of task distribution and fault tolerance ● Two main components ○ Hadoop Distributed File System ○ MapReduce

Slide 3

Slide 3 text

A Brief History of Hadoop ● In 2003-2004, Google researchers published papers on the Google File System and MapReduce ● Doug Cutting did an initial implementation of Hadoop based on the papers ● Yahoo hired Cutting in 2006 and provided a team for development ● In 2008, Yahoo announced that their search engine was using Hadoop running on a 10,000+ core Linux cluster ● The word Hadoop is the name Cutting’s son gave to a toy elephant

Slide 4

Slide 4 text

Use Cases of Hadoop ● Filtering ○ Web search ○ Product search ● Classification ○ Spam detection ○ Fraud/anomaly detection ● Recommendation Engine ○ Online retailers suggesting products ○ Social networks suggesting people you may know

Slide 5

Slide 5 text

Hadoop vs RDBMS RDBMS ● Structured data ● Optimized for queries and updates at arbitrary locations ● Suitable for small data ● Cannot scale to web-scale data Hadoop ● Structured/unstructured data ● Optimized for sequential reading and batch processing of data ● Suitable for web-scale data ● High latency for small data

Slide 6

Slide 6 text

Hadoop vs MPI MPI ● Suitable for long-running computations which involve small amounts of data ● No fault tolerance provided by the framework ● More flexible program structure Hadoop ● Typically used for short computations on large amounts of data ● Fault tolerance provided by default ● Programs restricted to have “MapReduce” structure

Slide 7

Slide 7 text

Hadoop Distributed File System ● A file is split into blocks of size 64 MB ● Each block is replicated 3 times ● Blocks are distributed randomly on the machines in the cluster ● NameNode: machine which holds the files to block mapping ● DataNodes: machines which store the blocks

Slide 8

Slide 8 text

MapReduce ● Arbitrary programs cannot be run on Hadoop ● Programs should conform to the MapReduce programming model ● MapReduce programs transform lists of input data elements into lists of output data ● The input and output lists are constrained to be lists of key-value pairs ○ A key-value pair is an ordered pair (k,v) where k is the key and v is the value ○ Example: [ (“Alice”, 28), (“Bob”, 35), (“Eve”, 28) ] ● Two list processing idioms are used ○ Map ○ Reduce

Slide 9

Slide 9 text

Map A mapper function transforms each input list element to an output list element

Slide 10

Slide 10 text

Examples of Map ● Square [3, 6, 5, 9, 10] → [9, 36, 25, 81, 100] ● isPrime [3, 6, 5, 9, 10] → [True, False, True, False, False] ● toUpper [“This”, “is”, “a”, “test”] → [“THIS”, “IS”, “A”, “TEST”]

Slide 11

Slide 11 text

Reduce A reducer function combines the values in an input list to a single output value

Slide 12

Slide 12 text

Examples of Reduce ● Summation [3, 6, 5, 9, 10] → 33 ● Median [3, 6, 5, 9, 10] → 6 ● Histogram [4, 6, 1, 6, 4, 4, 1, 1] → [(1,3), (4,3), (6,2)]

Slide 13

Slide 13 text

An Example Application: Word Count ● Count how many times different words appear in a set of files ○ Use case: Spam filtering ● Suppose we have two files ○ file1.txt: Hello, this is the first file ○ file2.txt: This is the second file ● Expected output hello 1 first 1 this 2 second 1 is 2 file 2 the 2

Slide 14

Slide 14 text

Word Count as MapReduce: Mapper Mapper pseudocode mapper (file-contents): for each word in file-contents: emit (word, 1) ● file1.txt: Hello, this is the first file ○ Mapper output (hello, 1) (this, 1) (is, 1) (the, 1) (first, 1) (file, 1) ● file2.txt: This is the second file ○ Mapper output (this, 1) (is, 1) (the, 1) (second, 1) (file, 1)

Slide 15

Slide 15 text

Word Count as MapReduce: Reducer ● Hadoop groups values with same key ● Reducer pseudocode reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) ● Output of mapper stage (hello, 1) (this, 1) (is, 1) (the, 1) (first, 1) (file, 1) (this, 1) (is, 1) (the, 1) (second, 1) (file, 1) ● Input to reducer stage (hello, [1]) (this, [1, 1]) (is, [1, 1]) (the, [1,1]) (first, [1]) (file, [1, 1]) (second, [1]) ● Output of reducer stage (hello, 1) (this, 2) (is, 2) (the, 2) (first, 1) (file, 2) (second, 1)

Slide 16

Slide 16 text

MapReduce Data Flow ● Several mapper processes are created each processing file blocks on a node ● Intermediate (key,value) pairs are exchanged to send all values with same key to a single reducer ● Each reducer generates an output file ● The reducer outputs can be fed to a second MapReduce job for further processing

Slide 17

Slide 17 text

Combiner Function ● Suppose a file has 1000 occurrences of the word “is” ● 1000 key-value pairs equal to (is, 1) will be created and sent to the reducer for “is” ● A more efficient method is to just send (is, 1000) ● This node-local processing is done by the Combiner ● In this case, the Combiner has the same implementation as the Reducer

Slide 18

Slide 18 text

Partitioner Function ● The default implementation distributes the keys randomly among the reducers ReducerIndex = Hash(key) % NumReducers ● In WordCount, suppose we want all keys starting with the same letter to go to the same reducer ● A custom Partitioner can achieve this ReducerIndex = Hash(FirstLetterOfKey) % NumReducers

Slide 19

Slide 19 text

Slide 20

Slide 20 text

WordCount Mapper in Java public static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } ● MapClass is the user-defined class implementing the Mapper interface ● User needs to implement the map function

Slide 21

Slide 21 text

WordCount Mapper in Java public static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } ● By default, Hadoop assumes that the input is a text file where each line needs to be processed independently ● Input key to Mapper is LongWritable - the byte offset of a line in a file ● Input value is Text - the contents of the line

Slide 22

Slide 22 text

WordCount Mapper in Java public static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } ● Output key of Mapper is of type Text ● Output value is IntWritable

Slide 23

Slide 23 text

WordCount Mapper in Java public static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } ● Create a variable to hold the constant one ● Split a line into words ● For each word, emit a key-value pair consisting of (word, 1)

Slide 24

Slide 24 text

WordCount Reducer in Java public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

WordCount Driver public void run(String inputPath, String outputPath) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); }

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Fault Tolerance ● In large clusters, individual nodes or network components may fail ● Hadoop achieves fault tolerance by restarting tasks ● A MapReduce job is monitored by a JobTracker ● Each map task and reduce task is assigned to a TaskTracker ● If a TaskTracker fails to ping the JobTracker for one minute, it is assumed to have crashed ● Other TaskTrackers will re-execute tasks assigned to the failed TaskTracker

Slide 35

Slide 35 text

MapReduce Design Patterns ● Summarization ○ Numerical summarization ○ Inverted Index ● Filtering ○ Top 10 lists ● Data organization ○ Sorting

Slide 36

Slide 36 text

Numerical Summarizations ● Minimum, Maximum, Average, Median, Standard Deviation ● Suppose we have bank transaction data in the following format ● We want to find the maximum, minimum and average transaction amount for each of the PAN card numbers Date PAN card number Amount 21/11/2015 ABCDE1234F 80,000 30/11/2015 PQRST4567U 1,20,000 01/12/2015 ABCDE1234F 25,000 23/01/2016 GHAJK2345L 1,00,000

Slide 37

Slide 37 text

MapReduce Structure ● Input to the mapper will be a list of (LongWritable, Text) pairs ● For each input pair, mapper outputs a (PAN card, Amount) pair ● All mapper outputs with same PAN card will arrive at a single reducer ● Input to reducer will be (PAN card, [Amount1, Amount2, ..., AmountN]) ● Maximum and minimum are the largest and smallest amounts in the list ● For average, divide sum of amounts by number of amounts

Slide 38

Slide 38 text

Inverted Index ● Faster web search ○ Each web page contains a list of words ○ An inverted index is a list of webpages which contain a particular word ● Citations ○ Every patent may cite some other past patents ○ An inverted index is a list of patents which cite a particular patent

Slide 39

Slide 39 text

Vehicle Tracking ● Suppose streets in Delhi are equipped with CCTVs which can perform number plate recognition ● Each camera sends a list of vehicles it has seen ● Suppose we want to know which areas a vehicle visited on a particular day Date CCTV ID Vehicle Number 21/11/2015 123 DL9C 1234 30/11/2015 101 DL5A 7890 01/12/2015 123 DL8B 5555 23/01/2016 155 DL9C 1234

Slide 40

Slide 40 text

MapReduce Structure ● Input to the mapper will be a list of (LongWritable, Text) pairs ● For each input pair, mapper outputs a (Vehicle Number, CCTV ID) pair if the vehicle was seen on the day of interest ● All mapper outputs with same Vehicle Number will arrive at a single reducer ● Input to reducer will be (Vehicle Number, [CCTV ID1, CCTV ID2, ..., CCTV IDn]) ● The reducer removes any duplicates in the CCTV ID list

Slide 41

Slide 41 text

Top 10 List ● Given bank transaction data, suppose we want to find the 10 largest transactions Date PAN card number Amount 21/11/2015 ABCDE1234F 80,000 30/11/2015 PQRST4567U 1,20,000 01/12/2015 ABCDE1234F 25,000 23/01/2016 GHIJK2345L 1,00,000

Slide 42

Slide 42 text

Top 10 List Input Split Input Split Input Split Top Ten Mapper Top Ten Mapper Top Ten Mapper Top Ten Reducer Local top 10 Local top 10 Local top 10 Top Ten Output Final top 10

Slide 43

Slide 43 text

MapReduce Structure ● Set the number of reducers to one ● Input to the mapper will be a list of (LongWritable, Text) pairs ● Each mapper outputs ten (NULL, (Amount, PAN Card, Date)) pairs corresponding to the ten largest amounts ● All mapper outputs will arrive at the single reducer ● The input to the reducer will be (NULL, [(A 1 , PC 1 , D 1 ), (A 2 , PC 2 , D 2 ), …, (A 10M , PC 10M , D 10M ) ]) ● The reducer computes the top ten transactions

Slide 44

Slide 44 text

Sorting ● Given bank transaction data, suppose we want to sort all transactions in ascending order of amounts Date PAN card number Amount 21/11/2015 ABCDE1234F 80,000 30/11/2015 PQRST4567U 1,20,000 01/12/2015 ABCDE1234F 25,000 23/01/2016 GHIJK2345L 1,00,000

Slide 45

Slide 45 text

MapReduce Structure ● Input to the mapper will be a list of (LongWritable, Text) pairs ● Suppose we set the mapper output to a (Amount, (PAN Card, Date)) pair ● All mapper outputs corresponding to same amount will arrive at the single reducer ● But the default partitioner in Hadoop does not guarantee that amounts which are near will arrive at the same reducer ● We need a custom partitioner

Slide 46

Slide 46 text

Custom Partitioner for Sorting Input Split Input Split Input Split Mapper Custom Partitioner Mapper Mapper Custom Partitioner Custom Partitioner Reducer Reducer Reducer 0 to 1L 5L and above 1L to 5L Sorted Output Sorted Output Sorted Output

Slide 47

Slide 47 text

MapReduce Structure ● Input to the mapper will be a list of (LongWritable, Text) pairs ● Mapper outputs are (Amount, (PAN Card, Date)) pairs ● Mapper outputs with amounts in same range will arrive at the same reducer ● The input to the reducer will be sorted by amount (Amount1, [(PAN Card1, Date1), (PAN Card2, Date2), …]) (Amount2, [(PAN Card3, Date3), (PAN Card4, Date4), …]) (AmountN, [(PAN Card5, Date5), (PAN Card6, Date6), …]) ● Each reducer will output all the transactions it receives ● The outputs of all reducers can be concatenated to get the sorted data … … …

Slide 48

Slide 48 text

Reasons for Hadoop Popularity ● Ease of use ○ Hadoop takes care of the challenges of distributed computing ○ User can focus on the data processing logic ○ Same program can be executed on a 10 machine cluster or 1000 machine cluster ● MapReduce is flexible ○ A large class of problems can be expressed as MapReduce computations ● Scale out ○ Can scale to clusters having thousands of machines ○ Can handle web-scale data

Slide 49

Slide 49 text

Learning resources ● Introduction to Hadoop and MapReduce, MOOC from Udacity, https://www.udacity.com/courses/ud617 ● Hadoop Tutorial from Yahoo!, https://developer.yahoo.com/hadoop/tutorial/ ● Hadoop: The Definitive Guide, Tom White, O'Reilly Media, 2012 ● Hadoop in Action, Chuck Lam, Manning Publications, 2010 ● MapReduce Design Patterns, Donald Miner and Adam Shook, O'Reilly Media, 2012

Slide 50

Slide 50 text

Attribution Some figures were taken from the "Hadoop Tutorial from Yahoo!" by Yahoo! Inc. which is licensed under a Creative Commons Attribution 3.0 Unported License. No changes were made to the figures. https://creativecommons.org/licenses/by/3.0/

Slide 51

Slide 51 text

Thanks for your attention!