Hadoop 101

Hadoop 101 Mohit Soni eBay Inc. BarCamp Chennai - 5
Mohit Soni

About Me • I work as a Software Engineer at
eBay • Worked on large-scale data processing with eBay Research Labs BarCamp Chennai - 5 Mohit Soni

First Things First BarCamp Chennai - 5 Mohit Soni

• Inspired from functional operations – Map – Reduce •
Functional operations do not modify data, they generate new data • Original data remains unmodified MapReduce BarCamp Chennai - 5 Mohit Soni

MapReduce def MapReduce(data, mapper, reducer): return reduce(reducer, map(mapper, data)) MapReduce(list,
sqr, add) -> 30 Functional Operations BarCamp Chennai - 5 Mohit Soni Map def sqr(n): return n * n list = [1,2,3,4] map(sqr, list) -> [1,4,9,16] Reduce def add(i, j): return i + j list = [1,2,3,4] reduce(add, list) -> 10 Python code

BarCamp Chennai - 5 Mohit Soni

• Framework for large-scale data processing • Based on Google’s
MapReduce and GFS • An Apache Software Foundation project • Open Source! • Written in Java • Oh, btw What is Hadoop ? BarCamp Chennai - 5 Mohit Soni

• Need to process lots of data (PetaByte scale) •
Need to parallelize processing across multitude of CPUs • Achieves above while KeepIng Software Simple • Gives scalability with low-cost commodity hardware Why Hadoop ? BarCamp Chennai - 5 Mohit Soni

Source: Hadoop Wiki Hadoop fans BarCamp Chennai - 5 Mohit
Soni

Hadoop is a good choice for: • Indexing data •
Log Analysis • Image manipulation • Sorting large-scale data • Data Mining When to use and not-use Hadoop ? BarCamp Chennai - 5 Mohit Soni Hadoop is not a good choice: • For real-time processing • For processing intensive tasks with little data • If you have Jaguar or RoadRunner in your stock

• Hadoop Distributed File System • Based on Google’s GFS
(Google File System) • Write once read many access model • Fault tolerant • Efficient for batch-processing HDFS – Overview BarCamp Chennai - 5 Mohit Soni

• HDFS splits input data into blocks • Block size
in HDFS: 64/128MB (configurable) • Block size *nix: 4KB HDFS – Blocks BarCamp Chennai - 5 Mohit Soni Block 1 Block 2 Block 3 Input Data

• Blocks are replicated across nodes to handle hardware failure
• Node failure is handled gracefully, without loss of data HDFS – Replication BarCamp Chennai - 5 Mohit Soni Block 1 Block 2 Block 1 Block 3 Block 2 Block 3

HDFS – Architecture BarCamp Chennai - 5 Mohit Soni NameNode
Client Cluster DataNodes

• NameNode (Master) – Manages filesystem metadata – Manages replication
of blocks – Manages read/write access to files • Metadata – List of files – List of blocks that constitutes a file – List of DataNodes on which blocks reside, etc • Single Point of Failure (candidate for spending $$) HDFS – NameNode BarCamp Chennai - 5 Mohit Soni

• DataNode (Slave) – Contains actual data – Manages data
blocks – Informs NameNode about block IDs stored – Client read/write data blocks from DataNode – Performs block replication as instructed by NameNode • Block Replication – Supports various pluggable replication strategies – Clients read blocks from nearest DataNode • Data Pipelining – Client write block to first DataNode – First DataNode forwards data to next DataNode in pipeline – When block is replicated across all replicas, next block is chosen HDFS – DataNode BarCamp Chennai - 5 Mohit Soni

Hadoop - Architecture BarCamp Chennai - 5 Mohit Soni JobTracker
TaskTracker TaskTracker NameNode DataNode DataNode DataNode DataNode DataNode DataNode User

• JobTracker (Master) – 1 Job Tracker per cluster –
Accepts job requests from users – Schedule Map and Reduce tasks for TaskTrackers – Monitors tasks and TaskTrackers status – Re-execute task on failure • TaskTracker (Slave) – Multiple TaskTrackers in a cluster – Run Map and Reduce tasks Hadoop - Terminology BarCamp Chennai - 5 Mohit Soni

Input Data Input Map Shuffle + Sort Reduce Output Map
Map Map Output Data Reduce Reduce BarCamp Chennai - 5 Mohit Soni MapReduce – Flow

Word Count Hadoop’s HelloWorld BarCamp Chennai - 5 Mohit Soni

• Input – Text files • Output – Single file
containing (Word <TAB> Count) • Map Phase – Generates (Word, Count) pairs – [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}] • Reduce Phase – For each word, calculates aggregate – [{a,7}, {b,5}, {c,6}] BarCamp Chennai - 5 Mohit Soni Word Count Example

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { String l = value.toString(); StringTokenizer t = new StringTokenizer(l); while(t.hasMoreTokens()) { word.set(t.nextToken()); out.collect(word, one); } } } BarCamp Chennai - 5 Mohit Soni Word Count – Mapper

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{ public void reduce(Text key, Iterator<IntWriter> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { int sum = 0; while(values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } } BarCamp Chennai - 5 Mohit Soni Word Count – Reducer

public class WordCountConfig { public static void main(String[] args) throws
Exception { if (args.length() != 2) { System.exit(1); } JobConf conf = new JobConf(WordCountConfig.class); conf.setJobName(“Word Counter”); FileInputFormat.addInputPath(conf, new Path(args[0]); FileInputFormat.addOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordCountMapper.class); conf.setCombinerClass(WordCountReducer.class); conf.setReducerClass(WordCountReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); JobClient.runJob(conf); } } BarCamp Chennai - 5 Mohit Soni Word Count – Config

• http://hadoop.apache.org/ • Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified
Data Processing on Large Clusters • Tom White, Hadoop: The Definitive Guide, O’Reilly • Setting up a Single-Node Cluster: http://bit.ly/glNzs4 • Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP Diving Deeper BarCamp Chennai - 5 Mohit Soni

• Follow me on twitter @mohitsoni • http://mohitsoni.com/ Catching-Up BarCamp
Chennai - 5 Mohit Soni

Hadoop 101

Hadoop 101

mohit

More Decks by mohit

Other Decks in Programming

Featured

Transcript

Hadoop 101 Mohit Soni eBay Inc. BarCamp Chennai - 5

About Me • I work as a Software Engineer at

First Things First BarCamp Chennai - 5 Mohit Soni

• Inspired from functional operations – Map – Reduce •

MapReduce def MapReduce(data, mapper, reducer): return reduce(reducer, map(mapper, data)) MapReduce(list,

BarCamp Chennai - 5 Mohit Soni

• Framework for large-scale data processing • Based on Google’s

• Need to process lots of data (PetaByte scale) •

Source: Hadoop Wiki Hadoop fans BarCamp Chennai - 5 Mohit

Hadoop is a good choice for: • Indexing data •

• Hadoop Distributed File System • Based on Google’s GFS

• HDFS splits input data into blocks • Block size

• Blocks are replicated across nodes to handle hardware failure

HDFS – Architecture BarCamp Chennai - 5 Mohit Soni NameNode

• NameNode (Master) – Manages filesystem metadata – Manages replication

• DataNode (Slave) – Contains actual data – Manages data

Hadoop - Architecture BarCamp Chennai - 5 Mohit Soni JobTracker

• JobTracker (Master) – 1 Job Tracker per cluster –

Input Data Input Map Shuffle + Sort Reduce Output Map

Word Count Hadoop’s HelloWorld BarCamp Chennai - 5 Mohit Soni

• Input – Text files • Output – Single file

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>

public class WordCountConfig { public static void main(String[] args) throws

• http://hadoop.apache.org/ • Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified

• Follow me on twitter @mohitsoni • http://mohitsoni.com/ Catching-Up BarCamp