Hadoop 101 - Speaker Deck

Slide 1

Slide 1 text

Hadoop 101 Mohit Soni eBay Inc. BarCamp Chennai - 5 Mohit Soni

Slide 2

Slide 2 text

About Me • I work as a Software Engineer at eBay • Worked on large-scale data processing with eBay Research Labs BarCamp Chennai - 5 Mohit Soni

Slide 3

Slide 3 text

First Things First BarCamp Chennai - 5 Mohit Soni

Slide 4

Slide 4 text

• Inspired from functional operations – Map – Reduce • Functional operations do not modify data, they generate new data • Original data remains unmodified MapReduce BarCamp Chennai - 5 Mohit Soni

Slide 5

Slide 5 text

MapReduce def MapReduce(data, mapper, reducer): return reduce(reducer, map(mapper, data)) MapReduce(list, sqr, add) -> 30 Functional Operations BarCamp Chennai - 5 Mohit Soni Map def sqr(n): return n * n list = [1,2,3,4] map(sqr, list) -> [1,4,9,16] Reduce def add(i, j): return i + j list = [1,2,3,4] reduce(add, list) -> 10 Python code

Slide 6

Slide 6 text

BarCamp Chennai - 5 Mohit Soni

Slide 7

Slide 7 text

• Framework for large-scale data processing • Based on Google’s MapReduce and GFS • An Apache Software Foundation project • Open Source! • Written in Java • Oh, btw What is Hadoop ? BarCamp Chennai - 5 Mohit Soni

Slide 8

Slide 8 text

• Need to process lots of data (PetaByte scale) • Need to parallelize processing across multitude of CPUs • Achieves above while KeepIng Software Simple • Gives scalability with low-cost commodity hardware Why Hadoop ? BarCamp Chennai - 5 Mohit Soni

Slide 9

Slide 9 text

Source: Hadoop Wiki Hadoop fans BarCamp Chennai - 5 Mohit Soni

Slide 10

Slide 10 text

Hadoop is a good choice for: • Indexing data • Log Analysis • Image manipulation • Sorting large-scale data • Data Mining When to use and not-use Hadoop ? BarCamp Chennai - 5 Mohit Soni Hadoop is not a good choice: • For real-time processing • For processing intensive tasks with little data • If you have Jaguar or RoadRunner in your stock

Slide 11

Slide 11 text

• Hadoop Distributed File System • Based on Google’s GFS (Google File System) • Write once read many access model • Fault tolerant • Efficient for batch-processing HDFS – Overview BarCamp Chennai - 5 Mohit Soni

Slide 12

Slide 12 text

• HDFS splits input data into blocks • Block size in HDFS: 64/128MB (configurable) • Block size *nix: 4KB HDFS – Blocks BarCamp Chennai - 5 Mohit Soni Block 1 Block 2 Block 3 Input Data

Slide 13

Slide 13 text

• Blocks are replicated across nodes to handle hardware failure • Node failure is handled gracefully, without loss of data HDFS – Replication BarCamp Chennai - 5 Mohit Soni Block 1 Block 2 Block 1 Block 3 Block 2 Block 3

Slide 14

Slide 14 text

HDFS – Architecture BarCamp Chennai - 5 Mohit Soni NameNode Client Cluster DataNodes

Slide 15

Slide 15 text

• NameNode (Master) – Manages filesystem metadata – Manages replication of blocks – Manages read/write access to files • Metadata – List of files – List of blocks that constitutes a file – List of DataNodes on which blocks reside, etc • Single Point of Failure (candidate for spending $$) HDFS – NameNode BarCamp Chennai - 5 Mohit Soni

Slide 16

Slide 16 text

• DataNode (Slave) – Contains actual data – Manages data blocks – Informs NameNode about block IDs stored – Client read/write data blocks from DataNode – Performs block replication as instructed by NameNode • Block Replication – Supports various pluggable replication strategies – Clients read blocks from nearest DataNode • Data Pipelining – Client write block to first DataNode – First DataNode forwards data to next DataNode in pipeline – When block is replicated across all replicas, next block is chosen HDFS – DataNode BarCamp Chennai - 5 Mohit Soni

Slide 17

Slide 17 text

Hadoop - Architecture BarCamp Chennai - 5 Mohit Soni JobTracker TaskTracker TaskTracker NameNode DataNode DataNode DataNode DataNode DataNode DataNode User

Slide 18

Slide 18 text

• JobTracker (Master) – 1 Job Tracker per cluster – Accepts job requests from users – Schedule Map and Reduce tasks for TaskTrackers – Monitors tasks and TaskTrackers status – Re-execute task on failure • TaskTracker (Slave) – Multiple TaskTrackers in a cluster – Run Map and Reduce tasks Hadoop - Terminology BarCamp Chennai - 5 Mohit Soni

Slide 19

Slide 19 text

Input Data Input Map Shuffle + Sort Reduce Output Map Map Map Output Data Reduce Reduce BarCamp Chennai - 5 Mohit Soni MapReduce – Flow

Slide 20

Slide 20 text

Word Count Hadoop’s HelloWorld BarCamp Chennai - 5 Mohit Soni

Slide 21

Slide 21 text

• Input – Text files • Output – Single file containing (Word Count) • Map Phase – Generates (Word, Count) pairs – [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}] • Reduce Phase – For each word, calculates aggregate – [{a,7}, {b,5}, {c,6}] BarCamp Chennai - 5 Mohit Soni Word Count Example

Slide 22

Slide 22 text

public class WordCountMapper extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector out, Reporter reporter) throws Exception { String l = value.toString(); StringTokenizer t = new StringTokenizer(l); while(t.hasMoreTokens()) { word.set(t.nextToken()); out.collect(word, one); } } } BarCamp Chennai - 5 Mohit Soni Word Count – Mapper

Slide 23

Slide 23 text

public class WordCountReducer extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector out, Reporter reporter) throws Exception { int sum = 0; while(values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } } BarCamp Chennai - 5 Mohit Soni Word Count – Reducer

Slide 24

Slide 24 text

public class WordCountConfig { public static void main(String[] args) throws Exception { if (args.length() != 2) { System.exit(1); } JobConf conf = new JobConf(WordCountConfig.class); conf.setJobName(“Word Counter”); FileInputFormat.addInputPath(conf, new Path(args[0]); FileInputFormat.addOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordCountMapper.class); conf.setCombinerClass(WordCountReducer.class); conf.setReducerClass(WordCountReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); JobClient.runJob(conf); } } BarCamp Chennai - 5 Mohit Soni Word Count – Config

Slide 25

Slide 25 text

• http://hadoop.apache.org/ • Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified Data Processing on Large Clusters • Tom White, Hadoop: The Definitive Guide, O’Reilly • Setting up a Single-Node Cluster: http://bit.ly/glNzs4 • Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP Diving Deeper BarCamp Chennai - 5 Mohit Soni

Slide 26

Slide 26 text

• Follow me on twitter @mohitsoni • http://mohitsoni.com/ Catching-Up BarCamp Chennai - 5 Mohit Soni