Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop 101

mohit
December 04, 2010

Hadoop 101

Slides from BarCamp

mohit

December 04, 2010
Tweet

More Decks by mohit

Other Decks in Programming

Transcript

  1. Hadoop 101 Mohit Soni eBay Inc. BarCamp Chennai - 5

    Mohit Soni
  2. About Me • I work as a Software Engineer at

    eBay • Worked on large-scale data processing with eBay Research Labs BarCamp Chennai - 5 Mohit Soni
  3. First Things First BarCamp Chennai - 5 Mohit Soni

  4. • Inspired from functional operations – Map – Reduce •

    Functional operations do not modify data, they generate new data • Original data remains unmodified MapReduce BarCamp Chennai - 5 Mohit Soni
  5. MapReduce def MapReduce(data, mapper, reducer): return reduce(reducer, map(mapper, data)) MapReduce(list,

    sqr, add) -> 30 Functional Operations BarCamp Chennai - 5 Mohit Soni Map def sqr(n): return n * n list = [1,2,3,4] map(sqr, list) -> [1,4,9,16] Reduce def add(i, j): return i + j list = [1,2,3,4] reduce(add, list) -> 10 Python code
  6. BarCamp Chennai - 5 Mohit Soni

  7. • Framework for large-scale data processing • Based on Google’s

    MapReduce and GFS • An Apache Software Foundation project • Open Source! • Written in Java • Oh, btw What is Hadoop ? BarCamp Chennai - 5 Mohit Soni
  8. • Need to process lots of data (PetaByte scale) •

    Need to parallelize processing across multitude of CPUs • Achieves above while KeepIng Software Simple • Gives scalability with low-cost commodity hardware Why Hadoop ? BarCamp Chennai - 5 Mohit Soni
  9. Source: Hadoop Wiki Hadoop fans BarCamp Chennai - 5 Mohit

    Soni
  10. Hadoop is a good choice for: • Indexing data •

    Log Analysis • Image manipulation • Sorting large-scale data • Data Mining When to use and not-use Hadoop ? BarCamp Chennai - 5 Mohit Soni Hadoop is not a good choice: • For real-time processing • For processing intensive tasks with little data • If you have Jaguar or RoadRunner in your stock
  11. • Hadoop Distributed File System • Based on Google’s GFS

    (Google File System) • Write once read many access model • Fault tolerant • Efficient for batch-processing HDFS – Overview BarCamp Chennai - 5 Mohit Soni
  12. • HDFS splits input data into blocks • Block size

    in HDFS: 64/128MB (configurable) • Block size *nix: 4KB HDFS – Blocks BarCamp Chennai - 5 Mohit Soni Block 1 Block 2 Block 3 Input Data
  13. • Blocks are replicated across nodes to handle hardware failure

    • Node failure is handled gracefully, without loss of data HDFS – Replication BarCamp Chennai - 5 Mohit Soni Block 1 Block 2 Block 1 Block 3 Block 2 Block 3
  14. HDFS – Architecture BarCamp Chennai - 5 Mohit Soni NameNode

    Client Cluster DataNodes
  15. • NameNode (Master) – Manages filesystem metadata – Manages replication

    of blocks – Manages read/write access to files • Metadata – List of files – List of blocks that constitutes a file – List of DataNodes on which blocks reside, etc • Single Point of Failure (candidate for spending $$) HDFS – NameNode BarCamp Chennai - 5 Mohit Soni
  16. • DataNode (Slave) – Contains actual data – Manages data

    blocks – Informs NameNode about block IDs stored – Client read/write data blocks from DataNode – Performs block replication as instructed by NameNode • Block Replication – Supports various pluggable replication strategies – Clients read blocks from nearest DataNode • Data Pipelining – Client write block to first DataNode – First DataNode forwards data to next DataNode in pipeline – When block is replicated across all replicas, next block is chosen HDFS – DataNode BarCamp Chennai - 5 Mohit Soni
  17. Hadoop - Architecture BarCamp Chennai - 5 Mohit Soni JobTracker

    TaskTracker TaskTracker NameNode DataNode DataNode DataNode DataNode DataNode DataNode User
  18. • JobTracker (Master) – 1 Job Tracker per cluster –

    Accepts job requests from users – Schedule Map and Reduce tasks for TaskTrackers – Monitors tasks and TaskTrackers status – Re-execute task on failure • TaskTracker (Slave) – Multiple TaskTrackers in a cluster – Run Map and Reduce tasks Hadoop - Terminology BarCamp Chennai - 5 Mohit Soni
  19. Input Data Input Map Shuffle + Sort Reduce Output Map

    Map Map Output Data Reduce Reduce BarCamp Chennai - 5 Mohit Soni MapReduce – Flow
  20. Word Count Hadoop’s HelloWorld BarCamp Chennai - 5 Mohit Soni

  21. • Input – Text files • Output – Single file

    containing (Word <TAB> Count) • Map Phase – Generates (Word, Count) pairs – [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}] • Reduce Phase – For each word, calculates aggregate – [{a,7}, {b,5}, {c,6}] BarCamp Chennai - 5 Mohit Soni Word Count Example
  22. public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>

    { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { String l = value.toString(); StringTokenizer t = new StringTokenizer(l); while(t.hasMoreTokens()) { word.set(t.nextToken()); out.collect(word, one); } } } BarCamp Chennai - 5 Mohit Soni Word Count – Mapper
  23. public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>

    { public void reduce(Text key, Iterator<IntWriter> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { int sum = 0; while(values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } } BarCamp Chennai - 5 Mohit Soni Word Count – Reducer
  24. public class WordCountConfig { public static void main(String[] args) throws

    Exception { if (args.length() != 2) { System.exit(1); } JobConf conf = new JobConf(WordCountConfig.class); conf.setJobName(“Word Counter”); FileInputFormat.addInputPath(conf, new Path(args[0]); FileInputFormat.addOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordCountMapper.class); conf.setCombinerClass(WordCountReducer.class); conf.setReducerClass(WordCountReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); JobClient.runJob(conf); } } BarCamp Chennai - 5 Mohit Soni Word Count – Config
  25. • http://hadoop.apache.org/ • Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified

    Data Processing on Large Clusters • Tom White, Hadoop: The Definitive Guide, O’Reilly • Setting up a Single-Node Cluster: http://bit.ly/glNzs4 • Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP Diving Deeper BarCamp Chennai - 5 Mohit Soni
  26. • Follow me on twitter @mohitsoni • http://mohitsoni.com/ Catching-Up BarCamp

    Chennai - 5 Mohit Soni