Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop 101

4f9ddde97e53bca54270fdbd3ec02d83?s=47 mohit
December 04, 2010

Hadoop 101

Slides from BarCamp

4f9ddde97e53bca54270fdbd3ec02d83?s=128

mohit

December 04, 2010
Tweet

Transcript

  1. Hadoop 101 Mohit Soni eBay Inc. BarCamp Chennai - 5

    Mohit Soni
  2. About Me • I work as a Software Engineer at

    eBay • Worked on large-scale data processing with eBay Research Labs BarCamp Chennai - 5 Mohit Soni
  3. First Things First BarCamp Chennai - 5 Mohit Soni

  4. • Inspired from functional operations – Map – Reduce •

    Functional operations do not modify data, they generate new data • Original data remains unmodified MapReduce BarCamp Chennai - 5 Mohit Soni
  5. MapReduce def MapReduce(data, mapper, reducer): return reduce(reducer, map(mapper, data)) MapReduce(list,

    sqr, add) -> 30 Functional Operations BarCamp Chennai - 5 Mohit Soni Map def sqr(n): return n * n list = [1,2,3,4] map(sqr, list) -> [1,4,9,16] Reduce def add(i, j): return i + j list = [1,2,3,4] reduce(add, list) -> 10 Python code
  6. BarCamp Chennai - 5 Mohit Soni

  7. • Framework for large-scale data processing • Based on Google’s

    MapReduce and GFS • An Apache Software Foundation project • Open Source! • Written in Java • Oh, btw What is Hadoop ? BarCamp Chennai - 5 Mohit Soni
  8. • Need to process lots of data (PetaByte scale) •

    Need to parallelize processing across multitude of CPUs • Achieves above while KeepIng Software Simple • Gives scalability with low-cost commodity hardware Why Hadoop ? BarCamp Chennai - 5 Mohit Soni
  9. Source: Hadoop Wiki Hadoop fans BarCamp Chennai - 5 Mohit

    Soni
  10. Hadoop is a good choice for: • Indexing data •

    Log Analysis • Image manipulation • Sorting large-scale data • Data Mining When to use and not-use Hadoop ? BarCamp Chennai - 5 Mohit Soni Hadoop is not a good choice: • For real-time processing • For processing intensive tasks with little data • If you have Jaguar or RoadRunner in your stock
  11. • Hadoop Distributed File System • Based on Google’s GFS

    (Google File System) • Write once read many access model • Fault tolerant • Efficient for batch-processing HDFS – Overview BarCamp Chennai - 5 Mohit Soni
  12. • HDFS splits input data into blocks • Block size

    in HDFS: 64/128MB (configurable) • Block size *nix: 4KB HDFS – Blocks BarCamp Chennai - 5 Mohit Soni Block 1 Block 2 Block 3 Input Data
  13. • Blocks are replicated across nodes to handle hardware failure

    • Node failure is handled gracefully, without loss of data HDFS – Replication BarCamp Chennai - 5 Mohit Soni Block 1 Block 2 Block 1 Block 3 Block 2 Block 3
  14. HDFS – Architecture BarCamp Chennai - 5 Mohit Soni NameNode

    Client Cluster DataNodes
  15. • NameNode (Master) – Manages filesystem metadata – Manages replication

    of blocks – Manages read/write access to files • Metadata – List of files – List of blocks that constitutes a file – List of DataNodes on which blocks reside, etc • Single Point of Failure (candidate for spending $$) HDFS – NameNode BarCamp Chennai - 5 Mohit Soni
  16. • DataNode (Slave) – Contains actual data – Manages data

    blocks – Informs NameNode about block IDs stored – Client read/write data blocks from DataNode – Performs block replication as instructed by NameNode • Block Replication – Supports various pluggable replication strategies – Clients read blocks from nearest DataNode • Data Pipelining – Client write block to first DataNode – First DataNode forwards data to next DataNode in pipeline – When block is replicated across all replicas, next block is chosen HDFS – DataNode BarCamp Chennai - 5 Mohit Soni
  17. Hadoop - Architecture BarCamp Chennai - 5 Mohit Soni JobTracker

    TaskTracker TaskTracker NameNode DataNode DataNode DataNode DataNode DataNode DataNode User
  18. • JobTracker (Master) – 1 Job Tracker per cluster –

    Accepts job requests from users – Schedule Map and Reduce tasks for TaskTrackers – Monitors tasks and TaskTrackers status – Re-execute task on failure • TaskTracker (Slave) – Multiple TaskTrackers in a cluster – Run Map and Reduce tasks Hadoop - Terminology BarCamp Chennai - 5 Mohit Soni
  19. Input Data Input Map Shuffle + Sort Reduce Output Map

    Map Map Output Data Reduce Reduce BarCamp Chennai - 5 Mohit Soni MapReduce – Flow
  20. Word Count Hadoop’s HelloWorld BarCamp Chennai - 5 Mohit Soni

  21. • Input – Text files • Output – Single file

    containing (Word <TAB> Count) • Map Phase – Generates (Word, Count) pairs – [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}] • Reduce Phase – For each word, calculates aggregate – [{a,7}, {b,5}, {c,6}] BarCamp Chennai - 5 Mohit Soni Word Count Example
  22. public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>

    { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { String l = value.toString(); StringTokenizer t = new StringTokenizer(l); while(t.hasMoreTokens()) { word.set(t.nextToken()); out.collect(word, one); } } } BarCamp Chennai - 5 Mohit Soni Word Count – Mapper
  23. public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>

    { public void reduce(Text key, Iterator<IntWriter> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { int sum = 0; while(values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } } BarCamp Chennai - 5 Mohit Soni Word Count – Reducer
  24. public class WordCountConfig { public static void main(String[] args) throws

    Exception { if (args.length() != 2) { System.exit(1); } JobConf conf = new JobConf(WordCountConfig.class); conf.setJobName(“Word Counter”); FileInputFormat.addInputPath(conf, new Path(args[0]); FileInputFormat.addOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordCountMapper.class); conf.setCombinerClass(WordCountReducer.class); conf.setReducerClass(WordCountReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); JobClient.runJob(conf); } } BarCamp Chennai - 5 Mohit Soni Word Count – Config
  25. • http://hadoop.apache.org/ • Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified

    Data Processing on Large Clusters • Tom White, Hadoop: The Definitive Guide, O’Reilly • Setting up a Single-Node Cluster: http://bit.ly/glNzs4 • Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP Diving Deeper BarCamp Chennai - 5 Mohit Soni
  26. • Follow me on twitter @mohitsoni • http://mohitsoni.com/ Catching-Up BarCamp

    Chennai - 5 Mohit Soni