Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop 101

mohit
December 04, 2010

Hadoop 101

Slides from BarCamp

mohit

December 04, 2010
Tweet

More Decks by mohit

Other Decks in Programming

Transcript

  1. Hadoop 101
    Mohit Soni
    eBay Inc.
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  2. About Me
    • I work as a Software Engineer at eBay
    • Worked on large-scale data processing with
    eBay Research Labs
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  3. First Things First
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  4. • Inspired from functional operations
    – Map
    – Reduce
    • Functional operations do not modify data,
    they generate new data
    • Original data remains unmodified
    MapReduce
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  5. MapReduce
    def MapReduce(data, mapper, reducer):
    return reduce(reducer, map(mapper, data))
    MapReduce(list, sqr, add) -> 30
    Functional Operations
    BarCamp Chennai - 5 Mohit Soni
    Map
    def sqr(n):
    return n * n
    list = [1,2,3,4]
    map(sqr, list) -> [1,4,9,16]
    Reduce
    def add(i, j):
    return i + j
    list = [1,2,3,4]
    reduce(add, list) -> 10
    Python code

    View full-size slide

  6. BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  7. • Framework for large-scale data processing
    • Based on Google’s MapReduce and GFS
    • An Apache Software Foundation project
    • Open Source!
    • Written in Java
    • Oh, btw
    What is Hadoop ?
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  8. • Need to process lots of data (PetaByte scale)
    • Need to parallelize processing across
    multitude of CPUs
    • Achieves above while KeepIng Software
    Simple
    • Gives scalability with low-cost commodity
    hardware
    Why Hadoop ?
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  9. Source: Hadoop Wiki
    Hadoop fans
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  10. Hadoop is a good choice for:
    • Indexing data
    • Log Analysis
    • Image manipulation
    • Sorting large-scale data
    • Data Mining
    When to use and not-use Hadoop ?
    BarCamp Chennai - 5 Mohit Soni
    Hadoop is not a good choice:
    • For real-time processing
    • For processing intensive tasks with little data
    • If you have Jaguar or RoadRunner in your stock

    View full-size slide

  11. • Hadoop Distributed File System
    • Based on Google’s GFS (Google File System)
    • Write once read many access model
    • Fault tolerant
    • Efficient for batch-processing
    HDFS – Overview
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  12. • HDFS splits input data into blocks
    • Block size in HDFS: 64/128MB (configurable)
    • Block size *nix: 4KB
    HDFS – Blocks
    BarCamp Chennai - 5 Mohit Soni
    Block 1
    Block 2
    Block 3
    Input Data

    View full-size slide

  13. • Blocks are replicated across nodes to handle hardware failure
    • Node failure is handled gracefully, without loss of data
    HDFS – Replication
    BarCamp Chennai - 5 Mohit Soni
    Block 1
    Block 2
    Block 1
    Block 3
    Block 2
    Block 3

    View full-size slide

  14. HDFS – Architecture
    BarCamp Chennai - 5 Mohit Soni
    NameNode
    Client
    Cluster
    DataNodes

    View full-size slide

  15. • NameNode (Master)
    – Manages filesystem metadata
    – Manages replication of blocks
    – Manages read/write access to files
    • Metadata
    – List of files
    – List of blocks that constitutes a file
    – List of DataNodes on which blocks reside, etc
    • Single Point of Failure (candidate for spending $$)
    HDFS – NameNode
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  16. • DataNode (Slave)
    – Contains actual data
    – Manages data blocks
    – Informs NameNode about block IDs stored
    – Client read/write data blocks from DataNode
    – Performs block replication as instructed by NameNode
    • Block Replication
    – Supports various pluggable replication strategies
    – Clients read blocks from nearest DataNode
    • Data Pipelining
    – Client write block to first DataNode
    – First DataNode forwards data to next DataNode in pipeline
    – When block is replicated across all replicas, next block is chosen
    HDFS – DataNode
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  17. Hadoop - Architecture
    BarCamp Chennai - 5 Mohit Soni
    JobTracker
    TaskTracker TaskTracker
    NameNode
    DataNode
    DataNode
    DataNode
    DataNode
    DataNode
    DataNode
    User

    View full-size slide

  18. • JobTracker (Master)
    – 1 Job Tracker per cluster
    – Accepts job requests from users
    – Schedule Map and Reduce tasks for TaskTrackers
    – Monitors tasks and TaskTrackers status
    – Re-execute task on failure
    • TaskTracker (Slave)
    – Multiple TaskTrackers in a cluster
    – Run Map and Reduce tasks
    Hadoop - Terminology
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  19. Input
    Data
    Input Map Shuffle + Sort Reduce Output
    Map
    Map
    Map
    Output
    Data
    Reduce
    Reduce
    BarCamp Chennai - 5 Mohit Soni
    MapReduce – Flow

    View full-size slide

  20. Word Count
    Hadoop’s HelloWorld
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  21. • Input
    – Text files
    • Output
    – Single file containing (Word Count)
    • Map Phase
    – Generates (Word, Count) pairs
    – [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]
    • Reduce Phase
    – For each word, calculates aggregate
    – [{a,7}, {b,5}, {c,6}]
    BarCamp Chennai - 5 Mohit Soni
    Word Count Example

    View full-size slide

  22. public class WordCountMapper extends MapReduceBase implements
    Mapper {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value, OutputCollectorIntWritable> out, Reporter reporter) throws Exception {
    String l = value.toString();
    StringTokenizer t = new StringTokenizer(l);
    while(t.hasMoreTokens()) {
    word.set(t.nextToken());
    out.collect(word, one);
    }
    }
    }
    BarCamp Chennai - 5 Mohit Soni
    Word Count – Mapper

    View full-size slide

  23. public class WordCountReducer extends MapReduceBase implements
    Reducer {
    public void reduce(Text key, Iterator values,
    OutputCollector out, Reporter reporter) throws
    Exception {
    int sum = 0;
    while(values.hasNext()) {
    sum += values.next().get();
    }
    out.collect(key, new IntWritable(sum));
    }
    }
    BarCamp Chennai - 5 Mohit Soni
    Word Count – Reducer

    View full-size slide

  24. public class WordCountConfig {
    public static void main(String[] args) throws Exception {
    if (args.length() != 2) {
    System.exit(1);
    }
    JobConf conf = new JobConf(WordCountConfig.class);
    conf.setJobName(“Word Counter”);
    FileInputFormat.addInputPath(conf, new Path(args[0]);
    FileInputFormat.addOutputPath(conf, new Path(args[1]));
    conf.setMapperClass(WordCountMapper.class);
    conf.setCombinerClass(WordCountReducer.class);
    conf.setReducerClass(WordCountReducer.class);
    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);
    JobClient.runJob(conf);
    }
    }
    BarCamp Chennai - 5 Mohit Soni
    Word Count – Config

    View full-size slide

  25. • http://hadoop.apache.org/
    • Jeffrey Dean and Sanjay Ghemwat, MapReduce:
    Simplified Data Processing on Large Clusters
    • Tom White, Hadoop: The Definitive Guide, O’Reilly
    • Setting up a Single-Node Cluster: http://bit.ly/glNzs4
    • Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP
    Diving Deeper
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide

  26. • Follow me on twitter @mohitsoni
    • http://mohitsoni.com/
    Catching-Up
    BarCamp Chennai - 5 Mohit Soni

    View full-size slide