Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hello World! in Hadoop

Hello World! in Hadoop

Every journey to learn new programming language or paradigm starts with the simple Hello World! program. The Hadoop world is no different, and it's Hello World equivalent is the Word Count Program.

The presentation covers the basic concepts required to understand the Word Count program and how one can go about setting it up and executing it on your own.

Chandra Yarlagadda

June 26, 2015
Tweet

Other Decks in Programming

Transcript

  1. Location / People / Events SMB Analytics Infrastructure Analytics Applications

    Data Sources Storage Security Crowdsourcing Hadoop Related NoSQL Databases Data Visualization Cluster Services Personal Data MPP Databases Industry Applications Social Media Sentiment Analysis Analytics Solutions Crowdsourced Analytics IT Analytics Data Sources Data Marketplaces Publisher Tools Marketing Management / Monitoring Real-Time Ad Optimization Statistical Computing Cross Infrastructure / Analytics Application Service Providers Big Data Search Analytics Services Collection / Transport Open Source Projects Framework Query / Data Flow Data Access Coordination / Workflow Machine Learning Real - Time Statistical Tools Cloud Deployment NewSQL Databases Source : www.bigdatalandscape.com
  2. Location / People / Events SMB Analytics Infrastructure Analytics Applications

    Data Sources Storage Security Crowdsourcing Hadoop Related NoSQL Databases Data Visualization Cluster Services Personal Data MPP Databases Industry Applications Social Media Sentiment Analysis Analytics Solutions Crowdsourced Analytics IT Analytics Data Sources Data Marketplaces Publisher Tools Marketing Management / Monitoring Real-Time Ad Optimization Statistical Computing Cross Infrastructure / Analytics Application Service Providers Big Data Search Analytics Services Collection / Transport Open Source Projects Framework Query / Data Flow Data Access Coordination / Workflow Machine Learning Real - Time Statistical Tools Cloud Deployment NewSQL Databases Source : www.bigdatalandscape.com
  3. Before We Jump In! • Hadoop 1.2.1 • Daemon •

    Pseudo - Distributed Mode • Cloudera VM
  4. Before We Jump In! • Hadoop 1.2.1 • Daemon •

    Pseudo - Distributed Mode • Cloudera VM • Do Not Start the Cloudera Manager
  5. Before We Jump In! • Hadoop 1.2.1 • Daemon •

    Pseudo - Distributed Mode • Cloudera VM • Do Not Start the Cloudera Manager • Word Count
  6. HDFS • Hadoop Distributed File System • Designed for Storing

    Very large files with • Streaming data access
  7. HDFS • Hadoop Distributed File System • Designed for Storing

    Very large files with • Streaming data access • running on clusters of Commodity hardware
  8. `

  9. ` Name Node fsImage Block1 Block2 Block3 Block1 Block1 Block2

    Block2 Block3 Block3 Data Nodes Block1
  10. ` Name Node fsImage Block1 Block2 Block3 Block1 Block1 Block2

    Block2 Block3 Block3 Data Nodes Block1
  11. ` Name Node Edit Log fsImage Block1 Block2 Block3 Block1

    Block1 Block2 Block2 Block3 Block3 Data Nodes Block1
  12. ` Name Node Edit Log fsImage Block1 Block2 Block3 Block1

    Block1 Block2 Block2 Block3 Block3 Data Nodes Block1
  13. ` Name Node Edit Log fsImage Block1 Block2 Block3 Block1

    Block1 Block2 Block2 Block3 Block3 Data Nodes Block1
  14. ` Name Node Secondary Name Node Edit Log fsImage Block1

    Block2 Block3 Block1 Block1 Block2 Block2 Block3 Block3 Data Nodes Block1
  15. ` Name Node Secondary Name Node Edit Log fsImage Block1

    Block2 Block3 Block1 Block1 Block2 Block2 Block3 Block3 Data Nodes Block1
  16. ` Name Node Secondary Name Node Edit Log fsImage Block1

    Block2 Block3 Block1 Block1 Block2 Block2 Block3 Block3 Data Nodes Block1
  17. ` Secondary Name Node Edit Log fsImage Block1 Block2 Block3

    Block1 Block1 Block2 Block2 Block3 Block3 Data Nodes Block1
  18. ` Secondary Name Node Edit Log fsImage Block1 Block2 Block3

    Block1 Block1 Block2 Block2 Block3 Block3 Data Nodes Block1
  19. ` Secondary Name Node Edit Log fsImage Block1 Block2 Block3

    Block1 Block1 Block2 Block2 Block3 Block3 Data Nodes Block1
  20. ` Secondary Name Node Edit Log fsImage Block1 Block2 Block3

    Block1 Block1 Block2 Block2 Block3 Block3 Data Nodes Block1
  21. ` Secondary Name Node Edit Log fsImage Block1 Block2 Block3

    Block1 Block1 Block2 Block2 Block3 Block3 Data Nodes Block1
  22. ` Secondary Name Node Edit Log fsImage New Name Node

    Block1 Block2 Block3 Block1 Block1 Block2 Block2 Block3 Block3 Data Nodes Block1
  23. HDFS - ATTRIBUTES • Distributed • Write Once - Read

    Many Times Pattern • High Throughput - High Latency
  24. HDFS - ATTRIBUTES • Distributed • Write Once - Read

    Many Times Pattern • High Throughput - High Latency • CAP ( Consitency - Availability - Parition Tolerance)
  25. HDFS - ATTRIBUTES • Distributed • Write Once - Read

    Many Times Pattern • High Throughput - High Latency • CAP ( Consitency - Availability - Parition Tolerance) • Not Suitable for a lot of Small Files
  26. HDFS - ATTRIBUTES • Distributed • Write Once - Read

    Many Times Pattern • High Throughput - High Latency • CAP ( Consitency - Availability - Parition Tolerance) • Not Suitable for a lot of Small Files • High Network Utilization and Disk I/O
  27. MAP REDUCE It is a programming model and an associated

    implementation for • Processing and Generating large data sets with a
  28. MAP REDUCE It is a programming model and an associated

    implementation for • Processing and Generating large data sets with a • Parallel,
  29. MAP REDUCE It is a programming model and an associated

    implementation for • Processing and Generating large data sets with a • Parallel, • Distributed Algorithm on a
  30. MAP REDUCE It is a programming model and an associated

    implementation for • Processing and Generating large data sets with a • Parallel, • Distributed Algorithm on a • Cluster ( HDFS)
  31. MAP REDUCE - KEY POINTS • Map and Reduce Phase

    • User defines the Map and Reduce Functions
  32. MAP REDUCE - KEY POINTS • Map and Reduce Phase

    • User defines the Map and Reduce Functions • Both Phases have a set of Key - Value Pairs
  33. MAP REDUCE - SINGLE REDUCER Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer
  34. MAP REDUCE - SINGLE REDUCER Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer
  35. MAP REDUCE - SINGLE REDUCER Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer
  36. MAP REDUCE - SINGLE REDUCER Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer
  37. MAP REDUCE - SINGLE REDUCER Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer Name Node
  38. MAP REDUCE - SINGLE REDUCER Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer HDFS Name Node
  39. MAP REDUCE - SINGLE REDUCER Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer HDFS Job Tracker Task Tracker Name Node
  40. MAP REDUCE - SINGLE REDUCER Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer HDFS Job Tracker Task Tracker Name Node
  41. MAP REDUCE - MULTIPLE REDUCERS Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer HDFS Job Tracker Task Tracker Name Node Reducer
  42. MAP REDUCE - MULTIPLE REDUCERS Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer HDFS Job Tracker Task Tracker Name Node Reducer
  43. MAP REDUCE - MULTIPLE REDUCERS Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer HDFS Job Tracker Task Tracker Name Node Reducer Sort and Shuffle
  44. MAP REDUCE - MULTIPLE REDUCERS Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer HDFS Job Tracker Task Tracker Name Node Reducer Sort and Shuffle
  45. MAP REDUCE - MULTIPLE REDUCERS Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 Reducer HDFS Job Tracker Task Tracker Name Node Reducer Sort and Shuffle
  46. MAP REDUCE - ZERO REDUCERS Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 HDFS Job Tracker Name Node
  47. MAP REDUCE - ZERO REDUCERS Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 HDFS Job Tracker Name Node
  48. MAP REDUCE - ZERO REDUCERS Mapper Mapper Mapper Input Split

    1 Input Split 2 Input Split 3 HDFS Job Tracker Task Tracker Name Node
  49. Key - Value map ( InputKey, InputValue) => Set of

    (IntermediateKey,IntermediateValue)
  50. Key - Value 0 Twinkle Twinkle Little Star Little 1

    Star 1 Twinkle 1,1 => map ( InputKey, InputValue) => Set of (IntermediateKey,IntermediateValue)
  51. Key - Value 0 Twinkle Twinkle Little Star Little 1

    Star 1 Twinkle 1,1 => map ( InputKey, InputValue) => Set of (IntermediateKey,IntermediateValue) reduce(IntermediateKeys,IntermediateValues) => Set of (OutputKey,OutputValue)
  52. Key - Value 0 Twinkle Twinkle Little Star Little 1

    Star 1 Twinkle 1,1 => Little 1 Star 1 Twinkle 1,1 => Little 1 Star 1 Twinkle 2 map ( InputKey, InputValue) => Set of (IntermediateKey,IntermediateValue) reduce(IntermediateKeys,IntermediateValues) => Set of (OutputKey,OutputValue)
  53. MAP CODE public static class TokenizerMapper extends Mapper<Object, Text, Text,

    IntWritable>{ map ( InputKey, InputValue) => Set of (IntermediateKey,IntermediateValue)
  54. MAP CODE public static class extends map ( InputKey, InputValue)

    => Set of (IntermediateKey,IntermediateValue) private Text word = new Text(); private final static IntWritable one = new IntWritable(1);
  55. MAP CODE public static class extends map ( InputKey, InputValue)

    => Set of (IntermediateKey,IntermediateValue) public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); private private final static
  56. MAP CODE public static class extends map ( InputKey, InputValue)

    => Set of (IntermediateKey,IntermediateValue) public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); private private final static 0 Twinkle Twinkle Little Star
  57. MAP CODE public static class extends map ( InputKey, InputValue)

    => Set of (IntermediateKey,IntermediateValue) public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); private private final static 0 Twinkle Twinkle Little Star Little 1 Star 1 Twinkle 1,1
  58. MAP CODE public static class TokenizerMapper extends Mapper<Object, Text, Text,

    IntWritable>{ map ( InputKey, InputValue) => Set of (IntermediateKey,IntermediateValue) public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); private Text word = new Text(); private final static IntWritable one = new IntWritable(1); } } }
  59. REDUCE CODE public static class IntSumReducer extends Reducer<Text, IntWritable, Text,

    IntWritable>{ reduce(IntermediateKeys,IntermediateValues) => Set of (OutputKey,OutputValue)
  60. REDUCE CODE public static class extends reduce(IntermediateKeys,IntermediateValues) => Set of

    (OutputKey,OutputValue) private IntWritable result = new IntWritable();
  61. REDUCE CODE public static class extends reduce(IntermediateKeys,IntermediateValues) => Set of

    (OutputKey,OutputValue) public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val:values) { sum += val.get(); } result.set(sum); context.write(key, result); private
  62. REDUCE CODE public static class extends reduce(IntermediateKeys,IntermediateValues) => Set of

    (OutputKey,OutputValue) public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val:values) { sum += val.get(); } result.set(sum); context.write(key, result); private Little 1 Star 1 Twinkle 1,1 Little 1 Star 1 Twinkle 2
  63. REDUCE CODE public static class IntSumReducer extends Reducer<Text, IntWritable, Text,

    IntWritable>{ reduce(IntermediateKeys,IntermediateValues) => Set of (OutputKey,OutputValue) public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val:values) { sum += val.get(); } result.set(sum); context.write(key, result); private IntWritable result = new IntWritable(); } }
  64. MAIN() public static void main(String[] args) throws Exception { Job

    job = Job.getInstance(conf, "word count"); Configuration conf = new Configuration(); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } job.setInputFormatClass(TextInputFormat.class);
  65. RESOUCES • https://hadoop.apache.org/ • Hadoop the Definitive Guide - Tom

    White • http://hortonworks.com/tutorials/ • http://www.cloudera.com/content/cloudera/en/documentation/core/ v5-3-x/topics/cloudera_quickstart_vm.html