Hello World! in Hadoop

HELLO WORLD! IN HADOOP Chandra Yarlagadda randommuses chandrayarlagadda twitter.com/ linkedin.com/in/
Email : [email protected]

Titanium Sponsors Platinum Sponsors Gold Sponsors

Location / People / Events SMB Analytics Infrastructure Analytics Applications
Data Sources Storage Security Crowdsourcing Hadoop Related NoSQL Databases Data Visualization Cluster Services Personal Data MPP Databases Industry Applications Social Media Sentiment Analysis Analytics Solutions Crowdsourced Analytics IT Analytics Data Sources Data Marketplaces Publisher Tools Marketing Management / Monitoring Real-Time Ad Optimization Statistical Computing Cross Infrastructure / Analytics Application Service Providers Big Data Search Analytics Services Collection / Transport Open Source Projects Framework Query / Data Flow Data Access Coordination / Workflow Machine Learning Real - Time Statistical Tools Cloud Deployment NewSQL Databases Source : www.bigdatalandscape.com

HDFS MAP REDUCE

What is Big Data?

What is Big Data? Volume Variety Velocity

Before We Jump In!

Before We Jump In! • Hadoop 1.2.1

Before We Jump In! • Hadoop 1.2.1 • Daemon

Before We Jump In! • Hadoop 1.2.1 • Daemon •
Pseudo - Distributed Mode

Pseudo - Distributed Mode • Cloudera VM

Pseudo - Distributed Mode • Cloudera VM • Do Not Start the Cloudera Manager

Pseudo - Distributed Mode • Cloudera VM • Do Not Start the Cloudera Manager • Word Count

HDFS • Hadoop Distributed File System

HDFS • Hadoop Distributed File System • Designed for Storing
Very large ﬁles with

Very large ﬁles with • Streaming data access

Very large ﬁles with • Streaming data access • running on clusters of Commodity hardware

` Data Nodes

` Name Node Data Nodes

` Name Node Block1 Block2 Block3 Data Nodes

` Name Node Block1 Block2 Block3 Block1 Block2 Block3 Data
Nodes

` Name Node Block1 Block2 Block3 Block1 Block1 Block2 Block2
Block3 Block3 Data Nodes

` Name Node Block1 Block2 Block3 Block1 Block1 Block2 Block2
Block3 Block3 Data Nodes Block1

` Name Node fsImage Block1 Block2 Block3 Block1 Block1 Block2
Block2 Block3 Block3 Data Nodes Block1

` Name Node Edit Log fsImage Block1 Block2 Block3 Block1
Block1 Block2 Block2 Block3 Block3 Data Nodes Block1

` Name Node Secondary Name Node Edit Log fsImage Block1
Block2 Block3 Block1 Block1 Block2 Block2 Block3 Block3 Data Nodes Block1

` Secondary Name Node Edit Log fsImage Block1 Block2 Block3
Block1 Block1 Block2 Block2 Block3 Block3 Data Nodes Block1

` Secondary Name Node Edit Log fsImage New Name Node
Block1 Block2 Block3 Block1 Block1 Block2 Block2 Block3 Block3 Data Nodes Block1

HDFS - ATTRIBUTES

HDFS - ATTRIBUTES • Distributed

HDFS - ATTRIBUTES • Distributed • Write Once - Read
Many Times Pattern

Many Times Pattern • High Throughput - High Latency

Many Times Pattern • High Throughput - High Latency • CAP ( Consitency - Availability - Parition Tolerance)

Many Times Pattern • High Throughput - High Latency • CAP ( Consitency - Availability - Parition Tolerance) • Not Suitable for a lot of Small Files

Many Times Pattern • High Throughput - High Latency • CAP ( Consitency - Availability - Parition Tolerance) • Not Suitable for a lot of Small Files • High Network Utilization and Disk I/O

MAP REDUCE

MAP REDUCE It is a programming model and an associated
implementation for

implementation for • Processing and Generating large data sets with a

implementation for • Processing and Generating large data sets with a • Parallel,

implementation for • Processing and Generating large data sets with a • Parallel, • Distributed Algorithm on a

implementation for • Processing and Generating large data sets with a • Parallel, • Distributed Algorithm on a • Cluster ( HDFS)

MAP REDUCE - KEY POINTS

MAP REDUCE - KEY POINTS • Map and Reduce Phase

• User deﬁnes the Map and Reduce Functions

• User deﬁnes the Map and Reduce Functions • Both Phases have a set of Key - Value Pairs

MAP REDUCE - SINGLE REDUCER

MAP REDUCE - SINGLE REDUCER Mapper Mapper Mapper Reducer

MAP REDUCE - SINGLE REDUCER Mapper Mapper Mapper Input Split
1 Input Split 2 Input Split 3 Reducer

1 Input Split 2 Input Split 3 Reducer Name Node

1 Input Split 2 Input Split 3 Reducer HDFS Name Node

1 Input Split 2 Input Split 3 Reducer HDFS Job Tracker Task Tracker Name Node

MAP REDUCE - MULTIPLE REDUCERS Mapper Mapper Mapper Input Split
1 Input Split 2 Input Split 3 Reducer HDFS Job Tracker Task Tracker Name Node Reducer

MAP REDUCE - MULTIPLE REDUCERS Mapper Mapper Mapper Input Split
1 Input Split 2 Input Split 3 Reducer HDFS Job Tracker Task Tracker Name Node Reducer Sort and Shufﬂe

MAP REDUCE - ZERO REDUCERS Mapper Mapper Mapper Input Split
1 Input Split 2 Input Split 3 HDFS Job Tracker Name Node

MAP REDUCE - ZERO REDUCERS Mapper Mapper Mapper Input Split
1 Input Split 2 Input Split 3 HDFS Job Tracker Task Tracker Name Node

Key - Value

Key - Value map ( InputKey, InputValue) => Set of
(IntermediateKey,IntermediateValue)

Key - Value 0 Twinkle Twinkle Little Star Little 1
Star 1 Twinkle 1,1 => map ( InputKey, InputValue) => Set of (IntermediateKey,IntermediateValue)

Star 1 Twinkle 1,1 => map ( InputKey, InputValue) => Set of (IntermediateKey,IntermediateValue) reduce(IntermediateKeys,IntermediateValues) => Set of (OutputKey,OutputValue)

Star 1 Twinkle 1,1 => Little 1 Star 1 Twinkle 1,1 => Little 1 Star 1 Twinkle 2 map ( InputKey, InputValue) => Set of (IntermediateKey,IntermediateValue) reduce(IntermediateKeys,IntermediateValues) => Set of (OutputKey,OutputValue)

MAP CODE

MAP CODE map ( InputKey, InputValue) => Set of (IntermediateKey,IntermediateValue)

MAP CODE public static class TokenizerMapper extends Mapper<Object, Text, Text,
IntWritable>{ map ( InputKey, InputValue) => Set of (IntermediateKey,IntermediateValue)

MAP CODE public static class extends map ( InputKey, InputValue)
=> Set of (IntermediateKey,IntermediateValue) private Text word = new Text(); private final static IntWritable one = new IntWritable(1);

=> Set of (IntermediateKey,IntermediateValue) public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); private private final static

=> Set of (IntermediateKey,IntermediateValue) public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); private private final static 0 Twinkle Twinkle Little Star

=> Set of (IntermediateKey,IntermediateValue) public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); private private final static 0 Twinkle Twinkle Little Star Little 1 Star 1 Twinkle 1,1

MAP CODE public static class TokenizerMapper extends Mapper<Object, Text, Text,
IntWritable>{ map ( InputKey, InputValue) => Set of (IntermediateKey,IntermediateValue) public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); private Text word = new Text(); private final static IntWritable one = new IntWritable(1); } } }

REDUCE CODE

REDUCE CODE reduce(IntermediateKeys,IntermediateValues) => Set of (OutputKey,OutputValue)

REDUCE CODE public static class IntSumReducer extends Reducer<Text, IntWritable, Text,
IntWritable>{ reduce(IntermediateKeys,IntermediateValues) => Set of (OutputKey,OutputValue)

REDUCE CODE public static class extends reduce(IntermediateKeys,IntermediateValues) => Set of
(OutputKey,OutputValue) private IntWritable result = new IntWritable();

(OutputKey,OutputValue) public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val:values) { sum += val.get(); } result.set(sum); context.write(key, result); private

(OutputKey,OutputValue) public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val:values) { sum += val.get(); } result.set(sum); context.write(key, result); private Little 1 Star 1 Twinkle 1,1 Little 1 Star 1 Twinkle 2

REDUCE CODE public static class IntSumReducer extends Reducer<Text, IntWritable, Text,
IntWritable>{ reduce(IntermediateKeys,IntermediateValues) => Set of (OutputKey,OutputValue) public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val:values) { sum += val.get(); } result.set(sum); context.write(key, result); private IntWritable result = new IntWritable(); } }

MAIN() public static void main(String[] args) throws Exception { Job
job = Job.getInstance(conf, "word count"); Configuration conf = new Configuration(); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } job.setInputFormatClass(TextInputFormat.class);

DEMO TIME

RESOUCES

RESOUCES • https://hadoop.apache.org/

RESOUCES • https://hadoop.apache.org/ • Hadoop the Deﬁnitive Guide - Tom
White

White • http://hortonworks.com/tutorials/

White • http://hortonworks.com/tutorials/ • http://www.cloudera.com/content/cloudera/en/documentation/core/ v5-3-x/topics/cloudera_quickstart_vm.html

Questions?

Hello World! in Hadoop

Hello World! in Hadoop

Other Decks in Programming

Featured

Transcript