Who is Gabriel Reid?
• Day job at TomTom
• @ReidGabriel on Twitter
• Committer on Crunch
Slide 3
Slide 3 text
Map Reduce
Map
Input
Map
Reduce Reduce
Output
Slide 4
Slide 4 text
Word count
(is, 1)
(big, 1)
big data is big
(big, 1)
(data, 1)
(big, (1,1))
(data,(1,))
(is, (1,))
(big, 2)
(data, 1)
(is, 1)
Slide 5
Slide 5 text
What is Crunch?
• Abstraction on Hadoop MapReduce
• Higher-level ease of use, lower-level power
• Apache Incubator project
Slide 6
Slide 6 text
Origins of Crunch
• Follows lead of Hadoop MapReduce
(Google MapReduce), HBase (Google
BigTable), HDFS (GFS)
• Based on Google FlumeJava paper
• Started by a former FlumeJava user
(@joshwills)
• In Apache Incubator since end of May 2012
Slide 7
Slide 7 text
Why an abstraction on
MapReduce?
• Boilerplate
• Job control and dependencies
• Optimizations
Slide 8
Slide 8 text
WordCount in traditional
MapReduce (1)
public class WordCountMapper
extends Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(f());
context.write(word, one);
}
}
}
Slide 9
Slide 9 text
WordCount in traditional
MapReduce (2)
public class WordCountReducer
extends Reducer {
public void reduce(Text key, Iterable values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Slide 10
Slide 10 text
WordCount in traditional
MapReduce (3)
Job job = new Job();
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
Word count in Crunch
Pipeline pipeline = new MRPipeline(CrunchWordCount.class);
PCollection lines = pipeline.readTextFile(args[0]);
PCollection words = lines.parallelDo(
new DoFn() {
@Override
public void process(String line, Emitter emitter) {
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
emitter.emit(tokenizer.nextToken());
}
}
}, Writables.strings());
PTable wordCounts = Aggregate.count(words);
pipeline.writeTextFile(wordCounts, args[1]);
pipeline.done();
Slide 13
Slide 13 text
PCollection
• Immutable, typed collection of values
• Typically stored on HDFS
Slide 14
Slide 14 text
DoFn
• Operates on a single value
• Outputs 0, 1, or n values
• Single input type and output type
Slide 15
Slide 15 text
Implementing a DoFn
public class ToStringFn extends DoFn {
@Override
public void process(T input, Emitter emitter) {
emitter.emit(input.toString());
}
}
Types of PCollections
•PCollection
•PTable
•PGroupedTable
Slide 18
Slide 18 text
PCollections in practice
PCollection people = ...;
PTable peopleByDept = people.by(
new ExtractDepartmentFn(), Writables.strings());
PGroupedTable peopleGroupedByDept =
peopleByDept.groupByKey();
Other MapReduce
abstractions
• Hive
• Pig
• Cascading
Slide 27
Slide 27 text
How to get it
•First Apache release coming soon
•http://cwiki.apache.org/confluence/display/
CRUNCH
•https://git-wip-us.apache.org/repos/asf/
incubator-crunch.git