Origins of Crunch • Follows lead of Hadoop MapReduce (Google MapReduce), HBase (Google BigTable), HDFS (GFS) • Based on Google FlumeJava paper • Started by a former FlumeJava user (@joshwills) • In Apache Incubator since end of May 2012
WordCount in traditional MapReduce (1) public class WordCountMapper extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(f()); context.write(word, one); } } }
WordCount in traditional MapReduce (2) public class WordCountReducer extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
WordCount in traditional MapReduce (3) Job job = new Job(); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);
Implementing a DoFn public class ToStringFn extends DoFn { @Override public void process(T input, Emitter emitter) { emitter.emit(input.toString()); } }
PCollections in practice PCollection people = ...; PTable peopleByDept = people.by( new ExtractDepartmentFn(), Writables.strings()); PGroupedTable peopleGroupedByDept = peopleByDept.groupByKey();
How to get it •First Apache release coming soon •http://cwiki.apache.org/confluence/display/ CRUNCH •https://git-wip-us.apache.org/repos/asf/ incubator-crunch.git