Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Crunch

August 28, 2012

Introduction to Crunch

An introduction to Crunch, an Apache Incubator project for simple and efficient MapReduce pipelines in Java


August 28, 2012

Other Decks in Technology


  1. Who is Gabriel Reid? • Day job at TomTom •

    @ReidGabriel on Twitter • Committer on Crunch
  2. Word count (is, 1) (big, 1) big data is big

    (big, 1) (data, 1) (big, (1,1)) (data,(1,)) (is, (1,)) (big, 2) (data, 1) (is, 1)
  3. What is Crunch? • Abstraction on Hadoop MapReduce • Higher-level

    ease of use, lower-level power • Apache Incubator project
  4. Origins of Crunch • Follows lead of Hadoop MapReduce (Google

    MapReduce), HBase (Google BigTable), HDFS (GFS) • Based on Google FlumeJava paper • Started by a former FlumeJava user (@joshwills) • In Apache Incubator since end of May 2012
  5. WordCount in traditional MapReduce (1) public class WordCountMapper extends Mapper<LongWritable,

    Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(f()); context.write(word, one); } } }
  6. WordCount in traditional MapReduce (2) public class WordCountReducer extends Reducer<Text,

    IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
  7. WordCount in traditional MapReduce (3) Job job = new Job();

    job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);
  8. Word count in Crunch Pipeline pipeline = new MRPipeline(CrunchWordCount.class); PCollection<String>

    lines = pipeline.readTextFile(args[0]); PCollection<String> words = lines.parallelDo( new DoFn<String, String>() { @Override public void process(String line, Emitter<String> emitter) { StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { emitter.emit(tokenizer.nextToken()); } } }, Writables.strings()); PTable<String, Long> wordCounts = Aggregate.count(words); pipeline.writeTextFile(wordCounts, args[1]); pipeline.done();
  9. DoFn • Operates on a single value • Outputs 0,

    1, or n values • Single input type and output type
  10. Implementing a DoFn public class ToStringFn<T> extends DoFn<T, String> {

    @Override public void process(T input, Emitter<String> emitter) { emitter.emit(input.toString()); } }
  11. PCollections in practice PCollection<Person> people = ...; PTable<String,Person> peopleByDept =

    people.by( new ExtractDepartmentFn(), Writables.strings()); PGroupedTable<String, Person> peopleGroupedByDept = peopleByDept.groupByKey();
  12. Input and Output •Text files •Sequence files •Avro files •Hbase

    •Anything else with an InputFormat or OutputFormat implementation
  13. Example: web server log analysis PCollection<WebServerLogEntry> webserverEntries = pipeline.read( At.avroFile(inputPath,

    Avros.records(WebServerLogEntry.class)); PTable<String, Long> topGetsByBytes = webserverEntries .filter(new RequestMethodFilterFn("GET")) .parallelDo( new WebServerEntryToUrlAndResponseSizeFn(), Avros.tableOf(Avros.strings(), Avros.longs())) .groupByKey() .combineValues(CombineFn.<String>SUM_LONGS()) .top(50);
  14. Under the hood WebServerEntries Calculate sums Intermediate output Calculated PTable

    Top 50 filter RequestMethodFilterFn LogEntryToPairFn Top-50 pre-filter
  15. Developing with Crunch •Runs on a cluster or locally •Unit

    test-friendly •PCollection#materialize •MemPipeline
  16. Scala, anyone? object WordCount extends PipelineApp { def countWords(file: String)

    = { read(from.textFile(file)) .flatMap(_.split("\\W+").filter(!_.isEmpty())) .count } val counts = countWords(args(0)) write(counts, to.textFile(args(1))) }
  17. How to get it •First Apache release coming soon •http://cwiki.apache.org/confluence/display/

    CRUNCH •https://git-wip-us.apache.org/repos/asf/ incubator-crunch.git