$30 off During Our Annual Pro Sale. View Details »

Introduction to Crunch

greid
August 28, 2012

Introduction to Crunch

An introduction to Crunch, an Apache Incubator project for simple and efficient MapReduce pipelines in Java

greid

August 28, 2012
Tweet

Other Decks in Technology

Transcript

  1. Crunch
    Simple and Efficient MapReduce Pipelines

    View Slide

  2. Who is Gabriel Reid?
    • Day job at TomTom
    • @ReidGabriel on Twitter
    • Committer on Crunch

    View Slide

  3. Map Reduce
    Map
    Input
    Map
    Reduce Reduce
    Output

    View Slide

  4. Word count
    (is, 1)
    (big, 1)
    big data is big
    (big, 1)
    (data, 1)
    (big, (1,1))
    (data,(1,))
    (is, (1,))
    (big, 2)
    (data, 1)
    (is, 1)

    View Slide

  5. What is Crunch?
    • Abstraction on Hadoop MapReduce
    • Higher-level ease of use, lower-level power
    • Apache Incubator project

    View Slide

  6. Origins of Crunch
    • Follows lead of Hadoop MapReduce
    (Google MapReduce), HBase (Google
    BigTable), HDFS (GFS)
    • Based on Google FlumeJava paper
    • Started by a former FlumeJava user
    (@joshwills)
    • In Apache Incubator since end of May 2012

    View Slide

  7. Why an abstraction on
    MapReduce?
    • Boilerplate
    • Job control and dependencies
    • Optimizations

    View Slide

  8. WordCount in traditional
    MapReduce (1)
    public class WordCountMapper
    extends Mapper {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value,
    Context context)
    throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
    word.set(f());
    context.write(word, one);
    }
    }
    }

    View Slide

  9. WordCount in traditional
    MapReduce (2)
    public class WordCountReducer
    extends Reducer {
    public void reduce(Text key, Iterable values,
    Context context)
    throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
    sum += val.get();
    }
    context.write(key, new IntWritable(sum));
    }
    }

    View Slide

  10. WordCount in traditional
    MapReduce (3)
    Job job = new Job();
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(WordCountMapper.class);
    job.setReducerClass(WordCountReducer.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    job.waitForCompletion(true);

    View Slide

  11. Crunch Basics:
    PCollection & DoFn
    PCollection PCollection
    DoFn

    View Slide

  12. Word count in Crunch
    Pipeline pipeline = new MRPipeline(CrunchWordCount.class);
    PCollection lines = pipeline.readTextFile(args[0]);
    PCollection words = lines.parallelDo(
    new DoFn() {
    @Override
    public void process(String line, Emitter emitter) {
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
    emitter.emit(tokenizer.nextToken());
    }
    }
    }, Writables.strings());
    PTable wordCounts = Aggregate.count(words);
    pipeline.writeTextFile(wordCounts, args[1]);
    pipeline.done();

    View Slide

  13. PCollection
    • Immutable, typed collection of values
    • Typically stored on HDFS

    View Slide

  14. DoFn
    • Operates on a single value
    • Outputs 0, 1, or n values
    • Single input type and output type

    View Slide

  15. Implementing a DoFn
    public class ToStringFn extends DoFn {
    @Override
    public void process(T input, Emitter emitter) {
    emitter.emit(input.toString());
    }
    }

    View Slide

  16. Built-in DoFn
    implementations
    •MapFn
    •FilterFn
    •CombineFn

    View Slide

  17. Types of PCollections
    •PCollection
    •PTable
    •PGroupedTable

    View Slide

  18. PCollections in practice
    PCollection people = ...;
    PTable peopleByDept = people.by(
    new ExtractDepartmentFn(), Writables.strings());
    PGroupedTable peopleGroupedByDept =
    peopleByDept.groupByKey();

    View Slide

  19. Multiple PCollections
    •Union
    •Join
    •CoGroup
    •CrossJoin

    View Slide

  20. Data types
    •“Simple” datatypes
    •org.apache.hadoop.io.Writable
    •Apache Avro (including reflection!)

    View Slide

  21. Input and Output
    •Text files
    •Sequence files
    •Avro files
    •Hbase
    •Anything else with an InputFormat or
    OutputFormat implementation

    View Slide

  22. Example: web server log
    analysis
    PCollection webserverEntries = pipeline.read(
    At.avroFile(inputPath, Avros.records(WebServerLogEntry.class));
    PTable topGetsByBytes = webserverEntries
    .filter(new RequestMethodFilterFn("GET"))
    .parallelDo(
    new WebServerEntryToUrlAndResponseSizeFn(),
    Avros.tableOf(Avros.strings(), Avros.longs()))
    .groupByKey()
    .combineValues(CombineFn.SUM_LONGS())
    .top(50);

    View Slide

  23. Under the hood
    WebServerEntries
    Calculate
    sums
    Intermediate
    output
    Calculated PTable
    Top
    50
    filter
    RequestMethodFilterFn
    LogEntryToPairFn
    Top-50 pre-filter

    View Slide

  24. Developing with Crunch
    •Runs on a cluster or locally
    •Unit test-friendly
    •PCollection#materialize
    •MemPipeline

    View Slide

  25. Scala, anyone?
    object WordCount extends PipelineApp {
    def countWords(file: String) = {
    read(from.textFile(file))
    .flatMap(_.split("\\W+").filter(!_.isEmpty()))
    .count
    }
    val counts = countWords(args(0))
    write(counts, to.textFile(args(1)))
    }

    View Slide

  26. Other MapReduce
    abstractions
    • Hive
    • Pig
    • Cascading

    View Slide

  27. How to get it
    •First Apache release coming soon
    •http://cwiki.apache.org/confluence/display/
    CRUNCH
    •https://git-wip-us.apache.org/repos/asf/
    incubator-crunch.git

    View Slide