Slide 1

Slide 1 text

Crunch Simple and Efficient MapReduce Pipelines

Slide 2

Slide 2 text

Who is Gabriel Reid? • Day job at TomTom • @ReidGabriel on Twitter • Committer on Crunch

Slide 3

Slide 3 text

Map Reduce Map Input Map Reduce Reduce Output

Slide 4

Slide 4 text

Word count (is, 1) (big, 1) big data is big (big, 1) (data, 1) (big, (1,1)) (data,(1,)) (is, (1,)) (big, 2) (data, 1) (is, 1)

Slide 5

Slide 5 text

What is Crunch? • Abstraction on Hadoop MapReduce • Higher-level ease of use, lower-level power • Apache Incubator project

Slide 6

Slide 6 text

Origins of Crunch • Follows lead of Hadoop MapReduce (Google MapReduce), HBase (Google BigTable), HDFS (GFS) • Based on Google FlumeJava paper • Started by a former FlumeJava user (@joshwills) • In Apache Incubator since end of May 2012

Slide 7

Slide 7 text

Why an abstraction on MapReduce? • Boilerplate • Job control and dependencies • Optimizations

Slide 8

Slide 8 text

WordCount in traditional MapReduce (1) public class WordCountMapper extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(f()); context.write(word, one); } } }

Slide 9

Slide 9 text

WordCount in traditional MapReduce (2) public class WordCountReducer extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

Slide 10

Slide 10 text

WordCount in traditional MapReduce (3) Job job = new Job(); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

Slide 11

Slide 11 text

Crunch Basics: PCollection & DoFn PCollection PCollection DoFn

Slide 12

Slide 12 text

Word count in Crunch Pipeline pipeline = new MRPipeline(CrunchWordCount.class); PCollection lines = pipeline.readTextFile(args[0]); PCollection words = lines.parallelDo( new DoFn() { @Override public void process(String line, Emitter emitter) { StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { emitter.emit(tokenizer.nextToken()); } } }, Writables.strings()); PTable wordCounts = Aggregate.count(words); pipeline.writeTextFile(wordCounts, args[1]); pipeline.done();

Slide 13

Slide 13 text

PCollection • Immutable, typed collection of values • Typically stored on HDFS

Slide 14

Slide 14 text

DoFn • Operates on a single value • Outputs 0, 1, or n values • Single input type and output type

Slide 15

Slide 15 text

Implementing a DoFn public class ToStringFn extends DoFn { @Override public void process(T input, Emitter emitter) { emitter.emit(input.toString()); } }

Slide 16

Slide 16 text

Built-in DoFn implementations •MapFn •FilterFn •CombineFn

Slide 17

Slide 17 text

Types of PCollections •PCollection •PTable •PGroupedTable

Slide 18

Slide 18 text

PCollections in practice PCollection people = ...; PTable peopleByDept = people.by( new ExtractDepartmentFn(), Writables.strings()); PGroupedTable peopleGroupedByDept = peopleByDept.groupByKey();

Slide 19

Slide 19 text

Multiple PCollections •Union •Join •CoGroup •CrossJoin

Slide 20

Slide 20 text

Data types •“Simple” datatypes •org.apache.hadoop.io.Writable •Apache Avro (including reflection!)

Slide 21

Slide 21 text

Input and Output •Text files •Sequence files •Avro files •Hbase •Anything else with an InputFormat or OutputFormat implementation

Slide 22

Slide 22 text

Example: web server log analysis PCollection webserverEntries = pipeline.read( At.avroFile(inputPath, Avros.records(WebServerLogEntry.class)); PTable topGetsByBytes = webserverEntries .filter(new RequestMethodFilterFn("GET")) .parallelDo( new WebServerEntryToUrlAndResponseSizeFn(), Avros.tableOf(Avros.strings(), Avros.longs())) .groupByKey() .combineValues(CombineFn.SUM_LONGS()) .top(50);

Slide 23

Slide 23 text

Under the hood WebServerEntries Calculate sums Intermediate output Calculated PTable Top 50 filter RequestMethodFilterFn LogEntryToPairFn Top-50 pre-filter

Slide 24

Slide 24 text

Developing with Crunch •Runs on a cluster or locally •Unit test-friendly •PCollection#materialize •MemPipeline

Slide 25

Slide 25 text

Scala, anyone? object WordCount extends PipelineApp { def countWords(file: String) = { read(from.textFile(file)) .flatMap(_.split("\\W+").filter(!_.isEmpty())) .count } val counts = countWords(args(0)) write(counts, to.textFile(args(1))) }

Slide 26

Slide 26 text

Other MapReduce abstractions • Hive • Pig • Cascading

Slide 27

Slide 27 text

How to get it •First Apache release coming soon •http://cwiki.apache.org/confluence/display/ CRUNCH •https://git-wip-us.apache.org/repos/asf/ incubator-crunch.git