Introduction to Crunch

Crunch Simple and Efﬁcient MapReduce Pipelines

Who is Gabriel Reid? • Day job at TomTom •
@ReidGabriel on Twitter • Committer on Crunch

Map Reduce Map Input Map Reduce Reduce Output

Word count (is, 1) (big, 1) big data is big
(big, 1) (data, 1) (big, (1,1)) (data,(1,)) (is, (1,)) (big, 2) (data, 1) (is, 1)

What is Crunch? • Abstraction on Hadoop MapReduce • Higher-level
ease of use, lower-level power • Apache Incubator project

Origins of Crunch • Follows lead of Hadoop MapReduce (Google
MapReduce), HBase (Google BigTable), HDFS (GFS) • Based on Google FlumeJava paper • Started by a former FlumeJava user (@joshwills) • In Apache Incubator since end of May 2012

Why an abstraction on MapReduce? • Boilerplate • Job control
and dependencies • Optimizations

WordCount in traditional MapReduce (1) public class WordCountMapper extends Mapper<LongWritable,
Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(f()); context.write(word, one); } } }

WordCount in traditional MapReduce (2) public class WordCountReducer extends Reducer<Text,
IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

WordCount in traditional MapReduce (3) Job job = new Job();
job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

Crunch Basics: PCollection & DoFn PCollection PCollection DoFn

Word count in Crunch Pipeline pipeline = new MRPipeline(CrunchWordCount.class); PCollection<String>
lines = pipeline.readTextFile(args[0]); PCollection<String> words = lines.parallelDo( new DoFn<String, String>() { @Override public void process(String line, Emitter<String> emitter) { StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { emitter.emit(tokenizer.nextToken()); } } }, Writables.strings()); PTable<String, Long> wordCounts = Aggregate.count(words); pipeline.writeTextFile(wordCounts, args[1]); pipeline.done();

PCollection • Immutable, typed collection of values • Typically stored
on HDFS

DoFn • Operates on a single value • Outputs 0,
1, or n values • Single input type and output type

Implementing a DoFn public class ToStringFn<T> extends DoFn<T, String> {
@Override public void process(T input, Emitter<String> emitter) { emitter.emit(input.toString()); } }

Built-in DoFn implementations •MapFn •FilterFn •CombineFn

Types of PCollections •PCollection<T> •PTable<K,V> •PGroupedTable<K,V>

PCollections in practice PCollection<Person> people = ...; PTable<String,Person> peopleByDept =
people.by( new ExtractDepartmentFn(), Writables.strings()); PGroupedTable<String, Person> peopleGroupedByDept = peopleByDept.groupByKey();

Multiple PCollections •Union •Join •CoGroup •CrossJoin

Data types •“Simple” datatypes •org.apache.hadoop.io.Writable •Apache Avro (including reﬂection!)

Input and Output •Text files •Sequence files •Avro files •Hbase
•Anything else with an InputFormat or OutputFormat implementation

Example: web server log analysis PCollection<WebServerLogEntry> webserverEntries = pipeline.read( At.avroFile(inputPath,
Avros.records(WebServerLogEntry.class)); PTable<String, Long> topGetsByBytes = webserverEntries .filter(new RequestMethodFilterFn("GET")) .parallelDo( new WebServerEntryToUrlAndResponseSizeFn(), Avros.tableOf(Avros.strings(), Avros.longs())) .groupByKey() .combineValues(CombineFn.<String>SUM_LONGS()) .top(50);

Under the hood WebServerEntries Calculate sums Intermediate output Calculated PTable
Top 50 ﬁlter RequestMethodFilterFn LogEntryToPairFn Top-50 pre-ﬁlter

Developing with Crunch •Runs on a cluster or locally •Unit
test-friendly •PCollection#materialize •MemPipeline

Scala, anyone? object WordCount extends PipelineApp { def countWords(file: String)
= { read(from.textFile(file)) .flatMap(_.split("\\W+").filter(!_.isEmpty())) .count } val counts = countWords(args(0)) write(counts, to.textFile(args(1))) }

Other MapReduce abstractions • Hive • Pig • Cascading

How to get it •First Apache release coming soon •http://cwiki.apache.org/conﬂuence/display/
CRUNCH •https://git-wip-us.apache.org/repos/asf/ incubator-crunch.git

Introduction to Crunch

Introduction to Crunch

greid

Other Decks in Technology

Featured

Transcript

Crunch Simple and Efﬁcient MapReduce Pipelines

Who is Gabriel Reid? • Day job at TomTom •

Map Reduce Map Input Map Reduce Reduce Output

Word count (is, 1) (big, 1) big data is big

What is Crunch? • Abstraction on Hadoop MapReduce • Higher-level

Origins of Crunch • Follows lead of Hadoop MapReduce (Google

Why an abstraction on MapReduce? • Boilerplate • Job control

WordCount in traditional MapReduce (1) public class WordCountMapper extends Mapper<LongWritable,

WordCount in traditional MapReduce (2) public class WordCountReducer extends Reducer<Text,

WordCount in traditional MapReduce (3) Job job = new Job();

Crunch Basics: PCollection & DoFn PCollection PCollection DoFn

Word count in Crunch Pipeline pipeline = new MRPipeline(CrunchWordCount.class); PCollection<String>

PCollection • Immutable, typed collection of values • Typically stored

DoFn • Operates on a single value • Outputs 0,

Implementing a DoFn public class ToStringFn<T> extends DoFn<T, String> {

Built-in DoFn implementations •MapFn •FilterFn •CombineFn

Types of PCollections •PCollection<T> •PTable<K,V> •PGroupedTable<K,V>

PCollections in practice PCollection<Person> people = ...; PTable<String,Person> peopleByDept =

Multiple PCollections •Union •Join •CoGroup •CrossJoin

Data types •“Simple” datatypes •org.apache.hadoop.io.Writable •Apache Avro (including reﬂection!)

Input and Output •Text files •Sequence files •Avro files •Hbase

Example: web server log analysis PCollection<WebServerLogEntry> webserverEntries = pipeline.read( At.avroFile(inputPath,

Under the hood WebServerEntries Calculate sums Intermediate output Calculated PTable

Developing with Crunch •Runs on a cluster or locally •Unit

Scala, anyone? object WordCount extends PipelineApp { def countWords(file: String)

Other MapReduce abstractions • Hive • Pig • Cascading

How to get it •First Apache release coming soon •http://cwiki.apache.org/conﬂuence/display/