Slide 18
Slide 18 text
Apache Crunch
• Abstraction layer on top of MapReduce
• More developer friendly than Pig/Hive (in many cases)
• Modeled after Google’s FlumeJava
• Flexible Data Types:
PCollection, PTable
• Simple but Powerful Operations:
parallelDo(), groupByKey(), combineValues(),
flatten(), count(), join(), sort(), top()
• Robust join strategies: reduce-side, map-side, sharded joins, bloom-
filter joins, cogroups, etc.
• Three runner pipelines: MRPipeline (Hadoop), SparkPipeline,
MemPipeline
So for these reasons, we decided to give Apache Crunch a try.
Crunch is an abstraction layer for defining data pipelines that under the covers compile to a series of map-reduce jobs Distinct from other abstraction layers like say Pig and Hive it’s geared less towards data
scientists and more towards developers. More like Cascading/Scalding if you are familiar with those. The Crunch API itself is modeled after Googles FlumeJava project, which is what they use for this at Google.
Their goal is to be simple & flexible. We found that with some tweaks, Crunch actually enabled us to meet all of our goals. So let me just jump right into what we did…