An introduction to Apache Crunch

Apache Crunch • What is it ? • How does
it work ? • Why use it ? • Hadoop MapReduce pipelines • Scrunch • Joins www.semtech-solutions.co.nz [email protected]

Apache Crunch – Pipe line • Crunch is based on
Google's FlumeJava • Provides a Java based API for M/R pipelines • It uses an MST ( multiple serializable type ) data model • Good for processing complex data types • Better for “non tuple” data types i.e. – Images – Audio – Seismic data www.semtech-solutions.co.nz [email protected]

Apache Crunch – Pipe line • What is a Map
Reduce Pipe line ? – Map – Shuffle – Reduce – Combine • Arranged in sequence and / or in parallel • Potentially very long chains www.semtech-solutions.co.nz [email protected]

Apache Crunch – Scala • Scrunch is a Scala wrapper
for Apache Crunch • Reduced code • Functional and OO styles • Uses type inferencing for Map / Reduce • Incorporates Java Materialize functionality • Includes REPL ( read eval print loop ) www.semtech-solutions.co.nz [email protected]

Apache Crunch – Joins • Details of Joins available in
Crunch – Inner / Outer like SQL joins – Same with Left / Right / Full joins – MapSide join is an in memory join www.semtech-solutions.co.nz [email protected]

Apache Crunch – Performance • A light weight API that
runs efficiently • Crunch is a thin veneer on top of Map Reduce • Two implementations available – Hadoop Writeables – Avro • Avro implementation much faster www.semtech-solutions.co.nz [email protected]

Apache Crunch – API • Data Model – Pipeline –
MRPipeline – MemPipeline – Pcollection – Ptable – PgroupTable – Source – Target – Emitter – PType www.semtech-solutions.co.nz [email protected] • Operators – DoFn – CombineFn – FilterFn – Joins – Cartesian – Sort – Secondary Sort – Pobject – BloomFilters

Contact Us • Feel free to contact us at –
www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

An introduction to Apache Crunch

An introduction to Apache Crunch

Mike Frampton

More Decks by Mike Frampton

Other Decks in Technology

Featured

Transcript

Apache Crunch • What is it ? • How does

Apache Crunch – Pipe line • Crunch is based on

Apache Crunch – Pipe line • What is a Map

Apache Crunch – Scala • Scrunch is a Scala wrapper

Apache Crunch – Joins • Details of Joins available in

Apache Crunch – Performance • A light weight API that

Apache Crunch – API • Data Model – Pipeline –

Contact Us • Feel free to contact us at –