Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An introduction to Apache Crunch

An introduction to Apache Crunch

A short introduction to Apache Crunch. What is it and how does it simplify and aid the
creation of Hadoop pipelines ?

Mike Frampton

December 17, 2013
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. Apache Crunch • What is it ? • How does

    it work ? • Why use it ? • Hadoop MapReduce pipelines • Scrunch • Joins www.semtech-solutions.co.nz [email protected]
  2. Apache Crunch – Pipe line • Crunch is based on

    Google's FlumeJava • Provides a Java based API for M/R pipelines • It uses an MST ( multiple serializable type ) data model • Good for processing complex data types • Better for “non tuple” data types i.e. – Images – Audio – Seismic data www.semtech-solutions.co.nz [email protected]
  3. Apache Crunch – Pipe line • What is a Map

    Reduce Pipe line ? – Map – Shuffle – Reduce – Combine • Arranged in sequence and / or in parallel • Potentially very long chains www.semtech-solutions.co.nz [email protected]
  4. Apache Crunch – Scala • Scrunch is a Scala wrapper

    for Apache Crunch • Reduced code • Functional and OO styles • Uses type inferencing for Map / Reduce • Incorporates Java Materialize functionality • Includes REPL ( read eval print loop ) www.semtech-solutions.co.nz [email protected]
  5. Apache Crunch – Joins • Details of Joins available in

    Crunch – Inner / Outer like SQL joins – Same with Left / Right / Full joins – MapSide join is an in memory join www.semtech-solutions.co.nz [email protected]
  6. Apache Crunch – Performance • A light weight API that

    runs efficiently • Crunch is a thin veneer on top of Map Reduce • Two implementations available – Hadoop Writeables – Avro • Avro implementation much faster www.semtech-solutions.co.nz [email protected]
  7. Apache Crunch – API • Data Model – Pipeline –

    MRPipeline – MemPipeline – Pcollection – Ptable – PgroupTable – Source – Target – Emitter – PType www.semtech-solutions.co.nz [email protected] • Operators – DoFn – CombineFn – FilterFn – Joins – Cartesian – Sort – Secondary Sort – Pobject – BloomFilters
  8. Contact Us • Feel free to contact us at –

    www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems