Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Hadoop MapReduce needs Scala

Why Hadoop MapReduce needs Scala

An introduction to Scoobi and Scalding, two Scala DSLs for Hadoop.

This presentation was first given at the combined meetup of the Dutch Scala Enthusiasts and the Dutch Hadoop User Group on Thursday March 22nd 2012

Age Mooij

March 26, 2012
Tweet

More Decks by Age Mooij

Other Decks in Programming

Transcript

  1. @agemooij
    A Look at Scoobi and Scalding
    Scala DSLs for Hadoop
    Why
    Needs Scala
    Scoobi
    Scalding

    View Slide

  2. Obligatory “About Me” Slide

    View Slide

  3. Rocks!

    View Slide

  4. Sucks!
    But programming
    kinda

    View Slide

  5. Hello World Word Count
    using
    Hadoop MapReduce

    View Slide

  6. For each word, sum the 1s to get the total
    Split lines into words
    Group by word (?)
    Turn each word into a Pair(word, 1)

    View Slide

  7. Low level glue code
    Lots of small unintuitive
    Mapper and Reducer
    Classes
    Lots of Hadoop intrusiveness
    (Context, Writables, Exceptions, etc.)
    Actually runs the code on the cluster

    View Slide

  8. This does not make me
    a happy developer
    Especially for things that are a little bit more
    complicated than counting words
    Hard to compose/chain jobs into real programs
    Unintuitive, invasive programming model
    Lots of low-level boilerplate code
    Branching, Joins, CoGroups, etc. hard to implement

    View Slide

  9. What are the
    alternatives?
    Apache Pig?
    Apache Cascading?

    View Slide

  10. Counting Words using Apache Pig
    Already a lot better, but anything more complex gets
    hard pretty fast.
    Handy for quick exploration of data
    Pig is hard to customize
    Nice!

    View Slide

  11. package cascadingtutorial.wordcount;
    /**
    * Wordcount example in Cascading
    */
    public class Main
    {
    public static void main( String[] args )
    {
    String inputPath = args[0];
    String outputPath = args[1];
    Scheme inputScheme = new TextLine(new Fields("offset", "line"));
    Scheme outputScheme = new TextLine();
    Tap sourceTap = inputPath.matches( "^[^:]+://.*") ?
    new Hfs(inputScheme, inputPath) :
    new Lfs(inputScheme, inputPath);
    Tap sinkTap = outputPath.matches("^[^:]+://.*") ?
    new Hfs(outputScheme, outputPath) :
    new Lfs(outputScheme, outputPath);
    Pipe wcPipe = new Each("wordcount",
    new Fields("line"),
    new RegexSplitGenerator(new Fields("word"), "\\s+"),
    new Fields("word"));
    wcPipe = new GroupBy(wcPipe, new Fields("word"));
    wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));
    Properties properties = new Properties();
    FlowConnector.setApplicationJarClass(properties, Main.class);
    Flow parsedLogFlow = new FlowConnector(properties)
    .connect(sourceTap, sinkTap, wcPipe);
    parsedLogFlow.start();
    parsedLogFlow.complete();
    }
    }
    Counting Words using
    Apache Cascading
    Pipes & Filters
    Not very intuitive
    Lots of boilerplate code
    Very powerful
    Record Model

    View Slide

  12. Meh...
    Getting better, but definitely not perfect yet

    View Slide

  13. How would we
    count words in
    plain Scala?
    Our language of choice

    View Slide

  14. Nice!
    Familiar, intuitive
    What if...?

    View Slide

  15. But that code doesn’t
    scale to my cluster!
    Or does it?
    Meanwhile at Google...

    View Slide

  16. Introducing
    Scoobi & Scalding
    Scala DSLs for Hadoop MapReduce
    Scalding
    5%
    Scoobi
    95%
    NOTE:
    My relative familiarity
    with either platform:

    View Slide

  17. http://github.com/nicta/scoobi
    A Scala library that
    implements a higher level
    programming model for
    Hadoop MapReduce

    View Slide

  18. Counting Words using Scoobi
    For each word, sum the 1s to get the total
    Split lines into words
    Group by word
    Turn each word into a Pair(word, 1)
    Actually runs the code on the cluster

    View Slide

  19. Scoobi is...
    • A distributed collections abstraction:
    • Distributed collection objects abstract data in HDFS
    • Methods on these objects abstract map/reduce operations
    • Programs manipulate distributed collections objects
    • Scoobi turns these manipulations into MapReduce jobs
    • Based on Google’s FlumeJava / Cascades
    • A source code generator
    • A staging compiler
    • A job plan optimizer
    • Open sourced by NICTA
    • Written in Scala (W00t!)

    View Slide

  20. DList[T]
    • Abstracts storage of data and files on HDFS
    • Calling methods on DList objects to transform and
    manipulate them abstracts the mapper, combiner,
    sort-and-shuffle, and reducer phases of MapReduce
    • Persisting a DList triggers compilation of the graph
    into one or more MR jobs and their execution
    • Very familiar: like standard Scala Lists
    • Strongly typed
    • Parameterized with rich types and Tuples
    • Easy list manipulation using typical higher order
    functions like map, flatMap, filter, etc.

    View Slide

  21. DList[T]

    View Slide

  22. Under the Hood
    A Staging Compiler

    View Slide

  23. • Can read/write text files, Sequence files and Avro files
    • Can influence sorting (raw, secondary)
    IO
    Serialization
    • Serialization of custom types through Scala type
    classes and WireFormat[T]
    • Scoobi implements WireFormat[T] for primitive types,
    strings, tuples, Option[T], either[T], Iterable[T]
    • Out of the box support for serialization of Scala case
    classes

    View Slide

  24. IO/Serialization I

    View Slide

  25. IO/Serialization II
    For normal (i.e. non-case) classes

    View Slide

  26. Scalding!
    http://github.com/twitter/scalding
    A Scala library that
    implements a higher level
    programming model for
    Hadoop MapReduce
    Cascading

    View Slide

  27. Counting Words using Scalding

    View Slide

  28. Scalding is...
    • A distributed collections abstraction
    • A wrapper around Cascading (i.e. no source code generation)
    • Based on the same record model (i.e. named fields)
    • Less strongly typed
    • Uses Kryo Serialization
    • Used by Twitter in production
    • Written in Scala (W00t!)

    View Slide

  29. Let’s see a bigger example...

    View Slide

  30. How do they compare?
    Different approaches,
    similar power
    Small feature
    differences, which will
    even out over time
    Scoobi gets a little
    closer to idiomatic Scala Twitter is definitely a
    bigger fish than NICTA,
    so Scalding gets all the
    attention
    Both open sourced
    (last year)

    View Slide

  31. Which one should I use?
    Ehm...

    View Slide

  32. Further Info
    http://github.com/nicta/scoobi
    [email protected]
    [email protected]
    (The README is very good)
    http://github.com/twitter/scalding
    [email protected]
    http://blog.echen.me/2012/02/09/movie-recommendations-and-more-
    via-mapreduce-and-scalding/

    View Slide

  33. Some Spam...
    Ask us about our interesting training offers!
    • Cloudera Developer Training
    Multiple times this year
    • Fast Track to Scala (with TypeSafe)
    26-27 April

    View Slide

  34. Questions?

    View Slide