Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Hadoop MapReduce needs Scala

Why Hadoop MapReduce needs Scala

An introduction to Scoobi and Scalding, two Scala DSLs for Hadoop.

This presentation was first given at the combined meetup of the Dutch Scala Enthusiasts and the Dutch Hadoop User Group on Thursday March 22nd 2012

C2d2eb5dc036708dac0282dc78f88a03?s=128

Age Mooij

March 26, 2012
Tweet

Transcript

  1. @agemooij A Look at Scoobi and Scalding Scala DSLs for

    Hadoop Why Needs Scala Scoobi Scalding
  2. Obligatory “About Me” Slide

  3. Rocks!

  4. Sucks! But programming kinda

  5. Hello World Word Count using Hadoop MapReduce

  6. For each word, sum the 1s to get the total

    Split lines into words Group by word (?) Turn each word into a Pair(word, 1)
  7. Low level glue code Lots of small unintuitive Mapper and

    Reducer Classes Lots of Hadoop intrusiveness (Context, Writables, Exceptions, etc.) Actually runs the code on the cluster
  8. This does not make me a happy developer Especially for

    things that are a little bit more complicated than counting words Hard to compose/chain jobs into real programs Unintuitive, invasive programming model Lots of low-level boilerplate code Branching, Joins, CoGroups, etc. hard to implement
  9. What are the alternatives? Apache Pig? Apache Cascading?

  10. Counting Words using Apache Pig Already a lot better, but

    anything more complex gets hard pretty fast. Handy for quick exploration of data Pig is hard to customize Nice!
  11. package cascadingtutorial.wordcount; /** * Wordcount example in Cascading */ public

    class Main { public static void main( String[] args ) { String inputPath = args[0]; String outputPath = args[1]; Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } } Counting Words using Apache Cascading Pipes & Filters Not very intuitive Lots of boilerplate code Very powerful Record Model
  12. Meh... Getting better, but definitely not perfect yet

  13. How would we count words in plain Scala? Our language

    of choice
  14. Nice! Familiar, intuitive What if...?

  15. But that code doesn’t scale to my cluster! Or does

    it? Meanwhile at Google...
  16. Introducing Scoobi & Scalding Scala DSLs for Hadoop MapReduce Scalding

    5% Scoobi 95% NOTE: My relative familiarity with either platform:
  17. http://github.com/nicta/scoobi A Scala library that implements a higher level programming

    model for Hadoop MapReduce
  18. Counting Words using Scoobi For each word, sum the 1s

    to get the total Split lines into words Group by word Turn each word into a Pair(word, 1) Actually runs the code on the cluster
  19. Scoobi is... • A distributed collections abstraction: • Distributed collection

    objects abstract data in HDFS • Methods on these objects abstract map/reduce operations • Programs manipulate distributed collections objects • Scoobi turns these manipulations into MapReduce jobs • Based on Google’s FlumeJava / Cascades • A source code generator • A staging compiler • A job plan optimizer • Open sourced by NICTA • Written in Scala (W00t!)
  20. DList[T] • Abstracts storage of data and files on HDFS

    • Calling methods on DList objects to transform and manipulate them abstracts the mapper, combiner, sort-and-shuffle, and reducer phases of MapReduce • Persisting a DList triggers compilation of the graph into one or more MR jobs and their execution • Very familiar: like standard Scala Lists • Strongly typed • Parameterized with rich types and Tuples • Easy list manipulation using typical higher order functions like map, flatMap, filter, etc.
  21. DList[T]

  22. Under the Hood A Staging Compiler

  23. • Can read/write text files, Sequence files and Avro files

    • Can influence sorting (raw, secondary) IO Serialization • Serialization of custom types through Scala type classes and WireFormat[T] • Scoobi implements WireFormat[T] for primitive types, strings, tuples, Option[T], either[T], Iterable[T] • Out of the box support for serialization of Scala case classes
  24. IO/Serialization I

  25. IO/Serialization II For normal (i.e. non-case) classes

  26. Scalding! http://github.com/twitter/scalding A Scala library that implements a higher level

    programming model for Hadoop MapReduce Cascading
  27. Counting Words using Scalding

  28. Scalding is... • A distributed collections abstraction • A wrapper

    around Cascading (i.e. no source code generation) • Based on the same record model (i.e. named fields) • Less strongly typed • Uses Kryo Serialization • Used by Twitter in production • Written in Scala (W00t!)
  29. Let’s see a bigger example...

  30. How do they compare? Different approaches, similar power Small feature

    differences, which will even out over time Scoobi gets a little closer to idiomatic Scala Twitter is definitely a bigger fish than NICTA, so Scalding gets all the attention Both open sourced (last year)
  31. Which one should I use? Ehm...

  32. Further Info http://github.com/nicta/scoobi scoobi-dev@googlegroups.com scoobi-users@googlegroups.com (The README is very good)

    http://github.com/twitter/scalding cascading-user@googlegroups.com http://blog.echen.me/2012/02/09/movie-recommendations-and-more- via-mapreduce-and-scalding/
  33. Some Spam... Ask us about our interesting training offers! •

    Cloudera Developer Training Multiple times this year • Fast Track to Scala (with TypeSafe) 26-27 April
  34. Questions?