Why Hadoop Need scala (Updated)

Slide 1

Slide 1 text

@agemooij A Look at Scoobi and Scalding Scala DSLs for Hadoop Why Needs Scala Scoobi Scalding

Slide 2

Slide 2 text

Obligatory “About Me” Slide

Slide 3

Slide 3 text

Rocks!

Slide 4

Slide 4 text

Sucks! But programming kinda

Slide 5

Slide 5 text

Hello World Word Count using Hadoop MapReduce

Slide 6

Slide 6 text

For each word, sum the 1s to get the total Split lines into words Group by word (?) Turn each word into a Pair(word, 1)

Slide 7

Slide 7 text

Low level glue code Lots of small unintuitive Mapper and Reducer Classes Lots of Hadoop intrusiveness (Context, Writables, Exceptions, etc.) Actually runs the code on the cluster

Slide 8

Slide 8 text

This does not make me a happy Hadoop developer! Especially for things that are a little bit more complicated than counting words • Unintuitive, invasive programming model • Hard to compose/chain jobs into real, more complicated programs • Lots of low-level boilerplate code • Branching, Joins, CoGroups, etc. hard to implement

Slide 9

Slide 9 text

What Are the Alternatives?

Slide 10

Slide 10 text

Counting Words using Apache Pig Already a lot better, but anything more complex gets hard pretty fast. Handy for quick exploration of data! Pig is hard to customize/extend Nice! And the same goes for Hive

Slide 11

Slide 11 text

package cascadingtutorial.wordcount; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { String inputPath = args[0]; String outputPath = args[1]; Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } } Pipes & Filters Not very intuitive Lots of boilerplate code Very powerful! Record Model Strange new abstraction Joins & CoGroups

Slide 12

Slide 12 text

Meh... I’m lazy I want more power with less work!

Slide 13

Slide 13 text

How would we count words in plain Scala? (My current language of choice)

Slide 14

Slide 14 text

Nice! Familiar, intuitive What if...?

Slide 15

Slide 15 text

But that code doesn’t scale to my cluster! Or does it? Meanwhile at Google...

Slide 16

Slide 16 text

Introducing Scoobi & Scalding Scala DSLs for Hadoop MapReduce Scalding 5% Scoobi 95% NOTE: My relative familiarity with either platform:

Slide 17

Slide 17 text

http://github.com/nicta/scoobi A Scala library that implements a higher level programming model for Hadoop MapReduce

Slide 18

Slide 18 text

Counting Words using Scoobi For each word, sum the 1s to get the total Split lines into words Group by word Turn each word into a Pair(word, 1) Actually runs the code on the cluster

Slide 19

Slide 19 text

Scoobi is... • A distributed collections abstraction: • Distributed collection objects abstract data in HDFS • Methods on these objects abstract map/reduce operations • Programs manipulate distributed collections objects • Scoobi turns these manipulations into MapReduce jobs • Based on Google’s FlumeJava / Cascades • A source code generator (it generates Java code!) • A job plan optimizer • Open sourced by NICTA • Written in Scala (W00t!)

Slide 20

Slide 20 text

DList[T] • Abstracts storage of data and files on HDFS • Calling methods on DList objects to transform and manipulate them abstracts the mapper, combiner, sort-and-shuffle, and reducer phases of MapReduce • Persisting a DList triggers compilation of the graph into one or more MR jobs and their execution • Very familiar: like standard Scala Lists • Strongly typed • Parameterized with rich types and Tuples • Easy list manipulation using typical higher order functions like map, flatMap, filter, etc.

Slide 21

Slide 21 text

DList[T]

Slide 22

Slide 22 text

• Can read/write text files, Sequence files and Avro files • Can influence sorting (raw, secondary) IO Serialization • Serialization of custom types through Scala type classes and WireFormat[T] • Scoobi implements WireFormat[T] for primitive types, strings, tuples, Option[T], either[T], Iterable[T], etc. • Out of the box support for serialization of Scala case classes

Slide 23

Slide 23 text

IO/Serialization I

Slide 24

Slide 24 text

IO/Serialization II For normal (i.e. non-case) classes

Slide 25

Slide 25 text

Further Info http://nicta.github.com/scoobi/ [email protected] [email protected] Version 0.4 released today (!) • Avro, Sequence Files • Materialized DObjects • DList reduction methods (product, min, etc.) • Vastly improved testing support • Less overhead • Much more

Slide 26

Slide 26 text

Scalding! http://github.com/twitter/scalding A Scala library that implements a higher level programming model for Hadoop MapReduce Cascading

Slide 27

Slide 27 text

Counting Words using Scalding

Slide 28

Slide 28 text

Scalding is... • A distributed collections abstraction • A wrapper around Cascading (i.e. no source code generation) • Based on the same record model (i.e. named ﬁelds) • Less strongly typed • Uses Kryo Serialization • Used by Twitter in production • Written in Scala (W00t!)

Slide 29

Slide 29 text

Further Info @scalding [email protected] http://blog.echen.me/2012/02/09/movie-recommendations-and-more- via-mapreduce-and-scalding/ https://github.com/twitter/scalding/wiki Current version: 0.5.4 http://github.com/twitter/scalding

Slide 30

Slide 30 text

How do they compare? Different approaches, similar power Small feature differences, which will even out over time Scoobi gets a little closer to idiomatic Scala Twitter is deﬁnitely a bigger ﬁsh than NICTA, so Scalding gets all the attention Both open sourced (last year) Scoobi has better docs!