Scalding: MapReduce made easy

Scalding Not just burning yourself: building super simple MapReduce jobs
Thursday, September 19, 13

It all starts with Cascading Thursday, September 19, 13

Cascading • Cascading is an application framework for Java developers
to simply develop robust Data Analytics and Data Management applications on Apache Hadoop. • http://docs.cascading.org/impatient/ Thursday, September 19, 13

Cascading • Based on the idea of data-flows. • Route
the data, it decides how to process it Thursday, September 19, 13

Text doc_id text doc01 A rain shadow
is a dry area on the lee back side of a mountainous area. doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover. doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain. doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley. doc05 Two Women. Secrets. A Broken Land. [DVD Australia] Sample Data Word Count Flow Single Map/Reduce job Thursday, September 19, 13

Graphviz output for debugging justice Thursday, September 19, 13

.dot output Thursday, September 19, 13

Cascading tracks data dependencies and automatically generates MapReduce jobs from
modest amounts of code. Not only for simple jobs... Thursday, September 19, 13

I thought this was about Scalding? Thursday, September 19, 13

Scalding • Scala DSL for Cascading • The two are
very tightly linked • More than half of the questions on cascading-users are about Scalding. • I think Cascading is kind of ugly. • “It's concise and functional, and it's just good, clean fun.” Thursday, September 19, 13

Scalding • Has 2 DSLs. The Type Safe API and
the Fields API • We will be using Fields API for all these examples Thursday, September 19, 13

class WordCountJob(args : Args) extends Job(args) { TextLine( args("input")
) .flatMap('line -‐> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-‐zA-‐Z0-‐9\\s]", "").split("\\s+") } } Thursday, September 19, 13

It might be readable Time for a real example Thursday,
September 19, 13

class LinkedInScrapeValidationSetJob(args : Args) extends Job(args) {
override def config(implicit mode: Mode) = { super.config(mode) ++ Map( "mapred.child.tmp" -‐> "./tmp" ) } val searchResults = WritableSequenceFile[org.apache.hadoop.io.Text, SearchResultWritable]( args("input"), new Fields("key", "value") ) val identibaseDump = Tsv(args("identibaseDump"), new Fields("email", "account")) .addTrap(Tsv("hdfs://scratch/tmp/wtf")) // data with newlines needs to go away // Extract URLs val searchResultUrls = searchResults .map('value -‐> 'url) { line : SearchResultWritable => line.getResult.get_url() } // Process only linkedin urls, get non-‐empty pub suffixes val searchResultLiSuffixes = searchResultUrls .filter('url) { url: String => url.contains("linkedin") } .map('url -‐> 'liUrl) { url: String => liUrlSuffix(url) } .filter('liUrl) { line:String => !line.isEmpty } // Extract pub suffixes for join from Identibase dump val identibasePairs = identibaseDump .map('account -‐> 'urlSuffix) { account : String => accountToLiUrlSuffix(account) } .filter('urlSuffix) { line:String => !line.isEmpty } // Join it val joined = searchResultLiSuffixes .joinWithSmaller('liUrl -‐> 'urlSuffix, identibasePairs) joined .project(new Fields("email", "liUrl")) .write(Tsv(args("output"))) // Puburl extractors val urlPattern = "([0-‐9a-‐f]+)/([0-‐9a-‐f]+/[0-‐9a-‐f]+)".r def accountToLiUrlSuffix(s: String):String = urlPattern.findFirstIn(s).getOrElse("") def liUrlSuffix(s: String): String = s.split("/").takeRight(3).mkString("/") } Sometimes you need to discard bad data. Add a trap. Can be /dev/null In Cascading, a SequenceFile != Hadoop SequenceFile. WritableSequenceFile is what you want This processes the SequenceFile stuff Important: When doing joins, it’s VERY important to know the relative sizes of your data sets or your joins will be stupid slow. That’s because it’s a HashJoin to avoid the full NxM of a join. On Rome, you might need to set his mapred.child.tmp value This particular job joined some of Dan’s old LinkedIn data with edges from Identibase Thursday, September 19, 13

Get familiar. A lot of large-scale problems quickly become a
lot easier to solve. Thursday, September 19, 13

I use Scalding in local mode for data processing. It’s
hard to beat the TSV/CSV stuff. Thursday, September 19, 13

Scalding job effortlessly finds that Douglas County is fastest growing
county in the country. Not far behind: Summit County. What a surprise... Data democratization https://gist.github.com/krishnanraman/4696053 Thursday, September 19, 13

I’d encourage you to take a look at this tutorial
and run the examples. It’s what convinced me. And I put in production soon after. http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/ Thursday, September 19, 13

Movie Recommendations Tutorial from last slide Thursday, September 19, 13

/** * The input is a TSV file with three
columns: (user, movie, rating). */ val INPUT_FILENAME = "data/ratings.tsv" /** * Read in the input and give each field a type and name. */ val ratings = Tsv(INPUT_FILENAME, ('user, 'movie, 'rating)) /** * Let's also keep track of the total number of people who rated each movie. */ val numRaters = ratings // Put the number of people who rated each movie into a field called "numRaters". .groupBy('movie) { _.size }.rename('size -> 'numRaters) // Merge `ratings` with `numRaters`, by joining on their movie fields. val ratingsWithSize = ratings.joinWithSmaller('movie -> 'movie, numRaters) ratingsWithSize = (user, movie, rating, numRaters) Read data & basic aggregation Thursday, September 19, 13

/** * To get all pairs of co-rated movies, we'll
join `ratings` against itself. * So first make a dummy copy of the ratings that we can join against. */ val ratings2 = ratingsWithSize .rename(('user, 'movie, 'rating, 'numRaters) -> ('user2, 'movie2, 'rating2, 'numRaters2)) /** * Now find all pairs of co-rated movies (pairs of movies that a user has rated) by * joining the duplicate rating streams on their user fields, */ val ratingPairs = ratingsWithSize .joinWithSmaller('user -> 'user2, ratings2) // De-dupe so that we don't calculate similarity of both (A, B) and (B, A). .filter('movie, 'movie2) { movies : (String, String) => movies._1 < movies._2 } .project('movie, 'rating, 'numRaters, 'movie2, 'rating2, 'numRaters2) // By grouping on ('movie, 'movie2), we can now get all the people who rated any pair of movies. ratingPairs = (movie, rating, numRaters, movie2, rating2, numRaters2) Joining the data with its self (small -> big) Thursday, September 19, 13

/** * Compute dot products, norms, sums, and sizes of
the rating vectors. */ val vectorCalcs = ratingPairs // Compute (x*y, x^2, y^2), which we need for dot products and norms. .map(('rating, 'rating2) -> ('ratingProd, 'ratingSq, 'rating2Sq)) { ratings : (Double, Double) => (ratings._1 * ratings._2, math.pow(ratings._1, 2), math.pow(ratings._2, 2)) } .groupBy('movie, 'movie2) { group => group.size // length of each vector .sum('ratingProd -> 'dotProduct) .sum('rating -> 'ratingSum) .sum('rating2 -> 'rating2Sum) .sum('ratingSq -> 'ratingNormSq) .sum('rating2Sq -> 'rating2NormSq) .max('numRaters) // Just an easy way to make sure the numRaters field stays. .max('numRaters2) // All of these operations chain together like in a builder object. } Confusing hack, because we already aggregated this earlier Setup for Similarity Metrics Thursday, September 19, 13

val PRIOR_COUNT = 10 val PRIOR_CORRELATION = 0 val similarities
= vectorCalcs .map(('size, 'dotProduct, 'ratingSum, 'rating2Sum, 'ratingNormSq, 'rating2NormSq, 'numRaters, 'numRaters2) -> ('correlation, 'regularizedCorrelation, 'cosineSimilarity, 'jaccardSimilarity)) { fields : (Double, Double, Double, Double, Double, Double, Double, Double) => val (size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq, numRaters, numRaters2) = fields val corr = correlation(size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq) val regCorr = regularizedCorrelation(size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq, PRIOR_COUNT, PRIOR_CORRELATION) val cosSim = cosineSimilarity(dotProduct, math.sqrt(ratingNormSq), math.sqrt(rating2NormSq)) val jaccard = jaccardSimilarity(size, numRaters, numRaters2) (corr, regCorr, cosSim, jaccard) } Similarity Metrics Thursday, September 19, 13

/** * Output all similarities to a
TSV file. */ similarities .project('item, 'item2, 'correlation, 'regularizedCorrelation, 'cosineSimilarity, 'jaccardSimilarity, 'size, 'numRaters, 'numRaters2) .write(Tsv(args("output"), writeHeader = true)) Write Results Default: false Thursday, September 19, 13

And gotchas Other cool functions Thursday, September 19, 13

groupRandomly -> send to n random reducers shufﬂe -> send
to n random reducers partition -> separates on some predicate and then applies grouping functions over the resulting partition reduce -> Applies an associative aggregation over a group (e.g. summation). This happens in the mapper. fold -> A more fundamental reduce, it can run any function over any group and return any type. Runs in the reduce phase approxUniques -> Faster uniques with an error percentage. Thursday, September 19, 13

Run as inner joins by default. This is configurable per-join
Joins Thursday, September 19, 13

Not just a pretty face (job) Running yo’ jobs Thursday,
September 19, 13

hadoop jar scalding-li-0.0.1.jar com.fullcontact.hadoop.scalding.ScrapeToTsv --hdfs --input hdfs://scratch/user/xorlev/search_results --output hdfs://scratch/user/xorlev/urls-tsv/scrape.tsv https://github.com/snowplow/scalding-example-project
I started with the following example project It’s a nice start. Instructions for running on EMR too. To run it on Rome I used ~/Code/oss/scalding/scripts/scald.rb --local ~/Code/test-scald/src/main/scala/test/analytics/ThreadSimilarityJob.scala --input /tmp/data_original.csv --output /tmp/out.csv Local mode Thursday, September 19, 13

Scalding: MapReduce made easy

Scalding: MapReduce made easy

Michael Rose

More Decks by Michael Rose

Other Decks in Programming

Featured

Transcript

Scalding Not just burning yourself: building super simple MapReduce jobs

It all starts with Cascading Thursday, September 19, 13

Cascading • Cascading is an application framework for Java developers

Cascading • Based on the idea of data-flows. • Route

Thursday, September 19, 13

Text doc_id text doc01 A rain shadow

Graphviz output for debugging justice Thursday, September 19, 13

.dot output Thursday, September 19, 13

Cascading tracks data dependencies and automatically generates MapReduce jobs from

I thought this was about Scalding? Thursday, September 19, 13

Scalding • Scala DSL for Cascading • The two are

Scalding • Has 2 DSLs. The Type Safe API and

class WordCountJob(args : Args) extends Job(args) { TextLine( args("input")

class WordCountJob(args : Args) extends Job(args) { TextLine( args("input")

It might be readable Time for a real example Thursday,

class LinkedInScrapeValidationSetJob(args : Args) extends Job(args) {

Get familiar. A lot of large-scale problems quickly become a

I use Scalding in local mode for data processing. It’s

Scalding job effortlessly finds that Douglas County is fastest growing

I’d encourage you to take a look at this tutorial

Movie Recommendations Tutorial from last slide Thursday, September 19, 13

/** * The input is a TSV file with three

/** * To get all pairs of co-rated movies, we'll

/** * Compute dot products, norms, sums, and sizes of

val PRIOR_COUNT = 10 val PRIOR_CORRELATION = 0 val similarities

/** * Output all similarities to a

And gotchas Other cool functions Thursday, September 19, 13

Thursday, September 19, 13

groupRandomly -> send to n random reducers shufﬂe -> send

Run as inner joins by default. This is configurable per-join

Not just a pretty face (job) Running yo’ jobs Thursday,

hadoop jar scalding-li-0.0.1.jar com.fullcontact.hadoop.scalding.ScrapeToTsv --hdfs --input hdfs://scratch/user/xorlev/search_results --output hdfs://scratch/user/xorlev/urls-tsv/scrape.tsv https://github.com/snowplow/scalding-example-project