Slide 1

Slide 1 text

Scalding Not just burning yourself: building super simple MapReduce jobs Thursday, September 19, 13

Slide 2

Slide 2 text

It all starts with Cascading Thursday, September 19, 13

Slide 3

Slide 3 text

Cascading • Cascading is an application framework for Java developers to simply develop robust Data Analytics and Data Management applications on Apache Hadoop. • http://docs.cascading.org/impatient/ Thursday, September 19, 13

Slide 4

Slide 4 text

Cascading • Based on the idea of data-flows. • Route the data, it decides how to process it Thursday, September 19, 13

Slide 5

Slide 5 text

Thursday, September 19, 13

Slide 6

Slide 6 text

Text doc_id   text doc01     A  rain  shadow  is  a  dry  area  on  the  lee  back  side  of  a  mountainous  area. doc02     This  sinking,  dry  air  produces  a  rain  shadow,  or  area  in  the  lee  of  a  mountain  with  less  rain  and  cloudcover. doc03     A  rain  shadow  is  an  area  of  dry  land  that  lies  on  the  leeward  (or  downwind)  side  of  a  mountain. doc04     This  is  known  as  the  rain  shadow  effect  and  is  the  primary  cause  of  leeward  deserts  of  mountain  ranges,  such  as  California's  Death  Valley. doc05     Two  Women.  Secrets.  A  Broken  Land.  [DVD  Australia] Sample Data Word Count Flow Single Map/Reduce job Thursday, September 19, 13

Slide 7

Slide 7 text

Graphviz output for debugging justice Thursday, September 19, 13

Slide 8

Slide 8 text

.dot output Thursday, September 19, 13

Slide 9

Slide 9 text

Cascading tracks data dependencies and automatically generates MapReduce jobs from modest amounts of code. Not only for simple jobs... Thursday, September 19, 13

Slide 10

Slide 10 text

I thought this was about Scalding? Thursday, September 19, 13

Slide 11

Slide 11 text

Scalding • Scala DSL for Cascading • The two are very tightly linked • More than half of the questions on cascading-users are about Scalding. • I think Cascading is kind of ugly. • “It's concise and functional, and it's just good, clean fun.” Thursday, September 19, 13

Slide 12

Slide 12 text

Scalding • Has 2 DSLs. The Type Safe API and the Fields API • We will be using Fields API for all these examples Thursday, September 19, 13

Slide 13

Slide 13 text

class  WordCountJob(args  :  Args)  extends  Job(args)  {    TextLine(  args("input")  )        .flatMap('line  -­‐>  'word)  {  line  :  String  =>  tokenize(line)  }        .groupBy('word)  {  _.size  }        .write(  Tsv(  args("output")  )  )    //  Split  a  piece  of  text  into  individual  words.    def  tokenize(text  :  String)  :  Array[String]  =  {        //  Lowercase  each  word  and  remove  punctuation.        text.toLowerCase.replaceAll("[^a-­‐zA-­‐Z0-­‐9\\s]",  "").split("\\s+")    } } Thursday, September 19, 13

Slide 14

Slide 14 text

class  WordCountJob(args  :  Args)  extends  Job(args)  {    TextLine(  args("input")  )        .flatMap('line  -­‐>  'word)  {  line  :  String  =>  tokenize(line)  }        .groupBy('word)  {  _.size  }        .write(  Tsv(  args("output")  )  )    //  Split  a  piece  of  text  into  individual  words.    def  tokenize(text  :  String)  :  Array[String]  =  {        //  Lowercase  each  word  and  remove  punctuation.        text.toLowerCase.replaceAll("[^a-­‐zA-­‐Z0-­‐9\\s]",  "").split("\\s+")    } } Thursday, September 19, 13

Slide 15

Slide 15 text

It might be readable Time for a real example Thursday, September 19, 13

Slide 16

Slide 16 text

class  LinkedInScrapeValidationSetJob(args  :  Args)  extends  Job(args)  {        override  def  config(implicit  mode:  Mode)  =  {        super.config(mode)  ++  Map(            "mapred.child.tmp"  -­‐>  "./tmp"        )    }      val  searchResults  =  WritableSequenceFile[org.apache.hadoop.io.Text,  SearchResultWritable](  args("input"),  new  Fields("key",  "value")    )    val  identibaseDump  =  Tsv(args("identibaseDump"),  new  Fields("email",  "account"))        .addTrap(Tsv("hdfs://scratch/tmp/wtf"))  //  data  with  newlines  needs  to  go  away      //  Extract  URLs    val  searchResultUrls  =  searchResults        .map('value  -­‐>  'url)  {  line  :  SearchResultWritable  =>  line.getResult.get_url()  }      //  Process  only  linkedin  urls,  get  non-­‐empty  pub  suffixes    val  searchResultLiSuffixes  =  searchResultUrls        .filter('url)  {  url:  String  =>  url.contains("linkedin")  }        .map('url  -­‐>  'liUrl)  {  url:  String  =>  liUrlSuffix(url)  }        .filter('liUrl)  {  line:String  =>  !line.isEmpty  }      //  Extract  pub  suffixes  for  join  from  Identibase  dump    val  identibasePairs  =  identibaseDump        .map('account  -­‐>  'urlSuffix)  {  account  :  String  =>  accountToLiUrlSuffix(account)  }        .filter('urlSuffix)  {  line:String  =>  !line.isEmpty  }      //  Join  it    val  joined  =  searchResultLiSuffixes        .joinWithSmaller('liUrl  -­‐>  'urlSuffix,  identibasePairs)        joined        .project(new  Fields("email",  "liUrl"))        .write(Tsv(args("output")))      //  Puburl  extractors    val  urlPattern  =  "([0-­‐9a-­‐f]+)/([0-­‐9a-­‐f]+/[0-­‐9a-­‐f]+)".r    def  accountToLiUrlSuffix(s:  String):String  =  urlPattern.findFirstIn(s).getOrElse("")      def  liUrlSuffix(s:  String):  String  =  s.split("/").takeRight(3).mkString("/") } Sometimes you need to discard bad data. Add a trap. Can be /dev/null In Cascading, a SequenceFile != Hadoop SequenceFile. WritableSequenceFile is what you want This processes the SequenceFile stuff Important: When doing joins, it’s VERY important to know the relative sizes of your data sets or your joins will be stupid slow. That’s because it’s a HashJoin to avoid the full NxM of a join. On Rome, you might need to set his mapred.child.tmp value This particular job joined some of Dan’s old LinkedIn data with edges from Identibase Thursday, September 19, 13

Slide 17

Slide 17 text

Get familiar. A lot of large-scale problems quickly become a lot easier to solve. Thursday, September 19, 13

Slide 18

Slide 18 text

I use Scalding in local mode for data processing. It’s hard to beat the TSV/CSV stuff. Thursday, September 19, 13

Slide 19

Slide 19 text

Scalding job effortlessly finds that Douglas County is fastest growing county in the country. Not far behind: Summit County. What a surprise... Data democratization https://gist.github.com/krishnanraman/4696053 Thursday, September 19, 13

Slide 20

Slide 20 text

I’d encourage you to take a look at this tutorial and run the examples. It’s what convinced me. And I put in production soon after. http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/ Thursday, September 19, 13

Slide 21

Slide 21 text

Movie Recommendations Tutorial from last slide Thursday, September 19, 13

Slide 22

Slide 22 text

/** * The input is a TSV file with three columns: (user, movie, rating). */ val INPUT_FILENAME = "data/ratings.tsv" /** * Read in the input and give each field a type and name. */ val ratings = Tsv(INPUT_FILENAME, ('user, 'movie, 'rating)) /** * Let's also keep track of the total number of people who rated each movie. */ val numRaters = ratings // Put the number of people who rated each movie into a field called "numRaters". .groupBy('movie) { _.size }.rename('size -> 'numRaters) // Merge `ratings` with `numRaters`, by joining on their movie fields. val ratingsWithSize = ratings.joinWithSmaller('movie -> 'movie, numRaters) ratingsWithSize = (user, movie, rating, numRaters) Read data & basic aggregation Thursday, September 19, 13

Slide 23

Slide 23 text

/** * To get all pairs of co-rated movies, we'll join `ratings` against itself. * So first make a dummy copy of the ratings that we can join against. */ val ratings2 = ratingsWithSize .rename(('user, 'movie, 'rating, 'numRaters) -> ('user2, 'movie2, 'rating2, 'numRaters2)) /** * Now find all pairs of co-rated movies (pairs of movies that a user has rated) by * joining the duplicate rating streams on their user fields, */ val ratingPairs = ratingsWithSize .joinWithSmaller('user -> 'user2, ratings2) // De-dupe so that we don't calculate similarity of both (A, B) and (B, A). .filter('movie, 'movie2) { movies : (String, String) => movies._1 < movies._2 } .project('movie, 'rating, 'numRaters, 'movie2, 'rating2, 'numRaters2) // By grouping on ('movie, 'movie2), we can now get all the people who rated any pair of movies. ratingPairs = (movie, rating, numRaters, movie2, rating2, numRaters2) Joining the data with its self (small -> big) Thursday, September 19, 13

Slide 24

Slide 24 text

/** * Compute dot products, norms, sums, and sizes of the rating vectors. */ val vectorCalcs = ratingPairs // Compute (x*y, x^2, y^2), which we need for dot products and norms. .map(('rating, 'rating2) -> ('ratingProd, 'ratingSq, 'rating2Sq)) { ratings : (Double, Double) => (ratings._1 * ratings._2, math.pow(ratings._1, 2), math.pow(ratings._2, 2)) } .groupBy('movie, 'movie2) { group => group.size // length of each vector .sum('ratingProd -> 'dotProduct) .sum('rating -> 'ratingSum) .sum('rating2 -> 'rating2Sum) .sum('ratingSq -> 'ratingNormSq) .sum('rating2Sq -> 'rating2NormSq) .max('numRaters) // Just an easy way to make sure the numRaters field stays. .max('numRaters2) // All of these operations chain together like in a builder object. } Confusing hack, because we already aggregated this earlier Setup for Similarity Metrics Thursday, September 19, 13

Slide 25

Slide 25 text

val PRIOR_COUNT = 10 val PRIOR_CORRELATION = 0 val similarities = vectorCalcs .map(('size, 'dotProduct, 'ratingSum, 'rating2Sum, 'ratingNormSq, 'rating2NormSq, 'numRaters, 'numRaters2) -> ('correlation, 'regularizedCorrelation, 'cosineSimilarity, 'jaccardSimilarity)) { fields : (Double, Double, Double, Double, Double, Double, Double, Double) => val (size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq, numRaters, numRaters2) = fields val corr = correlation(size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq) val regCorr = regularizedCorrelation(size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq, PRIOR_COUNT, PRIOR_CORRELATION) val cosSim = cosineSimilarity(dotProduct, math.sqrt(ratingNormSq), math.sqrt(rating2NormSq)) val jaccard = jaccardSimilarity(size, numRaters, numRaters2) (corr, regCorr, cosSim, jaccard) } Similarity Metrics Thursday, September 19, 13

Slide 26

Slide 26 text

   /**      *  Output  all  similarities  to  a  TSV  file.      */    similarities        .project('item,  'item2,  'correlation,  'regularizedCorrelation,  'cosineSimilarity,   'jaccardSimilarity,  'size,  'numRaters,  'numRaters2)        .write(Tsv(args("output"),  writeHeader  =  true)) Write Results Default: false Thursday, September 19, 13

Slide 27

Slide 27 text

And gotchas Other cool functions Thursday, September 19, 13

Slide 28

Slide 28 text

Thursday, September 19, 13

Slide 29

Slide 29 text

groupRandomly -> send to n random reducers shuffle -> send to n random reducers partition -> separates on some predicate and then applies grouping functions over the resulting partition reduce -> Applies an associative aggregation over a group (e.g. summation). This happens in the mapper. fold -> A more fundamental reduce, it can run any function over any group and return any type. Runs in the reduce phase approxUniques -> Faster uniques with an error percentage. Thursday, September 19, 13

Slide 30

Slide 30 text

Run as inner joins by default. This is configurable per-join Joins Thursday, September 19, 13

Slide 31

Slide 31 text

Not just a pretty face (job) Running yo’ jobs Thursday, September 19, 13

Slide 32

Slide 32 text

hadoop jar scalding-li-0.0.1.jar com.fullcontact.hadoop.scalding.ScrapeToTsv --hdfs --input hdfs://scratch/user/xorlev/search_results --output hdfs://scratch/user/xorlev/urls-tsv/scrape.tsv https://github.com/snowplow/scalding-example-project I started with the following example project It’s a nice start. Instructions for running on EMR too. To run it on Rome I used ~/Code/oss/scalding/scripts/scald.rb --local ~/Code/test-scald/src/main/scala/test/analytics/ThreadSimilarityJob.scala --input /tmp/data_original.csv --output /tmp/out.csv Local mode Thursday, September 19, 13