Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalding: MapReduce made easy

Michael Rose
September 20, 2013

Scalding: MapReduce made easy

Basic Intro to Scalding given for internal FullContact tech talks.

Michael Rose

September 20, 2013
Tweet

More Decks by Michael Rose

Other Decks in Programming

Transcript

  1. Cascading • Cascading is an application framework for Java developers

    to simply develop robust Data Analytics and Data Management applications on Apache Hadoop. • http://docs.cascading.org/impatient/ Thursday, September 19, 13
  2. Cascading • Based on the idea of data-flows. • Route

    the data, it decides how to process it Thursday, September 19, 13
  3. Text doc_id   text doc01     A  rain  shadow

     is  a  dry  area  on  the  lee  back  side  of  a  mountainous  area. doc02     This  sinking,  dry  air  produces  a  rain  shadow,  or  area  in  the  lee  of  a  mountain  with  less  rain  and  cloudcover. doc03     A  rain  shadow  is  an  area  of  dry  land  that  lies  on  the  leeward  (or  downwind)  side  of  a  mountain. doc04     This  is  known  as  the  rain  shadow  effect  and  is  the  primary  cause  of  leeward  deserts  of  mountain  ranges,  such  as  California's  Death  Valley. doc05     Two  Women.  Secrets.  A  Broken  Land.  [DVD  Australia] Sample Data Word Count Flow Single Map/Reduce job Thursday, September 19, 13
  4. Cascading tracks data dependencies and automatically generates MapReduce jobs from

    modest amounts of code. Not only for simple jobs... Thursday, September 19, 13
  5. Scalding • Scala DSL for Cascading • The two are

    very tightly linked • More than half of the questions on cascading-users are about Scalding. • I think Cascading is kind of ugly. • “It's concise and functional, and it's just good, clean fun.” Thursday, September 19, 13
  6. Scalding • Has 2 DSLs. The Type Safe API and

    the Fields API • We will be using Fields API for all these examples Thursday, September 19, 13
  7. class  WordCountJob(args  :  Args)  extends  Job(args)  {    TextLine(  args("input")

     )        .flatMap('line  -­‐>  'word)  {  line  :  String  =>  tokenize(line)  }        .groupBy('word)  {  _.size  }        .write(  Tsv(  args("output")  )  )    //  Split  a  piece  of  text  into  individual  words.    def  tokenize(text  :  String)  :  Array[String]  =  {        //  Lowercase  each  word  and  remove  punctuation.        text.toLowerCase.replaceAll("[^a-­‐zA-­‐Z0-­‐9\\s]",  "").split("\\s+")    } } Thursday, September 19, 13
  8. class  WordCountJob(args  :  Args)  extends  Job(args)  {    TextLine(  args("input")

     )        .flatMap('line  -­‐>  'word)  {  line  :  String  =>  tokenize(line)  }        .groupBy('word)  {  _.size  }        .write(  Tsv(  args("output")  )  )    //  Split  a  piece  of  text  into  individual  words.    def  tokenize(text  :  String)  :  Array[String]  =  {        //  Lowercase  each  word  and  remove  punctuation.        text.toLowerCase.replaceAll("[^a-­‐zA-­‐Z0-­‐9\\s]",  "").split("\\s+")    } } Thursday, September 19, 13
  9. class  LinkedInScrapeValidationSetJob(args  :  Args)  extends  Job(args)  {      

     override  def  config(implicit  mode:  Mode)  =  {        super.config(mode)  ++  Map(            "mapred.child.tmp"  -­‐>  "./tmp"        )    }      val  searchResults  =  WritableSequenceFile[org.apache.hadoop.io.Text,  SearchResultWritable](  args("input"),  new  Fields("key",  "value")    )    val  identibaseDump  =  Tsv(args("identibaseDump"),  new  Fields("email",  "account"))        .addTrap(Tsv("hdfs://scratch/tmp/wtf"))  //  data  with  newlines  needs  to  go  away      //  Extract  URLs    val  searchResultUrls  =  searchResults        .map('value  -­‐>  'url)  {  line  :  SearchResultWritable  =>  line.getResult.get_url()  }      //  Process  only  linkedin  urls,  get  non-­‐empty  pub  suffixes    val  searchResultLiSuffixes  =  searchResultUrls        .filter('url)  {  url:  String  =>  url.contains("linkedin")  }        .map('url  -­‐>  'liUrl)  {  url:  String  =>  liUrlSuffix(url)  }        .filter('liUrl)  {  line:String  =>  !line.isEmpty  }      //  Extract  pub  suffixes  for  join  from  Identibase  dump    val  identibasePairs  =  identibaseDump        .map('account  -­‐>  'urlSuffix)  {  account  :  String  =>  accountToLiUrlSuffix(account)  }        .filter('urlSuffix)  {  line:String  =>  !line.isEmpty  }      //  Join  it    val  joined  =  searchResultLiSuffixes        .joinWithSmaller('liUrl  -­‐>  'urlSuffix,  identibasePairs)        joined        .project(new  Fields("email",  "liUrl"))        .write(Tsv(args("output")))      //  Puburl  extractors    val  urlPattern  =  "([0-­‐9a-­‐f]+)/([0-­‐9a-­‐f]+/[0-­‐9a-­‐f]+)".r    def  accountToLiUrlSuffix(s:  String):String  =  urlPattern.findFirstIn(s).getOrElse("")      def  liUrlSuffix(s:  String):  String  =  s.split("/").takeRight(3).mkString("/") } Sometimes you need to discard bad data. Add a trap. Can be /dev/null In Cascading, a SequenceFile != Hadoop SequenceFile. WritableSequenceFile is what you want This processes the SequenceFile stuff Important: When doing joins, it’s VERY important to know the relative sizes of your data sets or your joins will be stupid slow. That’s because it’s a HashJoin to avoid the full NxM of a join. On Rome, you might need to set his mapred.child.tmp value This particular job joined some of Dan’s old LinkedIn data with edges from Identibase Thursday, September 19, 13
  10. Get familiar. A lot of large-scale problems quickly become a

    lot easier to solve. Thursday, September 19, 13
  11. I use Scalding in local mode for data processing. It’s

    hard to beat the TSV/CSV stuff. Thursday, September 19, 13
  12. Scalding job effortlessly finds that Douglas County is fastest growing

    county in the country. Not far behind: Summit County. What a surprise... Data democratization https://gist.github.com/krishnanraman/4696053 Thursday, September 19, 13
  13. I’d encourage you to take a look at this tutorial

    and run the examples. It’s what convinced me. And I put in production soon after. http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/ Thursday, September 19, 13
  14. /** * The input is a TSV file with three

    columns: (user, movie, rating). */ val INPUT_FILENAME = "data/ratings.tsv" /** * Read in the input and give each field a type and name. */ val ratings = Tsv(INPUT_FILENAME, ('user, 'movie, 'rating)) /** * Let's also keep track of the total number of people who rated each movie. */ val numRaters = ratings // Put the number of people who rated each movie into a field called "numRaters". .groupBy('movie) { _.size }.rename('size -> 'numRaters) // Merge `ratings` with `numRaters`, by joining on their movie fields. val ratingsWithSize = ratings.joinWithSmaller('movie -> 'movie, numRaters) ratingsWithSize = (user, movie, rating, numRaters) Read data & basic aggregation Thursday, September 19, 13
  15. /** * To get all pairs of co-rated movies, we'll

    join `ratings` against itself. * So first make a dummy copy of the ratings that we can join against. */ val ratings2 = ratingsWithSize .rename(('user, 'movie, 'rating, 'numRaters) -> ('user2, 'movie2, 'rating2, 'numRaters2)) /** * Now find all pairs of co-rated movies (pairs of movies that a user has rated) by * joining the duplicate rating streams on their user fields, */ val ratingPairs = ratingsWithSize .joinWithSmaller('user -> 'user2, ratings2) // De-dupe so that we don't calculate similarity of both (A, B) and (B, A). .filter('movie, 'movie2) { movies : (String, String) => movies._1 < movies._2 } .project('movie, 'rating, 'numRaters, 'movie2, 'rating2, 'numRaters2) // By grouping on ('movie, 'movie2), we can now get all the people who rated any pair of movies. ratingPairs = (movie, rating, numRaters, movie2, rating2, numRaters2) Joining the data with its self (small -> big) Thursday, September 19, 13
  16. /** * Compute dot products, norms, sums, and sizes of

    the rating vectors. */ val vectorCalcs = ratingPairs // Compute (x*y, x^2, y^2), which we need for dot products and norms. .map(('rating, 'rating2) -> ('ratingProd, 'ratingSq, 'rating2Sq)) { ratings : (Double, Double) => (ratings._1 * ratings._2, math.pow(ratings._1, 2), math.pow(ratings._2, 2)) } .groupBy('movie, 'movie2) { group => group.size // length of each vector .sum('ratingProd -> 'dotProduct) .sum('rating -> 'ratingSum) .sum('rating2 -> 'rating2Sum) .sum('ratingSq -> 'ratingNormSq) .sum('rating2Sq -> 'rating2NormSq) .max('numRaters) // Just an easy way to make sure the numRaters field stays. .max('numRaters2) // All of these operations chain together like in a builder object. } Confusing hack, because we already aggregated this earlier Setup for Similarity Metrics Thursday, September 19, 13
  17. val PRIOR_COUNT = 10 val PRIOR_CORRELATION = 0 val similarities

    = vectorCalcs .map(('size, 'dotProduct, 'ratingSum, 'rating2Sum, 'ratingNormSq, 'rating2NormSq, 'numRaters, 'numRaters2) -> ('correlation, 'regularizedCorrelation, 'cosineSimilarity, 'jaccardSimilarity)) { fields : (Double, Double, Double, Double, Double, Double, Double, Double) => val (size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq, numRaters, numRaters2) = fields val corr = correlation(size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq) val regCorr = regularizedCorrelation(size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq, PRIOR_COUNT, PRIOR_CORRELATION) val cosSim = cosineSimilarity(dotProduct, math.sqrt(ratingNormSq), math.sqrt(rating2NormSq)) val jaccard = jaccardSimilarity(size, numRaters, numRaters2) (corr, regCorr, cosSim, jaccard) } Similarity Metrics Thursday, September 19, 13
  18.    /**      *  Output  all  similarities  to  a

     TSV  file.      */    similarities        .project('item,  'item2,  'correlation,  'regularizedCorrelation,  'cosineSimilarity,   'jaccardSimilarity,  'size,  'numRaters,  'numRaters2)        .write(Tsv(args("output"),  writeHeader  =  true)) Write Results Default: false Thursday, September 19, 13
  19. groupRandomly -> send to n random reducers shuffle -> send

    to n random reducers partition -> separates on some predicate and then applies grouping functions over the resulting partition reduce -> Applies an associative aggregation over a group (e.g. summation). This happens in the mapper. fold -> A more fundamental reduce, it can run any function over any group and return any type. Runs in the reduce phase approxUniques -> Faster uniques with an error percentage. Thursday, September 19, 13
  20. hadoop jar scalding-li-0.0.1.jar com.fullcontact.hadoop.scalding.ScrapeToTsv --hdfs --input hdfs://scratch/user/xorlev/search_results --output hdfs://scratch/user/xorlev/urls-tsv/scrape.tsv https://github.com/snowplow/scalding-example-project

    I started with the following example project It’s a nice start. Instructions for running on EMR too. To run it on Rome I used ~/Code/oss/scalding/scripts/scald.rb --local ~/Code/test-scald/src/main/scala/test/analytics/ThreadSimilarityJob.scala --input /tmp/data_original.csv --output /tmp/out.csv Local mode Thursday, September 19, 13