Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large-Scale Machine Learning at Twitter

Large-Scale Machine Learning at Twitter

Slides for talk at 2012 Hadoop Summit.

Jimmy Lin

June 18, 2012
Tweet

Other Decks in Programming

Transcript

  1. Training a Smarter Pig:

  2. Source: Wikipedia (All Souls College, Oxford) From the Ivory Tower…

  3. Source: Wikipedia (Factory) … to building sh*t that works.

  4. #numbers

  5. 140 characters 140 million active users 340 million Tweets per

    day
  6. PBs total DW capacity ~100 TB daily ingestion 10s of

    Ks daily Hadoop jobs
  7. #shamelessplugs Wed. 4:30pm

  8. Source: Wikipedia (Everest) “traditional” business intelligence

  9. Goals Develop a generic machine learning platform Make machine learning

    tools easier to use
  10. Source: Wikipedia (Plumbing) Contributions

  11.   What we were doing and what wasn’t working…  

    Design goals   Scaling up machine learning   Integration with Pig   Simple case study Path forward…
  12. Machine learning… B.H. Source: Wikipedia (Stonehenge) Summer 2008: Twitter acquires

    Summize (real-time search, sentiment analysis)
  13. Source: http://www.flickr.com/photos/neilsingapore/4119503693/ Machine learning… B.P.

  14. frontend databases and services data

  15. Production considerations: dependency management scheduling resource allocation monitoring error reporting

    alerting …
  16. “one off” machine learning data munging Joining multiple dataset Feature

    extraction … down-sample train download
  17. What doesn’t work… 1.  Down-sampling for training on single-processor  

    Defeats the whole point of big data! 2.  Ad hoc productionizing   Disconnected from rest of production Oink workflow   None of the benefits of Oink So, we redesigned it… with two goals:
  18. Seamless scaling Source: Wikipedia (Galaxy)

  19. Integration with production workflows Source: Wikipedia (Oil refinery)

  20. Source: Wikipedia (Sorting) Classification

  21. Supervised classification in a nutshell Given Induce s.t. loss is

    minimized € argmin θ 1 n  f (x i ;w),y i ( ) i=0 n ∑ Consider functions of a parametric form: Key insight: machine learning as an optimization problem! empirical loss = (closed form solutions generally not possible) loss function model parameters (weights) (sparse) feature vector label
  22. Gradient Descent € w(t +1) = w(t ) −γ(t )

    1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ Repeat until convergence:
  23. €  x ( ) € ∇ € d dx

     w(t +1) = w(t ) −γ(t ) 1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ Intuition behind the math… Old weights Update based on gradient New weights
  24. Gradient Descent w(t +1) = w(t ) −γ(t ) 1

    n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ Source: Wikipedia (Hills)
  25. € w(t +1) = w(t ) −γ(t ) 1 n

    ∇ f (x i ;w(t)),y ( ) i=0 n ∑ mapper mapper mapper mapper reducer compute partial gradient single reducer mappers update model iterate until convergence
  26. Shortcomings   Hadoop is bad at iterative algorithms   High

    job startup costs   Awkward to retain state across iterations   High sensitivity to skew   Iteration speed bounded by slowest task   Potentially poor cluster utilization   Must shuffle all data to a single reducer
  27. Gradient Descent Source: Wikipedia (Hills)

  28. Stochastic Gradient Descent Source: Wikipedia (Water Slide)

  29. Gradient Descent Stochastic Gradient Descent (SGD) w(t +1) = w(t

    ) −γ(t ) 1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ € w(t +1) = w(t ) −γ(t )∇ f (x;w(t)),y ( ) “batch” learning: update model after considering all training instances “online” learning: update model after considering each (randomly-selected) training instance Solves the iteration problem! What about the single reducer problem? In practice… just as good!
  30. Ensembles Source: Wikipedia (Orchestra)

  31. Ensemble Learning   Learn multiple models   Simplest possible technique:

    majority voting   Simple weighted voting:   Why does it work?   If errors uncorrelated, multiple classifiers being wrong is less likely   Reduces the variance component of error   Embarassingly parallel ensemble learning:   Train each classifier on partitioned input   Contrast with boosting: more difficult to parallelize
  32. When you only have a hammer… … get rid of

    everything that’s not a nail! Stochastic gradient descent Ensemble methods Source: Wikipedia (Hammer) … good fit for acyclic dataflows
  33. It’s like an aggregate function! Machine learning is basically a

    user-defined aggregate function! initialize update terminate sum = 0 count = 0 add to sum increment count return sum / count AVG SGD initialize weights return weights € w(t +1) = w(t ) −γ(t )∇ f (x;w(t)),y ( )
  34. Classifier

  35. It’s just Pig! For “free”: dependency management, scheduling, resource allocation,

    monitoring, error reporting, alerting, …
  36. Classifier Training training = load ‘training.txt’ using SVMLightStorage() as (target:

    int, features: map[]); store training into ‘model/’ using FeaturesLRClassifierBuilder(); Want an ensemble? training = foreach training generate label, features, RANDOM() as random; training = order training by random parallel 5; Logistic regression + SGD (L2 regularization) Pegasos variant (fully SGD or sub-gradient)
  37. define Classify ClassifyWithLR(‘model/’); data = load ‘test.txt’ using SVMLightStorage() as

    (target: double, features: map[]); data = foreach data generate target, Classify(features) as prediction; Making Predictions Want an ensemble? define Classify ClassifyWithEnsemble(‘model/’, ‘classifier.LR’, ‘vote’);
  38. Sentiment Analysis Case Study   Binary polarity classification: {positive, negative}

    sentiment   Independently interesting task   Illustrates end-to-end flow   Use the “emoticon trick” to gather data   Data   Test: 500k positive/500k negative tweets from 9/1/2011   Training: {1m, 10m, 100m} instances from before (50/50 split)   Features: Sliding window byte-4grams   Models:   Logistic regression with SGD (L2 regularization)   Ensembles of various sizes (simple weighted voting)
  39. status = load '/tables/statuses/$DATE' using StatusLoader() as (id: long, uid:

    long, text: chararray); status = foreach status generate text, RANDOM() as random; status = filter status by IdentifyLanguage(text) == 'en'; -- Filter for positive examples positive = filter status by ContainsPositiveEmoticon(text) and not ContainsNegativeEmoticon(text) and length(text) > 20; positive = foreach positive generate (int) 1 as label, RemovePositiveEmoticons(text) as text, random; positive = order positive by random; -- Randomize ordering of tweets. positive = limit positive $N; -- Take N positive examples. -- Filter for negative examples negative = filter status by ContainsNegativeEmoticon(text) and not ContainsPositiveEmoticon(text) and length(text) > 20; negative = foreach negative generate (int) -1 as label, RemoveNegativeEmoticons(text) as text, random; negative = order negative by random; -- Randomize ordering of tweets negative = limit negative $N; -- Take N negative examples training = union positive, negative; -- Randomize order of positive and negative examples training = foreach training generate (int) $0 as label, (chararray) $1 as text, RANDOM() as random; training = order training by random parallel $PARTITIONS; training = foreach training generate label, text; store training into '$OUTPUT' using LRClassifierBuilder(); Load tweets Branch 1: filter positive emoticons Branch 2: filter negative emoticons Shuffle together, randomize Train!
  40. 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 1 1

    1 3 5 7 9 11 13 15 17 19 3 5 11 21 31 41 Accuracy Number of Classifiers in Ensemble 1m instances 10m instances 100m instances “for free” Ensembles with 10m examples
  41. Twitter Applications Anti-abuse Follower recommendation User modeling …

  42. Related Work   Big data and machine learning:   Integration

    into DB? MADLibs, Bismark   Integration into custom DSL? Spark   Integration into existing package? MATLAB, R   Faster iterative algorithms in MapReduce   HaLoop, Twister, PrIter: Requires a different programming model   Why not just use Mahout?   Our core ML libraries pre-date Mahout   Tighter integration with our internal workflows   Mahout + Pig can be integrated in exactly the same way!
  43. Source: Wikipedia (Sonoran Desert) Jimmy Lin and Alek Kolcz. Large-Scale

    Machine Learning at Twitter. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), May 2012, Scottsdale, Arizona.
  44. Questions? …btw, we’re hiring Twittering Machine. Paul Klee (1922) watercolor

    and ink