Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large-Scale Machine Learning at Twitter

Large-Scale Machine Learning at Twitter

Slides for talk at 2012 Hadoop Summit.

Jimmy Lin

June 18, 2012
Tweet

Other Decks in Programming

Transcript

  1.   What we were doing and what wasn’t working…  

    Design goals   Scaling up machine learning   Integration with Pig   Simple case study Path forward…
  2. What doesn’t work… 1.  Down-sampling for training on single-processor  

    Defeats the whole point of big data! 2.  Ad hoc productionizing   Disconnected from rest of production Oink workflow   None of the benefits of Oink So, we redesigned it… with two goals:
  3. Supervised classification in a nutshell Given Induce s.t. loss is

    minimized € argmin θ 1 n  f (x i ;w),y i ( ) i=0 n ∑ Consider functions of a parametric form: Key insight: machine learning as an optimization problem! empirical loss = (closed form solutions generally not possible) loss function model parameters (weights) (sparse) feature vector label
  4. Gradient Descent € w(t +1) = w(t ) −γ(t )

    1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ Repeat until convergence:
  5. €  x ( ) € ∇ € d dx

     w(t +1) = w(t ) −γ(t ) 1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ Intuition behind the math… Old weights Update based on gradient New weights
  6. Gradient Descent w(t +1) = w(t ) −γ(t ) 1

    n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ Source: Wikipedia (Hills)
  7. € w(t +1) = w(t ) −γ(t ) 1 n

    ∇ f (x i ;w(t)),y ( ) i=0 n ∑ mapper mapper mapper mapper reducer compute partial gradient single reducer mappers update model iterate until convergence
  8. Shortcomings   Hadoop is bad at iterative algorithms   High

    job startup costs   Awkward to retain state across iterations   High sensitivity to skew   Iteration speed bounded by slowest task   Potentially poor cluster utilization   Must shuffle all data to a single reducer
  9. Gradient Descent Stochastic Gradient Descent (SGD) w(t +1) = w(t

    ) −γ(t ) 1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ € w(t +1) = w(t ) −γ(t )∇ f (x;w(t)),y ( ) “batch” learning: update model after considering all training instances “online” learning: update model after considering each (randomly-selected) training instance Solves the iteration problem! What about the single reducer problem? In practice… just as good!
  10. Ensemble Learning   Learn multiple models   Simplest possible technique:

    majority voting   Simple weighted voting:   Why does it work?   If errors uncorrelated, multiple classifiers being wrong is less likely   Reduces the variance component of error   Embarassingly parallel ensemble learning:   Train each classifier on partitioned input   Contrast with boosting: more difficult to parallelize
  11. When you only have a hammer… … get rid of

    everything that’s not a nail! Stochastic gradient descent Ensemble methods Source: Wikipedia (Hammer) … good fit for acyclic dataflows
  12. It’s like an aggregate function! Machine learning is basically a

    user-defined aggregate function! initialize update terminate sum = 0 count = 0 add to sum increment count return sum / count AVG SGD initialize weights return weights € w(t +1) = w(t ) −γ(t )∇ f (x;w(t)),y ( )
  13. Classifier Training training = load ‘training.txt’ using SVMLightStorage() as (target:

    int, features: map[]); store training into ‘model/’ using FeaturesLRClassifierBuilder(); Want an ensemble? training = foreach training generate label, features, RANDOM() as random; training = order training by random parallel 5; Logistic regression + SGD (L2 regularization) Pegasos variant (fully SGD or sub-gradient)
  14. define Classify ClassifyWithLR(‘model/’); data = load ‘test.txt’ using SVMLightStorage() as

    (target: double, features: map[]); data = foreach data generate target, Classify(features) as prediction; Making Predictions Want an ensemble? define Classify ClassifyWithEnsemble(‘model/’, ‘classifier.LR’, ‘vote’);
  15. Sentiment Analysis Case Study   Binary polarity classification: {positive, negative}

    sentiment   Independently interesting task   Illustrates end-to-end flow   Use the “emoticon trick” to gather data   Data   Test: 500k positive/500k negative tweets from 9/1/2011   Training: {1m, 10m, 100m} instances from before (50/50 split)   Features: Sliding window byte-4grams   Models:   Logistic regression with SGD (L2 regularization)   Ensembles of various sizes (simple weighted voting)
  16. status = load '/tables/statuses/$DATE' using StatusLoader() as (id: long, uid:

    long, text: chararray); status = foreach status generate text, RANDOM() as random; status = filter status by IdentifyLanguage(text) == 'en'; -- Filter for positive examples positive = filter status by ContainsPositiveEmoticon(text) and not ContainsNegativeEmoticon(text) and length(text) > 20; positive = foreach positive generate (int) 1 as label, RemovePositiveEmoticons(text) as text, random; positive = order positive by random; -- Randomize ordering of tweets. positive = limit positive $N; -- Take N positive examples. -- Filter for negative examples negative = filter status by ContainsNegativeEmoticon(text) and not ContainsPositiveEmoticon(text) and length(text) > 20; negative = foreach negative generate (int) -1 as label, RemoveNegativeEmoticons(text) as text, random; negative = order negative by random; -- Randomize ordering of tweets negative = limit negative $N; -- Take N negative examples training = union positive, negative; -- Randomize order of positive and negative examples training = foreach training generate (int) $0 as label, (chararray) $1 as text, RANDOM() as random; training = order training by random parallel $PARTITIONS; training = foreach training generate label, text; store training into '$OUTPUT' using LRClassifierBuilder(); Load tweets Branch 1: filter positive emoticons Branch 2: filter negative emoticons Shuffle together, randomize Train!
  17. 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 1 1

    1 3 5 7 9 11 13 15 17 19 3 5 11 21 31 41 Accuracy Number of Classifiers in Ensemble 1m instances 10m instances 100m instances “for free” Ensembles with 10m examples
  18. Related Work   Big data and machine learning:   Integration

    into DB? MADLibs, Bismark   Integration into custom DSL? Spark   Integration into existing package? MATLAB, R   Faster iterative algorithms in MapReduce   HaLoop, Twister, PrIter: Requires a different programming model   Why not just use Mahout?   Our core ML libraries pre-date Mahout   Tighter integration with our internal workflows   Mahout + Pig can be integrated in exactly the same way!
  19. Source: Wikipedia (Sonoran Desert) Jimmy Lin and Alek Kolcz. Large-Scale

    Machine Learning at Twitter. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), May 2012, Scottsdale, Arizona.