Large-Scale Machine Learning at Twitter

Training a Smarter Pig:

Source: Wikipedia (All Souls College, Oxford) From the Ivory Tower…

Source: Wikipedia (Factory) … to building sh*t that works.

#numbers

140 characters 140 million active users 340 million Tweets per
day

PBs total DW capacity ~100 TB daily ingestion 10s of
Ks daily Hadoop jobs

#shamelessplugs Wed. 4:30pm

Source: Wikipedia (Everest) “traditional” business intelligence

Goals Develop a generic machine learning platform Make machine learning
tools easier to use

Source: Wikipedia (Plumbing) Contributions

  What we were doing and what wasn’t working…  
Design goals   Scaling up machine learning   Integration with Pig   Simple case study Path forward…

Machine learning… B.H. Source: Wikipedia (Stonehenge) Summer 2008: Twitter acquires
Summize (real-time search, sentiment analysis)

Source: http://www.flickr.com/photos/neilsingapore/4119503693/ Machine learning… B.P.

frontend databases and services data

Production considerations: dependency management scheduling resource allocation monitoring error reporting
alerting …

“one off” machine learning data munging Joining multiple dataset Feature
extraction … down-sample train download

What doesn’t work… 1.  Down-sampling for training on single-processor  
Defeats the whole point of big data! 2.  Ad hoc productionizing   Disconnected from rest of production Oink workﬂow   None of the beneﬁts of Oink So, we redesigned it… with two goals:

Seamless scaling Source: Wikipedia (Galaxy)

Integration with production workﬂows Source: Wikipedia (Oil refinery)

Source: Wikipedia (Sorting) Classiﬁcation

Supervised classiﬁcation in a nutshell Given Induce s.t. loss is
minimized € argmin θ 1 n  f (x i ;w),y i ( ) i=0 n ∑ Consider functions of a parametric form: Key insight: machine learning as an optimization problem! empirical loss = (closed form solutions generally not possible) loss function model parameters (weights) (sparse) feature vector label

Gradient Descent € w(t +1) = w(t ) −γ(t )
1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ Repeat until convergence:

€  x ( ) € ∇ € d dx
 w(t +1) = w(t ) −γ(t ) 1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ Intuition behind the math… Old weights Update based on gradient New weights

Gradient Descent w(t +1) = w(t ) −γ(t ) 1
n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ Source: Wikipedia (Hills)

€ w(t +1) = w(t ) −γ(t ) 1 n
∇ f (x i ;w(t)),y ( ) i=0 n ∑ mapper mapper mapper mapper reducer compute partial gradient single reducer mappers update model iterate until convergence

Shortcomings   Hadoop is bad at iterative algorithms   High
job startup costs   Awkward to retain state across iterations   High sensitivity to skew   Iteration speed bounded by slowest task   Potentially poor cluster utilization   Must shufﬂe all data to a single reducer

Gradient Descent Source: Wikipedia (Hills)

Stochastic Gradient Descent Source: Wikipedia (Water Slide)

Gradient Descent Stochastic Gradient Descent (SGD) w(t +1) = w(t
) −γ(t ) 1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ € w(t +1) = w(t ) −γ(t )∇ f (x;w(t)),y ( ) “batch” learning: update model after considering all training instances “online” learning: update model after considering each (randomly-selected) training instance Solves the iteration problem! What about the single reducer problem? In practice… just as good!

Ensembles Source: Wikipedia (Orchestra)

Ensemble Learning   Learn multiple models   Simplest possible technique:
majority voting   Simple weighted voting:   Why does it work?   If errors uncorrelated, multiple classifiers being wrong is less likely   Reduces the variance component of error   Embarassingly parallel ensemble learning:   Train each classifier on partitioned input   Contrast with boosting: more difficult to parallelize

When you only have a hammer… … get rid of
everything that’s not a nail! Stochastic gradient descent Ensemble methods Source: Wikipedia (Hammer) … good ﬁt for acyclic dataﬂows

It’s like an aggregate function! Machine learning is basically a
user-deﬁned aggregate function! initialize update terminate sum = 0 count = 0 add to sum increment count return sum / count AVG SGD initialize weights return weights € w(t +1) = w(t ) −γ(t )∇ f (x;w(t)),y ( )

Classiﬁer

It’s just Pig! For “free”: dependency management, scheduling, resource allocation,
monitoring, error reporting, alerting, …

Classiﬁer Training training = load ‘training.txt’ using SVMLightStorage() as (target:
int, features: map[]); store training into ‘model/’ using FeaturesLRClassiﬁerBuilder(); Want an ensemble? training = foreach training generate label, features, RANDOM() as random; training = order training by random parallel 5; Logistic regression + SGD (L2 regularization) Pegasos variant (fully SGD or sub-gradient)

define Classify ClassifyWithLR(‘model/’); data = load ‘test.txt’ using SVMLightStorage() as
(target: double, features: map[]); data = foreach data generate target, Classify(features) as prediction; Making Predictions Want an ensemble? define Classify ClassifyWithEnsemble(‘model/’, ‘classifier.LR’, ‘vote’);

Sentiment Analysis Case Study   Binary polarity classiﬁcation: {positive, negative}
sentiment   Independently interesting task   Illustrates end-to-end ﬂow   Use the “emoticon trick” to gather data   Data   Test: 500k positive/500k negative tweets from 9/1/2011   Training: {1m, 10m, 100m} instances from before (50/50 split)   Features: Sliding window byte-4grams   Models:   Logistic regression with SGD (L2 regularization)   Ensembles of various sizes (simple weighted voting)

status = load '/tables/statuses/$DATE' using StatusLoader() as (id: long, uid:
long, text: chararray); status = foreach status generate text, RANDOM() as random; status = filter status by IdentifyLanguage(text) == 'en'; -- Filter for positive examples positive = filter status by ContainsPositiveEmoticon(text) and not ContainsNegativeEmoticon(text) and length(text) > 20; positive = foreach positive generate (int) 1 as label, RemovePositiveEmoticons(text) as text, random; positive = order positive by random; -- Randomize ordering of tweets. positive = limit positive $N; -- Take N positive examples. -- Filter for negative examples negative = filter status by ContainsNegativeEmoticon(text) and not ContainsPositiveEmoticon(text) and length(text) > 20; negative = foreach negative generate (int) -1 as label, RemoveNegativeEmoticons(text) as text, random; negative = order negative by random; -- Randomize ordering of tweets negative = limit negative $N; -- Take N negative examples training = union positive, negative; -- Randomize order of positive and negative examples training = foreach training generate (int) $0 as label, (chararray) $1 as text, RANDOM() as random; training = order training by random parallel $PARTITIONS; training = foreach training generate label, text; store training into '$OUTPUT' using LRClassifierBuilder(); Load tweets Branch 1: filter positive emoticons Branch 2: filter negative emoticons Shuffle together, randomize Train!

0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 1 1
1 3 5 7 9 11 13 15 17 19 3 5 11 21 31 41 Accuracy Number of Classifiers in Ensemble 1m instances 10m instances 100m instances “for free” Ensembles with 10m examples

Twitter Applications Anti-abuse Follower recommendation User modeling …

Related Work   Big data and machine learning:   Integration
into DB? MADLibs, Bismark   Integration into custom DSL? Spark   Integration into existing package? MATLAB, R   Faster iterative algorithms in MapReduce   HaLoop, Twister, PrIter: Requires a different programming model   Why not just use Mahout?   Our core ML libraries pre-date Mahout   Tighter integration with our internal workﬂows   Mahout + Pig can be integrated in exactly the same way!

Source: Wikipedia (Sonoran Desert) Jimmy Lin and Alek Kolcz. Large-Scale
Machine Learning at Twitter. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), May 2012, Scottsdale, Arizona.

Questions? …btw, we’re hiring Twittering Machine. Paul Klee (1922) watercolor
and ink

Large-Scale Machine Learning at Twitter

Large-Scale Machine Learning at Twitter

Other Decks in Programming

Featured

Transcript