Large-Scale Machine Learning at Twitter

Slide 1

Slide 1 text

Training a Smarter Pig:

Slide 2

Slide 2 text

Source: Wikipedia (All Souls College, Oxford) From the Ivory Tower…

Slide 3

Slide 3 text

Source: Wikipedia (Factory) … to building sh*t that works.

Slide 4

Slide 4 text

#numbers

Slide 5

Slide 5 text

140 characters 140 million active users 340 million Tweets per day

Slide 6

Slide 6 text

PBs total DW capacity ~100 TB daily ingestion 10s of Ks daily Hadoop jobs

Slide 7

Slide 7 text

#shamelessplugs Wed. 4:30pm

Slide 8

Slide 8 text

Source: Wikipedia (Everest) “traditional” business intelligence

Slide 9

Slide 9 text

Goals Develop a generic machine learning platform Make machine learning tools easier to use

Slide 10

Slide 10 text

Source: Wikipedia (Plumbing) Contributions

Slide 11

Slide 11 text

  What we were doing and what wasn’t working…   Design goals   Scaling up machine learning   Integration with Pig   Simple case study Path forward…

Slide 12

Slide 12 text

Machine learning… B.H. Source: Wikipedia (Stonehenge) Summer 2008: Twitter acquires Summize (real-time search, sentiment analysis)

Slide 13

Slide 13 text

Source: http://www.flickr.com/photos/neilsingapore/4119503693/ Machine learning… B.P.

Slide 14

Slide 14 text

frontend databases and services data

Slide 15

Slide 15 text

Production considerations: dependency management scheduling resource allocation monitoring error reporting alerting …

Slide 16

Slide 16 text

“one off” machine learning data munging Joining multiple dataset Feature extraction … down-sample train download

Slide 17

Slide 17 text

What doesn’t work… 1.  Down-sampling for training on single-processor   Defeats the whole point of big data! 2.  Ad hoc productionizing   Disconnected from rest of production Oink workﬂow   None of the beneﬁts of Oink So, we redesigned it… with two goals:

Slide 18

Slide 18 text

Seamless scaling Source: Wikipedia (Galaxy)

Slide 19

Slide 19 text

Integration with production workﬂows Source: Wikipedia (Oil refinery)

Slide 20

Slide 20 text

Source: Wikipedia (Sorting) Classiﬁcation

Slide 21

Slide 21 text

Supervised classiﬁcation in a nutshell Given Induce s.t. loss is minimized € argmin θ 1 n  f (x i ;w),y i ( ) i=0 n ∑ Consider functions of a parametric form: Key insight: machine learning as an optimization problem! empirical loss = (closed form solutions generally not possible) loss function model parameters (weights) (sparse) feature vector label

Slide 22

Slide 22 text

Gradient Descent € w(t +1) = w(t ) −γ(t ) 1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ Repeat until convergence:

Slide 23

Slide 23 text

€  x ( ) € ∇ € d dx  w(t +1) = w(t ) −γ(t ) 1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ Intuition behind the math… Old weights Update based on gradient New weights

Slide 24

Slide 24 text

Gradient Descent w(t +1) = w(t ) −γ(t ) 1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ Source: Wikipedia (Hills)

Slide 25

Slide 25 text

€ w(t +1) = w(t ) −γ(t ) 1 n ∇ f (x i ;w(t)),y ( ) i=0 n ∑ mapper mapper mapper mapper reducer compute partial gradient single reducer mappers update model iterate until convergence

Slide 26

Slide 26 text

Shortcomings   Hadoop is bad at iterative algorithms   High job startup costs   Awkward to retain state across iterations   High sensitivity to skew   Iteration speed bounded by slowest task   Potentially poor cluster utilization   Must shufﬂe all data to a single reducer

Slide 27

Slide 27 text

Gradient Descent Source: Wikipedia (Hills)

Slide 28

Slide 28 text

Stochastic Gradient Descent Source: Wikipedia (Water Slide)

Slide 29

Slide 29 text

Gradient Descent Stochastic Gradient Descent (SGD) w(t +1) = w(t ) −γ(t ) 1 n ∇ f (x i ;w(t)),y i ( ) i=0 n ∑ € w(t +1) = w(t ) −γ(t )∇ f (x;w(t)),y ( ) “batch” learning: update model after considering all training instances “online” learning: update model after considering each (randomly-selected) training instance Solves the iteration problem! What about the single reducer problem? In practice… just as good!

Slide 30

Slide 30 text

Ensembles Source: Wikipedia (Orchestra)

Slide 31

Slide 31 text

Ensemble Learning   Learn multiple models   Simplest possible technique: majority voting   Simple weighted voting:   Why does it work?   If errors uncorrelated, multiple classifiers being wrong is less likely   Reduces the variance component of error   Embarassingly parallel ensemble learning:   Train each classifier on partitioned input   Contrast with boosting: more difficult to parallelize

Slide 32

Slide 32 text

When you only have a hammer… … get rid of everything that’s not a nail! Stochastic gradient descent Ensemble methods Source: Wikipedia (Hammer) … good ﬁt for acyclic dataﬂows

Slide 33

Slide 33 text

It’s like an aggregate function! Machine learning is basically a user-deﬁned aggregate function! initialize update terminate sum = 0 count = 0 add to sum increment count return sum / count AVG SGD initialize weights return weights € w(t +1) = w(t ) −γ(t )∇ f (x;w(t)),y ( )

Slide 34

Slide 34 text

Classiﬁer

Slide 35

Slide 35 text

It’s just Pig! For “free”: dependency management, scheduling, resource allocation, monitoring, error reporting, alerting, …

Slide 36

Slide 36 text

Classiﬁer Training training = load ‘training.txt’ using SVMLightStorage() as (target: int, features: map[]); store training into ‘model/’ using FeaturesLRClassiﬁerBuilder(); Want an ensemble? training = foreach training generate label, features, RANDOM() as random; training = order training by random parallel 5; Logistic regression + SGD (L2 regularization) Pegasos variant (fully SGD or sub-gradient)

Slide 37

Slide 37 text

define Classify ClassifyWithLR(‘model/’); data = load ‘test.txt’ using SVMLightStorage() as (target: double, features: map[]); data = foreach data generate target, Classify(features) as prediction; Making Predictions Want an ensemble? define Classify ClassifyWithEnsemble(‘model/’, ‘classifier.LR’, ‘vote’);

Slide 38

Slide 38 text

Sentiment Analysis Case Study   Binary polarity classiﬁcation: {positive, negative} sentiment   Independently interesting task   Illustrates end-to-end ﬂow   Use the “emoticon trick” to gather data   Data   Test: 500k positive/500k negative tweets from 9/1/2011   Training: {1m, 10m, 100m} instances from before (50/50 split)   Features: Sliding window byte-4grams   Models:   Logistic regression with SGD (L2 regularization)   Ensembles of various sizes (simple weighted voting)

Slide 39

Slide 39 text

status = load '/tables/statuses/$DATE' using StatusLoader() as (id: long, uid: long, text: chararray); status = foreach status generate text, RANDOM() as random; status = filter status by IdentifyLanguage(text) == 'en'; -- Filter for positive examples positive = filter status by ContainsPositiveEmoticon(text) and not ContainsNegativeEmoticon(text) and length(text) > 20; positive = foreach positive generate (int) 1 as label, RemovePositiveEmoticons(text) as text, random; positive = order positive by random; -- Randomize ordering of tweets. positive = limit positive $N; -- Take N positive examples. -- Filter for negative examples negative = filter status by ContainsNegativeEmoticon(text) and not ContainsPositiveEmoticon(text) and length(text) > 20; negative = foreach negative generate (int) -1 as label, RemoveNegativeEmoticons(text) as text, random; negative = order negative by random; -- Randomize ordering of tweets negative = limit negative $N; -- Take N negative examples training = union positive, negative; -- Randomize order of positive and negative examples training = foreach training generate (int) $0 as label, (chararray) $1 as text, RANDOM() as random; training = order training by random parallel $PARTITIONS; training = foreach training generate label, text; store training into '$OUTPUT' using LRClassifierBuilder(); Load tweets Branch 1: filter positive emoticons Branch 2: filter negative emoticons Shuffle together, randomize Train!

Slide 40

Slide 40 text

0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 1 1 1 3 5 7 9 11 13 15 17 19 3 5 11 21 31 41 Accuracy Number of Classifiers in Ensemble 1m instances 10m instances 100m instances “for free” Ensembles with 10m examples

Slide 41

Slide 41 text

Twitter Applications Anti-abuse Follower recommendation User modeling …

Slide 42

Slide 42 text

Related Work   Big data and machine learning:   Integration into DB? MADLibs, Bismark   Integration into custom DSL? Spark   Integration into existing package? MATLAB, R   Faster iterative algorithms in MapReduce   HaLoop, Twister, PrIter: Requires a different programming model   Why not just use Mahout?   Our core ML libraries pre-date Mahout   Tighter integration with our internal workﬂows   Mahout + Pig can be integrated in exactly the same way!

Slide 43

Slide 43 text

Source: Wikipedia (Sonoran Desert) Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), May 2012, Scottsdale, Arizona.

Slide 44

Slide 44 text

Questions? …btw, we’re hiring Twittering Machine. Paul Klee (1922) watercolor and ink