Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large-Scale Machine Learning at Twitter

Large-Scale Machine Learning at Twitter

Slides for talk at 2012 Hadoop Summit.

Jimmy Lin

June 18, 2012
Tweet

Other Decks in Programming

Transcript

  1. Training a Smarter Pig:

    View Slide

  2. Source: Wikipedia (All Souls College, Oxford)
    From the Ivory Tower…

    View Slide

  3. Source: Wikipedia (Factory)
    … to building sh*t that works.


    View Slide

  4. #numbers


    View Slide

  5. 140 characters


    140 million active users


    340 million Tweets per day


    View Slide

  6. PBs total DW capacity


    ~100 TB daily ingestion


    10s of Ks daily Hadoop jobs


    View Slide

  7. #shamelessplugs


    Wed. 4:30pm

    View Slide

  8. Source: Wikipedia (Everest)
    “traditional” business intelligence

    View Slide

  9. Goals


    Develop a generic machine learning platform

    Make machine learning tools easier to use


    View Slide

  10. Source: Wikipedia (Plumbing)
    Contributions

    View Slide

  11.   What we were doing and what
    wasn’t working…

      Design goals

      Scaling up machine learning

      Integration with Pig

      Simple case study

    Path forward…

    View Slide

  12. Machine learning… B.H.

    Source: Wikipedia (Stonehenge)
    Summer 2008: Twitter acquires Summize
    (real-time search, sentiment analysis)


    View Slide

  13. Source: http://www.flickr.com/photos/neilsingapore/4119503693/
    Machine learning… B.P.

    View Slide

  14. frontend databases and services


    data

    View Slide

  15. Production considerations:


    dependency management


    scheduling


    resource allocation


    monitoring


    error reporting


    alerting





    View Slide

  16. “one off” machine learning

    data munging

    Joining multiple dataset

    Feature extraction



    down-sample

    train

    download

    View Slide

  17. What doesn’t work…

    1.  Down-sampling for training on single-processor

      Defeats the whole point of big data!

    2.  Ad hoc productionizing

      Disconnected from rest of production Oink workflow

      None of the benefits of Oink

    So, we redesigned it… with two goals:


    View Slide

  18. Seamless scaling

    Source: Wikipedia (Galaxy)

    View Slide

  19. Integration with production workflows

    Source: Wikipedia (Oil refinery)

    View Slide

  20. Source: Wikipedia (Sorting)
    Classification

    View Slide

  21. Supervised classification in a nutshell

    Given

    Induce
    s.t. loss is minimized


    argmin
    θ
    1
    n
     f (x
    i
    ;w),y
    i
    ( )
    i=0
    n

    Consider functions of a parametric form:

    Key insight: machine learning as an optimization problem!

    empirical loss =

    (closed form solutions generally not possible)

    loss function

    model parameters (weights)

    (sparse) feature vector

    label

    View Slide

  22. Gradient Descent



    w(t +1) = w(t ) −γ(t )
    1
    n
    ∇ f (x
    i
    ;w(t)),y
    i
    ( )
    i=0
    n

    Repeat until convergence:


    View Slide


  23.  x
    ( )

    ∇

    d
    dx

    w(t +1) = w(t ) −γ(t )
    1
    n
    ∇ f (x
    i
    ;w(t)),y
    i
    ( )
    i=0
    n

    Intuition behind the math…

    Old weights


    Update based on gradient


    New weights


    View Slide

  24. Gradient Descent

    w(t +1) = w(t ) −γ(t )
    1
    n
    ∇ f (x
    i
    ;w(t)),y
    i
    ( )
    i=0
    n

    Source: Wikipedia (Hills)

    View Slide


  25. w(t +1) = w(t ) −γ(t )
    1
    n
    ∇ f (x
    i
    ;w(t)),y
    ( )
    i=0
    n

    mapper

    mapper

    mapper

    mapper


    reducer


    compute partial gradient

    single reducer

    mappers

    update model

    iterate until convergence

    View Slide

  26. Shortcomings

      Hadoop is bad at iterative algorithms

      High job startup costs

      Awkward to retain state across iterations

      High sensitivity to skew

      Iteration speed bounded by slowest task

      Potentially poor cluster utilization

      Must shuffle all data to a single reducer

    View Slide

  27. Gradient Descent

    Source: Wikipedia (Hills)

    View Slide

  28. Stochastic Gradient Descent

    Source: Wikipedia (Water Slide)

    View Slide

  29. Gradient Descent

    Stochastic Gradient Descent (SGD)

    w(t +1) = w(t ) −γ(t )
    1
    n
    ∇ f (x
    i
    ;w(t)),y
    i
    ( )
    i=0
    n


    w(t +1) = w(t ) −γ(t )∇ f (x;w(t)),y
    ( )
    “batch” learning: update model after considering all
    training instances

    “online” learning: update model after considering each
    (randomly-selected) training instance

    Solves the iteration problem!

    What about the single reducer problem?

    In practice… just as good!

    View Slide

  30. Ensembles

    Source: Wikipedia (Orchestra)

    View Slide

  31. Ensemble Learning

      Learn multiple models

      Simplest possible technique: majority voting

      Simple weighted voting:

      Why does it work?

      If errors uncorrelated, multiple classifiers being wrong is less likely

      Reduces the variance component of error

      Embarassingly parallel ensemble learning:

      Train each classifier on partitioned input

      Contrast with boosting: more difficult to parallelize

    View Slide

  32. When you only have a hammer…

    … get rid of everything that’s not a nail!

    Stochastic gradient descent

    Ensemble methods

    Source: Wikipedia (Hammer)
    … good fit for acyclic dataflows

    View Slide

  33. It’s like an aggregate function!

    Machine learning is basically a user-defined aggregate function!


    initialize


    update


    terminate


    sum = 0

    count = 0

    add to sum

    increment count

    return sum / count

    AVG
    SGD

    initialize weights

    return weights


    w(t +1) = w(t ) −γ(t )∇ f (x;w(t)),y
    ( )

    View Slide

  34. Classifier

    View Slide

  35. It’s just Pig!

    For “free”: dependency management,
    scheduling, resource allocation,
    monitoring, error reporting, alerting, …

    View Slide

  36. Classifier Training

    training = load ‘training.txt’ using SVMLightStorage()


    as (target: int, features: map[]);

    store training into ‘model/’


    using FeaturesLRClassifierBuilder();

    Want an ensemble?

    training = foreach training generate


    label, features, RANDOM() as random;

    training = order training by random parallel 5;

    Logistic regression + SGD (L2 regularization)

    Pegasos variant (fully SGD or sub-gradient)

    View Slide

  37. define Classify ClassifyWithLR(‘model/’);

    data = load ‘test.txt’ using SVMLightStorage()


    as (target: double, features: map[]);

    data = foreach data generate target,


    Classify(features) as prediction;

    Making Predictions

    Want an ensemble?

    define Classify ClassifyWithEnsemble(‘model/’,

    ‘classifier.LR’, ‘vote’);

    View Slide

  38. Sentiment Analysis Case Study

      Binary polarity classification: {positive, negative} sentiment

      Independently interesting task

      Illustrates end-to-end flow

      Use the “emoticon trick” to gather data

      Data

      Test: 500k positive/500k negative tweets from 9/1/2011

      Training: {1m, 10m, 100m} instances from before (50/50 split)

      Features: Sliding window byte-4grams

      Models:

      Logistic regression with SGD (L2 regularization)

      Ensembles of various sizes (simple weighted voting)

    View Slide

  39. status = load '/tables/statuses/$DATE' using StatusLoader() as (id: long, uid: long, text: chararray);

    status = foreach status generate text, RANDOM() as random;

    status = filter status by IdentifyLanguage(text) == 'en';

    -- Filter for positive examples

    positive = filter status by ContainsPositiveEmoticon(text) and not ContainsNegativeEmoticon(text)

    and length(text) > 20;

    positive = foreach positive generate (int) 1 as label, RemovePositiveEmoticons(text) as text, random;

    positive = order positive by random;
    -- Randomize ordering of tweets.

    positive = limit positive $N;
    -- Take N positive examples.

    -- Filter for negative examples

    negative = filter status by ContainsNegativeEmoticon(text) and not ContainsPositiveEmoticon(text)

    and length(text) > 20;

    negative = foreach negative generate (int) -1 as label, RemoveNegativeEmoticons(text) as text, random;

    negative = order negative by random;
    -- Randomize ordering of tweets

    negative = limit negative $N;

    -- Take N negative examples

    training = union positive, negative;

    -- Randomize order of positive and negative examples

    training = foreach training generate (int) $0 as label, (chararray) $1 as text, RANDOM() as random;

    training = order training by random parallel $PARTITIONS;

    training = foreach training generate label, text;

    store training into '$OUTPUT' using LRClassifierBuilder();

    Load tweets
    Branch 1: filter positive emoticons
    Branch 2: filter negative emoticons
    Shuffle together, randomize
    Train!

    View Slide

  40. 0.75
    0.76
    0.77
    0.78
    0.79
    0.8
    0.81
    0.82
    1 1 1 3 5 7 9 11 13 15 17 19 3 5 11 21 31 41
    Accuracy
    Number of Classifiers in Ensemble
    1m instances
    10m instances
    100m instances
    “for free”

    Ensembles with 10m examples

    View Slide

  41. Twitter Applications


    Anti-abuse


    Follower recommendation


    User modeling





    View Slide

  42. Related Work

      Big data and machine learning:

      Integration into DB? MADLibs, Bismark

      Integration into custom DSL? Spark

      Integration into existing package? MATLAB, R

      Faster iterative algorithms in MapReduce

      HaLoop, Twister, PrIter: Requires a different programming model

      Why not just use Mahout?

      Our core ML libraries pre-date Mahout

      Tighter integration with our internal workflows

      Mahout + Pig can be integrated in exactly the same way!

    View Slide

  43. Source: Wikipedia (Sonoran Desert)
    Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter.
    Proceedings of the 2012 ACM SIGMOD International Conference on
    Management of Data (SIGMOD 2012), May 2012, Scottsdale, Arizona.

    View Slide

  44. Questions?

    …btw, we’re hiring

    Twittering Machine. Paul Klee (1922) watercolor and ink

    View Slide