"Small Data" Machine Learning

"Small Data" Machine Learning

What do you do if you have a single letter Twitter handle and your reply stream is virtually useless due to people using “@a” in ways that make you lose faith in humanity (“I’m @a bar” is the best of it)? These days, you might call machine learning to the rescue. ML is a hot new topic, but may be a bit difficult to get started with in a large project. This session explains how I used ML and a couple of other tricks to build a reply stream cleaner that sanitizes my Twitter consumption and will hopefully inspire you to use ML in your own projects.

Aa4af19d5034741a0864f0f0738800f2?s=128

Andrei Zmievski

January 26, 2013
Tweet

Transcript

  1. 2.

    WORK We are all superheroes, because we help our customers

    keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.
  2. 3.

    WORK We are all superheroes, because we help our customers

    keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.
  3. 4.
  4. 7.

    @a Acquired in October 2008 Had a different account earlier,

    but then @k asked if I wanted it.. Know many other single-letter Twitterers.
  5. 14.

    CONS Disadvantages Visual filtering is next to impossible Could be

    a set of hard-coded rules derived empirically
  6. 15.

    CONS I hate humanity Disadvantages Visual filtering is next to

    impossible Could be a set of hard-coded rules derived empirically
  7. 16.

    CONS I hate humanity Disadvantages Visual filtering is next to

    impossible Could be a set of hard-coded rules derived empirically
  8. 18.

    REPLYCLEANER Even with false negatives, reduces garbage to where visual

    filtering is possible - uses trained model to classify tweets into good/bad - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
  9. 19.

    REPLYCLEANER Even with false negatives, reduces garbage to where visual

    filtering is possible - uses trained model to classify tweets into good/bad - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
  10. 24.
  11. 28.

    “Field of study that gives computers the ability to learn

    without being explicitly programmed.” — Arthur Samuel (1959)
  12. 33.

    supervised unsupervised Labeled dataset, training maps input to desired outputs

    Example: regression - predicting house prices, classification - spam filtering
  13. 36.
  14. 38.

    features # of rooms sq. footage house age yard? feature

    vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  15. 39.

    features 102.3 0.94 -10.1 83.0 weights # of rooms sq.

    footage house age yard? feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  16. 40.

    features 102.3 0.94 -10.1 83.0 weights # of rooms sq.

    footage house age yard? 1 45.7 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  17. 41.

    features 102.3 0.94 -10.1 83.0 weights # of rooms sq.

    footage house age yard? 1 45.7 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  18. 42.

    features 102.3 0.94 -10.1 83.0 = weights prediction # of

    rooms sq. footage house age yard? 1 45.7 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  19. 43.

    X = ⇥ 1 x1 x2 . . . ⇤

    ✓ = ⇥ ✓0 ✓1 ✓2 . . . ⇤ dot product
  20. 44.

    X = ⇥ 1 x1 x2 . . . ⇤

    ✓ = ⇥ ✓0 ✓1 ✓2 . . . ⇤ ✓ · X = ✓0 + ✓1x1 + ✓2x2 + . . . dot product
  21. 45.

    training data learning algorithm hypothesis Hypothesis (decision function): what the

    system has learned so far Hypothesis is applied to new data
  22. 46.

    hθ (X) The task of our algorithm is to determine

    the parameters of the hypothesis.
  23. 47.

    hθ (X) input data The task of our algorithm is

    to determine the parameters of the hypothesis.
  24. 48.

    hθ (X) parameters input data The task of our algorithm

    is to determine the parameters of the hypothesis.
  25. 49.

    hθ (X) parameters input data prediction y The task of

    our algorithm is to determine the parameters of the hypothesis.
  26. 50.

    LINEAR REGRESSION 5 10 15 20 25 30 35 40

    80 120 160 200 whisky age whisky price $ Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
  27. 51.

    LINEAR REGRESSION 5 10 15 20 25 30 35 40

    80 120 160 200 whisky age whisky price $ Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
  28. 52.

    LINEAR REGRESSION 5 10 15 20 25 30 35 40

    80 120 160 200 whisky age whisky price $ Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
  29. 53.

    LOGISTIC REGRESSION g(z) = 1 1 + e z 0.5

    1 0 z Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor.
  30. 54.

    LOGISTIC REGRESSION g(z) = 1 1 + e z 0.5

    1 0 z z = ✓ · X Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor.
  31. 55.

    h✓(X) = 1 1 + e ✓·X Probability that y=1

    for input X LOGISTIC REGRESSION If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70% chance it’s spam. Thresholding on that is up to you.
  32. 61.

    independent & discriminant Independent: feature A should not co-occur (correlate)

    with feature B highly. Discriminant: a feature should provide uniquely classifiable data (what letter a tweet starts with is not a good feature).
  33. 62.

    ‣ @a at the end of the tweet ‣ @a...

    ‣ length < N chars ‣ # of user mentions in the tweet ‣ # of hashtags ‣ language! ‣ @a followed by punctuation and a word character (except for apostrophe) possible features
  34. 63.

    feature = extractor(tweet) For each feature, write a small function

    that takes a tweet and returns a numeric value
  35. 64.

    corpus extractors feature vectors Run the set of these functions

    over the corpus and build up feature vectors Array of arrays Save to DB
  36. 67.

    id Indonesian 3548 en English 1804 tl Tagalog 733 es

    Spanish 329 so Somalian 305 ja Japanese 300 pt Portuguese 262 ar Arabic 256 nl Dutch 150 it Italian 137 sw Swahili 118 fr French 92 Top 12 Languages I guarantee you people aren’t tweeting at me in Swahili.
  37. 68.

    Language Detection Can’t trust the language field in user’s profile

    data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.
  38. 69.

    Language Detection Text_LanguageDetect textcat pecl / pear / Can’t trust

    the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.
  39. 70.

    ✓ Clean-up text (remove mentions, links, etc) ✓ Run language

    detection ✓ If unknown/low weight, pretend it’s English, else: ✓ If not a character set-determined language, try harder: ✓ Tokenize into words ✓ Difference with English vocabulary ✓ If words remain, run parts-of-speech tagger on each ✓ For NNS, VBZ, and VBD run stemming algorithm ✓ If result is in English vocabulary, remove from remaining ✓ If remaining list is not empty, calculate: unusual_word_ratio = size(remaining)/size(words) ✓ If ratio < 20%, pretend it’s English EnglishNotEnglish A lot of this is heuristic-based, after some trial-and-error. Seems to help with my corpus.
  40. 71.

    BINARY CLASSIFICATION Grunt work Built a web-based tool to display

    tweets a page at a time and select good ones
  41. 72.

    feature vectors labels (good/bad) I N P U T O

    U T P U T Had my input and output
  42. 74.

    BIAS CORRECTION BAD GOOD 99% = bad (less < 100

    tweets were good) Training a model as-is would not produce good results Need to adjust the bias
  43. 76.

    O V E R SAMPLING Oversampling: use multiple copies of

    good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
  44. 77.

    O V E R SAMPLING Oversampling: use multiple copies of

    good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
  45. 78.

    O V E R SAMPLING UNDER Undersampling: drop most of

    the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training
  46. 79.

    SAMPLING UNDER Undersampling: drop most of the bad tweets to

    equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training
  47. 80.

    OVERSAMPLING Synthetic Synthesize feature vectors by determining what constitutes a

    good tweet and do weighted random selection of feature values.
  48. 81.

    chance feature 90% “good” language 70% 25% 5% no hashtags

    1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  49. 82.

    chance feature 90% “good” language 70% 25% 5% no hashtags

    1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 1 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  50. 83.

    chance feature 90% “good” language 70% 25% 5% no hashtags

    1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 1 2 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  51. 84.

    chance feature 90% “good” language 70% 25% 5% no hashtags

    1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 1 2 0 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  52. 85.

    chance feature 90% “good” language 70% 25% 5% no hashtags

    1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 1 2 0 77 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  53. 86.

    Model Training We have the hypothesis (decision function) and the

    training set, How do we actually determine the weights/parameters?
  54. 87.

    COST FUNCTION Measures how far the prediction of the system

    is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
  55. 88.

    REALITY PREDICTION COST FUNCTION Measures how far the prediction of

    the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
  56. 89.

    COST FUNCTION J ( ✓ ) = 1 m m

    X i=1 Cost ( h✓( x ) , y ) Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
  57. 90.

    Cost(h✓(x), y) = ( log (h✓(x)) if y = 1

    log (1 h✓(x)) if y = 0 LOGISTIC COST
  58. 91.

    1 0 y=1 y=0 1 0 Correct guess Incorrect guess

    Cost = 0 Cost = huge When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.
  59. 93.

    GRADIENT DESCENT Random starting point. Pretend you’re standing on a

    hill. Find the direction of the steepest descent and take a step. Repeat. Imagine a ball rolling down from a hill.
  60. 94.

    ✓i ↵ = ✓i @J(✓) @✓i GRADIENT DESCENT Each step

    adjusts the parameters according to the slope
  61. 95.

    ↵ = each parameter ✓i ✓i @J(✓) @✓i Have to

    update them simultaneously (the whole vector at a time).
  62. 96.

    ✓i = ✓i learning rate ↵ @J(✓) @✓i Controls how

    big a step you take If α is big have an aggressive gradient descent If α is small take tiny steps If too small, tiny steps, takes too long If too big, can overshoot the minimum and fail to converge
  63. 97.

    ✓i ↵ = ✓i derivative aka “the slope” @J(✓) @✓i

    The slope indicates the steepness of the descent step for each weight, i.e. direction. Keep going for a number of iterations or until cost is below a threshold (convergence). Graph the cost function versus # of iterations and see where it starts to approach 0, past that are diminishing returns.
  64. 98.

    ✓i = ✓i ↵ m X j=1 ( h✓( x

    j) y j) x j i FINAL UPDATE ALGORITHM Derivative for logistic regression simplifies to this term. Have to update the weights simultaneously!
  65. 99.

    X1 = [1 12.0] X2 = [1 -3.5] Hypothesis for

    each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  66. 100.

    y1 = 1 y2 = 0 X1 = [1 12.0]

    X2 = [1 -3.5] Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  67. 101.

    y1 = 1 y2 = 0 X1 = [1 12.0]

    X2 = [1 -3.5] θ = [0.1 0.1] Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  68. 102.

    y1 = 1 y2 = 0 X1 = [1 12.0]

    X2 = [1 -3.5] θ = [0.1 0.1] ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  69. 103.

    y1 = 1 y2 = 0 h(X1) = 1 /

    (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  70. 104.

    y1 = 1 y2 = 0 h(X1) = 1 /

    (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  71. 105.

    y1 = 1 y2 = 0 h(X1) = 1 /

    (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0 ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  72. 106.

    y1 = 1 y2 = 0 = 0.1 - 0.05

    • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0 ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  73. 107.

    y1 = 1 y2 = 0 = 0.1 - 0.05

    • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0 ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  74. 108.

    y1 = 1 y2 = 0 = 0.1 - 0.05

    • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) = 0.088 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0 ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  75. 109.

    y1 = 1 y2 = 0 = 0.1 - 0.05

    • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5) = 0.305 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T1 ↵= 0.05 Note that the hypotheses don’t change within the iteration.
  76. 110.

    y1 = 1 y2 = 0 X1 = [1 12.0]

    X2 = [1 -3.5] θ = [T0 T1] ↵= 0.05 Replace parameter (weights) vector with the temporaries.
  77. 111.

    y1 = 1 y2 = 0 X1 = [1 12.0]

    X2 = [1 -3.5] θ = [0.088 0.305] ↵= 0.05 Do next iteration
  78. 113.
  79. 115.

    DATA TRAINING TEST Train model on training set, then test

    results on test set. Rinse, lather, repeat feature selection/synthesis/training until results are "good enough". Pick the best parameters and save them (DB, other).
  80. 116.

    Putting It All Together Let’s put our model to use,

    finally. The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain error handling, etc. Once we get the actual tweet though..
  81. 117.

    Load the model The weights we have calculated via training

    Easiest is to load them from DB (can be used to test different models).
  82. 118.

    HARD CODED RULES We apply some hardcoded rules to filter

    out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  83. 119.

    HARD CODED RULES SKIP truncated retweets: "RT @A ..." We

    apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  84. 120.

    HARD CODED RULES SKIP truncated retweets: "RT @A ..." @

    mentions of friends We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  85. 121.

    HARD CODED RULES SKIP truncated retweets: "RT @A ..." tweets

    from friends @ mentions of friends We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  86. 126.

    h✓(X) = 1 1 + e ✓·X Remember this? ✓·X

    = ✓0 + ✓1X1 + ✓2X2 + . . . First is our hypothesis.
  87. 127.

    Finally h✓(X) = 1 1 + e (✓0+✓1X1+✓2X2+... ) If

    h > 0.5 , tweet is bad, otherwise good Remember that the output of h() is 0..1 (probability). 0.5 is your threshold, adjust it for your degree of tolerance. I used 0.9 to reduce false positives.
  88. 128.

    extract features 3 simple steps Invoke the feature extractor to

    construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
  89. 129.

    extract features run the model 3 simple steps Invoke the

    feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
  90. 130.

    extract features run the model act on the result 3

    simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
  91. 132.

    Lessons Learned Blocking is the only option (and is final)

    Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear
  92. 133.

    Lessons Learned Blocking is the only option (and is final)

    Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear -Connection handling, backoff in case of problems, undocumented API errors, etc.
  93. 134.

    Lessons Learned Blocking is the only option (and is final)

    Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear -No way for blocked person to get ahold of you via Twitter anymore, so when training the model, err on the side of caution.
  94. 135.

    Lessons Learned Blocking is the only option (and is final)

    Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear -Some tweets are shown on the website, but never seen through the API.
  95. 136.

    Lessons Learned Blocking is the only option (and is final)

    Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear -Lots of room for improvement.
  96. 137.

    Lessons Learned Blocking is the only option (and is final)

    Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear PHP sucks at math-y stuff -Lots of room for improvement.
  97. 138.

    NEXT STEPS ★ Realtime feedback ★ More features ★ Grammar

    analysis ★ Support Vector Machines or decision trees ★ Clockwork Raven for manual classification ★ Other minimization algos: BFGS, conjugate gradient ★ Need pecl/scikit-learn Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classification to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.
  98. 139.

    TOOLS ★ MongoDB ★ pear/Text_LanguageDetect ★ English vocabulary corpus ★

    Parts-Of-Speech tagging ★ SplFixedArray ★ phirehose ★ Python’s scikit-learn (for validation) ★ Code sample MongoDB (great fit for JSON data) English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/ SplFixedArray in PHP (memory savings and slightly faster)
  99. 140.