"Small Data" Machine Learning

Small Data Machine Learning Andrei Zmievski

WORK We are all superheroes, because we help our customers
keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come ﬁnd me.

MATH SOME

MATH SOME AWE

@a Acquired in October 2008 Had a different account earlier,
but then @k asked if I wanted it.. Know many other single-letter Twitterers.

FAME Advantages

FAME FORTUNE

FAME FORTUNE Wall Street Journal?!

FAME FORTUNE FOLLOWERS

lol, what?! FAME FORTUNE FOLLOWERS

140-length(“@a “)=137 MAXIMUM REPLY SPACE!

CONS Disadvantages Visual ﬁltering is next to impossible Could be
a set of hard-coded rules derived empirically

CONS I hate humanity Disadvantages Visual ﬁltering is next to
impossible Could be a set of hard-coded rules derived empirically

Machine Learning to the Rescue! Being grumpy makes you learn
stuff

REPLYCLEANER Even with false negatives, reduces garbage to where visual
ﬁltering is possible - uses trained model to classify tweets into good/bad - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline

REPLYCLEANER

I still hate humanity

I still hate humanity I still hate humanity

Machine Learning A branch of Artiﬁcial Intelligence No widely accepted
deﬁnition

“Field of study that gives computers the ability to learn
without being explicitly programmed.” — Arthur Samuel (1959)

SPAM FILTERING

RECOMMENDATIONS

TRANSLATION

CLUSTERING And many more: medical diagnoses, detecting credit card fraud,
etc.

supervised unsupervised Labeled dataset, training maps input to desired outputs
Example: regression - predicting house prices, classiﬁcation - spam ﬁltering

supervised unsupervised no labels in the dataset, algorithm needs to
ﬁnd structure Example: clustering

Feature individual measurable property of the phenomenon under observation usually
numeric

Feature Vector a set of features for an observation Think
of it as an array

features # of rooms sq. footage house age yard? feature
vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simpliﬁes calculation) dot product produces a linear predictor

features 102.3 0.94 -10.1 83.0 weights # of rooms sq.
footage house age yard? feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simpliﬁes calculation) dot product produces a linear predictor

features 102.3 0.94 -10.1 83.0 weights # of rooms sq.
footage house age yard? 1 45.7 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simpliﬁes calculation) dot product produces a linear predictor

features 102.3 0.94 -10.1 83.0 = weights prediction # of
rooms sq. footage house age yard? 1 45.7 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simpliﬁes calculation) dot product produces a linear predictor

X = ⇥ 1 x1 x2 . . . ⇤
✓ = ⇥ ✓0 ✓1 ✓2 . . . ⇤ dot product

X = ⇥ 1 x1 x2 . . . ⇤
✓ = ⇥ ✓0 ✓1 ✓2 . . . ⇤ ✓ · X = ✓0 + ✓1x1 + ✓2x2 + . . . dot product

training data learning algorithm hypothesis Hypothesis (decision function): what the
system has learned so far Hypothesis is applied to new data

hθ (X) The task of our algorithm is to determine
the parameters of the hypothesis.

hθ (X) input data The task of our algorithm is
to determine the parameters of the hypothesis.

hθ (X) parameters input data The task of our algorithm
is to determine the parameters of the hypothesis.

hθ (X) parameters input data prediction y The task of
our algorithm is to determine the parameters of the hypothesis.

LINEAR REGRESSION 5 10 15 20 25 30 35 40
80 120 160 200 whisky age whisky price $ Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classiﬁcation because its output is unbounded. Thresholding on some value is tricky and does not produce good results.

LOGISTIC REGRESSION g(z) = 1 1 + e z 0.5
1 0 z Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor.

LOGISTIC REGRESSION g(z) = 1 1 + e z 0.5
1 0 z z = ✓ · X Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor.

h✓(X) = 1 1 + e ✓·X Probability that y=1
for input X LOGISTIC REGRESSION If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70% chance it’s spam. Thresholding on that is up to you.

Building the Tool

Corpus collection of source data used for training and testing
the model

Twitter MongoDB phirehose hooks into streaming API

Twitter MongoDB phirehose 8500 tweets hooks into streaming API

Feature Identiﬁcation

independent & discriminant Independent: feature A should not co-occur (correlate)
with feature B highly. Discriminant: a feature should provide uniquely classiﬁable data (what letter a tweet starts with is not a good feature).

‣ @a at the end of the tweet ‣ @a...
‣ length < N chars ‣ # of user mentions in the tweet ‣ # of hashtags ‣ language! ‣ @a followed by punctuation and a word character (except for apostrophe) possible features

feature = extractor(tweet) For each feature, write a small function
that takes a tweet and returns a numeric value

corpus extractors feature vectors Run the set of these functions
over the corpus and build up feature vectors Array of arrays Save to DB

Language Matters high correlation between the language of the tweet
and its category (good/bad)

Indonesian or Tagalog? Garbage.

id Indonesian 3548 en English 1804 tl Tagalog 733 es
Spanish 329 so Somalian 305 ja Japanese 300 pt Portuguese 262 ar Arabic 256 nl Dutch 150 it Italian 137 sw Swahili 118 fr French 92 Top 12 Languages I guarantee you people aren’t tweeting at me in Swahili.

Language Detection Can’t trust the language ﬁeld in user’s proﬁle
data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.

Language Detection Text_LanguageDetect textcat pecl / pear / Can’t trust
the language ﬁeld in user’s proﬁle data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.

✓ Clean-up text (remove mentions, links, etc) ✓ Run language
detection ✓ If unknown/low weight, pretend it’s English, else: ✓ If not a character set-determined language, try harder: ✓ Tokenize into words ✓ Diﬀerence with English vocabulary ✓ If words remain, run parts-of-speech tagger on each ✓ For NNS, VBZ, and VBD run stemming algorithm ✓ If result is in English vocabulary, remove from remaining ✓ If remaining list is not empty, calculate: unusual_word_ratio = size(remaining)/size(words) ✓ If ratio < 20%, pretend it’s English EnglishNotEnglish A lot of this is heuristic-based, after some trial-and-error. Seems to help with my corpus.

BINARY CLASSIFICATION Grunt work Built a web-based tool to display
tweets a page at a time and select good ones

feature vectors labels (good/bad) I N P U T O
U T P U T Had my input and output

BIAS CORRECTION One more thing to address

BIAS CORRECTION BAD GOOD 99% = bad (less < 100
tweets were good) Training a model as-is would not produce good results Need to adjust the bias

BIAS CORRECTION BAD GOOD

O V E R SAMPLING Oversampling: use multiple copies of
good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category

O V E R SAMPLING UNDER Undersampling: drop most of
the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training

SAMPLING UNDER Undersampling: drop most of the bad tweets to
equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training

OVERSAMPLING Synthetic Synthesize feature vectors by determining what constitutes a
good tweet and do weighted random selection of feature values.

chance feature 90% “good” language 70% 25% 5% no hashtags
1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 1 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 1 2 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 1 2 0 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 1 2 0 77 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

Model Training We have the hypothesis (decision function) and the
training set, How do we actually determine the weights/parameters?

COST FUNCTION Measures how far the prediction of the system
is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.

REALITY PREDICTION COST FUNCTION Measures how far the prediction of
the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.

COST FUNCTION J ( ✓ ) = 1 m m
X i=1 Cost ( h✓( x ) , y ) Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.

Cost(h✓(x), y) = ( log (h✓(x)) if y = 1
log (1 h✓(x)) if y = 0 LOGISTIC COST

1 0 y=1 y=0 1 0 Correct guess Incorrect guess
Cost = 0 Cost = huge When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.

minimize cost OVER θ Finding the best values of Theta
that minimize the cost

GRADIENT DESCENT Random starting point. Pretend you’re standing on a
hill. Find the direction of the steepest descent and take a step. Repeat. Imagine a ball rolling down from a hill.

✓i ↵ = ✓i @J(✓) @✓i GRADIENT DESCENT Each step
adjusts the parameters according to the slope

↵ = each parameter ✓i ✓i @J(✓) @✓i Have to
update them simultaneously (the whole vector at a time).

✓i = ✓i learning rate ↵ @J(✓) @✓i Controls how
big a step you take If α is big have an aggressive gradient descent If α is small take tiny steps If too small, tiny steps, takes too long If too big, can overshoot the minimum and fail to converge

✓i ↵ = ✓i derivative aka “the slope” @J(✓) @✓i
The slope indicates the steepness of the descent step for each weight, i.e. direction. Keep going for a number of iterations or until cost is below a threshold (convergence). Graph the cost function versus # of iterations and see where it starts to approach 0, past that are diminishing returns.

✓i = ✓i ↵ m X j=1 ( h✓( x
j) y j) x j i FINAL UPDATE ALGORITHM Derivative for logistic regression simpliﬁes to this term. Have to update the weights simultaneously!

X1 = [1 12.0] X2 = [1 -3.5] Hypothesis for
each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

y1 = 1 y2 = 0 X1 = [1 12.0]
X2 = [1 -3.5] Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

y1 = 1 y2 = 0 X1 = [1 12.0]
X2 = [1 -3.5] θ = [0.1 0.1] Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

y1 = 1 y2 = 0 X1 = [1 12.0]
X2 = [1 -3.5] θ = [0.1 0.1] ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

y1 = 1 y2 = 0 h(X1) = 1 /
(1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

y1 = 1 y2 = 0 h(X1) = 1 /
(1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

y1 = 1 y2 = 0 h(X1) = 1 /
(1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0 ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

y1 = 1 y2 = 0 = 0.1 - 0.05
• ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0 ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

y1 = 1 y2 = 0 = 0.1 - 0.05
• ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0 ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

y1 = 1 y2 = 0 = 0.1 - 0.05
• ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) = 0.088 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0 ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

y1 = 1 y2 = 0 = 0.1 - 0.05
• ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5) = 0.305 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T1 ↵= 0.05 Note that the hypotheses don’t change within the iteration.

y1 = 1 y2 = 0 X1 = [1 12.0]
X2 = [1 -3.5] θ = [T0 T1] ↵= 0.05 Replace parameter (weights) vector with the temporaries.

y1 = 1 y2 = 0 X1 = [1 12.0]
X2 = [1 -3.5] θ = [0.088 0.305] ↵= 0.05 Do next iteration

Trai ning CROSS Used to assess the results of the
training.

DATA TRAINING

DATA TRAINING TEST Train model on training set, then test
results on test set. Rinse, lather, repeat feature selection/synthesis/training until results are "good enough". Pick the best parameters and save them (DB, other).

Putting It All Together Let’s put our model to use,
ﬁnally. The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain error handling, etc. Once we get the actual tweet though..

Load the model The weights we have calculated via training
Easiest is to load them from DB (can be used to test different models).

HARD CODED RULES We apply some hardcoded rules to ﬁlter
out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so ﬁne to skip those.

HARD CODED RULES SKIP truncated retweets: "RT @A ..." We
apply some hardcoded rules to ﬁlter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so ﬁne to skip those.

HARD CODED RULES SKIP truncated retweets: "RT @A ..." @
mentions of friends We apply some hardcoded rules to ﬁlter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so ﬁne to skip those.

HARD CODED RULES SKIP truncated retweets: "RT @A ..." tweets
from friends @ mentions of friends We apply some hardcoded rules to ﬁlter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so ﬁne to skip those.

Classifying Tweets This is the moment we’ve been waiting for.

Classifying Tweets GOOD This is the moment we’ve been waiting
for.

Classifying Tweets GOOD BAD This is the moment we’ve been
waiting for.

h✓(X) = 1 1 + e ✓·X Remember this? First
is our hypothesis.

h✓(X) = 1 1 + e ✓·X Remember this? ✓·X
= ✓0 + ✓1X1 + ✓2X2 + . . . First is our hypothesis.

Finally h✓(X) = 1 1 + e (✓0+✓1X1+✓2X2+... ) If
h > 0.5 , tweet is bad, otherwise good Remember that the output of h() is 0..1 (probability). 0.5 is your threshold, adjust it for your degree of tolerance. I used 0.9 to reduce false positives.

extract features 3 simple steps Invoke the feature extractor to
construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classiﬁer.

extract features run the model 3 simple steps Invoke the
feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classiﬁer.

extract features run the model act on the result 3
simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classiﬁer.

BAD? block user! Also save the tweet to DB for
future analysis.

Lessons Learned Blocking is the only option (and is ﬁnal)
Streaming API delivery is incomplete ReplyCleaner judged to be ~80% eﬀective Twitter API is a pain in the rear

Streaming API delivery is incomplete ReplyCleaner judged to be ~80% eﬀective Twitter API is a pain in the rear -Connection handling, backoff in case of problems, undocumented API errors, etc.

Streaming API delivery is incomplete ReplyCleaner judged to be ~80% eﬀective Twitter API is a pain in the rear -No way for blocked person to get ahold of you via Twitter anymore, so when training the model, err on the side of caution.

Streaming API delivery is incomplete ReplyCleaner judged to be ~80% eﬀective Twitter API is a pain in the rear -Some tweets are shown on the website, but never seen through the API.

Streaming API delivery is incomplete ReplyCleaner judged to be ~80% eﬀective Twitter API is a pain in the rear -Lots of room for improvement.

Streaming API delivery is incomplete ReplyCleaner judged to be ~80% eﬀective Twitter API is a pain in the rear PHP sucks at math-y stuﬀ -Lots of room for improvement.

NEXT STEPS ★ Realtime feedback ★ More features ★ Grammar
analysis ★ Support Vector Machines or decision trees ★ Clockwork Raven for manual classiﬁcation ★ Other minimization algos: BFGS, conjugate gradient ★ Need pecl/scikit-learn Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classiﬁcation to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.

TOOLS ★ MongoDB ★ pear/Text_LanguageDetect ★ English vocabulary corpus ★
Parts-Of-Speech tagging ★ SplFixedArray ★ phirehose ★ Python’s scikit-learn (for validation) ★ Code sample MongoDB (great ﬁt for JSON data) English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/ SplFixedArray in PHP (memory savings and slightly faster)

Questions?

"Small Data" Machine Learning

"Small Data" Machine Learning

More Decks by Andrei Zmievski

Other Decks in Technology

Featured

Transcript