Slide 1

Slide 1 text

Small Data Machine Learning Andrei Zmievski

Slide 2

Slide 2 text

WORK We are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.

Slide 3

Slide 3 text

WORK We are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.

Slide 4

Slide 4 text

MATH

Slide 5

Slide 5 text

MATH SOME

Slide 6

Slide 6 text

MATH SOME AWE

Slide 7

Slide 7 text

@a Acquired in October 2008 Had a different account earlier, but then @k asked if I wanted it.. Know many other single-letter Twitterers.

Slide 8

Slide 8 text

FAME Advantages

Slide 9

Slide 9 text

FAME FORTUNE

Slide 10

Slide 10 text

FAME FORTUNE Wall Street Journal?!

Slide 11

Slide 11 text

FAME FORTUNE FOLLOWERS

Slide 12

Slide 12 text

lol, what?! FAME FORTUNE FOLLOWERS

Slide 13

Slide 13 text

140-length(“@a “)=137 MAXIMUM REPLY SPACE!

Slide 14

Slide 14 text

CONS Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically

Slide 15

Slide 15 text

CONS I hate humanity Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically

Slide 16

Slide 16 text

CONS I hate humanity Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically

Slide 17

Slide 17 text

Machine Learning to the Rescue! Being grumpy makes you learn stuff

Slide 18

Slide 18 text

REPLYCLEANER Even with false negatives, reduces garbage to where visual filtering is possible - uses trained model to classify tweets into good/bad - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline

Slide 19

Slide 19 text

REPLYCLEANER Even with false negatives, reduces garbage to where visual filtering is possible - uses trained model to classify tweets into good/bad - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline

Slide 20

Slide 20 text

REPLYCLEANER

Slide 21

Slide 21 text

REPLYCLEANER

Slide 22

Slide 22 text

REPLYCLEANER

Slide 23

Slide 23 text

REPLYCLEANER

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

I still hate humanity

Slide 26

Slide 26 text

I still hate humanity I still hate humanity

Slide 27

Slide 27 text

Machine Learning A branch of Artificial Intelligence No widely accepted definition

Slide 28

Slide 28 text

“Field of study that gives computers the ability to learn without being explicitly programmed.” — Arthur Samuel (1959)

Slide 29

Slide 29 text

SPAM FILTERING

Slide 30

Slide 30 text

RECOMMENDATIONS

Slide 31

Slide 31 text

TRANSLATION

Slide 32

Slide 32 text

CLUSTERING And many more: medical diagnoses, detecting credit card fraud, etc.

Slide 33

Slide 33 text

supervised unsupervised Labeled dataset, training maps input to desired outputs Example: regression - predicting house prices, classification - spam filtering

Slide 34

Slide 34 text

supervised unsupervised no labels in the dataset, algorithm needs to find structure Example: clustering

Slide 35

Slide 35 text

Feature individual measurable property of the phenomenon under observation usually numeric

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Feature Vector a set of features for an observation Think of it as an array

Slide 38

Slide 38 text

features # of rooms sq. footage house age yard? feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor

Slide 39

Slide 39 text

features 102.3 0.94 -10.1 83.0 weights # of rooms sq. footage house age yard? feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor

Slide 40

Slide 40 text

features 102.3 0.94 -10.1 83.0 weights # of rooms sq. footage house age yard? 1 45.7 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor

Slide 41

Slide 41 text

features 102.3 0.94 -10.1 83.0 weights # of rooms sq. footage house age yard? 1 45.7 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor

Slide 42

Slide 42 text

features 102.3 0.94 -10.1 83.0 = weights prediction # of rooms sq. footage house age yard? 1 45.7 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor

Slide 43

Slide 43 text

X = ⇥ 1 x1 x2 . . . ⇤ ✓ = ⇥ ✓0 ✓1 ✓2 . . . ⇤ dot product

Slide 44

Slide 44 text

X = ⇥ 1 x1 x2 . . . ⇤ ✓ = ⇥ ✓0 ✓1 ✓2 . . . ⇤ ✓ · X = ✓0 + ✓1x1 + ✓2x2 + . . . dot product

Slide 45

Slide 45 text

training data learning algorithm hypothesis Hypothesis (decision function): what the system has learned so far Hypothesis is applied to new data

Slide 46

Slide 46 text

hθ (X) The task of our algorithm is to determine the parameters of the hypothesis.

Slide 47

Slide 47 text

hθ (X) input data The task of our algorithm is to determine the parameters of the hypothesis.

Slide 48

Slide 48 text

hθ (X) parameters input data The task of our algorithm is to determine the parameters of the hypothesis.

Slide 49

Slide 49 text

hθ (X) parameters input data prediction y The task of our algorithm is to determine the parameters of the hypothesis.

Slide 50

Slide 50 text

LINEAR REGRESSION 5 10 15 20 25 30 35 40 80 120 160 200 whisky age whisky price $ Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.

Slide 51

Slide 51 text

LINEAR REGRESSION 5 10 15 20 25 30 35 40 80 120 160 200 whisky age whisky price $ Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.

Slide 52

Slide 52 text

LINEAR REGRESSION 5 10 15 20 25 30 35 40 80 120 160 200 whisky age whisky price $ Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.

Slide 53

Slide 53 text

LOGISTIC REGRESSION g(z) = 1 1 + e z 0.5 1 0 z Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor.

Slide 54

Slide 54 text

LOGISTIC REGRESSION g(z) = 1 1 + e z 0.5 1 0 z z = ✓ · X Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor.

Slide 55

Slide 55 text

h✓(X) = 1 1 + e ✓·X Probability that y=1 for input X LOGISTIC REGRESSION If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70% chance it’s spam. Thresholding on that is up to you.

Slide 56

Slide 56 text

Building the Tool

Slide 57

Slide 57 text

Corpus collection of source data used for training and testing the model

Slide 58

Slide 58 text

Twitter MongoDB phirehose hooks into streaming API

Slide 59

Slide 59 text

Twitter MongoDB phirehose 8500 tweets hooks into streaming API

Slide 60

Slide 60 text

Feature Identification

Slide 61

Slide 61 text

independent & discriminant Independent: feature A should not co-occur (correlate) with feature B highly. Discriminant: a feature should provide uniquely classifiable data (what letter a tweet starts with is not a good feature).

Slide 62

Slide 62 text

‣ @a at the end of the tweet ‣ @a... ‣ length < N chars ‣ # of user mentions in the tweet ‣ # of hashtags ‣ language! ‣ @a followed by punctuation and a word character (except for apostrophe) possible features

Slide 63

Slide 63 text

feature = extractor(tweet) For each feature, write a small function that takes a tweet and returns a numeric value

Slide 64

Slide 64 text

corpus extractors feature vectors Run the set of these functions over the corpus and build up feature vectors Array of arrays Save to DB

Slide 65

Slide 65 text

Language Matters high correlation between the language of the tweet and its category (good/bad)

Slide 66

Slide 66 text

Indonesian or Tagalog? Garbage.

Slide 67

Slide 67 text

id Indonesian 3548 en English 1804 tl Tagalog 733 es Spanish 329 so Somalian 305 ja Japanese 300 pt Portuguese 262 ar Arabic 256 nl Dutch 150 it Italian 137 sw Swahili 118 fr French 92 Top 12 Languages I guarantee you people aren’t tweeting at me in Swahili.

Slide 68

Slide 68 text

Language Detection Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.

Slide 69

Slide 69 text

Language Detection Text_LanguageDetect textcat pecl / pear / Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.

Slide 70

Slide 70 text

✓ Clean-up text (remove mentions, links, etc) ✓ Run language detection ✓ If unknown/low weight, pretend it’s English, else: ✓ If not a character set-determined language, try harder: ✓ Tokenize into words ✓ Difference with English vocabulary ✓ If words remain, run parts-of-speech tagger on each ✓ For NNS, VBZ, and VBD run stemming algorithm ✓ If result is in English vocabulary, remove from remaining ✓ If remaining list is not empty, calculate: unusual_word_ratio = size(remaining)/size(words) ✓ If ratio < 20%, pretend it’s English EnglishNotEnglish A lot of this is heuristic-based, after some trial-and-error. Seems to help with my corpus.

Slide 71

Slide 71 text

BINARY CLASSIFICATION Grunt work Built a web-based tool to display tweets a page at a time and select good ones

Slide 72

Slide 72 text

feature vectors labels (good/bad) I N P U T O U T P U T Had my input and output

Slide 73

Slide 73 text

BIAS CORRECTION One more thing to address

Slide 74

Slide 74 text

BIAS CORRECTION BAD GOOD 99% = bad (less < 100 tweets were good) Training a model as-is would not produce good results Need to adjust the bias

Slide 75

Slide 75 text

BIAS CORRECTION BAD GOOD

Slide 76

Slide 76 text

O V E R SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category

Slide 77

Slide 77 text

O V E R SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category

Slide 78

Slide 78 text

O V E R SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training

Slide 79

Slide 79 text

SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training

Slide 80

Slide 80 text

OVERSAMPLING Synthetic Synthesize feature vectors by determining what constitutes a good tweet and do weighted random selection of feature values.

Slide 81

Slide 81 text

chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

Slide 82

Slide 82 text

chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 1 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

Slide 83

Slide 83 text

chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 1 2 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

Slide 84

Slide 84 text

chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 1 2 0 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

Slide 85

Slide 85 text

chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10 1 2 0 77 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

Slide 86

Slide 86 text

Model Training We have the hypothesis (decision function) and the training set, How do we actually determine the weights/parameters?

Slide 87

Slide 87 text

COST FUNCTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.

Slide 88

Slide 88 text

REALITY PREDICTION COST FUNCTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.

Slide 89

Slide 89 text

COST FUNCTION J ( ✓ ) = 1 m m X i=1 Cost ( h✓( x ) , y ) Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.

Slide 90

Slide 90 text

Cost(h✓(x), y) = ( log (h✓(x)) if y = 1 log (1 h✓(x)) if y = 0 LOGISTIC COST

Slide 91

Slide 91 text

1 0 y=1 y=0 1 0 Correct guess Incorrect guess Cost = 0 Cost = huge When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.

Slide 92

Slide 92 text

minimize cost OVER θ Finding the best values of Theta that minimize the cost

Slide 93

Slide 93 text

GRADIENT DESCENT Random starting point. Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step. Repeat. Imagine a ball rolling down from a hill.

Slide 94

Slide 94 text

✓i ↵ = ✓i @J(✓) @✓i GRADIENT DESCENT Each step adjusts the parameters according to the slope

Slide 95

Slide 95 text

↵ = each parameter ✓i ✓i @J(✓) @✓i Have to update them simultaneously (the whole vector at a time).

Slide 96

Slide 96 text

✓i = ✓i learning rate ↵ @J(✓) @✓i Controls how big a step you take If α is big have an aggressive gradient descent If α is small take tiny steps If too small, tiny steps, takes too long If too big, can overshoot the minimum and fail to converge

Slide 97

Slide 97 text

✓i ↵ = ✓i derivative aka “the slope” @J(✓) @✓i The slope indicates the steepness of the descent step for each weight, i.e. direction. Keep going for a number of iterations or until cost is below a threshold (convergence). Graph the cost function versus # of iterations and see where it starts to approach 0, past that are diminishing returns.

Slide 98

Slide 98 text

✓i = ✓i ↵ m X j=1 ( h✓( x j) y j) x j i FINAL UPDATE ALGORITHM Derivative for logistic regression simplifies to this term. Have to update the weights simultaneously!

Slide 99

Slide 99 text

X1 = [1 12.0] X2 = [1 -3.5] Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

Slide 100

Slide 100 text

y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5] Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

Slide 101

Slide 101 text

y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

Slide 102

Slide 102 text

y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

Slide 103

Slide 103 text

y1 = 1 y2 = 0 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

Slide 104

Slide 104 text

y1 = 1 y2 = 0 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

Slide 105

Slide 105 text

y1 = 1 y2 = 0 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0 ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

Slide 106

Slide 106 text

y1 = 1 y2 = 0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0 ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

Slide 107

Slide 107 text

y1 = 1 y2 = 0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0 ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

Slide 108

Slide 108 text

y1 = 1 y2 = 0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) = 0.088 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0 ↵= 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.

Slide 109

Slide 109 text

y1 = 1 y2 = 0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5) = 0.305 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T1 ↵= 0.05 Note that the hypotheses don’t change within the iteration.

Slide 110

Slide 110 text

y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5] θ = [T0 T1] ↵= 0.05 Replace parameter (weights) vector with the temporaries.

Slide 111

Slide 111 text

y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.088 0.305] ↵= 0.05 Do next iteration

Slide 112

Slide 112 text

Trai ning CROSS Used to assess the results of the training.

Slide 113

Slide 113 text

DATA

Slide 114

Slide 114 text

DATA TRAINING

Slide 115

Slide 115 text

DATA TRAINING TEST Train model on training set, then test results on test set. Rinse, lather, repeat feature selection/synthesis/training until results are "good enough". Pick the best parameters and save them (DB, other).

Slide 116

Slide 116 text

Putting It All Together Let’s put our model to use, finally. The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain error handling, etc. Once we get the actual tweet though..

Slide 117

Slide 117 text

Load the model The weights we have calculated via training Easiest is to load them from DB (can be used to test different models).

Slide 118

Slide 118 text

HARD CODED RULES We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.

Slide 119

Slide 119 text

HARD CODED RULES SKIP truncated retweets: "RT @A ..." We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.

Slide 120

Slide 120 text

HARD CODED RULES SKIP truncated retweets: "RT @A ..." @ mentions of friends We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.

Slide 121

Slide 121 text

HARD CODED RULES SKIP truncated retweets: "RT @A ..." tweets from friends @ mentions of friends We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.

Slide 122

Slide 122 text

Classifying Tweets This is the moment we’ve been waiting for.

Slide 123

Slide 123 text

Classifying Tweets GOOD This is the moment we’ve been waiting for.

Slide 124

Slide 124 text

Classifying Tweets GOOD BAD This is the moment we’ve been waiting for.

Slide 125

Slide 125 text

h✓(X) = 1 1 + e ✓·X Remember this? First is our hypothesis.

Slide 126

Slide 126 text

h✓(X) = 1 1 + e ✓·X Remember this? ✓·X = ✓0 + ✓1X1 + ✓2X2 + . . . First is our hypothesis.

Slide 127

Slide 127 text

Finally h✓(X) = 1 1 + e (✓0+✓1X1+✓2X2+... ) If h > 0.5 , tweet is bad, otherwise good Remember that the output of h() is 0..1 (probability). 0.5 is your threshold, adjust it for your degree of tolerance. I used 0.9 to reduce false positives.

Slide 128

Slide 128 text

extract features 3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.

Slide 129

Slide 129 text

extract features run the model 3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.

Slide 130

Slide 130 text

extract features run the model act on the result 3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.

Slide 131

Slide 131 text

BAD? block user! Also save the tweet to DB for future analysis.

Slide 132

Slide 132 text

Lessons Learned Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear

Slide 133

Slide 133 text

Lessons Learned Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear -Connection handling, backoff in case of problems, undocumented API errors, etc.

Slide 134

Slide 134 text

Lessons Learned Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear -No way for blocked person to get ahold of you via Twitter anymore, so when training the model, err on the side of caution.

Slide 135

Slide 135 text

Lessons Learned Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear -Some tweets are shown on the website, but never seen through the API.

Slide 136

Slide 136 text

Lessons Learned Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear -Lots of room for improvement.

Slide 137

Slide 137 text

Lessons Learned Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear PHP sucks at math-y stuff -Lots of room for improvement.

Slide 138

Slide 138 text

NEXT STEPS ★ Realtime feedback ★ More features ★ Grammar analysis ★ Support Vector Machines or decision trees ★ Clockwork Raven for manual classification ★ Other minimization algos: BFGS, conjugate gradient ★ Need pecl/scikit-learn Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classification to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.

Slide 139

Slide 139 text

TOOLS ★ MongoDB ★ pear/Text_LanguageDetect ★ English vocabulary corpus ★ Parts-Of-Speech tagging ★ SplFixedArray ★ phirehose ★ Python’s scikit-learn (for validation) ★ Code sample MongoDB (great fit for JSON data) English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/ SplFixedArray in PHP (memory savings and slightly faster)

Slide 140

Slide 140 text

Questions?