Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"Small Data" Machine Learning

"Small Data" Machine Learning

What do you do if you have a single letter Twitter handle and your reply stream is virtually useless due to people using “@a” in ways that make you lose faith in humanity (“I’m @a bar” is the best of it)? These days, you might call machine learning to the rescue. ML is a hot new topic, but may be a bit difficult to get started with in a large project. This session explains how I used ML and a couple of other tricks to build a reply stream cleaner that sanitizes my Twitter consumption and will hopefully inspire you to use ML in your own projects.

Andrei Zmievski

January 26, 2013
Tweet

More Decks by Andrei Zmievski

Other Decks in Technology

Transcript

  1. Small Data Machine Learning
    Andrei Zmievski

    View Slide

  2. WORK
    We are all superheroes, because we help our customers keep their mission-critical apps
    running smoothly. If interested, I can show you a demo of what I’m working on. Come find
    me.

    View Slide

  3. WORK
    We are all superheroes, because we help our customers keep their mission-critical apps
    running smoothly. If interested, I can show you a demo of what I’m working on. Come find
    me.

    View Slide

  4. MATH

    View Slide

  5. MATH
    SOME

    View Slide

  6. MATH
    SOME
    AWE

    View Slide

  7. @a
    Acquired in October 2008
    Had a different account earlier, but then @k asked if I wanted it..
    Know many other single-letter Twitterers.

    View Slide

  8. FAME
    Advantages

    View Slide

  9. FAME
    FORTUNE

    View Slide

  10. FAME
    FORTUNE
    Wall Street Journal?!

    View Slide

  11. FAME
    FORTUNE
    FOLLOWERS

    View Slide

  12. lol, what?!
    FAME
    FORTUNE
    FOLLOWERS

    View Slide

  13. 140-length(“@a “)=137
    MAXIMUM REPLY SPACE!

    View Slide

  14. CONS
    Disadvantages
    Visual filtering is next to impossible
    Could be a set of hard-coded rules derived empirically

    View Slide

  15. CONS
    I hate humanity
    Disadvantages
    Visual filtering is next to impossible
    Could be a set of hard-coded rules derived empirically

    View Slide

  16. CONS
    I hate humanity
    Disadvantages
    Visual filtering is next to impossible
    Could be a set of hard-coded rules derived empirically

    View Slide

  17. Machine Learning
    to the Rescue!
    Being grumpy makes you learn stuff

    View Slide

  18. REPLYCLEANER
    Even with false negatives, reduces garbage to where visual filtering is possible
    - uses trained model to classify tweets into good/bad
    - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline

    View Slide

  19. REPLYCLEANER
    Even with false negatives, reduces garbage to where visual filtering is possible
    - uses trained model to classify tweets into good/bad
    - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline

    View Slide

  20. REPLYCLEANER

    View Slide

  21. REPLYCLEANER

    View Slide

  22. REPLYCLEANER

    View Slide

  23. REPLYCLEANER

    View Slide

  24. View Slide

  25. I still hate humanity

    View Slide

  26. I still hate humanity
    I still hate humanity

    View Slide

  27. Machine Learning
    A branch of Artificial Intelligence
    No widely accepted definition

    View Slide

  28. “Field of study that gives
    computers the ability to learn
    without being explicitly
    programmed.”
    — Arthur Samuel (1959)

    View Slide

  29. SPAM FILTERING

    View Slide

  30. RECOMMENDATIONS

    View Slide

  31. TRANSLATION

    View Slide

  32. CLUSTERING
    And many more: medical diagnoses, detecting credit card fraud, etc.

    View Slide

  33. supervised
    unsupervised
    Labeled dataset, training maps input to desired outputs
    Example: regression - predicting house prices, classification - spam filtering

    View Slide

  34. supervised
    unsupervised
    no labels in the dataset, algorithm needs to find structure
    Example: clustering

    View Slide

  35. Feature
    individual measurable property of the
    phenomenon under observation
    usually numeric

    View Slide

  36. View Slide

  37. Feature Vector
    a set of features for an observation
    Think of it as an array

    View Slide

  38. features
    # of rooms
    sq. footage
    house age
    yard?
    feature vector and weights vector
    1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
    calculation)
    dot product produces a linear predictor

    View Slide

  39. features
    102.3
    0.94
    -10.1
    83.0
    weights
    # of rooms
    sq. footage
    house age
    yard?
    feature vector and weights vector
    1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
    calculation)
    dot product produces a linear predictor

    View Slide

  40. features
    102.3
    0.94
    -10.1
    83.0
    weights
    # of rooms
    sq. footage
    house age
    yard?
    1 45.7
    feature vector and weights vector
    1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
    calculation)
    dot product produces a linear predictor

    View Slide

  41. features
    102.3
    0.94
    -10.1
    83.0
    weights
    # of rooms
    sq. footage
    house age
    yard?
    1 45.7
    feature vector and weights vector
    1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
    calculation)
    dot product produces a linear predictor

    View Slide

  42. features
    102.3
    0.94
    -10.1
    83.0
    =
    weights prediction
    # of rooms
    sq. footage
    house age
    yard?
    1 45.7
    feature vector and weights vector
    1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
    calculation)
    dot product produces a linear predictor

    View Slide

  43. X
    =

    1
    x1 x2 . . .


    =

    ✓0 ✓1 ✓2 . . .

    dot product

    View Slide

  44. X
    =

    1
    x1 x2 . . .


    =

    ✓0 ✓1 ✓2 . . .


    ·
    X
    =
    ✓0 +
    ✓1x1 +
    ✓2x2 +
    . . .
    dot product

    View Slide

  45. training data
    learning algorithm
    hypothesis
    Hypothesis (decision function): what the system has learned so far
    Hypothesis is applied to new data

    View Slide


  46. (X)
    The task of our algorithm is to determine the parameters of the hypothesis.

    View Slide


  47. (X)
    input data
    The task of our algorithm is to determine the parameters of the hypothesis.

    View Slide


  48. (X)
    parameters
    input data
    The task of our algorithm is to determine the parameters of the hypothesis.

    View Slide


  49. (X)
    parameters
    input data
    prediction y
    The task of our algorithm is to determine the parameters of the hypothesis.

    View Slide

  50. LINEAR REGRESSION
    5 10 15 20 25 30 35
    40
    80
    120
    160
    200
    whisky age
    whisky price $
    Models the relationship between a scalar dependent variable y and one or more explanatory
    variables denoted X. Here X = whisky age, y = whisky price.
    Linear regression does not work well for classification because its output is unbounded.
    Thresholding on some value is tricky and does not produce good results.

    View Slide

  51. LINEAR REGRESSION
    5 10 15 20 25 30 35
    40
    80
    120
    160
    200
    whisky age
    whisky price $
    Models the relationship between a scalar dependent variable y and one or more explanatory
    variables denoted X. Here X = whisky age, y = whisky price.
    Linear regression does not work well for classification because its output is unbounded.
    Thresholding on some value is tricky and does not produce good results.

    View Slide

  52. LINEAR REGRESSION
    5 10 15 20 25 30 35
    40
    80
    120
    160
    200
    whisky age
    whisky price $
    Models the relationship between a scalar dependent variable y and one or more explanatory
    variables denoted X. Here X = whisky age, y = whisky price.
    Linear regression does not work well for classification because its output is unbounded.
    Thresholding on some value is tricky and does not produce good results.

    View Slide

  53. LOGISTIC REGRESSION
    g(z) =
    1
    1 + e z
    0.5
    1
    0
    z
    Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.
    z is just our old dot product, the linear predictor.

    View Slide

  54. LOGISTIC REGRESSION
    g(z) =
    1
    1 + e z
    0.5
    1
    0
    z
    z = ✓ · X
    Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.
    z is just our old dot product, the linear predictor.

    View Slide

  55. h✓(X) =
    1
    1 + e ✓·X
    Probability that y=1 for input X
    LOGISTIC REGRESSION
    If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70%
    chance it’s spam. Thresholding on that is up to you.

    View Slide

  56. Building the Tool

    View Slide

  57. Corpus
    collection of source data used for training and
    testing the model

    View Slide

  58. Twitter MongoDB
    phirehose
    hooks into streaming API

    View Slide

  59. Twitter MongoDB
    phirehose
    8500 tweets
    hooks into streaming API

    View Slide

  60. Feature
    Identification

    View Slide

  61. independent
    &
    discriminant
    Independent: feature A should not co-occur (correlate) with feature B highly.
    Discriminant: a feature should provide uniquely classifiable data (what letter a tweet starts
    with is not a good feature).

    View Slide


  62. @a at the end of the tweet

    @a...

    length < N chars

    # of user mentions in the tweet

    # of hashtags

    language!

    @a followed by punctuation and a word
    character (except for apostrophe)
    possible features

    View Slide

  63. feature = extractor(tweet)
    For each feature, write a small function that takes a tweet and returns a numeric value

    View Slide

  64. corpus
    extractors
    feature vectors
    Run the set of these functions over the corpus and build up feature vectors
    Array of arrays
    Save to DB

    View Slide

  65. Language
    Matters
    high correlation between the language of the tweet and its category (good/bad)

    View Slide

  66. Indonesian or Tagalog?
    Garbage.

    View Slide

  67. id Indonesian 3548
    en English 1804
    tl Tagalog 733
    es Spanish 329
    so Somalian 305
    ja Japanese 300
    pt Portuguese 262
    ar Arabic 256
    nl Dutch 150
    it Italian 137
    sw Swahili 118
    fr French 92
    Top 12 Languages
    I guarantee you people aren’t tweeting at me in Swahili.

    View Slide

  68. Language
    Detection
    Can’t trust the language field in user’s profile data.
    Used character N-grams and character sets for detection.
    Has its own error rate, so needs some post-processing.

    View Slide

  69. Language
    Detection
    Text_LanguageDetect
    textcat
    pecl /
    pear /
    Can’t trust the language field in user’s profile data.
    Used character N-grams and character sets for detection.
    Has its own error rate, so needs some post-processing.

    View Slide

  70. ✓ Clean-up text (remove mentions, links, etc)
    ✓ Run language detection
    ✓ If unknown/low weight, pretend it’s English, else:
    ✓ If not a character set-determined language, try harder:
    ✓ Tokenize into words
    ✓ Difference with English vocabulary
    ✓ If words remain, run parts-of-speech tagger on each
    ✓ For NNS, VBZ, and VBD run stemming algorithm
    ✓ If result is in English vocabulary, remove from remaining
    ✓ If remaining list is not empty, calculate:
    unusual_word_ratio = size(remaining)/size(words)
    ✓ If ratio < 20%, pretend it’s English
    EnglishNotEnglish
    A lot of this is heuristic-based, after some trial-and-error.
    Seems to help with my corpus.

    View Slide

  71. BINARY CLASSIFICATION
    Grunt work
    Built a web-based tool to display tweets a page at a time and select good ones

    View Slide

  72. feature vectors
    labels (good/bad)
    I N P U T
    O U T P U T
    Had my input and output

    View Slide

  73. BIAS
    CORRECTION
    One more thing to address

    View Slide

  74. BIAS
    CORRECTION
    BAD GOOD
    99% = bad (less < 100 tweets were good)
    Training a model as-is would not produce good results
    Need to adjust the bias

    View Slide

  75. BIAS
    CORRECTION
    BAD GOOD

    View Slide

  76. O V E R
    SAMPLING
    Oversampling: use multiple copies of good tweets to equalize with bad
    Problem: bias very high, each good tweet would have to be copied 100 times, and not
    contribute any variance to the good category

    View Slide

  77. O V E R
    SAMPLING
    Oversampling: use multiple copies of good tweets to equalize with bad
    Problem: bias very high, each good tweet would have to be copied 100 times, and not
    contribute any variance to the good category

    View Slide

  78. O V E R
    SAMPLING
    UNDER
    Undersampling: drop most of the bad tweets to equalize with good
    Problem: total corpus ends up being < 200 tweets, not enough for training

    View Slide

  79. SAMPLING
    UNDER
    Undersampling: drop most of the bad tweets to equalize with good
    Problem: total corpus ends up being < 200 tweets, not enough for training

    View Slide

  80. OVERSAMPLING
    Synthetic
    Synthesize feature vectors by determining what constitutes a good tweet and do weighted
    random selection of feature values.

    View Slide

  81. chance feature
    90% “good” language
    70%
    25%
    5%
    no hashtags
    1 hashtag
    2 hashtags
    2% @a at the end
    85% rand length > 10
    The actual synthesis is somewhat more complex and was also trial-and-error based
    Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
    (limited to 1000)

    View Slide

  82. chance feature
    90% “good” language
    70%
    25%
    5%
    no hashtags
    1 hashtag
    2 hashtags
    2% @a at the end
    85% rand length > 10
    1
    The actual synthesis is somewhat more complex and was also trial-and-error based
    Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
    (limited to 1000)

    View Slide

  83. chance feature
    90% “good” language
    70%
    25%
    5%
    no hashtags
    1 hashtag
    2 hashtags
    2% @a at the end
    85% rand length > 10
    1
    2
    The actual synthesis is somewhat more complex and was also trial-and-error based
    Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
    (limited to 1000)

    View Slide

  84. chance feature
    90% “good” language
    70%
    25%
    5%
    no hashtags
    1 hashtag
    2 hashtags
    2% @a at the end
    85% rand length > 10
    1
    2
    0
    The actual synthesis is somewhat more complex and was also trial-and-error based
    Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
    (limited to 1000)

    View Slide

  85. chance feature
    90% “good” language
    70%
    25%
    5%
    no hashtags
    1 hashtag
    2 hashtags
    2% @a at the end
    85% rand length > 10
    1
    2
    0
    77
    The actual synthesis is somewhat more complex and was also trial-and-error based
    Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
    (limited to 1000)

    View Slide

  86. Model
    Training
    We have the hypothesis (decision function) and the training set,
    How do we actually determine the weights/parameters?

    View Slide

  87. COST
    FUNCTION
    Measures how far the prediction of the system is from the reality.
    The cost depends on the parameters.
    The less the cost, the closer we’re to the ideal parameters for the model.

    View Slide

  88. REALITY
    PREDICTION
    COST
    FUNCTION
    Measures how far the prediction of the system is from the reality.
    The cost depends on the parameters.
    The less the cost, the closer we’re to the ideal parameters for the model.

    View Slide

  89. COST
    FUNCTION
    J
    (

    ) =
    1
    m
    m
    X
    i=1
    Cost
    (
    h✓(
    x
    )
    , y
    )
    Measures how far the prediction of the system is from the reality.
    The cost depends on the parameters.
    The less the cost, the closer we’re to the ideal parameters for the model.

    View Slide

  90. Cost(h✓(x), y) =
    (
    log (h✓(x)) if y = 1
    log (1 h✓(x)) if y = 0
    LOGISTIC COST

    View Slide

  91. 1
    0
    y=1 y=0
    1
    0
    Correct guess
    Incorrect guess
    Cost = 0
    Cost = huge
    When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess),
    the more we penalize the algorithm. Same for y=0.

    View Slide

  92. minimize cost
    OVER θ
    Finding the best values of Theta that minimize the cost

    View Slide

  93. GRADIENT DESCENT
    Random starting point.
    Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step.
    Repeat.
    Imagine a ball rolling down from a hill.

    View Slide

  94. ✓i

    = ✓i
    @J(✓)
    @✓i
    GRADIENT DESCENT
    Each step adjusts the parameters according to the slope

    View Slide


  95. =
    each parameter
    ✓i ✓i
    @J(✓)
    @✓i
    Have to update them simultaneously (the whole vector at a time).

    View Slide

  96. ✓i
    = ✓i
    learning rate

    @J(✓)
    @✓i
    Controls how big a step you take
    If α is big have an aggressive gradient descent
    If α is small take tiny steps
    If too small, tiny steps, takes too long
    If too big, can overshoot the minimum and fail to converge

    View Slide

  97. ✓i

    = ✓i
    derivative
    aka
    “the slope”
    @J(✓)
    @✓i
    The slope indicates the steepness of the descent step for each weight, i.e. direction.
    Keep going for a number of iterations or until cost is below a threshold (convergence).
    Graph the cost function versus # of iterations and see where it starts to approach 0, past that
    are diminishing returns.

    View Slide

  98. ✓i =
    ✓i ↵
    m
    X
    j=1
    (
    h✓(
    x
    j)
    y
    j)
    x
    j
    i
    FINAL UPDATE ALGORITHM
    Derivative for logistic regression simplifies to this term.
    Have to update the weights simultaneously!

    View Slide

  99. X1 = [1 12.0]
    X2 = [1 -3.5]
    Hypothesis for each data point based on current parameters.
    Each parameter is updated in order and result is saved to a temporary.

    View Slide

  100. y1 = 1
    y2 = 0
    X1 = [1 12.0]
    X2 = [1 -3.5]
    Hypothesis for each data point based on current parameters.
    Each parameter is updated in order and result is saved to a temporary.

    View Slide

  101. y1 = 1
    y2 = 0
    X1 = [1 12.0]
    X2 = [1 -3.5]
    θ = [0.1 0.1]
    Hypothesis for each data point based on current parameters.
    Each parameter is updated in order and result is saved to a temporary.

    View Slide

  102. y1 = 1
    y2 = 0
    X1 = [1 12.0]
    X2 = [1 -3.5]
    θ = [0.1 0.1]
    ↵= 0.05
    Hypothesis for each data point based on current parameters.
    Each parameter is updated in order and result is saved to a temporary.

    View Slide

  103. y1 = 1
    y2 = 0
    h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786
    X1 = [1 12.0]
    X2 = [1 -3.5]
    θ = [0.1 0.1]
    ↵= 0.05
    Hypothesis for each data point based on current parameters.
    Each parameter is updated in order and result is saved to a temporary.

    View Slide

  104. y1 = 1
    y2 = 0
    h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786
    h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438
    X1 = [1 12.0]
    X2 = [1 -3.5]
    θ = [0.1 0.1]
    ↵= 0.05
    Hypothesis for each data point based on current parameters.
    Each parameter is updated in order and result is saved to a temporary.

    View Slide

  105. y1 = 1
    y2 = 0
    h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786
    h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438
    X1 = [1 12.0]
    X2 = [1 -3.5]
    θ = [0.1 0.1]
    T0
    ↵= 0.05
    Hypothesis for each data point based on current parameters.
    Each parameter is updated in order and result is saved to a temporary.

    View Slide

  106. y1 = 1
    y2 = 0
    = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
    h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786
    h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438
    X1 = [1 12.0]
    X2 = [1 -3.5]
    θ = [0.1 0.1]
    T0
    ↵= 0.05
    Hypothesis for each data point based on current parameters.
    Each parameter is updated in order and result is saved to a temporary.

    View Slide

  107. y1 = 1
    y2 = 0
    = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
    = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)
    h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786
    h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438
    X1 = [1 12.0]
    X2 = [1 -3.5]
    θ = [0.1 0.1]
    T0
    ↵= 0.05
    Hypothesis for each data point based on current parameters.
    Each parameter is updated in order and result is saved to a temporary.

    View Slide

  108. y1 = 1
    y2 = 0
    = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
    = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)
    = 0.088
    h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786
    h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438
    X1 = [1 12.0]
    X2 = [1 -3.5]
    θ = [0.1 0.1]
    T0
    ↵= 0.05
    Hypothesis for each data point based on current parameters.
    Each parameter is updated in order and result is saved to a temporary.

    View Slide

  109. y1 = 1
    y2 = 0
    = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
    = 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5)
    = 0.305
    h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786
    h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438
    X1 = [1 12.0]
    X2 = [1 -3.5]
    θ = [0.1 0.1]
    T1
    ↵= 0.05
    Note that the hypotheses don’t change within the iteration.

    View Slide

  110. y1 = 1
    y2 = 0
    X1 = [1 12.0]
    X2 = [1 -3.5]
    θ = [T0 T1]
    ↵= 0.05
    Replace parameter (weights) vector with the temporaries.

    View Slide

  111. y1 = 1
    y2 = 0
    X1 = [1 12.0]
    X2 = [1 -3.5]
    θ = [0.088 0.305]
    ↵= 0.05
    Do next iteration

    View Slide

  112. Trai ning
    CROSS
    Used to assess the results of the training.

    View Slide

  113. DATA

    View Slide

  114. DATA
    TRAINING

    View Slide

  115. DATA
    TRAINING
    TEST
    Train model on training set, then test results on test set.
    Rinse, lather, repeat feature selection/synthesis/training until results are "good enough".
    Pick the best parameters and save them (DB, other).

    View Slide

  116. Putting It All
    Together
    Let’s put our model to use, finally.
    The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain
    error handling, etc. Once we get the actual tweet though..

    View Slide

  117. Load the model
    The weights we have calculated via training
    Easiest is to load them from DB (can be used to test different models).

    View Slide

  118. HARD
    CODED
    RULES
    We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
    The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.

    View Slide

  119. HARD
    CODED
    RULES
    SKIP
    truncated retweets: "RT @A ..."
    We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
    The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.

    View Slide

  120. HARD
    CODED
    RULES
    SKIP
    truncated retweets: "RT @A ..."
    @ mentions of friends
    We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
    The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.

    View Slide

  121. HARD
    CODED
    RULES
    SKIP
    truncated retweets: "RT @A ..."
    tweets from friends
    @ mentions of friends
    We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
    The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.

    View Slide

  122. Classifying Tweets
    This is the moment we’ve been waiting for.

    View Slide

  123. Classifying Tweets
    GOOD
    This is the moment we’ve been waiting for.

    View Slide

  124. Classifying Tweets
    GOOD BAD
    This is the moment we’ve been waiting for.

    View Slide

  125. h✓(X) =
    1
    1 + e ✓·X
    Remember this?
    First is our hypothesis.

    View Slide

  126. h✓(X) =
    1
    1 + e ✓·X
    Remember this?
    ✓·X = ✓0 + ✓1X1 + ✓2X2 + . . .
    First is our hypothesis.

    View Slide

  127. Finally
    h✓(X) =
    1
    1 + e (✓0+✓1X1+✓2X2+... )
    If h > 0.5 , tweet is bad, otherwise good
    Remember that the output of h() is 0..1 (probability).
    0.5 is your threshold, adjust it for your degree of tolerance. I used 0.9 to reduce false
    positives.

    View Slide

  128. extract features
    3 simple steps
    Invoke the feature extractor to construct the feature vector for this tweet.
    Evaluate the decision function over the feature vector (input the calculated feature
    parameters into the equation).
    Use the output of the classifier.

    View Slide

  129. extract features
    run the model
    3 simple steps
    Invoke the feature extractor to construct the feature vector for this tweet.
    Evaluate the decision function over the feature vector (input the calculated feature
    parameters into the equation).
    Use the output of the classifier.

    View Slide

  130. extract features
    run the model
    act on the result
    3 simple steps
    Invoke the feature extractor to construct the feature vector for this tweet.
    Evaluate the decision function over the feature vector (input the calculated feature
    parameters into the equation).
    Use the output of the classifier.

    View Slide

  131. BAD?
    block
    user!
    Also save the tweet to DB for future analysis.

    View Slide

  132. Lessons Learned
    Blocking is the only option (and is final)
    Streaming API delivery is incomplete
    ReplyCleaner judged to be ~80% effective
    Twitter API is a pain in the rear

    View Slide

  133. Lessons Learned
    Blocking is the only option (and is final)
    Streaming API delivery is incomplete
    ReplyCleaner judged to be ~80% effective
    Twitter API is a pain in the rear
    -Connection handling, backoff in case of problems, undocumented API errors, etc.

    View Slide

  134. Lessons Learned
    Blocking is the only option (and is final)
    Streaming API delivery is incomplete
    ReplyCleaner judged to be ~80% effective
    Twitter API is a pain in the rear
    -No way for blocked person to get ahold of you via Twitter anymore, so when training the
    model, err on the side of caution.

    View Slide

  135. Lessons Learned
    Blocking is the only option (and is final)
    Streaming API delivery is incomplete
    ReplyCleaner judged to be ~80% effective
    Twitter API is a pain in the rear
    -Some tweets are shown on the website, but never seen through the API.

    View Slide

  136. Lessons Learned
    Blocking is the only option (and is final)
    Streaming API delivery is incomplete
    ReplyCleaner judged to be ~80% effective
    Twitter API is a pain in the rear
    -Lots of room for improvement.

    View Slide

  137. Lessons Learned
    Blocking is the only option (and is final)
    Streaming API delivery is incomplete
    ReplyCleaner judged to be ~80% effective
    Twitter API is a pain in the rear
    PHP sucks at math-y stuff
    -Lots of room for improvement.

    View Slide

  138. NEXT
    STEPS
    ★ Realtime feedback
    ★ More features
    ★ Grammar analysis
    ★ Support Vector Machines or
    decision trees
    ★ Clockwork Raven for manual
    classification
    ★ Other minimization algos:
    BFGS, conjugate gradient
    ★ Need pecl/scikit-learn
    Click on the tweets that are bad and it immediately incorporates them into the model.
    Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.
    SVMs more appropriate for biased data sets.
    Farm out manual classification to Mechanical Turk.
    May help avoid local minima, no need to pick alpha, often faster than GD.

    View Slide

  139. TOOLS
    ★ MongoDB
    ★ pear/Text_LanguageDetect
    ★ English vocabulary corpus
    ★ Parts-Of-Speech tagging
    ★ SplFixedArray
    ★ phirehose
    ★ Python’s scikit-learn (for
    validation)
    ★ Code sample
    MongoDB (great fit for JSON data)
    English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/
    SplFixedArray in PHP (memory savings and slightly faster)

    View Slide

  140. Questions?

    View Slide