Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MOM! My algorithms SUCK

Abe Stanway
September 19, 2013

MOM! My algorithms SUCK

Given at Monitorama.eu 2013 in Berlin. http://vimeo.com/75183236

Abe Stanway

September 19, 2013
Tweet

More Decks by Abe Stanway

Other Decks in Programming

Transcript

  1. @abestanway
    MOM! my algorithms SUCK

    View Slide

  2. i know how to fix
    monitoring once
    and for all.

    View Slide

  3. a real human physically staring
    at a single metric 24/7

    View Slide

  4. that human will then alert a
    sleeping engineer when her
    metric does something weird

    View Slide

  5. Boom. Perfect Monitoring™.

    View Slide

  6. this works because humans are
    excellent visual pattern matchers*
    *there are, of course, many advanced
    statistical applications where signal
    cannot be determined from noise just
    by looking at the data.

    View Slide

  7. can we teach software to be
    as good at simple anomaly
    detection as humans are?

    View Slide

  8. let’s explore.

    View Slide

  9. anomalies = not “normal”

    View Slide

  10. humans can tell what
    “normal” is by just looking
    at a timeseries.

    View Slide

  11. “if a datapoint is not within
    reasonable bounds, more or
    less, of what usually happens,
    it’s an anomaly”
    the human definition:

    View Slide

  12. there are real statistics
    that describe what we
    mentally approximate

    View Slide

  13. View Slide

  14. “what usually happens”
    the mean

    View Slide

  15. “more or less”
    the standard deviation

    View Slide

  16. “reasonable bounds”

    View Slide

  17. so, in math speak, a metric is
    anomalous if the absolute value of
    latest datapoint is over three standard
    deviations above the mean

    View Slide

  18. we have essentially derived
    statistical process control.

    View Slide

  19. pioneered in the 1920s.
    heavily used in
    industrial engineering
    for quality control on
    assembly lines.

    View Slide

  20. traditional control charts
    specification limits

    View Slide

  21. grounded in
    exchangeability
    past = future

    View Slide

  22. needs to be stationary

    View Slide

  23. produced by independent
    random variables, with well-
    defined expected values

    View Slide

  24. this allows for
    statistical inference

    View Slide

  25. in other words, you need good
    lookin’ timeseries for this to work.

    View Slide

  26. normal distribution:
    a more concise
    definition of
    good lookin’
    μ
    34.1%
    13.6%
    2.1%
    34.1%
    13.6%
    μ - σ
    2.1%

    View Slide

  27. if you’ve got a normal distribution, chances are
    you’ve got an exchangeable, stationary series
    produced by independent random variables

    View Slide

  28. 99.7% fall under 3σ

    View Slide

  29. μ
    34.1%
    13.6%
    2.1%
    34.1%
    13.6%
    2.1%
    μ - σ
    if your datapoint is in
    here, it’s an anomaly.

    View Slide

  30. when only .3% lie above 3σ...

    View Slide

  31. ...you get a high
    signal to noise ratio...

    View Slide

  32. ...where “signal” indicates a
    fundmental state change, as opposed
    to a random, improbable variation.

    View Slide

  33. a fundamental state change in
    the process means a different
    probability distribution function
    that describes the process

    View Slide

  34. determining when probability
    distribution function shifts have
    occurred, as early as possible.
    anomaly detection:

    View Slide

  35. μ
    1

    View Slide

  36. μ
    1
    a new PDF that
    describes a new
    process

    View Slide

  37. drilling holes
    sawing boards
    forging steel

    View Slide

  38. snapped drill bit
    teeth missing on table saw
    steel, like, melted

    View Slide

  39. processes with well planned
    expected values that only suffer
    small, random deviances when
    working properly...

    View Slide

  40. ...and massive “deviances”, aka,
    probability function shifts, when
    working improperly.

    View Slide

  41. the bad news:

    View Slide

  42. server infrastructures
    aren’t like assembly lines

    View Slide

  43. systems are active
    participants in their
    own design

    View Slide

  44. processes don’t have well
    defined expected values

    View Slide

  45. they aren’t produced by genuinely
    independent random variables.

    View Slide

  46. large variance does not
    necessarily indicate poor quality

    View Slide

  47. they have seasonality

    View Slide

  48. skewed distributions!
    less than 99.73% of all
    values lie within 3σ, so
    breaching 3σ is not
    necessarily bad

    possibly
    normal range

    View Slide

  49. the dirty secret: using SPC-based
    algorithms results in lots and lots
    of false positives, and probably lots
    of false negatives as well

    View Slide

  50. no way to retroactively find the
    false negatives short of combing
    with human eyes!

    View Slide

  51. how do we
    combat this?*
    *warning!
    ideas!

    View Slide

  52. we could always use
    custom fit models...

    View Slide

  53. ...after all, as long as the
    *errors* from the model
    are normally distributed,
    we can use 3σ

    View Slide

  54. Parameters are cool!
    a pretty decent forecast
    based on an artisanal
    handcrafted model

    View Slide

  55. but fitting models is
    hard, even by hand.

    View Slide

  56. possible to implement a class of
    ML algorithms that determine
    models based on distribution of
    errors, using Q-Q plots

    View Slide

  57. Q-Q plots can also be used to
    determine if the PDF has
    changed, although hard to do
    with limited sample size

    View Slide

  58. consenus: throw lots of
    different models at a series,
    hope it all shakes out.

    View Slide

  59. [yes] [yes] [no] [no] [yes] [yes]
    =
    anomaly!

    View Slide

  60. of course, if your models are
    all SPC-based, this doesn’t
    really get you anywhere

    View Slide

  61. use exponentially weighted
    moving averages to adapt faster

    View Slide

  62. fourier transforms to
    detect seasonality

    View Slide

  63. second order anomalies: is the
    series “anomalously anomalous”?

    View Slide

  64. ...this is all very hard.

    View Slide

  65. so, we can either
    change what we expect
    of monitoring...

    View Slide

  66. ...and treat it as a way of
    building noisy situational
    awareness, not absolute
    directives (alerts)...

    View Slide

  67. ...or we can change what we
    expect out of engineering...

    View Slide

  68. ...and construct strict
    specifications and expected
    values of all metrics.

    View Slide

  69. neither are going to happen.

    View Slide

  70. so we have to crack
    this algorithm nut.

    View Slide

  71. ...ugh.
    @abestanway

    View Slide