Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Berlin 2013 - Session - Abe Stanway

Monitorama
September 19, 2013
350

Berlin 2013 - Session - Abe Stanway

Monitorama

September 19, 2013
Tweet

Transcript

  1. @abestanway
    MOM! my algorithms SUCK

    View full-size slide

  2. i know how to fix
    monitoring once
    and for all.

    View full-size slide

  3. a real human physically staring
    at a single metric 24/7

    View full-size slide

  4. that human will then alert a
    sleeping engineer when her
    metric does something weird

    View full-size slide

  5. Boom. Perfect Monitoring™.

    View full-size slide

  6. this works because humans are
    excellent visual pattern matchers*
    *there are, of course, many advanced
    statistical applications where signal
    cannot be determined from noise just
    by looking at the data.

    View full-size slide

  7. can we teach software to be
    as good at simple anomaly
    detection as humans are?

    View full-size slide

  8. let’s explore.

    View full-size slide

  9. anomalies = not “normal”

    View full-size slide

  10. humans can tell what
    “normal” is by just looking
    at a timeseries.

    View full-size slide

  11. “if a datapoint is not within
    reasonable bounds, more or
    less, of what usually happens,
    it’s an anomaly”
    the human definition:

    View full-size slide

  12. there are real statistics
    that describe what we
    mentally approximate

    View full-size slide

  13. “what usually happens”
    the mean

    View full-size slide

  14. “more or less”
    the standard deviation

    View full-size slide

  15. “reasonable bounds”

    View full-size slide

  16. so, in math speak, a metric is
    anomalous if the absolute value of
    latest datapoint is over three standard
    deviations above the mean

    View full-size slide

  17. we have essentially derived
    statistical process control.

    View full-size slide

  18. pioneered in the 1920s.
    heavily used in
    industrial engineering
    for quality control on
    assembly lines.

    View full-size slide

  19. traditional control charts
    specification limits

    View full-size slide

  20. grounded in
    exchangeability
    past = future

    View full-size slide

  21. needs to be stationary

    View full-size slide

  22. produced by independent
    random variables, with well-
    defined expected values

    View full-size slide

  23. this allows for
    statistical inference

    View full-size slide

  24. in other words, you need good
    lookin’ timeseries for this to work.

    View full-size slide

  25. normal distribution:
    a more concise
    definition of
    good lookin’
    μ
    34.1%
    13.6%
    2.1%
    34.1%
    13.6%
    μ - σ
    2.1%

    View full-size slide

  26. if you’ve got a normal distribution, chances are
    you’ve got an exchangeable, stationary series
    produced by independent random variables

    View full-size slide

  27. 99.7% fall under 3σ

    View full-size slide

  28. μ
    34.1%
    13.6%
    2.1%
    34.1%
    13.6%
    2.1%
    μ - σ
    if your datapoint is in
    here, it’s an anomaly.

    View full-size slide

  29. when only .3% lie above 3σ...

    View full-size slide

  30. ...you get a high
    signal to noise ratio...

    View full-size slide

  31. ...where “signal” indicates a
    fundmental state change, as opposed
    to a random, improbable variation.

    View full-size slide

  32. a fundamental state change in
    the process means a different
    probability distribution function
    that describes the process

    View full-size slide

  33. determining when probability
    distribution function shifts have
    occurred, as early as possible.
    anomaly detection:

    View full-size slide

  34. μ
    1
    a new PDF that
    describes a new
    process

    View full-size slide

  35. drilling holes
    sawing boards
    forging steel

    View full-size slide

  36. snapped drill bit
    teeth missing on table saw
    steel, like, melted

    View full-size slide

  37. processes with well planned
    expected values that only suffer
    small, random deviances when
    working properly...

    View full-size slide

  38. ...and massive “deviances”, aka,
    probability function shifts, when
    working improperly.

    View full-size slide

  39. the bad news:

    View full-size slide

  40. server infrastructures
    aren’t like assembly lines

    View full-size slide

  41. systems are active
    participants in their
    own design

    View full-size slide

  42. processes don’t have well
    defined expected values

    View full-size slide

  43. they aren’t produced by genuinely
    independent random variables.

    View full-size slide

  44. large variance does not
    necessarily indicate poor quality

    View full-size slide

  45. they have seasonality

    View full-size slide

  46. skewed distributions!
    less than 99.73% of all
    values lie within 3σ, so
    breaching 3σ is not
    necessarily bad

    possibly
    normal range

    View full-size slide

  47. the dirty secret: using SPC-based
    algorithms results in lots and lots
    of false positives, and probably lots
    of false negatives as well

    View full-size slide

  48. no way to retroactively find the
    false negatives short of combing
    with human eyes!

    View full-size slide

  49. how do we
    combat this?*
    *warning!
    ideas!

    View full-size slide

  50. we could always use
    custom fit models...

    View full-size slide

  51. ...after all, as long as the
    *errors* from the model
    are normally distributed,
    we can use 3σ

    View full-size slide

  52. Parameters are cool!
    a pretty decent forecast
    based on an artisanal
    handcrafted model

    View full-size slide

  53. but fitting models is
    hard, even by hand.

    View full-size slide

  54. possible to implement a class of
    ML algorithms that determine
    models based on distribution of
    errors, using Q-Q plots

    View full-size slide

  55. Q-Q plots can also be used to
    determine if the PDF has
    changed, although hard to do
    with limited sample size

    View full-size slide

  56. consenus: throw lots of
    different models at a series,
    hope it all shakes out.

    View full-size slide

  57. [yes] [yes] [no] [no] [yes] [yes]
    =
    anomaly!

    View full-size slide

  58. of course, if your models are
    all SPC-based, this doesn’t
    really get you anywhere

    View full-size slide

  59. use exponentially weighted
    moving averages to adapt faster

    View full-size slide

  60. fourier transforms to
    detect seasonality

    View full-size slide

  61. second order anomalies: is the
    series “anomalously anomalous”?

    View full-size slide

  62. ...this is all very hard.

    View full-size slide

  63. so, we can either
    change what we expect
    of monitoring...

    View full-size slide

  64. ...and treat it as a way of
    building noisy situational
    awareness, not absolute
    directives (alerts)...

    View full-size slide

  65. ...or we can change what we
    expect out of engineering...

    View full-size slide

  66. ...and construct strict
    specifications and expected
    values of all metrics.

    View full-size slide

  67. neither are going to happen.

    View full-size slide

  68. so we have to crack
    this algorithm nut.

    View full-size slide

  69. ...ugh.
    @abestanway

    View full-size slide