MOM! My algorithms SUCK

6601d82cf1b6776afd9c31f3d18294c3?s=47 Abe Stanway
September 19, 2013

MOM! My algorithms SUCK

Given at 2013 in Berlin.


Abe Stanway

September 19, 2013


  1. @abestanway MOM! my algorithms SUCK

  2. i know how to fix monitoring once and for all.

  3. a real human physically staring at a single metric 24/7

  4. that human will then alert a sleeping engineer when her

    metric does something weird
  5. Boom. Perfect Monitoring™.

  6. this works because humans are excellent visual pattern matchers* *there

    are, of course, many advanced statistical applications where signal cannot be determined from noise just by looking at the data.
  7. can we teach software to be as good at simple

    anomaly detection as humans are?
  8. let’s explore.

  9. anomalies = not “normal”

  10. humans can tell what “normal” is by just looking at

    a timeseries.
  11. “if a datapoint is not within reasonable bounds, more or

    less, of what usually happens, it’s an anomaly” the human definition:
  12. there are real statistics that describe what we mentally approximate

  13. None
  14. “what usually happens” the mean

  15. “more or less” the standard deviation

  16. “reasonable bounds” 3σ

  17. so, in math speak, a metric is anomalous if the

    absolute value of latest datapoint is over three standard deviations above the mean
  18. we have essentially derived statistical process control.

  19. pioneered in the 1920s. heavily used in industrial engineering for

    quality control on assembly lines.
  20. traditional control charts specification limits

  21. grounded in exchangeability past = future

  22. needs to be stationary

  23. produced by independent random variables, with well- defined expected values

  24. this allows for statistical inference

  25. in other words, you need good lookin’ timeseries for this

    to work.
  26. normal distribution: a more concise definition of good lookin’ μ

    34.1% 13.6% 2.1% 34.1% 13.6% μ - σ 2.1%
  27. if you’ve got a normal distribution, chances are you’ve got

    an exchangeable, stationary series produced by independent random variables
  28. 99.7% fall under 3σ

  29. μ 34.1% 13.6% 2.1% 34.1% 13.6% 2.1% μ - σ

    if your datapoint is in here, it’s an anomaly.
  30. when only .3% lie above 3σ...

  31. get a high signal to noise ratio...

  32. ...where “signal” indicates a fundmental state change, as opposed to

    a random, improbable variation.
  33. a fundamental state change in the process means a different

    probability distribution function that describes the process
  34. determining when probability distribution function shifts have occurred, as early

    as possible. anomaly detection:
  35. μ 1

  36. μ 1 a new PDF that describes a new process

  37. drilling holes sawing boards forging steel

  38. snapped drill bit teeth missing on table saw steel, like,

  39. processes with well planned expected values that only suffer small,

    random deviances when working properly...
  40. ...and massive “deviances”, aka, probability function shifts, when working improperly.

  41. the bad news:

  42. server infrastructures aren’t like assembly lines

  43. systems are active participants in their own design

  44. processes don’t have well defined expected values

  45. they aren’t produced by genuinely independent random variables.

  46. large variance does not necessarily indicate poor quality

  47. they have seasonality

  48. skewed distributions! less than 99.73% of all values lie within

    3σ, so breaching 3σ is not necessarily bad 3σ possibly normal range
  49. the dirty secret: using SPC-based algorithms results in lots and

    lots of false positives, and probably lots of false negatives as well
  50. no way to retroactively find the false negatives short of

    combing with human eyes!
  51. how do we combat this?* *warning! ideas!

  52. we could always use custom fit models...

  53. ...after all, as long as the *errors* from the model

    are normally distributed, we can use 3σ
  54. Parameters are cool! a pretty decent forecast based on an

    artisanal handcrafted model
  55. but fitting models is hard, even by hand.

  56. possible to implement a class of ML algorithms that determine

    models based on distribution of errors, using Q-Q plots
  57. Q-Q plots can also be used to determine if the

    PDF has changed, although hard to do with limited sample size
  58. consenus: throw lots of different models at a series, hope

    it all shakes out.
  59. [yes] [yes] [no] [no] [yes] [yes] = anomaly!

  60. of course, if your models are all SPC-based, this doesn’t

    really get you anywhere
  61. use exponentially weighted moving averages to adapt faster

  62. fourier transforms to detect seasonality

  63. second order anomalies: is the series “anomalously anomalous”?

  64. ...this is all very hard.

  65. so, we can either change what we expect of monitoring...

  66. ...and treat it as a way of building noisy situational

    awareness, not absolute directives (alerts)...
  67. ...or we can change what we expect out of engineering...

  68. ...and construct strict specifications and expected values of all metrics.

  69. neither are going to happen.

  70. so we have to crack this algorithm nut.

  71. ...ugh. @abestanway