Given at Monitorama.eu 2013 in Berlin. http://vimeo.com/75183236
@abestanwayMOM! my algorithms SUCK
View Slide
i know how to fixmonitoring onceand for all.
a real human physically staringat a single metric 24/7
that human will then alert asleeping engineer when hermetric does something weird
Boom. Perfect Monitoring™.
this works because humans areexcellent visual pattern matchers**there are, of course, many advancedstatistical applications where signalcannot be determined from noise justby looking at the data.
can we teach software to beas good at simple anomalydetection as humans are?
let’s explore.
anomalies = not “normal”
humans can tell what“normal” is by just lookingat a timeseries.
“if a datapoint is not withinreasonable bounds, more orless, of what usually happens,it’s an anomaly”the human definition:
there are real statisticsthat describe what wementally approximate
“what usually happens”the mean
“more or less”the standard deviation
“reasonable bounds”3σ
so, in math speak, a metric isanomalous if the absolute value oflatest datapoint is over three standarddeviations above the mean
we have essentially derivedstatistical process control.
pioneered in the 1920s.heavily used inindustrial engineeringfor quality control onassembly lines.
traditional control chartsspecification limits
grounded inexchangeabilitypast = future
needs to be stationary
produced by independentrandom variables, with well-defined expected values
this allows forstatistical inference
in other words, you need goodlookin’ timeseries for this to work.
normal distribution:a more concisedefinition ofgood lookin’μ34.1%13.6%2.1%34.1%13.6%μ - σ2.1%
if you’ve got a normal distribution, chances areyou’ve got an exchangeable, stationary seriesproduced by independent random variables
99.7% fall under 3σ
μ34.1%13.6%2.1%34.1%13.6%2.1%μ - σif your datapoint is inhere, it’s an anomaly.
when only .3% lie above 3σ...
...you get a highsignal to noise ratio...
...where “signal” indicates afundmental state change, as opposedto a random, improbable variation.
a fundamental state change inthe process means a differentprobability distribution functionthat describes the process
determining when probabilitydistribution function shifts haveoccurred, as early as possible.anomaly detection:
μ1
μ1a new PDF thatdescribes a newprocess
drilling holessawing boardsforging steel
snapped drill bitteeth missing on table sawsteel, like, melted
processes with well plannedexpected values that only suffersmall, random deviances whenworking properly...
...and massive “deviances”, aka,probability function shifts, whenworking improperly.
the bad news:
server infrastructuresaren’t like assembly lines
systems are activeparticipants in theirown design
processes don’t have welldefined expected values
they aren’t produced by genuinelyindependent random variables.
large variance does notnecessarily indicate poor quality
they have seasonality
skewed distributions!less than 99.73% of allvalues lie within 3σ, sobreaching 3σ is notnecessarily bad3σpossiblynormal range
the dirty secret: using SPC-basedalgorithms results in lots and lotsof false positives, and probably lotsof false negatives as well
no way to retroactively find thefalse negatives short of combingwith human eyes!
how do wecombat this?**warning!ideas!
we could always usecustom fit models...
...after all, as long as the*errors* from the modelare normally distributed,we can use 3σ
Parameters are cool!a pretty decent forecastbased on an artisanalhandcrafted model
but fitting models ishard, even by hand.
possible to implement a class ofML algorithms that determinemodels based on distribution oferrors, using Q-Q plots
Q-Q plots can also be used todetermine if the PDF haschanged, although hard to dowith limited sample size
consenus: throw lots ofdifferent models at a series,hope it all shakes out.
[yes] [yes] [no] [no] [yes] [yes]=anomaly!
of course, if your models areall SPC-based, this doesn’treally get you anywhere
use exponentially weightedmoving averages to adapt faster
fourier transforms todetect seasonality
second order anomalies: is theseries “anomalously anomalous”?
...this is all very hard.
so, we can eitherchange what we expectof monitoring...
...and treat it as a way ofbuilding noisy situationalawareness, not absolutedirectives (alerts)...
...or we can change what weexpect out of engineering...
...and construct strictspecifications and expectedvalues of all metrics.
neither are going to happen.
so we have to crackthis algorithm nut.
...ugh.@abestanway