Given at Monitorama.eu 2013 in Berlin. http://vimeo.com/75183236

@abestanwayMOM! my algorithms SUCK

View Slide

i know how to fixmonitoring onceand for all.

a real human physically staringat a single metric 24/7

that human will then alert asleeping engineer when hermetric does something weird

Boom. Perfect Monitoring™.

this works because humans areexcellent visual pattern matchers**there are, of course, many advancedstatistical applications where signalcannot be determined from noise justby looking at the data.

can we teach software to beas good at simple anomalydetection as humans are?

let’s explore.

anomalies = not “normal”

humans can tell what“normal” is by just lookingat a timeseries.

“if a datapoint is not withinreasonable bounds, more orless, of what usually happens,it’s an anomaly”the human definition:

there are real statisticsthat describe what wementally approximate

“what usually happens”the mean

“more or less”the standard deviation

“reasonable bounds”3σ

so, in math speak, a metric isanomalous if the absolute value oflatest datapoint is over three standarddeviations above the mean

we have essentially derivedstatistical process control.

pioneered in the 1920s.heavily used inindustrial engineeringfor quality control onassembly lines.

traditional control chartsspecification limits

grounded inexchangeabilitypast = future

needs to be stationary

produced by independentrandom variables, with well-defined expected values

this allows forstatistical inference

in other words, you need goodlookin’ timeseries for this to work.

normal distribution:a more concisedefinition ofgood lookin’μ34.1%13.6%2.1%34.1%13.6%μ - σ2.1%

if you’ve got a normal distribution, chances areyou’ve got an exchangeable, stationary seriesproduced by independent random variables

99.7% fall under 3σ

μ34.1%13.6%2.1%34.1%13.6%2.1%μ - σif your datapoint is inhere, it’s an anomaly.

when only .3% lie above 3σ...

...you get a highsignal to noise ratio...

...where “signal” indicates afundmental state change, as opposedto a random, improbable variation.

a fundamental state change inthe process means a differentprobability distribution functionthat describes the process

determining when probabilitydistribution function shifts haveoccurred, as early as possible.anomaly detection:

μ1

μ1a new PDF thatdescribes a newprocess

drilling holessawing boardsforging steel

snapped drill bitteeth missing on table sawsteel, like, melted

processes with well plannedexpected values that only suffersmall, random deviances whenworking properly...

...and massive “deviances”, aka,probability function shifts, whenworking improperly.

the bad news:

server infrastructuresaren’t like assembly lines

systems are activeparticipants in theirown design

processes don’t have welldefined expected values

they aren’t produced by genuinelyindependent random variables.

large variance does notnecessarily indicate poor quality

they have seasonality

skewed distributions!less than 99.73% of allvalues lie within 3σ, sobreaching 3σ is notnecessarily bad3σpossiblynormal range

the dirty secret: using SPC-basedalgorithms results in lots and lotsof false positives, and probably lotsof false negatives as well

no way to retroactively find thefalse negatives short of combingwith human eyes!

how do wecombat this?**warning!ideas!

we could always usecustom fit models...

...after all, as long as the*errors* from the modelare normally distributed,we can use 3σ

Parameters are cool!a pretty decent forecastbased on an artisanalhandcrafted model

but fitting models ishard, even by hand.

possible to implement a class ofML algorithms that determinemodels based on distribution oferrors, using Q-Q plots

Q-Q plots can also be used todetermine if the PDF haschanged, although hard to dowith limited sample size

consenus: throw lots ofdifferent models at a series,hope it all shakes out.

[yes] [yes] [no] [no] [yes] [yes]=anomaly!

of course, if your models areall SPC-based, this doesn’treally get you anywhere

use exponentially weightedmoving averages to adapt faster

fourier transforms todetect seasonality

second order anomalies: is theseries “anomalously anomalous”?

...this is all very hard.

so, we can eitherchange what we expectof monitoring...

...and treat it as a way ofbuilding noisy situationalawareness, not absolutedirectives (alerts)...

...or we can change what weexpect out of engineering...

...and construct strictspecifications and expectedvalues of all metrics.

neither are going to happen.

so we have to crackthis algorithm nut.

...ugh.@abestanway