a real human physically staring
at a single metric 24/7
Slide 4
Slide 4 text
that human will then alert a
sleeping engineer when her
metric does something weird
Slide 5
Slide 5 text
Boom. Perfect Monitoring™.
Slide 6
Slide 6 text
this works because humans are
excellent visual pattern matchers*
*there are, of course, many advanced
statistical applications where signal
cannot be determined from noise just
by looking at the data.
Slide 7
Slide 7 text
can we teach software to be
as good at simple anomaly
detection as humans are?
Slide 8
Slide 8 text
let’s explore.
Slide 9
Slide 9 text
anomalies = not “normal”
Slide 10
Slide 10 text
humans can tell what
“normal” is by just looking
at a timeseries.
Slide 11
Slide 11 text
“if a datapoint is not within
reasonable bounds, more or
less, of what usually happens,
it’s an anomaly”
the human definition:
Slide 12
Slide 12 text
there are real statistics
that describe what we
mentally approximate
Slide 13
Slide 13 text
No content
Slide 14
Slide 14 text
“what usually happens”
the mean
Slide 15
Slide 15 text
“more or less”
the standard deviation
Slide 16
Slide 16 text
“reasonable bounds”
3σ
Slide 17
Slide 17 text
so, in math speak, a metric is
anomalous if the absolute value of
latest datapoint is over three standard
deviations above the mean
Slide 18
Slide 18 text
we have essentially derived
statistical process control.
Slide 19
Slide 19 text
pioneered in the 1920s.
heavily used in
industrial engineering
for quality control on
assembly lines.
Slide 20
Slide 20 text
traditional control charts
specification limits
Slide 21
Slide 21 text
grounded in
exchangeability
past = future
Slide 22
Slide 22 text
needs to be stationary
Slide 23
Slide 23 text
produced by independent
random variables, with well-
defined expected values
Slide 24
Slide 24 text
this allows for
statistical inference
Slide 25
Slide 25 text
in other words, you need good
lookin’ timeseries for this to work.
Slide 26
Slide 26 text
normal distribution:
a more concise
definition of
good lookin’
μ
34.1%
13.6%
2.1%
34.1%
13.6%
μ - σ
2.1%
Slide 27
Slide 27 text
if you’ve got a normal distribution, chances are
you’ve got an exchangeable, stationary series
produced by independent random variables
Slide 28
Slide 28 text
99.7% fall under 3σ
Slide 29
Slide 29 text
μ
34.1%
13.6%
2.1%
34.1%
13.6%
2.1%
μ - σ
if your datapoint is in
here, it’s an anomaly.
Slide 30
Slide 30 text
when only .3% lie above 3σ...
Slide 31
Slide 31 text
...you get a high
signal to noise ratio...
Slide 32
Slide 32 text
...where “signal” indicates a
fundmental state change, as opposed
to a random, improbable variation.
Slide 33
Slide 33 text
a fundamental state change in
the process means a different
probability distribution function
that describes the process
Slide 34
Slide 34 text
determining when probability
distribution function shifts have
occurred, as early as possible.
anomaly detection:
Slide 35
Slide 35 text
μ
1
Slide 36
Slide 36 text
μ
1
a new PDF that
describes a new
process
Slide 37
Slide 37 text
drilling holes
sawing boards
forging steel
Slide 38
Slide 38 text
snapped drill bit
teeth missing on table saw
steel, like, melted
Slide 39
Slide 39 text
processes with well planned
expected values that only suffer
small, random deviances when
working properly...
Slide 40
Slide 40 text
...and massive “deviances”, aka,
probability function shifts, when
working improperly.
Slide 41
Slide 41 text
the bad news:
Slide 42
Slide 42 text
server infrastructures
aren’t like assembly lines
Slide 43
Slide 43 text
systems are active
participants in their
own design
Slide 44
Slide 44 text
processes don’t have well
defined expected values
Slide 45
Slide 45 text
they aren’t produced by genuinely
independent random variables.
Slide 46
Slide 46 text
large variance does not
necessarily indicate poor quality
Slide 47
Slide 47 text
they have seasonality
Slide 48
Slide 48 text
skewed distributions!
less than 99.73% of all
values lie within 3σ, so
breaching 3σ is not
necessarily bad
3σ
possibly
normal range
Slide 49
Slide 49 text
the dirty secret: using SPC-based
algorithms results in lots and lots
of false positives, and probably lots
of false negatives as well
Slide 50
Slide 50 text
no way to retroactively find the
false negatives short of combing
with human eyes!
Slide 51
Slide 51 text
how do we
combat this?*
*warning!
ideas!
Slide 52
Slide 52 text
we could always use
custom fit models...
Slide 53
Slide 53 text
...after all, as long as the
*errors* from the model
are normally distributed,
we can use 3σ
Slide 54
Slide 54 text
Parameters are cool!
a pretty decent forecast
based on an artisanal
handcrafted model
Slide 55
Slide 55 text
but fitting models is
hard, even by hand.
Slide 56
Slide 56 text
possible to implement a class of
ML algorithms that determine
models based on distribution of
errors, using Q-Q plots
Slide 57
Slide 57 text
Q-Q plots can also be used to
determine if the PDF has
changed, although hard to do
with limited sample size
Slide 58
Slide 58 text
consenus: throw lots of
different models at a series,
hope it all shakes out.
Slide 59
Slide 59 text
[yes] [yes] [no] [no] [yes] [yes]
=
anomaly!
Slide 60
Slide 60 text
of course, if your models are
all SPC-based, this doesn’t
really get you anywhere
Slide 61
Slide 61 text
use exponentially weighted
moving averages to adapt faster
Slide 62
Slide 62 text
fourier transforms to
detect seasonality
Slide 63
Slide 63 text
second order anomalies: is the
series “anomalously anomalous”?
Slide 64
Slide 64 text
...this is all very hard.
Slide 65
Slide 65 text
so, we can either
change what we expect
of monitoring...
Slide 66
Slide 66 text
...and treat it as a way of
building noisy situational
awareness, not absolute
directives (alerts)...
Slide 67
Slide 67 text
...or we can change what we
expect out of engineering...
Slide 68
Slide 68 text
...and construct strict
specifications and expected
values of all metrics.