Slide 1

Slide 1 text

@abestanway MOM! my algorithms SUCK

Slide 2

Slide 2 text

i know how to fix monitoring once and for all.

Slide 3

Slide 3 text

a real human physically staring at a single metric 24/7

Slide 4

Slide 4 text

that human will then alert a sleeping engineer when her metric does something weird

Slide 5

Slide 5 text

Boom. Perfect Monitoring™.

Slide 6

Slide 6 text

this works because humans are excellent visual pattern matchers* *there are, of course, many advanced statistical applications where signal cannot be determined from noise just by looking at the data.

Slide 7

Slide 7 text

can we teach software to be as good at simple anomaly detection as humans are?

Slide 8

Slide 8 text

let’s explore.

Slide 9

Slide 9 text

anomalies = not “normal”

Slide 10

Slide 10 text

humans can tell what “normal” is by just looking at a timeseries.

Slide 11

Slide 11 text

“if a datapoint is not within reasonable bounds, more or less, of what usually happens, it’s an anomaly” the human definition:

Slide 12

Slide 12 text

there are real statistics that describe what we mentally approximate

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

“what usually happens” the mean

Slide 15

Slide 15 text

“more or less” the standard deviation

Slide 16

Slide 16 text

“reasonable bounds” 3σ

Slide 17

Slide 17 text

so, in math speak, a metric is anomalous if the absolute value of latest datapoint is over three standard deviations above the mean

Slide 18

Slide 18 text

we have essentially derived statistical process control.

Slide 19

Slide 19 text

pioneered in the 1920s. heavily used in industrial engineering for quality control on assembly lines.

Slide 20

Slide 20 text

traditional control charts specification limits

Slide 21

Slide 21 text

grounded in exchangeability past = future

Slide 22

Slide 22 text

needs to be stationary

Slide 23

Slide 23 text

produced by independent random variables, with well- defined expected values

Slide 24

Slide 24 text

this allows for statistical inference

Slide 25

Slide 25 text

in other words, you need good lookin’ timeseries for this to work.

Slide 26

Slide 26 text

normal distribution: a more concise definition of good lookin’ μ 34.1% 13.6% 2.1% 34.1% 13.6% μ - σ 2.1%

Slide 27

Slide 27 text

if you’ve got a normal distribution, chances are you’ve got an exchangeable, stationary series produced by independent random variables

Slide 28

Slide 28 text

99.7% fall under 3σ

Slide 29

Slide 29 text

μ 34.1% 13.6% 2.1% 34.1% 13.6% 2.1% μ - σ if your datapoint is in here, it’s an anomaly.

Slide 30

Slide 30 text

when only .3% lie above 3σ...

Slide 31

Slide 31 text

...you get a high signal to noise ratio...

Slide 32

Slide 32 text

...where “signal” indicates a fundmental state change, as opposed to a random, improbable variation.

Slide 33

Slide 33 text

a fundamental state change in the process means a different probability distribution function that describes the process

Slide 34

Slide 34 text

determining when probability distribution function shifts have occurred, as early as possible. anomaly detection:

Slide 35

Slide 35 text

μ 1

Slide 36

Slide 36 text

μ 1 a new PDF that describes a new process

Slide 37

Slide 37 text

drilling holes sawing boards forging steel

Slide 38

Slide 38 text

snapped drill bit teeth missing on table saw steel, like, melted

Slide 39

Slide 39 text

processes with well planned expected values that only suffer small, random deviances when working properly...

Slide 40

Slide 40 text

...and massive “deviances”, aka, probability function shifts, when working improperly.

Slide 41

Slide 41 text

the bad news:

Slide 42

Slide 42 text

server infrastructures aren’t like assembly lines

Slide 43

Slide 43 text

systems are active participants in their own design

Slide 44

Slide 44 text

processes don’t have well defined expected values

Slide 45

Slide 45 text

they aren’t produced by genuinely independent random variables.

Slide 46

Slide 46 text

large variance does not necessarily indicate poor quality

Slide 47

Slide 47 text

they have seasonality

Slide 48

Slide 48 text

skewed distributions! less than 99.73% of all values lie within 3σ, so breaching 3σ is not necessarily bad 3σ possibly normal range

Slide 49

Slide 49 text

the dirty secret: using SPC-based algorithms results in lots and lots of false positives, and probably lots of false negatives as well

Slide 50

Slide 50 text

no way to retroactively find the false negatives short of combing with human eyes!

Slide 51

Slide 51 text

how do we combat this?* *warning! ideas!

Slide 52

Slide 52 text

we could always use custom fit models...

Slide 53

Slide 53 text

...after all, as long as the *errors* from the model are normally distributed, we can use 3σ

Slide 54

Slide 54 text

Parameters are cool! a pretty decent forecast based on an artisanal handcrafted model

Slide 55

Slide 55 text

but fitting models is hard, even by hand.

Slide 56

Slide 56 text

possible to implement a class of ML algorithms that determine models based on distribution of errors, using Q-Q plots

Slide 57

Slide 57 text

Q-Q plots can also be used to determine if the PDF has changed, although hard to do with limited sample size

Slide 58

Slide 58 text

consenus: throw lots of different models at a series, hope it all shakes out.

Slide 59

Slide 59 text

[yes] [yes] [no] [no] [yes] [yes] = anomaly!

Slide 60

Slide 60 text

of course, if your models are all SPC-based, this doesn’t really get you anywhere

Slide 61

Slide 61 text

use exponentially weighted moving averages to adapt faster

Slide 62

Slide 62 text

fourier transforms to detect seasonality

Slide 63

Slide 63 text

second order anomalies: is the series “anomalously anomalous”?

Slide 64

Slide 64 text

...this is all very hard.

Slide 65

Slide 65 text

so, we can either change what we expect of monitoring...

Slide 66

Slide 66 text

...and treat it as a way of building noisy situational awareness, not absolute directives (alerts)...

Slide 67

Slide 67 text

...or we can change what we expect out of engineering...

Slide 68

Slide 68 text

...and construct strict specifications and expected values of all metrics.

Slide 69

Slide 69 text

neither are going to happen.

Slide 70

Slide 70 text

so we have to crack this algorithm nut.

Slide 71

Slide 71 text

...ugh. @abestanway