Slide 1

Slide 1 text

Abe Stanway @jonlives Jon Cowie @abestanway A DEEP DIVE INTO Monitoring with Skyline abe stanway

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

We have a large stack.

Slide 4

Slide 4 text

41 shards 24 api servers 72 web servers 42 Gearman boxes 150 node Hadoop cluster 15 memcached boxes 60 search machines

Slide 5

Slide 5 text

41 shards 24 api servers 72 web servers 42 Gearman boxes 150 node Hadoop cluster 15 memcached boxes 60 search machines (plus a lot more for various services)

Slide 6

Slide 6 text

Not to mention the app itself.

Slide 7

Slide 7 text

We practice continuous deployment.

Slide 8

Slide 8 text

de • ploy /diˈploi/ Verb To release your code for the world to see, hopefully without breaking the Internet

Slide 9

Slide 9 text

Everyone deploys. 250+ committers.

Slide 10

Slide 10 text

Hundreds of boxes hosting constantly evolving code...

Slide 11

Slide 11 text

...it’s a miracle we stay up, right?

Slide 12

Slide 12 text

We optimize for quick recovery by anticipating problems...

Slide 13

Slide 13 text

...instead of fearing human error.

Slide 14

Slide 14 text

Can’t fix what you don’t measure! - W. Edwards Deming

Slide 15

Slide 15 text

StatsD graphite Skyline Oculus Supergrep homemade! not homemade Nagios Ganglia

Slide 16

Slide 16 text

Text Real time error logging

Slide 17

Slide 17 text

“Not all things that break throw errors.” - Oscar Wilde

Slide 18

Slide 18 text

StatsD

Slide 19

Slide 19 text

StatsD::increment(“foo.bar”)

Slide 20

Slide 20 text

If it moves, graph it!

Slide 21

Slide 21 text

If it moves, graph it! we would graph them ➞

Slide 22

Slide 22 text

If it doesn’t move, graph it anyway (it might make a run for it)

Slide 23

Slide 23 text

DASHBOARDS!

Slide 24

Slide 24 text

[1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 60] [1358731200, 20] [1358731200, 20]

Slide 25

Slide 25 text

DASHBOARDS! x 250,000

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

lol nagios

Slide 28

Slide 28 text

Unknown anomalies

Slide 29

Slide 29 text

Kale.

Slide 30

Slide 30 text

Kale: - leaves - green stuff

Slide 31

Slide 31 text

Kale: - leaves - green stuffOCULUS SKYLINE

Slide 32

Slide 32 text

Q). How do you analyze a timeseries for anomalies in real time?

Slide 33

Slide 33 text

A). Lots of HTTP requests to Graphite’s API!

Slide 34

Slide 34 text

Q). How do you analyze a quarter million timeseries for anomalies in real time?

Slide 35

Slide 35 text

SKYLINE

Slide 36

Slide 36 text

SKYLINE

Slide 37

Slide 37 text

A real time anomaly detection system

Slide 38

Slide 38 text

Real time?

Slide 39

Slide 39 text

Kinda.

Slide 40

Slide 40 text

StatsD Ten second resolution

Slide 41

Slide 41 text

Ganglia One minute resolution

Slide 42

Slide 42 text

~ 10s ( ~ 1min Best case:

Slide 43

Slide 43 text

( Takes about 70 seconds with our throughput.

Slide 44

Slide 44 text

( Still faster than you would have discovered it otherwise.

Slide 45

Slide 45 text

Memory > Disk

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

Q). How do you get a quarter million timeseries into Redis on time?

Slide 48

Slide 48 text

STREAM THAT SHIT!

Slide 49

Slide 49 text

Graphite’s relay agent original graphite backup graphite

Slide 50

Slide 50 text

Graphite’s relay agent original graphite backup graphite [statsd.numStats, [1365603422, 82345]] pickles [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]

Slide 51

Slide 51 text

Graphite’s relay agent original graphite skyline [statsd.numStats, [1365603422, 82345]] pickles [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]

Slide 52

Slide 52 text

We import from Ganglia too.

Slide 53

Slide 53 text

Storing timeseries

Slide 54

Slide 54 text

Minimize I/O Minimize memory

Slide 55

Slide 55 text

redis.append() - Strings - Constant time - One operation per update

Slide 56

Slide 56 text

JSON?

Slide 57

Slide 57 text

“[1358711400, 51],” => get statsD.numStats ----------------------------

Slide 58

Slide 58 text

“[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23],”

Slide 59

Slide 59 text

“[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23], [1358711420, 45],”

Slide 60

Slide 60 text

OVER HALF CPU time spent decoding JSON

Slide 61

Slide 61 text

[1,2]

Slide 62

Slide 62 text

[ 1 , 2 ] Stuff we care about Extra bullshit

Slide 63

Slide 63 text

MESSAGEPACK

Slide 64

Slide 64 text

MESSAGEPACK A binary-based serialization protocol

Slide 65

Slide 65 text

\x93\x01\x02 Array size (16 or 32 bit big endian integer) Things we care about

Slide 66

Slide 66 text

\x93\x01\x02 Array size (16 or 32 bit big endian integer) Things we care about \x93\x02\x03

Slide 67

Slide 67 text

CUT IN HALF Run Time + Memory Used

Slide 68

Slide 68 text

ROOMBA.PY CLEANS THE DATA

Slide 69

Slide 69 text

“Wait...you wrote this in Python?”

Slide 70

Slide 70 text

Great statistics libraries Not fun for parallelism

Slide 71

Slide 71 text

Simple map/reduce design The Analyzer

Slide 72

Slide 72 text

Assign Redis keys to each process Process decodes and analyzes The Analyzer

Slide 73

Slide 73 text

Anomalous metrics written as JSON setInterval() retrieves from front end The Analyzer

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

What does it mean to be anomalous?

Slide 76

Slide 76 text

Consensus model

Slide 77

Slide 77 text

[yes] [yes] [no] [no] [yes] [yes] = anomaly!

Slide 78

Slide 78 text

Helps correct model mismatches

Slide 79

Slide 79 text

Implement everything you can get your hands on

Slide 80

Slide 80 text

Basic algorithm: “A metric is anomalous if its latest datapoint is over three standard deviations above its moving average.”

Slide 81

Slide 81 text

...(aka, the basic tenet of SPC) http://en.wikipedia.org/wiki/ Statistical process control – –

Slide 82

Slide 82 text

Mean 34.1% 34.1% 13.6% 13.6% 2.1% 2.1%

Slide 83

Slide 83 text

Mean 34.1% 34.1% 13.6% 13.6% 2.1% 2.1% if your datapoint is in here, it’s an anomaly

Slide 84

Slide 84 text

Histogram binning

Slide 85

Slide 85 text

Take some data

Slide 86

Slide 86 text

Find most recent datapoint value is 40

Slide 87

Slide 87 text

Make a histogram

Slide 88

Slide 88 text

Check which bin contains most recent data

Slide 89

Slide 89 text

Check which bin contains most recent data latest value is 40, tiny bin size, so...anomaly!

Slide 90

Slide 90 text

Ordinary least squares

Slide 91

Slide 91 text

Take some data

Slide 92

Slide 92 text

Fit a regression line

Slide 93

Slide 93 text

Find residuals

Slide 94

Slide 94 text

Three sigma winner!

Slide 95

Slide 95 text

Median absolute deviation

Slide 96

Slide 96 text

Median absolute deviation (calculate residuals with respect to median instead of regression line)

Slide 97

Slide 97 text

Exponentially weighted moving average

Slide 98

Slide 98 text

Instead of:

Slide 99

Slide 99 text

Add a decay factor!

Slide 100

Slide 100 text

Adding decay discounts older values.

Slide 101

Slide 101 text

Four horsemen of the modelpocalypse

Slide 102

Slide 102 text

1. Seasonality 2. Spike influence 3. Normality 4. Parameters

Slide 103

Slide 103 text

Anomaly?

Slide 104

Slide 104 text

Nope.

Slide 105

Slide 105 text

Text Spikes artificially raise the moving average Anomaly detected (yay!) Anomaly missed :( Bigger moving average

Slide 106

Slide 106 text

Real world data doesn’t necessarily follow a perfect normal distribution.

Slide 107

Slide 107 text

!=

Slide 108

Slide 108 text

Simple systems, simple definitions of “anomalous”

Slide 109

Slide 109 text

Complex systems, complex definitions of “anomalous”

Slide 110

Slide 110 text

Not to mention that complex systems evolve

Slide 111

Slide 111 text

How to avoid false positives upon the evolution of the measured processes?

Slide 112

Slide 112 text

Ionno.

Slide 113

Slide 113 text

Parameters!

Slide 114

Slide 114 text

Parameters are cool! Predicted page views

Slide 115

Slide 115 text

Cool model bro. (it’s a simplified Holt-Winters)

Slide 116

Slide 116 text

What are the parameters?

Slide 117

Slide 117 text

Seasonality: 365 day Overall trend weight: .68 Seasonal regression weight: .32 EWMA smoothing factor: .1

Slide 118

Slide 118 text

Must train before discovering lowest error for parameters

Slide 119

Slide 119 text

Mad expensive, yo. these people do not represent our CPUs

Slide 120

Slide 120 text

No good anomalies without good models.

Slide 121

Slide 121 text

A robust set of algorithms is the current focus of this project.

Slide 122

Slide 122 text

Thanks! @abestanway github.com/etsy/skyline