Slide 1

Slide 1 text

Abe Stanway @jonlives BRING THE NOISE! MAKING SENSE OF A HAILSTORM OF METRICS Jon Cowie @abestanway

Slide 2

Slide 2 text

Ninety minutes is a long time. - motivations - skyline - oculus - demo! - questions This talk: ~10 ~25 ~30 ~10 ~15

Slide 3

Slide 3 text

Ninety minutes is a long time. - motivations - skyline - oculus - demo! - questions This talk: ~10 ~25 ~30 ~10 ~15 But we have some sweet stuff to show you.

Slide 4

Slide 4 text

Background and Motivations

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

1.5 billion page views $117 million of goods sold 950 thousand users

Slide 7

Slide 7 text

1.5 billion page views $117 million of goods sold 950 thousand users (in december ‘12)

Slide 8

Slide 8 text

We practice continuous deployment.

Slide 9

Slide 9 text

de • ploy /diˈploi/ Verb To release your code for the world to see, hopefully without breaking the Internet

Slide 10

Slide 10 text

Everyone deploys. 250+ committers.

Slide 11

Slide 11 text

Day one: DEPLOY

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

30+ DEPLOYS A DAY (~8 commits per deploy!)

Slide 14

Slide 14 text

“30 deploys a day? Is that safe?”

Slide 15

Slide 15 text

We optimize for quick recovery by anticipating problems...

Slide 16

Slide 16 text

...instead of fearing human error.

Slide 17

Slide 17 text

Can’t fix what you don’t measure! - W. Edwards Deming

Slide 18

Slide 18 text

StatsD graphite Skyline Oculus Supergrep homemade! not homemade Nagios Ganglia

Slide 19

Slide 19 text

Text Real time error logging

Slide 20

Slide 20 text

“Not all things that break throw errors.” - Oscar Wilde

Slide 21

Slide 21 text

StatsD

Slide 22

Slide 22 text

StatsD::increment(“foo.bar”)

Slide 23

Slide 23 text

If it moves, graph it!

Slide 24

Slide 24 text

If it moves, graph it! we would graph them ➞

Slide 25

Slide 25 text

If it doesn’t move, graph it anyway (it might make a run for it)

Slide 26

Slide 26 text

DASHBOARDS!

Slide 27

Slide 27 text

[1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 60] [1358731200, 20] [1358731200, 20]

Slide 28

Slide 28 text

DASHBOARDS! x 250,000

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

lol nagios

Slide 31

Slide 31 text

“...but there are also unknown unknowns - there are things we do not know we don’t know.”

Slide 32

Slide 32 text

Unknown anomalies

Slide 33

Slide 33 text

Unknown correlations

Slide 34

Slide 34 text

Kale.

Slide 35

Slide 35 text

Kale: - leaves - green stuff

Slide 36

Slide 36 text

Kale: - leaves - green stuffOCULUS SKYLINE

Slide 37

Slide 37 text

Q). How do you analyze a timeseries for anomalies in real time?

Slide 38

Slide 38 text

A). Lots of HTTP requests to Graphite’s API!

Slide 39

Slide 39 text

Q). How do you analyze a quarter million timeseries for anomalies in real time?

Slide 40

Slide 40 text

SKYLINE

Slide 41

Slide 41 text

SKYLINE

Slide 42

Slide 42 text

A real time anomaly detection system

Slide 43

Slide 43 text

Real time?

Slide 44

Slide 44 text

Kinda.

Slide 45

Slide 45 text

StatsD Ten second resolution

Slide 46

Slide 46 text

Ganglia One minute resolution

Slide 47

Slide 47 text

~ 10s ( ~ 1min Best case:

Slide 48

Slide 48 text

( Takes about 90 seconds with our throughput.

Slide 49

Slide 49 text

( Still faster than you would have discovered it otherwise.

Slide 50

Slide 50 text

Memory > Disk

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

Q). How do you get a quarter million timeseries into Redis on time?

Slide 53

Slide 53 text

STREAM IT!

Slide 54

Slide 54 text

Graphite’s relay agent original graphite backup graphite

Slide 55

Slide 55 text

Graphite’s relay agent original graphite backup graphite [statsd.numStats, [1365603422, 82345]] pickles [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]

Slide 56

Slide 56 text

Graphite’s relay agent original graphite skyline [statsd.numStats, [1365603422, 82345]] pickles [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]

Slide 57

Slide 57 text

We import from Ganglia too.

Slide 58

Slide 58 text

Storing timeseries

Slide 59

Slide 59 text

Minimize I/O Minimize memory

Slide 60

Slide 60 text

redis.append() - Strings - Constant time - One operation per update

Slide 61

Slide 61 text

JSON?

Slide 62

Slide 62 text

“[1358711400, 51],” => get statsD.numStats ----------------------------

Slide 63

Slide 63 text

“[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23],”

Slide 64

Slide 64 text

“[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23], [1358711420, 45],”

Slide 65

Slide 65 text

OVER HALF CPU time spent decoding JSON

Slide 66

Slide 66 text

[1,2]

Slide 67

Slide 67 text

[ 1 , 2 ] Stuff we care about Extra junk

Slide 68

Slide 68 text

MESSAGEPACK

Slide 69

Slide 69 text

MESSAGEPACK A binary-based serialization protocol

Slide 70

Slide 70 text

\x93\x01\x02 Array size (16 or 32 bit big endian integer) Things we care about

Slide 71

Slide 71 text

\x93\x01\x02 Array size (16 or 32 bit big endian integer) Things we care about \x93\x02\x03

Slide 72

Slide 72 text

CUT IN HALF Run Time + Memory Used

Slide 73

Slide 73 text

ROOMBA.PY CLEANS THE DATA

Slide 74

Slide 74 text

“Wait...you wrote this in Python?”

Slide 75

Slide 75 text

Great statistics libraries Not fun for parallelism

Slide 76

Slide 76 text

Assign Redis keys to each process Process decodes and analyzes The Analyzer

Slide 77

Slide 77 text

Anomalous metrics written as JSON setInterval() retrieves from front end The Analyzer

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

What does it mean to be anomalous?

Slide 80

Slide 80 text

Consensus model

Slide 81

Slide 81 text

Implement everything you can get your hands on

Slide 82

Slide 82 text

Basic algorithm: “A metric is anomalous if its latest datapoint is over three standard deviations above its moving average.”

Slide 83

Slide 83 text

Grubb’s test, ordinary least squares

Slide 84

Slide 84 text

Histogram binning

Slide 85

Slide 85 text

Four horsemen of the modelpocalypse

Slide 86

Slide 86 text

1. Seasonality 2. Spike influence 3. Normality 4. Parameters

Slide 87

Slide 87 text

Anomaly?

Slide 88

Slide 88 text

Nope.

Slide 89

Slide 89 text

Text Spikes artificially raise the moving average Anomaly detected (yay!) Anomaly missed :( Bigger moving average

Slide 90

Slide 90 text

Real world data doesn’t necessarily follow a perfect normal distribution.

Slide 91

Slide 91 text

Too many metrics to fit parameters for them all!

Slide 92

Slide 92 text

A robust set of algorithms is the current focus of this project.

Slide 93

Slide 93 text

Q). How do you analyze a quarter million timeseries for correlations?

Slide 94

Slide 94 text

OCULUS

Slide 95

Slide 95 text

Image comparison is expensive and slow

Slide 96

Slide 96 text

“[[975, 1365528530], [643, 1365528540], [750, 1365528550], [992, 1365528560], [580, 1365528570], [586, 1365528580], [649, 1365528590], [548, 1365528600], [901, 1365528610], [633, 1365528620]]” Use raw timeseries instead of raw graphs

Slide 97

Slide 97 text

Naming Things Cache Invalidation Numerical Comparison? HARD PROBLEMS

Slide 98

Slide 98 text

Naming Things Cache Invalidation Numerical Comparison? HARD PROBLEMS

Slide 99

Slide 99 text

Euclidian Distance

Slide 100

Slide 100 text

Dynamic Time Warping (helps with phase shifts)

Slide 101

Slide 101 text

We’ve solved it!

Slide 102

Slide 102 text

O(N2)

Slide 103

Slide 103 text

O(N2) x 250k

Slide 104

Slide 104 text

Too slow!

Slide 105

Slide 105 text

doesn’t

Slide 106

Slide 106 text

No need to run DTW on all 250k.

Slide 107

Slide 107 text

Discard obviously dissimilar metrics.

Slide 108

Slide 108 text

“975 643 643 750 992 992 992 580” “sharpdecrement flat increment sharpincrement flat flat shapdecrement” Shape Description Alphabet

Slide 109

Slide 109 text

“975 643 643 750 992 992 992 580” “sharpdecrement flat increment sharpincrement flat flat shapdecrement” Shape Description Alphabet “24 4 4 11 25 25 25 0 1” (normalization step)

Slide 110

Slide 110 text

No content

Slide 111

Slide 111 text

Search for shape description fingerprint in Elasticsearch

Slide 112

Slide 112 text

Run DTW on results as final polish

Slide 113

Slide 113 text

O(N2) on ~10k metrics

Slide 114

Slide 114 text

Still too slow.

Slide 115

Slide 115 text

Fast DTW - O(N) coarsen project refine

Slide 116

Slide 116 text

Elasticsearch Details Phrase search for first pass scores across shape description fingerprints

Slide 117

Slide 117 text

Elasticsearch Details Phrase search for first pass scores across shape description fingerprints Custom FastDTW and euclidian distance plugins to score across the remaining filtered timeseries

Slide 118

Slide 118 text

Elasticsearch Structure { :id => “statsd.numStats”, :fingerprint => “sdec inc sinc sdec”, :values => "10 1 2 15 4" }

Slide 119

Slide 119 text

Mappings Specify tokenizers “Untouched” fields

Slide 120

Slide 120 text

First pass query :match => { :fingerprint => { :query => “sdec inc sinc sdec inc”, :type => "phrase", :slop => 20 } } shape description fingerprint

Slide 121

Slide 121 text

Refinement query {:custom_score => { :query => <first_pass_query>, :script => "oculus_dtw", :params => { :query_value => “10 20 20 10 30”, :query_field => "values.untouched", }, } raw timeseries

Slide 122

Slide 122 text

Skyline Elasticsearch Resque Sinatra Ganglia Graphite StatsD KALE Flask

Slide 123

Slide 123 text

Populating Elasticsearch

Slide 124

Slide 124 text

ES Index resque workers

Slide 125

Slide 125 text

Too slow to update and search

Slide 126

Slide 126 text

New Index Last Index Webapp

Slide 127

Slide 127 text

Sinatra frontend Queries ES Renders results

Slide 128

Slide 128 text

Collections

Slide 129

Slide 129 text

devops <3

Slide 130

Slide 130 text

No content

Slide 131

Slide 131 text

Special thanks to: Dr. Neil Gunther, PerfDynamics Dr. Brian Whitman, Echonest Burc Arpat, Facebook Seth Walker, Etsy Rafe Colburn, Etsy Mike Rembetsy, Etsy John Allspaw, Etsy

Slide 132

Slide 132 text

@abestanway @jonlives Thanks! github.com/etsy/skyline github.com/etsy/oculus