A Deep Dive into Monitoring with Skyline

Abe Stanway @jonlives Jon Cowie @abestanway A DEEP DIVE INTO
Monitoring with Skyline abe stanway

We have a large stack.

41 shards 24 api servers 72 web servers 42 Gearman
boxes 150 node Hadoop cluster 15 memcached boxes 60 search machines

41 shards 24 api servers 72 web servers 42 Gearman
boxes 150 node Hadoop cluster 15 memcached boxes 60 search machines (plus a lot more for various services)

Not to mention the app itself.

We practice continuous deployment.

de • ploy /diˈploi/ Verb To release your code for
the world to see, hopefully without breaking the Internet

Everyone deploys. 250+ committers.

Hundreds of boxes hosting constantly evolving code...

...it’s a miracle we stay up, right?

We optimize for quick recovery by anticipating problems...

...instead of fearing human error.

Can’t fix what you don’t measure! - W. Edwards Deming

StatsD graphite Skyline Oculus Supergrep homemade! not homemade Nagios Ganglia

Text Real time error logging

“Not all things that break throw errors.” - Oscar Wilde

StatsD

StatsD::increment(“foo.bar”)

If it moves, graph it!

If it moves, graph it! we would graph them ➞

If it doesn’t move, graph it anyway (it might make
a run for it)

DASHBOARDS!

[1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20]
[1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 60] [1358731200, 20] [1358731200, 20]

DASHBOARDS! x 250,000

lol nagios

Unknown anomalies

Kale: - leaves - green stuff

Kale: - leaves - green stuffOCULUS SKYLINE

Q). How do you analyze a timeseries for anomalies in
real time?

A). Lots of HTTP requests to Graphite’s API!

Q). How do you analyze a quarter million timeseries for
anomalies in real time?

SKYLINE

A real time anomaly detection system

Real time?

Kinda.

StatsD Ten second resolution

Ganglia One minute resolution

~ 10s ( ~ 1min Best case:

( Takes about 70 seconds with our throughput.

( Still faster than you would have discovered it otherwise.

Memory > Disk

Q). How do you get a quarter million timeseries into
Redis on time?

STREAM THAT SHIT!

Graphite’s relay agent original graphite backup graphite

Graphite’s relay agent original graphite backup graphite [statsd.numStats, [1365603422, 82345]]
pickles [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]

Graphite’s relay agent original graphite skyline [statsd.numStats, [1365603422, 82345]] pickles
[statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]

We import from Ganglia too.

Storing timeseries

Minimize I/O Minimize memory

redis.append() - Strings - Constant time - One operation per
update

“[1358711400, 51],” => get statsD.numStats ----------------------------

“[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23],”

“[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23], [1358711420, 45],”

OVER HALF CPU time spent decoding JSON

[ 1 , 2 ] Stuff we care about Extra
bullshit

MESSAGEPACK

MESSAGEPACK A binary-based serialization protocol

\x93\x01\x02 Array size (16 or 32 bit big endian integer)
Things we care about

\x93\x01\x02 Array size (16 or 32 bit big endian integer)
Things we care about \x93\x02\x03

CUT IN HALF Run Time + Memory Used

ROOMBA.PY CLEANS THE DATA

“Wait...you wrote this in Python?”

Great statistics libraries Not fun for parallelism

Simple map/reduce design The Analyzer

Assign Redis keys to each process Process decodes and analyzes
The Analyzer

Anomalous metrics written as JSON setInterval() retrieves from front end
The Analyzer

What does it mean to be anomalous?

Consensus model

[yes] [yes] [no] [no] [yes] [yes] = anomaly!

Helps correct model mismatches

Implement everything you can get your hands on

Basic algorithm: “A metric is anomalous if its latest datapoint
is over three standard deviations above its moving average.”

...(aka, the basic tenet of SPC) http://en.wikipedia.org/wiki/ Statistical process control
– –

Mean 34.1% 34.1% 13.6% 13.6% 2.1% 2.1%

Mean 34.1% 34.1% 13.6% 13.6% 2.1% 2.1% if your datapoint
is in here, it’s an anomaly

Histogram binning

Take some data

Find most recent datapoint value is 40

Make a histogram

Check which bin contains most recent data

Check which bin contains most recent data latest value is
40, tiny bin size, so...anomaly!

Ordinary least squares

Take some data

Fit a regression line

Find residuals

Three sigma winner!

Median absolute deviation

Median absolute deviation (calculate residuals with respect to median instead
of regression line)

Exponentially weighted moving average

Instead of:

Add a decay factor!

Adding decay discounts older values.

Four horsemen of the modelpocalypse

1. Seasonality 2. Spike influence 3. Normality 4. Parameters

Anomaly?

Text Spikes artificially raise the moving average Anomaly detected (yay!)
Anomaly missed :( Bigger moving average

Real world data doesn’t necessarily follow a perfect normal distribution.

Simple systems, simple definitions of “anomalous”

Complex systems, complex definitions of “anomalous”

Not to mention that complex systems evolve

How to avoid false positives upon the evolution of the
measured processes?

Ionno.

Parameters!

Parameters are cool! Predicted page views

Cool model bro. (it’s a simplified Holt-Winters)

What are the parameters?

Seasonality: 365 day Overall trend weight: .68 Seasonal regression weight:
.32 EWMA smoothing factor: .1

Must train before discovering lowest error for parameters

Mad expensive, yo. these people do not represent our CPUs

No good anomalies without good models.

A robust set of algorithms is the current focus of
this project.

Thanks! @abestanway github.com/etsy/skyline

A Deep Dive into Monitoring with Skyline

A Deep Dive into Monitoring with Skyline

More Decks by Abe Stanway

Other Decks in Programming

Featured

Transcript