Berlin 2013 - Kale Workshop - Abe Stanway

WELCOME TO BROOKLYN: A WORKSHOP ON KALE Abe Stanway @abestanway

Disclaimer: still in beta

Kale is composed of two sister services: Skyline and Oculus

SKYLINE

Q). How do you analyze a timeseries for anomalies in
real time?

A). Lots of HTTP requests to Graphite’s API!

Q). How do you analyze a quarter million timeseries for
anomalies in real time?

Skyline!

Real time?

Kinda.

StatsD Ten second resolution

Ganglia One minute resolution

~ 10s ( ~ 1min Best case:

( Takes about 70 seconds with our throughput.

( Still faster than you would have discovered it otherwise.

Memory > Disk

Q). How do you get a quarter million timeseries into
Redis on time?

STREAM THAT SHIT!

Graphite’s relay agent original graphite backup graphite

Graphite’s relay agent original graphite backup graphite [statsd.numStats, [1365603422, 82345]]
pickles [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]

Graphite’s relay agent original graphite skyline [statsd.numStats, [1365603422, 82345]] pickles
[statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]

We import from Ganglia too.

Storing timeseries

Minimize I/O Minimize memory

redis.append() - Strings - Constant time - One operation per
update

“[1358711400, 51],” => get statsD.numStats ----------------------------

“[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23],”

“[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23], [1358711420, 45],”

OVER HALF CPU time spent decoding JSON

[ 1 , 2 ] Stuff we care about Extra
bullshit

MESSAGEPACK

MESSAGEPACK A binary-based serialization protocol

\x93\x01\x02 Array size (16 or 32 bit big endian integer)
Things we care about

\x93\x01\x02 Array size (16 or 32 bit big endian integer)
Things we care about \x93\x02\x03

CUT IN HALF Run Time + Memory Used

ROOMBA.PY CLEANS THE DATA

“Wait...you wrote this in Python?”

Great statistics libraries Not fun for parallelism

Simple map/reduce design The Analyzer

Assign Redis keys to each process Process decodes and analyzes
The Analyzer

Anomalous metrics written as JSON setInterval() retrieves from front end
The Analyzer

What does it mean to be anomalous?

Consensus model

[yes] [yes] [no] [no] [yes] [yes] = anomaly!

Helps correct model mismatches

Implement everything you can get your hands on

Basic algorithm: “A metric is anomalous if its latest datapoint
is over three standard deviations above its moving average.”

Histogram binning

Take some data

Find most recent datapoint value is 40

Make a histogram

Check which bin contains most recent data

Check which bin contains most recent data latest value is
40, tiny bin size, so...anomaly!

Ordinary least squares

Take some data

Fit a regression line

Find residuals

Three sigma winner!

Median absolute deviation

Median absolute deviation (calculate residuals with respect to median instead
of regression line)

Exponentially weighted moving average

Instead of:

Add a decay factor!

These algorithms aren’t good enough.

A robust set of algorithms is the current focus of
this project.

Q). How do you analyze a quarter million timeseries for
correlations?

OCULUS

Image comparison is expensive and slow

“[[975, 1365528530], [643, 1365528540], [750, 1365528550], [992, 1365528560], [580, 1365528570],
[586, 1365528580], [649, 1365528590], [548, 1365528600], [901, 1365528610], [633, 1365528620]]” Use raw timeseries instead of raw graphs

Euclidian Distance

Dynamic Time Warping (helps with phase shifts)

We’ve solved it!

O(N2) x 250k

Too slow!

No need to run DTW on all 250k.

Discard obviously dissimilar metrics.

“975 643 643 750 992 992 992 580” “sharpdecrement flat
increment sharpincrement flat flat shapdecrement” Shape Description Alphabet

“975 643 643 750 992 992 992 580” “sharpdecrement flat
increment sharpincrement flat flat shapdecrement” Shape Description Alphabet “24 4 4 11 25 25 25 0 1” (normalization step)

Search for shape description fingerprint in Elasticsearch

Run DTW on results as final polish

O(N2) on ~10k metrics

Still too slow.

Fast DTW - O(N) similar strategy - coarse, then refine

Elasticsearch Details Phrase search for first pass scores across shape
description fingerprints

Elasticsearch Details Phrase search for first pass scores across shape
description fingerprints Custom FastDTW and euclidian distance plugins to score across the remaining filtered timeseries

Elasticsearch Structure { :id => “statsd.numStats”, :fingerprint => “sdec inc
sinc sdec”, :values => "10 1 2 15 4" }

First pass query :match => { :fingerprint => { :query
=> “sdec inc sinc sdec inc”, :type => "phrase", :slop => 20 } } shape description fingerprint

Refinement query {:custom_score => { :query => <ﬁrst_pass_query>, :script =>
"oculus_dtw", :params => { :query_value => “10 20 20 10 30”, :query_ﬁeld => "values.untouched", }, } raw timeseries

Skyline Elasticsearch Resque Sinatra Ganglia Graphite StatsD KALE Flask

Populating Elasticsearch

ES Index resque workers

Too slow to update and search

New Index Last Index Webapp

Sinatra frontend Queries ES Renders results

Happy monitoring. @abestanway github.com/etsy/skyline github.com/etsy/oculus

Berlin 2013 - Kale Workshop - Abe Stanway

Berlin 2013 - Kale Workshop - Abe Stanway

More Decks by Monitorama

Featured

Transcript