WELCOME TOBROOKLYN:A WORKSHOPON KALEAbe Stanway@abestanway
View Slide
Disclaimer: still in beta
Kale is composed of two sisterservices: Skyline and Oculus
SKYLINE
Q). How do you analyze atimeseries for anomaliesin real time?
A). Lots of HTTP requests toGraphite’s API!
Q). How do you analyze aquarter million timeseries foranomalies in real time?
Skyline!
Real time?
Kinda.
StatsDTen second resolution
GangliaOne minute resolution
~ 10s(~ 1minBest case:
(Takes about 70 secondswith our throughput.
(Still faster than you would havediscovered it otherwise.
Memory > Disk
Q). How do you get aquarter million timeseriesinto Redis on time?
STREAM THAT SHIT!
Graphite’s relay agentoriginalgraphite backup graphite
Graphite’s relay agentoriginalgraphite backup graphite[statsd.numStats, [1365603422, 82345]]pickles[statsd.numStats, [1365603432, 80611]][statsd.numStats, [1365603412, 73421]]
Graphite’s relay agentoriginalgraphite skyline[statsd.numStats, [1365603422, 82345]]pickles[statsd.numStats, [1365603432, 80611]][statsd.numStats, [1365603412, 73421]]
We import from Ganglia too.
Storing timeseries
Minimize I/OMinimize memory
redis.append()- Strings- Constant time- One operation per update
JSON?
“[1358711400, 51],”=> get statsD.numStats----------------------------
“[1358711400, 51],=> get statsD.numStats----------------------------[1358711410, 23],”
“[1358711400, 51],=> get statsD.numStats----------------------------[1358711410, 23],[1358711420, 45],”
OVER HALFCPU time spentdecoding JSON
[1,2]
[ 1 , 2 ]Stuff we care aboutExtra bullshit
MESSAGEPACK
MESSAGEPACKA binary-basedserialization protocol
\x93\x01\x02Array size(16 or 32 bit bigendian integer)Things we care about
\x93\x01\x02Array size(16 or 32 bit bigendian integer)Things we care about\x93\x02\x03
CUT IN HALFRun Time + Memory Used
ROOMBA.PYCLEANS THE DATA
“Wait...you wrote this in Python?”
Great statistics librariesNot fun for parallelism
Simple map/reduce designThe Analyzer
Assign Redis keys to each processProcess decodes and analyzesThe Analyzer
Anomalous metrics written as JSONsetInterval() retrieves from front endThe Analyzer
What does it meanto be anomalous?
Consensus model
[yes] [yes] [no] [no] [yes] [yes]=anomaly!
Helps correctmodel mismatches
Implement everything youcan get your hands on
Basic algorithm:“A metric is anomalous if itslatest datapoint is over threestandard deviations aboveits moving average.”
Histogram binning
Take some data
Find most recent datapointvalue is 40
Make a histogram
Check which bin contains most recent data
Check which bin contains most recent datalatest value is 40, tinybin size, so...anomaly!
Ordinary least squares
Fit a regression line
Find residuals
Three sigmawinner!
Median absolute deviation
Median absolute deviation(calculate residuals with respect to median instead of regression line)
Exponentially weightedmoving average
Instead of:
Add a decay factor!
These algorithmsaren’t good enough.
A robust set of algorithms is thecurrent focus of this project.
Q). How do you analyze aquarter million timeseriesfor correlations?
OCULUS
Image comparison is expensive and slow
“[[975, 1365528530],[643, 1365528540],[750, 1365528550],[992, 1365528560],[580, 1365528570],[586, 1365528580],[649, 1365528590],[548, 1365528600],[901, 1365528610],[633, 1365528620]]”Use raw timeseries instead of raw graphs
Euclidian Distance
Dynamic Time Warping(helps with phase shifts)
We’ve solved it!
O(N2)
O(N2) x 250k
Too slow!
No need to run DTW on all 250k.
Discard obviously dissimilar metrics.
“975 643 643 750 992 992 992 580”“sharpdecrement flat incrementsharpincrement flat flatshapdecrement”Shape Description Alphabet
“975 643 643 750 992 992 992 580”“sharpdecrement flat incrementsharpincrement flat flatshapdecrement”Shape Description Alphabet“24 4 4 11 25 25 25 0 1”(normalization step)
Search for shape descriptionfingerprint in Elasticsearch
Run DTW on resultsas final polish
O(N2) on ~10k metrics
Still too slow.
Fast DTW - O(N)similar strategy -coarse, then refine
Elasticsearch DetailsPhrase search for firstpass scores across shapedescription fingerprints
Elasticsearch DetailsPhrase search for first pass scoresacross shape description fingerprintsCustom FastDTW and euclidiandistance plugins to score across theremaining filtered timeseries
Elasticsearch Structure{:id => “statsd.numStats”,:fingerprint => “sdec inc sinc sdec”,:values => "10 1 2 15 4"}
First pass query:match => {:fingerprint => {:query => “sdec inc sinc sdec inc”,:type => "phrase",:slop => 20}}shape descriptionfingerprint
Refinement query{:custom_score => {:query => <first_pass_query>,:script => "oculus_dtw",:params => {:query_value => “10 20 20 10 30”,:query_field => "values.untouched",},}raw timeseries
SkylineElasticsearchResqueSinatraGangliaGraphiteStatsDKALEFlask
Populating Elasticsearch
ESIndexresque workers
Too slow toupdate and search
NewIndexLastIndexWebapp
Sinatra frontendQueries ESRenders results
Happy monitoring.@abestanwaygithub.com/etsy/skylinegithub.com/etsy/oculus