A talk I gave for the NYC Data Engineering Meetup at eBay. Video: http://g33ktalk.com/etsy-a-deep-dive-into-monitoring-with-skyline/
AbeStanway@jonlivesJonCowie@abestanwayA DEEP DIVE INTO Monitoring with Skylineabe stanway
View Slide
We have a large stack.
41 shards24 api servers72 web servers42 Gearman boxes150 node Hadoop cluster15 memcached boxes60 search machines
41 shards24 api servers72 web servers42 Gearman boxes150 node Hadoop cluster15 memcached boxes60 search machines(plus a lot more forvarious services)
Not to mention the app itself.
We practice continuousdeployment.
de • ploy /diˈploi/VerbTo release your code for theworld to see, hopefully withoutbreaking the Internet
Everyone deploys.250+ committers.
Hundreds of boxes hostingconstantly evolving code...
...it’s a miracle we stay up, right?
We optimize for quick recoveryby anticipating problems...
...instead of fearing human error.
Can’t fix what youdon’t measure!- W. Edwards Deming
StatsDgraphiteSkylineOculusSupergrephomemade!not homemadeNagiosGanglia
TextReal time error logging
“Not all things thatbreak throw errors.”- Oscar Wilde
StatsD
StatsD::increment(“foo.bar”)
If it moves,graph it!
If it moves,graph it!we would graph them ➞
If it doesn’t move,graph it anyway(it might make a run for it)
DASHBOARDS!
[1358731200, 20][1358731200, 20][1358731200, 20][1358731200, 20][1358731200, 20][1358731200, 20][1358731200, 20][1358731200, 20][1358731200, 60][1358731200, 20][1358731200, 20]
DASHBOARDS! x 250,000
lol nagios
Unknown anomalies
Kale.
Kale:- leaves- green stuff
Kale:- leaves- green stuffOCULUSSKYLINE
Q). How do you analyze atimeseries for anomaliesin real time?
A). Lots of HTTP requests toGraphite’s API!
Q). How do you analyze aquarter million timeseries foranomalies in real time?
SKYLINE
A real timeanomaly detectionsystem
Real time?
Kinda.
StatsDTen second resolution
GangliaOne minute resolution
~ 10s(~ 1minBest case:
(Takes about 70 secondswith our throughput.
(Still faster than you would havediscovered it otherwise.
Memory > Disk
Q). How do you get aquarter million timeseriesinto Redis on time?
STREAM THAT SHIT!
Graphite’s relay agentoriginalgraphite backup graphite
Graphite’s relay agentoriginalgraphite backup graphite[statsd.numStats, [1365603422, 82345]]pickles[statsd.numStats, [1365603432, 80611]][statsd.numStats, [1365603412, 73421]]
Graphite’s relay agentoriginalgraphite skyline[statsd.numStats, [1365603422, 82345]]pickles[statsd.numStats, [1365603432, 80611]][statsd.numStats, [1365603412, 73421]]
We import from Ganglia too.
Storing timeseries
Minimize I/OMinimize memory
redis.append()- Strings- Constant time- One operation per update
JSON?
“[1358711400, 51],”=> get statsD.numStats----------------------------
“[1358711400, 51],=> get statsD.numStats----------------------------[1358711410, 23],”
“[1358711400, 51],=> get statsD.numStats----------------------------[1358711410, 23],[1358711420, 45],”
OVER HALFCPU time spentdecoding JSON
[1,2]
[ 1 , 2 ]Stuff we care aboutExtra bullshit
MESSAGEPACK
MESSAGEPACKA binary-basedserialization protocol
\x93\x01\x02Array size(16 or 32 bit bigendian integer)Things we care about
\x93\x01\x02Array size(16 or 32 bit bigendian integer)Things we care about\x93\x02\x03
CUT IN HALFRun Time + Memory Used
ROOMBA.PYCLEANS THE DATA
“Wait...you wrote this in Python?”
Great statistics librariesNot fun for parallelism
Simple map/reduce designThe Analyzer
Assign Redis keys to each processProcess decodes and analyzesThe Analyzer
Anomalous metrics written as JSONsetInterval() retrieves from front endThe Analyzer
What does it meanto be anomalous?
Consensus model
[yes] [yes] [no] [no] [yes] [yes]=anomaly!
Helps correctmodel mismatches
Implement everything youcan get your hands on
Basic algorithm:“A metric is anomalous if itslatest datapoint is over threestandard deviations aboveits moving average.”
...(aka, the basic tenet of SPC)http://en.wikipedia.org/wiki/Statistical process control––
Mean34.1% 34.1%13.6% 13.6%2.1%2.1%
Mean34.1% 34.1%13.6% 13.6%2.1%2.1%if your datapoint is inhere, it’s an anomaly
Histogram binning
Take some data
Find most recent datapointvalue is 40
Make a histogram
Check which bin contains most recent data
Check which bin contains most recent datalatest value is 40, tinybin size, so...anomaly!
Ordinary least squares
Fit a regression line
Find residuals
Three sigmawinner!
Median absolute deviation
Median absolute deviation(calculate residuals with respect to median instead of regression line)
Exponentially weightedmoving average
Instead of:
Add a decay factor!
Adding decay discounts older values.
Four horsemen of the modelpocalypse
1. Seasonality2. Spike influence3. Normality4. Parameters
Anomaly?
Nope.
TextSpikes artificially raise the moving averageAnomalydetected (yay!)Anomaly missed :(Bigger moving average
Real world data doesn’tnecessarily follow a perfectnormal distribution.
!=
Simple systems, simpledefinitions of “anomalous”
Complex systems, complexdefinitions of “anomalous”
Not to mention thatcomplex systems evolve
How to avoid false positivesupon the evolution of themeasured processes?
Ionno.
Parameters!
Parameters are cool!Predicted page views
Cool model bro.(it’s a simplified Holt-Winters)
What are the parameters?
Seasonality: 365 dayOverall trend weight: .68Seasonal regression weight: .32EWMA smoothing factor: .1
Must train before discoveringlowest error for parameters
Mad expensive, yo.these people do not represent our CPUs
No good anomalieswithout good models.
A robust set of algorithms is thecurrent focus of this project.
Thanks!@abestanwaygithub.com/etsy/skyline