A Deep Dive into Monitoring with Skyline

A Deep Dive into Monitoring with Skyline

A talk I gave for the NYC Data Engineering Meetup at eBay. Video: http://g33ktalk.com/etsy-a-deep-dive-into-monitoring-with-skyline/

6601d82cf1b6776afd9c31f3d18294c3?s=128

Abe Stanway

July 23, 2013
Tweet

Transcript

  1. Abe Stanway @jonlives Jon Cowie @abestanway A DEEP DIVE INTO

    Monitoring with Skyline abe stanway
  2. None
  3. We have a large stack.

  4. 41 shards 24 api servers 72 web servers 42 Gearman

    boxes 150 node Hadoop cluster 15 memcached boxes 60 search machines
  5. 41 shards 24 api servers 72 web servers 42 Gearman

    boxes 150 node Hadoop cluster 15 memcached boxes 60 search machines (plus a lot more for various services)
  6. Not to mention the app itself.

  7. We practice continuous deployment.

  8. de • ploy /diˈploi/ Verb To release your code for

    the world to see, hopefully without breaking the Internet
  9. Everyone deploys. 250+ committers.

  10. Hundreds of boxes hosting constantly evolving code...

  11. ...it’s a miracle we stay up, right?

  12. We optimize for quick recovery by anticipating problems...

  13. ...instead of fearing human error.

  14. Can’t fix what you don’t measure! - W. Edwards Deming

  15. StatsD graphite Skyline Oculus Supergrep homemade! not homemade Nagios Ganglia

  16. Text Real time error logging

  17. “Not all things that break throw errors.” - Oscar Wilde

  18. StatsD

  19. StatsD::increment(“foo.bar”)

  20. If it moves, graph it!

  21. If it moves, graph it! we would graph them ➞

  22. If it doesn’t move, graph it anyway (it might make

    a run for it)
  23. DASHBOARDS!

  24. [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20]

    [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 60] [1358731200, 20] [1358731200, 20]
  25. DASHBOARDS! x 250,000

  26. None
  27. lol nagios

  28. Unknown anomalies

  29. Kale.

  30. Kale: - leaves - green stuff

  31. Kale: - leaves - green stuffOCULUS SKYLINE

  32. Q). How do you analyze a timeseries for anomalies in

    real time?
  33. A). Lots of HTTP requests to Graphite’s API!

  34. Q). How do you analyze a quarter million timeseries for

    anomalies in real time?
  35. SKYLINE

  36. SKYLINE

  37. A real time anomaly detection system

  38. Real time?

  39. Kinda.

  40. StatsD Ten second resolution

  41. Ganglia One minute resolution

  42. ~ 10s ( ~ 1min Best case:

  43. ( Takes about 70 seconds with our throughput.

  44. ( Still faster than you would have discovered it otherwise.

  45. Memory > Disk

  46. None
  47. Q). How do you get a quarter million timeseries into

    Redis on time?
  48. STREAM THAT SHIT!

  49. Graphite’s relay agent original graphite backup graphite

  50. Graphite’s relay agent original graphite backup graphite [statsd.numStats, [1365603422, 82345]]

    pickles [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]
  51. Graphite’s relay agent original graphite skyline [statsd.numStats, [1365603422, 82345]] pickles

    [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]
  52. We import from Ganglia too.

  53. Storing timeseries

  54. Minimize I/O Minimize memory

  55. redis.append() - Strings - Constant time - One operation per

    update
  56. JSON?

  57. “[1358711400, 51],” => get statsD.numStats ----------------------------

  58. “[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23],”

  59. “[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23], [1358711420, 45],”

  60. OVER HALF CPU time spent decoding JSON

  61. [1,2]

  62. [ 1 , 2 ] Stuff we care about Extra

    bullshit
  63. MESSAGEPACK

  64. MESSAGEPACK A binary-based serialization protocol

  65. \x93\x01\x02 Array size (16 or 32 bit big endian integer)

    Things we care about
  66. \x93\x01\x02 Array size (16 or 32 bit big endian integer)

    Things we care about \x93\x02\x03
  67. CUT IN HALF Run Time + Memory Used

  68. ROOMBA.PY CLEANS THE DATA

  69. “Wait...you wrote this in Python?”

  70. Great statistics libraries Not fun for parallelism

  71. Simple map/reduce design The Analyzer

  72. Assign Redis keys to each process Process decodes and analyzes

    The Analyzer
  73. Anomalous metrics written as JSON setInterval() retrieves from front end

    The Analyzer
  74. None
  75. What does it mean to be anomalous?

  76. Consensus model

  77. [yes] [yes] [no] [no] [yes] [yes] = anomaly!

  78. Helps correct model mismatches

  79. Implement everything you can get your hands on

  80. Basic algorithm: “A metric is anomalous if its latest datapoint

    is over three standard deviations above its moving average.”
  81. ...(aka, the basic tenet of SPC) http://en.wikipedia.org/wiki/ Statistical process control

    – –
  82. Mean 34.1% 34.1% 13.6% 13.6% 2.1% 2.1%

  83. Mean 34.1% 34.1% 13.6% 13.6% 2.1% 2.1% if your datapoint

    is in here, it’s an anomaly
  84. Histogram binning

  85. Take some data

  86. Find most recent datapoint value is 40

  87. Make a histogram

  88. Check which bin contains most recent data

  89. Check which bin contains most recent data latest value is

    40, tiny bin size, so...anomaly!
  90. Ordinary least squares

  91. Take some data

  92. Fit a regression line

  93. Find residuals

  94. Three sigma winner!

  95. Median absolute deviation

  96. Median absolute deviation (calculate residuals with respect to median instead

    of regression line)
  97. Exponentially weighted moving average

  98. Instead of:

  99. Add a decay factor!

  100. Adding decay discounts older values.

  101. Four horsemen of the modelpocalypse

  102. 1. Seasonality 2. Spike influence 3. Normality 4. Parameters

  103. Anomaly?

  104. Nope.

  105. Text Spikes artificially raise the moving average Anomaly detected (yay!)

    Anomaly missed :( Bigger moving average
  106. Real world data doesn’t necessarily follow a perfect normal distribution.

  107. !=

  108. Simple systems, simple definitions of “anomalous”

  109. Complex systems, complex definitions of “anomalous”

  110. Not to mention that complex systems evolve

  111. How to avoid false positives upon the evolution of the

    measured processes?
  112. Ionno.

  113. Parameters!

  114. Parameters are cool! Predicted page views

  115. Cool model bro. (it’s a simplified Holt-Winters)

  116. What are the parameters?

  117. Seasonality: 365 day Overall trend weight: .68 Seasonal regression weight:

    .32 EWMA smoothing factor: .1
  118. Must train before discovering lowest error for parameters

  119. Mad expensive, yo. these people do not represent our CPUs

  120. No good anomalies without good models.

  121. A robust set of algorithms is the current focus of

    this project.
  122. Thanks! @abestanway github.com/etsy/skyline