Bring the Noise: Continuously Deploying Under a Hailstorm of Metrics

Bring the Noise: Continuously Deploying Under a Hailstorm of Metrics

This talk was given at Velocity '13 in Santa Clara, and an abbreviated version was given at BACON '13 in London. It offers an overview of Etsy's Kale stack.

BACON Video: devslovebacon.com/conferences/bacon-2013/talks/bring-the-noise-continuously-deploying-under-a-hailstorm-of-metrics

Velocity video forthcoming.

6601d82cf1b6776afd9c31f3d18294c3?s=128

Abe Stanway

June 18, 2013
Tweet

Transcript

  1. Abe Stanway @jonlives BRING THE NOISE! MAKING SENSE OF A

    HAILSTORM OF METRICS Jon Cowie @abestanway
  2. Ninety minutes is a long time. - motivations - skyline

    - oculus - demo! - questions This talk: ~10 ~25 ~30 ~10 ~15
  3. Ninety minutes is a long time. - motivations - skyline

    - oculus - demo! - questions This talk: ~10 ~25 ~30 ~10 ~15 But we have some sweet stuff to show you.
  4. Background and Motivations

  5. None
  6. 1.5 billion page views $117 million of goods sold 950

    thousand users
  7. 1.5 billion page views $117 million of goods sold 950

    thousand users (in december ‘12)
  8. We practice continuous deployment.

  9. de • ploy /diˈploi/ Verb To release your code for

    the world to see, hopefully without breaking the Internet
  10. Everyone deploys. 250+ committers.

  11. Day one: DEPLOY

  12. None
  13. 30+ DEPLOYS A DAY (~8 commits per deploy!)

  14. “30 deploys a day? Is that safe?”

  15. We optimize for quick recovery by anticipating problems...

  16. ...instead of fearing human error.

  17. Can’t fix what you don’t measure! - W. Edwards Deming

  18. StatsD graphite Skyline Oculus Supergrep homemade! not homemade Nagios Ganglia

  19. Text Real time error logging

  20. “Not all things that break throw errors.” - Oscar Wilde

  21. StatsD

  22. StatsD::increment(“foo.bar”)

  23. If it moves, graph it!

  24. If it moves, graph it! we would graph them ➞

  25. If it doesn’t move, graph it anyway (it might make

    a run for it)
  26. DASHBOARDS!

  27. [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 20]

    [1358731200, 20] [1358731200, 20] [1358731200, 20] [1358731200, 60] [1358731200, 20] [1358731200, 20]
  28. DASHBOARDS! x 250,000

  29. None
  30. lol nagios

  31. “...but there are also unknown unknowns - there are things

    we do not know we don’t know.”
  32. Unknown anomalies

  33. Unknown correlations

  34. Kale.

  35. Kale: - leaves - green stuff

  36. Kale: - leaves - green stuffOCULUS SKYLINE

  37. Q). How do you analyze a timeseries for anomalies in

    real time?
  38. A). Lots of HTTP requests to Graphite’s API!

  39. Q). How do you analyze a quarter million timeseries for

    anomalies in real time?
  40. SKYLINE

  41. SKYLINE

  42. A real time anomaly detection system

  43. Real time?

  44. Kinda.

  45. StatsD Ten second resolution

  46. Ganglia One minute resolution

  47. ~ 10s ( ~ 1min Best case:

  48. ( Takes about 90 seconds with our throughput.

  49. ( Still faster than you would have discovered it otherwise.

  50. Memory > Disk

  51. None
  52. Q). How do you get a quarter million timeseries into

    Redis on time?
  53. STREAM IT!

  54. Graphite’s relay agent original graphite backup graphite

  55. Graphite’s relay agent original graphite backup graphite [statsd.numStats, [1365603422, 82345]]

    pickles [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]
  56. Graphite’s relay agent original graphite skyline [statsd.numStats, [1365603422, 82345]] pickles

    [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]
  57. We import from Ganglia too.

  58. Storing timeseries

  59. Minimize I/O Minimize memory

  60. redis.append() - Strings - Constant time - One operation per

    update
  61. JSON?

  62. “[1358711400, 51],” => get statsD.numStats ----------------------------

  63. “[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23],”

  64. “[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23], [1358711420, 45],”

  65. OVER HALF CPU time spent decoding JSON

  66. [1,2]

  67. [ 1 , 2 ] Stuff we care about Extra

    junk
  68. MESSAGEPACK

  69. MESSAGEPACK A binary-based serialization protocol

  70. \x93\x01\x02 Array size (16 or 32 bit big endian integer)

    Things we care about
  71. \x93\x01\x02 Array size (16 or 32 bit big endian integer)

    Things we care about \x93\x02\x03
  72. CUT IN HALF Run Time + Memory Used

  73. ROOMBA.PY CLEANS THE DATA

  74. “Wait...you wrote this in Python?”

  75. Great statistics libraries Not fun for parallelism

  76. Assign Redis keys to each process Process decodes and analyzes

    The Analyzer
  77. Anomalous metrics written as JSON setInterval() retrieves from front end

    The Analyzer
  78. None
  79. What does it mean to be anomalous?

  80. Consensus model

  81. Implement everything you can get your hands on

  82. Basic algorithm: “A metric is anomalous if its latest datapoint

    is over three standard deviations above its moving average.”
  83. Grubb’s test, ordinary least squares

  84. Histogram binning

  85. Four horsemen of the modelpocalypse

  86. 1. Seasonality 2. Spike influence 3. Normality 4. Parameters

  87. Anomaly?

  88. Nope.

  89. Text Spikes artificially raise the moving average Anomaly detected (yay!)

    Anomaly missed :( Bigger moving average
  90. Real world data doesn’t necessarily follow a perfect normal distribution.

  91. Too many metrics to fit parameters for them all!

  92. A robust set of algorithms is the current focus of

    this project.
  93. Q). How do you analyze a quarter million timeseries for

    correlations?
  94. OCULUS

  95. Image comparison is expensive and slow

  96. “[[975, 1365528530], [643, 1365528540], [750, 1365528550], [992, 1365528560], [580, 1365528570],

    [586, 1365528580], [649, 1365528590], [548, 1365528600], [901, 1365528610], [633, 1365528620]]” Use raw timeseries instead of raw graphs
  97. Naming Things Cache Invalidation Numerical Comparison? HARD PROBLEMS

  98. Naming Things Cache Invalidation Numerical Comparison? HARD PROBLEMS

  99. Euclidian Distance

  100. Dynamic Time Warping (helps with phase shifts)

  101. We’ve solved it!

  102. O(N2)

  103. O(N2) x 250k

  104. Too slow!

  105. doesn’t

  106. No need to run DTW on all 250k.

  107. Discard obviously dissimilar metrics.

  108. “975 643 643 750 992 992 992 580” “sharpdecrement flat

    increment sharpincrement flat flat shapdecrement” Shape Description Alphabet
  109. “975 643 643 750 992 992 992 580” “sharpdecrement flat

    increment sharpincrement flat flat shapdecrement” Shape Description Alphabet “24 4 4 11 25 25 25 0 1” (normalization step)
  110. None
  111. Search for shape description fingerprint in Elasticsearch

  112. Run DTW on results as final polish

  113. O(N2) on ~10k metrics

  114. Still too slow.

  115. Fast DTW - O(N) coarsen project refine

  116. Elasticsearch Details Phrase search for first pass scores across shape

    description fingerprints
  117. Elasticsearch Details Phrase search for first pass scores across shape

    description fingerprints Custom FastDTW and euclidian distance plugins to score across the remaining filtered timeseries
  118. Elasticsearch Structure { :id => “statsd.numStats”, :fingerprint => “sdec inc

    sinc sdec”, :values => "10 1 2 15 4" }
  119. Mappings Specify tokenizers “Untouched” fields

  120. First pass query :match => { :fingerprint => { :query

    => “sdec inc sinc sdec inc”, :type => "phrase", :slop => 20 } } shape description fingerprint
  121. Refinement query {:custom_score => { :query => <first_pass_query>, :script =>

    "oculus_dtw", :params => { :query_value => “10 20 20 10 30”, :query_field => "values.untouched", }, } raw timeseries
  122. Skyline Elasticsearch Resque Sinatra Ganglia Graphite StatsD KALE Flask

  123. Populating Elasticsearch

  124. ES Index resque workers

  125. Too slow to update and search

  126. New Index Last Index Webapp

  127. Sinatra frontend Queries ES Renders results

  128. Collections

  129. devops <3

  130. None
  131. Special thanks to: Dr. Neil Gunther, PerfDynamics Dr. Brian Whitman,

    Echonest Burc Arpat, Facebook Seth Walker, Etsy Rafe Colburn, Etsy Mike Rembetsy, Etsy John Allspaw, Etsy
  132. @abestanway @jonlives Thanks! github.com/etsy/skyline github.com/etsy/oculus