Pro Yearly is on sale from $80 to $50! »

Berlin 2013 - Kale Workshop - Abe Stanway

0580d500edfdb2e5e80e4732ac8df1ea?s=47 Monitorama
September 20, 2013
330

Berlin 2013 - Kale Workshop - Abe Stanway

0580d500edfdb2e5e80e4732ac8df1ea?s=128

Monitorama

September 20, 2013
Tweet

Transcript

  1. WELCOME TO BROOKLYN: A WORKSHOP ON KALE Abe Stanway @abestanway

  2. Disclaimer: still in beta

  3. Kale is composed of two sister services: Skyline and Oculus

  4. SKYLINE

  5. SKYLINE

  6. Q). How do you analyze a timeseries for anomalies in

    real time?
  7. A). Lots of HTTP requests to Graphite’s API!

  8. Q). How do you analyze a quarter million timeseries for

    anomalies in real time?
  9. Skyline!

  10. Real time?

  11. Kinda.

  12. StatsD Ten second resolution

  13. Ganglia One minute resolution

  14. ~ 10s ( ~ 1min Best case:

  15. ( Takes about 70 seconds with our throughput.

  16. ( Still faster than you would have discovered it otherwise.

  17. Memory > Disk

  18. None
  19. Q). How do you get a quarter million timeseries into

    Redis on time?
  20. STREAM THAT SHIT!

  21. Graphite’s relay agent original graphite backup graphite

  22. Graphite’s relay agent original graphite backup graphite [statsd.numStats, [1365603422, 82345]]

    pickles [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]
  23. Graphite’s relay agent original graphite skyline [statsd.numStats, [1365603422, 82345]] pickles

    [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]
  24. We import from Ganglia too.

  25. Storing timeseries

  26. Minimize I/O Minimize memory

  27. redis.append() - Strings - Constant time - One operation per

    update
  28. JSON?

  29. “[1358711400, 51],” => get statsD.numStats ----------------------------

  30. “[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23],”

  31. “[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23], [1358711420, 45],”

  32. OVER HALF CPU time spent decoding JSON

  33. [1,2]

  34. [ 1 , 2 ] Stuff we care about Extra

    bullshit
  35. MESSAGEPACK

  36. MESSAGEPACK A binary-based serialization protocol

  37. \x93\x01\x02 Array size (16 or 32 bit big endian integer)

    Things we care about
  38. \x93\x01\x02 Array size (16 or 32 bit big endian integer)

    Things we care about \x93\x02\x03
  39. CUT IN HALF Run Time + Memory Used

  40. ROOMBA.PY CLEANS THE DATA

  41. “Wait...you wrote this in Python?”

  42. Great statistics libraries Not fun for parallelism

  43. Simple map/reduce design The Analyzer

  44. Assign Redis keys to each process Process decodes and analyzes

    The Analyzer
  45. Anomalous metrics written as JSON setInterval() retrieves from front end

    The Analyzer
  46. None
  47. What does it mean to be anomalous?

  48. Consensus model

  49. [yes] [yes] [no] [no] [yes] [yes] = anomaly!

  50. Helps correct model mismatches

  51. Implement everything you can get your hands on

  52. Basic algorithm: “A metric is anomalous if its latest datapoint

    is over three standard deviations above its moving average.”
  53. Histogram binning

  54. Take some data

  55. Find most recent datapoint value is 40

  56. Make a histogram

  57. Check which bin contains most recent data

  58. Check which bin contains most recent data latest value is

    40, tiny bin size, so...anomaly!
  59. Ordinary least squares

  60. Take some data

  61. Fit a regression line

  62. Find residuals

  63. Three sigma winner!

  64. Median absolute deviation

  65. Median absolute deviation (calculate residuals with respect to median instead

    of regression line)
  66. Exponentially weighted moving average

  67. Instead of:

  68. Add a decay factor!

  69. These algorithms aren’t good enough.

  70. A robust set of algorithms is the current focus of

    this project.
  71. Q). How do you analyze a quarter million timeseries for

    correlations?
  72. OCULUS

  73. Image comparison is expensive and slow

  74. “[[975, 1365528530], [643, 1365528540], [750, 1365528550], [992, 1365528560], [580, 1365528570],

    [586, 1365528580], [649, 1365528590], [548, 1365528600], [901, 1365528610], [633, 1365528620]]” Use raw timeseries instead of raw graphs
  75. Euclidian Distance

  76. Dynamic Time Warping (helps with phase shifts)

  77. We’ve solved it!

  78. O(N2)

  79. O(N2) x 250k

  80. Too slow!

  81. No need to run DTW on all 250k.

  82. Discard obviously dissimilar metrics.

  83. “975 643 643 750 992 992 992 580” “sharpdecrement flat

    increment sharpincrement flat flat shapdecrement” Shape Description Alphabet
  84. “975 643 643 750 992 992 992 580” “sharpdecrement flat

    increment sharpincrement flat flat shapdecrement” Shape Description Alphabet “24 4 4 11 25 25 25 0 1” (normalization step)
  85. None
  86. Search for shape description fingerprint in Elasticsearch

  87. Run DTW on results as final polish

  88. O(N2) on ~10k metrics

  89. Still too slow.

  90. Fast DTW - O(N) similar strategy - coarse, then refine

  91. Elasticsearch Details Phrase search for first pass scores across shape

    description fingerprints
  92. Elasticsearch Details Phrase search for first pass scores across shape

    description fingerprints Custom FastDTW and euclidian distance plugins to score across the remaining filtered timeseries
  93. Elasticsearch Structure { :id => “statsd.numStats”, :fingerprint => “sdec inc

    sinc sdec”, :values => "10 1 2 15 4" }
  94. First pass query :match => { :fingerprint => { :query

    => “sdec inc sinc sdec inc”, :type => "phrase", :slop => 20 } } shape description fingerprint
  95. Refinement query {:custom_score => { :query => <first_pass_query>, :script =>

    "oculus_dtw", :params => { :query_value => “10 20 20 10 30”, :query_field => "values.untouched", }, } raw timeseries
  96. Skyline Elasticsearch Resque Sinatra Ganglia Graphite StatsD KALE Flask

  97. Populating Elasticsearch

  98. ES Index resque workers

  99. Too slow to update and search

  100. New Index Last Index Webapp

  101. Sinatra frontend Queries ES Renders results

  102. Happy monitoring. @abestanway github.com/etsy/skyline github.com/etsy/oculus