$30 off During Our Annual Pro Sale. View Details »

Berlin 2013 - Kale Workshop - Abe Stanway

Monitorama
September 20, 2013
380

Berlin 2013 - Kale Workshop - Abe Stanway

Monitorama

September 20, 2013
Tweet

Transcript

  1. WELCOME TO
    BROOKLYN:
    A WORKSHOP
    ON KALE
    Abe Stanway
    @abestanway

    View Slide

  2. Disclaimer: still in beta

    View Slide

  3. Kale is composed of two sister
    services: Skyline and Oculus

    View Slide

  4. SKYLINE

    View Slide

  5. SKYLINE

    View Slide

  6. Q). How do you analyze a
    timeseries for anomalies
    in real time?

    View Slide

  7. A). Lots of HTTP requests to
    Graphite’s API!

    View Slide

  8. Q). How do you analyze a
    quarter million timeseries for
    anomalies in real time?

    View Slide

  9. Skyline!

    View Slide

  10. Real time?

    View Slide

  11. Kinda.

    View Slide

  12. StatsD
    Ten second resolution

    View Slide

  13. Ganglia
    One minute resolution

    View Slide

  14. ~ 10s
    (
    ~ 1min
    Best case:

    View Slide

  15. (
    Takes about 70 seconds
    with our throughput.

    View Slide

  16. (
    Still faster than you would have
    discovered it otherwise.

    View Slide

  17. Memory > Disk

    View Slide

  18. View Slide

  19. Q). How do you get a
    quarter million timeseries
    into Redis on time?

    View Slide

  20. STREAM THAT SHIT!

    View Slide

  21. Graphite’s relay agent
    original
    graphite backup graphite

    View Slide

  22. Graphite’s relay agent
    original
    graphite backup graphite
    [statsd.numStats, [1365603422, 82345]]
    pickles
    [statsd.numStats, [1365603432, 80611]]
    [statsd.numStats, [1365603412, 73421]]

    View Slide

  23. Graphite’s relay agent
    original
    graphite skyline
    [statsd.numStats, [1365603422, 82345]]
    pickles
    [statsd.numStats, [1365603432, 80611]]
    [statsd.numStats, [1365603412, 73421]]

    View Slide

  24. We import from Ganglia too.

    View Slide

  25. Storing timeseries

    View Slide

  26. Minimize I/O
    Minimize memory

    View Slide

  27. redis.append()
    - Strings
    - Constant time
    - One operation per update

    View Slide

  28. JSON?

    View Slide

  29. “[1358711400, 51],”
    => get statsD.numStats
    ----------------------------

    View Slide

  30. “[1358711400, 51],
    => get statsD.numStats
    ----------------------------
    [1358711410, 23],”

    View Slide

  31. “[1358711400, 51],
    => get statsD.numStats
    ----------------------------
    [1358711410, 23],
    [1358711420, 45],”

    View Slide

  32. OVER HALF
    CPU time spent
    decoding JSON

    View Slide

  33. [1,2]

    View Slide

  34. [ 1 , 2 ]
    Stuff we care about
    Extra bullshit

    View Slide

  35. MESSAGEPACK

    View Slide

  36. MESSAGEPACK
    A binary-based
    serialization protocol

    View Slide

  37. \x93\x01\x02
    Array size
    (16 or 32 bit big
    endian integer)
    Things we care about

    View Slide

  38. \x93\x01\x02
    Array size
    (16 or 32 bit big
    endian integer)
    Things we care about
    \x93\x02\x03

    View Slide

  39. CUT IN HALF
    Run Time + Memory Used

    View Slide

  40. ROOMBA.PY
    CLEANS THE DATA

    View Slide

  41. “Wait...you wrote this in Python?”

    View Slide

  42. Great statistics libraries
    Not fun for parallelism

    View Slide

  43. Simple map/reduce design
    The Analyzer

    View Slide

  44. Assign Redis keys to each process
    Process decodes and analyzes
    The Analyzer

    View Slide

  45. Anomalous metrics written as JSON
    setInterval() retrieves from front end
    The Analyzer

    View Slide

  46. View Slide

  47. What does it mean
    to be anomalous?

    View Slide

  48. Consensus model

    View Slide

  49. [yes] [yes] [no] [no] [yes] [yes]
    =
    anomaly!

    View Slide

  50. Helps correct
    model mismatches

    View Slide

  51. Implement everything you
    can get your hands on

    View Slide

  52. Basic algorithm:
    “A metric is anomalous if its
    latest datapoint is over three
    standard deviations above
    its moving average.”

    View Slide

  53. Histogram binning

    View Slide

  54. Take some data

    View Slide

  55. Find most recent datapoint
    value is 40

    View Slide

  56. Make a histogram

    View Slide

  57. Check which bin contains most recent data

    View Slide

  58. Check which bin contains most recent data
    latest value is 40, tiny
    bin size, so...anomaly!

    View Slide

  59. Ordinary least squares

    View Slide

  60. Take some data

    View Slide

  61. Fit a regression line

    View Slide

  62. Find residuals

    View Slide

  63. Three sigma
    winner!

    View Slide

  64. Median absolute deviation

    View Slide

  65. Median absolute deviation
    (calculate residuals with respect to median instead of regression line)

    View Slide

  66. Exponentially weighted
    moving average

    View Slide

  67. Instead of:

    View Slide

  68. Add a decay factor!

    View Slide

  69. These algorithms
    aren’t good enough.

    View Slide

  70. A robust set of algorithms is the
    current focus of this project.

    View Slide

  71. Q). How do you analyze a
    quarter million timeseries
    for correlations?

    View Slide

  72. OCULUS

    View Slide

  73. Image comparison is expensive and slow

    View Slide

  74. “[[975, 1365528530],
    [643, 1365528540],
    [750, 1365528550],
    [992, 1365528560],
    [580, 1365528570],
    [586, 1365528580],
    [649, 1365528590],
    [548, 1365528600],
    [901, 1365528610],
    [633, 1365528620]]”
    Use raw timeseries instead of raw graphs

    View Slide

  75. Euclidian Distance

    View Slide

  76. Dynamic Time Warping
    (helps with phase shifts)

    View Slide

  77. We’ve solved it!

    View Slide

  78. O(N2)

    View Slide

  79. O(N2) x 250k

    View Slide

  80. Too slow!

    View Slide

  81. No need to run DTW on all 250k.

    View Slide

  82. Discard obviously dissimilar metrics.

    View Slide

  83. “975 643 643 750 992 992 992 580”
    “sharpdecrement flat increment
    sharpincrement flat flat
    shapdecrement”
    Shape Description Alphabet

    View Slide

  84. “975 643 643 750 992 992 992 580”
    “sharpdecrement flat increment
    sharpincrement flat flat
    shapdecrement”
    Shape Description Alphabet
    “24 4 4 11 25 25 25 0 1”
    (normalization step)

    View Slide

  85. View Slide

  86. Search for shape description
    fingerprint in Elasticsearch

    View Slide

  87. Run DTW on results
    as final polish

    View Slide

  88. O(N2) on ~10k metrics

    View Slide

  89. Still too slow.

    View Slide

  90. Fast DTW - O(N)
    similar strategy -
    coarse, then refine

    View Slide

  91. Elasticsearch Details
    Phrase search for first
    pass scores across shape
    description fingerprints

    View Slide

  92. Elasticsearch Details
    Phrase search for first pass scores
    across shape description fingerprints
    Custom FastDTW and euclidian
    distance plugins to score across the
    remaining filtered timeseries

    View Slide

  93. Elasticsearch Structure
    {
    :id => “statsd.numStats”,
    :fingerprint => “sdec inc sinc sdec”,
    :values => "10 1 2 15 4"
    }

    View Slide

  94. First pass query
    :match => {
    :fingerprint => {
    :query => “sdec inc sinc sdec inc”,
    :type => "phrase",
    :slop => 20
    }
    }
    shape description
    fingerprint

    View Slide

  95. Refinement query
    {:custom_score => {
    :query => <first_pass_query>,
    :script => "oculus_dtw",
    :params => {
    :query_value => “10 20 20 10 30”,
    :query_field => "values.untouched",
    },
    }
    raw timeseries

    View Slide

  96. Skyline
    Elasticsearch
    Resque
    Sinatra
    Ganglia
    Graphite
    StatsD
    KALE
    Flask

    View Slide

  97. Populating Elasticsearch

    View Slide

  98. ES
    Index
    resque workers

    View Slide

  99. Too slow to
    update and search

    View Slide

  100. New
    Index
    Last
    Index
    Webapp

    View Slide

  101. Sinatra frontend
    Queries ES
    Renders results

    View Slide

  102. Happy monitoring.
    @abestanway
    github.com/etsy/skyline
    github.com/etsy/oculus

    View Slide