Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Deep Dive into Monitoring with Skyline

A Deep Dive into Monitoring with Skyline

A talk I gave for the NYC Data Engineering Meetup at eBay. Video: http://g33ktalk.com/etsy-a-deep-dive-into-monitoring-with-skyline/

Abe Stanway

July 23, 2013
Tweet

More Decks by Abe Stanway

Other Decks in Programming

Transcript

  1. Abe
    Stanway
    @jonlives
    Jon
    Cowie
    @abestanway
    A DEEP DIVE INTO Monitoring with Skyline
    abe stanway

    View Slide

  2. View Slide

  3. We have a large stack.

    View Slide

  4. 41 shards
    24 api servers
    72 web servers
    42 Gearman boxes
    150 node Hadoop cluster
    15 memcached boxes
    60 search machines

    View Slide

  5. 41 shards
    24 api servers
    72 web servers
    42 Gearman boxes
    150 node Hadoop cluster
    15 memcached boxes
    60 search machines
    (plus a lot more for
    various services)

    View Slide

  6. Not to mention the app itself.

    View Slide

  7. We practice continuous
    deployment.

    View Slide

  8. de • ploy /diˈploi/
    Verb
    To release your code for the
    world to see, hopefully without
    breaking the Internet

    View Slide

  9. Everyone deploys.
    250+ committers.

    View Slide

  10. Hundreds of boxes hosting
    constantly evolving code...

    View Slide

  11. ...it’s a miracle we stay up, right?

    View Slide

  12. We optimize for quick recovery
    by anticipating problems...

    View Slide

  13. ...instead of fearing human error.

    View Slide

  14. Can’t fix what you
    don’t measure!
    - W. Edwards Deming

    View Slide

  15. StatsD
    graphite
    Skyline
    Oculus
    Supergrep
    homemade!
    not homemade
    Nagios
    Ganglia

    View Slide

  16. Text
    Real time error logging

    View Slide

  17. “Not all things that
    break throw errors.”
    - Oscar Wilde

    View Slide

  18. StatsD

    View Slide

  19. StatsD::increment(“foo.bar”)

    View Slide

  20. If it moves,
    graph it!

    View Slide

  21. If it moves,
    graph it!
    we would graph them ➞

    View Slide

  22. If it doesn’t move,
    graph it anyway
    (it might make a run for it)

    View Slide

  23. DASHBOARDS!

    View Slide

  24. [1358731200, 20]
    [1358731200, 20]
    [1358731200, 20]
    [1358731200, 20]
    [1358731200, 20]
    [1358731200, 20]
    [1358731200, 20]
    [1358731200, 20]
    [1358731200, 60]
    [1358731200, 20]
    [1358731200, 20]

    View Slide

  25. DASHBOARDS! x 250,000

    View Slide

  26. View Slide

  27. lol nagios

    View Slide

  28. Unknown anomalies

    View Slide

  29. Kale.

    View Slide

  30. Kale:
    - leaves
    - green stuff

    View Slide

  31. Kale:
    - leaves
    - green stuffOCULUS
    SKYLINE

    View Slide

  32. Q). How do you analyze a
    timeseries for anomalies
    in real time?

    View Slide

  33. A). Lots of HTTP requests to
    Graphite’s API!

    View Slide

  34. Q). How do you analyze a
    quarter million timeseries for
    anomalies in real time?

    View Slide

  35. SKYLINE

    View Slide

  36. SKYLINE

    View Slide

  37. A real time
    anomaly detection
    system

    View Slide

  38. Real time?

    View Slide

  39. Kinda.

    View Slide

  40. StatsD
    Ten second resolution

    View Slide

  41. Ganglia
    One minute resolution

    View Slide

  42. ~ 10s
    (
    ~ 1min
    Best case:

    View Slide

  43. (
    Takes about 70 seconds
    with our throughput.

    View Slide

  44. (
    Still faster than you would have
    discovered it otherwise.

    View Slide

  45. Memory > Disk

    View Slide

  46. View Slide

  47. Q). How do you get a
    quarter million timeseries
    into Redis on time?

    View Slide

  48. STREAM THAT SHIT!

    View Slide

  49. Graphite’s relay agent
    original
    graphite backup graphite

    View Slide

  50. Graphite’s relay agent
    original
    graphite backup graphite
    [statsd.numStats, [1365603422, 82345]]
    pickles
    [statsd.numStats, [1365603432, 80611]]
    [statsd.numStats, [1365603412, 73421]]

    View Slide

  51. Graphite’s relay agent
    original
    graphite skyline
    [statsd.numStats, [1365603422, 82345]]
    pickles
    [statsd.numStats, [1365603432, 80611]]
    [statsd.numStats, [1365603412, 73421]]

    View Slide

  52. We import from Ganglia too.

    View Slide

  53. Storing timeseries

    View Slide

  54. Minimize I/O
    Minimize memory

    View Slide

  55. redis.append()
    - Strings
    - Constant time
    - One operation per update

    View Slide

  56. JSON?

    View Slide

  57. “[1358711400, 51],”
    => get statsD.numStats
    ----------------------------

    View Slide

  58. “[1358711400, 51],
    => get statsD.numStats
    ----------------------------
    [1358711410, 23],”

    View Slide

  59. “[1358711400, 51],
    => get statsD.numStats
    ----------------------------
    [1358711410, 23],
    [1358711420, 45],”

    View Slide

  60. OVER HALF
    CPU time spent
    decoding JSON

    View Slide

  61. [1,2]

    View Slide

  62. [ 1 , 2 ]
    Stuff we care about
    Extra bullshit

    View Slide

  63. MESSAGEPACK

    View Slide

  64. MESSAGEPACK
    A binary-based
    serialization protocol

    View Slide

  65. \x93\x01\x02
    Array size
    (16 or 32 bit big
    endian integer)
    Things we care about

    View Slide

  66. \x93\x01\x02
    Array size
    (16 or 32 bit big
    endian integer)
    Things we care about
    \x93\x02\x03

    View Slide

  67. CUT IN HALF
    Run Time + Memory Used

    View Slide

  68. ROOMBA.PY
    CLEANS THE DATA

    View Slide

  69. “Wait...you wrote this in Python?”

    View Slide

  70. Great statistics libraries
    Not fun for parallelism

    View Slide

  71. Simple map/reduce design
    The Analyzer

    View Slide

  72. Assign Redis keys to each process
    Process decodes and analyzes
    The Analyzer

    View Slide

  73. Anomalous metrics written as JSON
    setInterval() retrieves from front end
    The Analyzer

    View Slide

  74. View Slide

  75. What does it mean
    to be anomalous?

    View Slide

  76. Consensus model

    View Slide

  77. [yes] [yes] [no] [no] [yes] [yes]
    =
    anomaly!

    View Slide

  78. Helps correct
    model mismatches

    View Slide

  79. Implement everything you
    can get your hands on

    View Slide

  80. Basic algorithm:
    “A metric is anomalous if its
    latest datapoint is over three
    standard deviations above
    its moving average.”

    View Slide

  81. ...(aka, the basic tenet of SPC)
    http://en.wikipedia.org/wiki/
    Statistical process control


    View Slide

  82. Mean
    34.1% 34.1%
    13.6% 13.6%
    2.1%
    2.1%

    View Slide

  83. Mean
    34.1% 34.1%
    13.6% 13.6%
    2.1%
    2.1%
    if your datapoint is in
    here, it’s an anomaly

    View Slide

  84. Histogram binning

    View Slide

  85. Take some data

    View Slide

  86. Find most recent datapoint
    value is 40

    View Slide

  87. Make a histogram

    View Slide

  88. Check which bin contains most recent data

    View Slide

  89. Check which bin contains most recent data
    latest value is 40, tiny
    bin size, so...anomaly!

    View Slide

  90. Ordinary least squares

    View Slide

  91. Take some data

    View Slide

  92. Fit a regression line

    View Slide

  93. Find residuals

    View Slide

  94. Three sigma
    winner!

    View Slide

  95. Median absolute deviation

    View Slide

  96. Median absolute deviation
    (calculate residuals with respect to median instead of regression line)

    View Slide

  97. Exponentially weighted
    moving average

    View Slide

  98. Instead of:

    View Slide

  99. Add a decay factor!

    View Slide

  100. Adding decay discounts older values.

    View Slide

  101. Four horsemen of the modelpocalypse

    View Slide

  102. 1. Seasonality
    2. Spike influence
    3. Normality
    4. Parameters

    View Slide

  103. Anomaly?

    View Slide

  104. Nope.

    View Slide

  105. Text
    Spikes artificially raise the moving average
    Anomaly
    detected (yay!)
    Anomaly missed :(
    Bigger moving average

    View Slide

  106. Real world data doesn’t
    necessarily follow a perfect
    normal distribution.

    View Slide

  107. !=

    View Slide

  108. Simple systems, simple
    definitions of “anomalous”

    View Slide

  109. Complex systems, complex
    definitions of “anomalous”

    View Slide

  110. Not to mention that
    complex systems evolve

    View Slide

  111. How to avoid false positives
    upon the evolution of the
    measured processes?

    View Slide

  112. Ionno.

    View Slide

  113. Parameters!

    View Slide

  114. Parameters are cool!
    Predicted page views

    View Slide

  115. Cool model bro.
    (it’s a simplified Holt-Winters)

    View Slide

  116. What are the parameters?

    View Slide

  117. Seasonality: 365 day
    Overall trend weight: .68
    Seasonal regression weight: .32
    EWMA smoothing factor: .1

    View Slide

  118. Must train before discovering
    lowest error for parameters

    View Slide

  119. Mad expensive, yo.
    these people do not represent our CPUs

    View Slide

  120. No good anomalies
    without good models.

    View Slide

  121. A robust set of algorithms is the
    current focus of this project.

    View Slide

  122. Thanks!
    @abestanway
    github.com/etsy/skyline

    View Slide