$30 off During Our Annual Pro Sale. View Details »

Bring the Noise: Continuously Deploying Under a Hailstorm of Metrics

Bring the Noise: Continuously Deploying Under a Hailstorm of Metrics

This talk was given at Velocity '13 in Santa Clara, and an abbreviated version was given at BACON '13 in London. It offers an overview of Etsy's Kale stack.

BACON Video: devslovebacon.com/conferences/bacon-2013/talks/bring-the-noise-continuously-deploying-under-a-hailstorm-of-metrics

Velocity video forthcoming.

Abe Stanway

June 18, 2013
Tweet

More Decks by Abe Stanway

Other Decks in Programming

Transcript

  1. Abe
    Stanway
    @jonlives
    BRING THE NOISE!
    MAKING SENSE OF
    A HAILSTORM
    OF METRICS
    Jon Cowie
    @abestanway

    View Slide

  2. Ninety minutes is a
    long time.
    - motivations
    - skyline
    - oculus
    - demo!
    - questions
    This talk:
    ~10
    ~25
    ~30
    ~10
    ~15

    View Slide

  3. Ninety minutes is a
    long time.
    - motivations
    - skyline
    - oculus
    - demo!
    - questions
    This talk:
    ~10
    ~25
    ~30
    ~10
    ~15
    But we have some
    sweet stuff to show
    you.

    View Slide

  4. Background and Motivations

    View Slide

  5. View Slide

  6. 1.5 billion page views
    $117 million of goods sold
    950 thousand users

    View Slide

  7. 1.5 billion page views
    $117 million of goods sold
    950 thousand users
    (in december ‘12)

    View Slide

  8. We practice continuous
    deployment.

    View Slide

  9. de • ploy /diˈploi/
    Verb
    To release your code for the
    world to see, hopefully without
    breaking the Internet

    View Slide

  10. Everyone deploys.
    250+ committers.

    View Slide

  11. Day one:
    DEPLOY

    View Slide

  12. View Slide

  13. 30+ DEPLOYS A DAY
    (~8 commits per deploy!)

    View Slide

  14. “30 deploys a day? Is that safe?”

    View Slide

  15. We optimize for quick recovery
    by anticipating problems...

    View Slide

  16. ...instead of fearing human error.

    View Slide

  17. Can’t fix what you
    don’t measure!
    - W. Edwards Deming

    View Slide

  18. StatsD
    graphite
    Skyline
    Oculus
    Supergrep
    homemade!
    not homemade
    Nagios
    Ganglia

    View Slide

  19. Text
    Real time error logging

    View Slide

  20. “Not all things that
    break throw errors.”
    - Oscar Wilde

    View Slide

  21. StatsD

    View Slide

  22. StatsD::increment(“foo.bar”)

    View Slide

  23. If it moves,
    graph it!

    View Slide

  24. If it moves,
    graph it!
    we would graph them ➞

    View Slide

  25. If it doesn’t move,
    graph it anyway
    (it might make a run for it)

    View Slide

  26. DASHBOARDS!

    View Slide

  27. [1358731200, 20]
    [1358731200, 20]
    [1358731200, 20]
    [1358731200, 20]
    [1358731200, 20]
    [1358731200, 20]
    [1358731200, 20]
    [1358731200, 20]
    [1358731200, 60]
    [1358731200, 20]
    [1358731200, 20]

    View Slide

  28. DASHBOARDS! x 250,000

    View Slide

  29. View Slide

  30. lol nagios

    View Slide

  31. “...but there are also
    unknown unknowns -
    there are things we do
    not know we don’t
    know.”

    View Slide

  32. Unknown
    anomalies

    View Slide

  33. Unknown
    correlations

    View Slide

  34. Kale.

    View Slide

  35. Kale:
    - leaves
    - green stuff

    View Slide

  36. Kale:
    - leaves
    - green stuffOCULUS
    SKYLINE

    View Slide

  37. Q). How do you analyze a
    timeseries for anomalies
    in real time?

    View Slide

  38. A). Lots of HTTP requests
    to Graphite’s API!

    View Slide

  39. Q). How do you analyze a
    quarter million timeseries
    for anomalies in real time?

    View Slide

  40. SKYLINE

    View Slide

  41. SKYLINE

    View Slide

  42. A real time
    anomaly detection
    system

    View Slide

  43. Real time?

    View Slide

  44. Kinda.

    View Slide

  45. StatsD
    Ten second resolution

    View Slide

  46. Ganglia
    One minute resolution

    View Slide

  47. ~ 10s
    (
    ~ 1min
    Best case:

    View Slide

  48. (
    Takes about 90 seconds
    with our throughput.

    View Slide

  49. (
    Still faster than you would
    have discovered it otherwise.

    View Slide

  50. Memory > Disk

    View Slide

  51. View Slide

  52. Q). How do you get a
    quarter million timeseries
    into Redis on time?

    View Slide

  53. STREAM IT!

    View Slide

  54. Graphite’s relay agent
    original
    graphite backup graphite

    View Slide

  55. Graphite’s relay agent
    original
    graphite backup graphite
    [statsd.numStats, [1365603422, 82345]]
    pickles
    [statsd.numStats, [1365603432, 80611]]
    [statsd.numStats, [1365603412, 73421]]

    View Slide

  56. Graphite’s relay agent
    original
    graphite skyline
    [statsd.numStats, [1365603422, 82345]]
    pickles
    [statsd.numStats, [1365603432, 80611]]
    [statsd.numStats, [1365603412, 73421]]

    View Slide

  57. We import from Ganglia too.

    View Slide

  58. Storing timeseries

    View Slide

  59. Minimize I/O
    Minimize memory

    View Slide

  60. redis.append()
    - Strings
    - Constant time
    - One operation per update

    View Slide

  61. JSON?

    View Slide

  62. “[1358711400, 51],”
    => get statsD.numStats
    ----------------------------

    View Slide

  63. “[1358711400, 51],
    => get statsD.numStats
    ----------------------------
    [1358711410, 23],”

    View Slide

  64. “[1358711400, 51],
    => get statsD.numStats
    ----------------------------
    [1358711410, 23],
    [1358711420, 45],”

    View Slide

  65. OVER HALF
    CPU time spent
    decoding JSON

    View Slide

  66. [1,2]

    View Slide

  67. [ 1 , 2 ]
    Stuff we care about
    Extra junk

    View Slide

  68. MESSAGEPACK

    View Slide

  69. MESSAGEPACK
    A binary-based
    serialization protocol

    View Slide

  70. \x93\x01\x02
    Array size
    (16 or 32 bit big
    endian integer)
    Things we care about

    View Slide

  71. \x93\x01\x02
    Array size
    (16 or 32 bit big
    endian integer)
    Things we care about
    \x93\x02\x03

    View Slide

  72. CUT IN HALF
    Run Time + Memory Used

    View Slide

  73. ROOMBA.PY
    CLEANS THE DATA

    View Slide

  74. “Wait...you wrote this in Python?”

    View Slide

  75. Great statistics libraries
    Not fun for parallelism

    View Slide

  76. Assign Redis keys to each process
    Process decodes and analyzes
    The Analyzer

    View Slide

  77. Anomalous metrics written as JSON
    setInterval() retrieves from front end
    The Analyzer

    View Slide

  78. View Slide

  79. What does it mean
    to be anomalous?

    View Slide

  80. Consensus model

    View Slide

  81. Implement everything you
    can get your hands on

    View Slide

  82. Basic algorithm:
    “A metric is anomalous if its
    latest datapoint is over three
    standard deviations above
    its moving average.”

    View Slide

  83. Grubb’s test, ordinary least squares

    View Slide

  84. Histogram binning

    View Slide

  85. Four horsemen of the modelpocalypse

    View Slide

  86. 1. Seasonality
    2. Spike influence
    3. Normality
    4. Parameters

    View Slide

  87. Anomaly?

    View Slide

  88. Nope.

    View Slide

  89. Text
    Spikes artificially raise the moving average
    Anomaly
    detected (yay!)
    Anomaly missed :(
    Bigger moving average

    View Slide

  90. Real world data doesn’t
    necessarily follow a perfect
    normal distribution.

    View Slide

  91. Too many metrics to fit
    parameters for them all!

    View Slide

  92. A robust set of algorithms is the
    current focus of this project.

    View Slide

  93. Q). How do you analyze a
    quarter million timeseries
    for correlations?

    View Slide

  94. OCULUS

    View Slide

  95. Image comparison is expensive and slow

    View Slide

  96. “[[975, 1365528530],
    [643, 1365528540],
    [750, 1365528550],
    [992, 1365528560],
    [580, 1365528570],
    [586, 1365528580],
    [649, 1365528590],
    [548, 1365528600],
    [901, 1365528610],
    [633, 1365528620]]”
    Use raw timeseries instead of raw graphs

    View Slide

  97. Naming Things
    Cache Invalidation
    Numerical Comparison?
    HARD PROBLEMS

    View Slide

  98. Naming Things
    Cache Invalidation
    Numerical Comparison?
    HARD PROBLEMS

    View Slide

  99. Euclidian Distance

    View Slide

  100. Dynamic Time Warping
    (helps with phase shifts)

    View Slide

  101. We’ve solved it!

    View Slide

  102. O(N2)

    View Slide

  103. O(N2) x 250k

    View Slide

  104. Too slow!

    View Slide

  105. doesn’t

    View Slide

  106. No need to run DTW on all 250k.

    View Slide

  107. Discard obviously dissimilar metrics.

    View Slide

  108. “975 643 643 750 992 992 992 580”
    “sharpdecrement flat increment
    sharpincrement flat flat
    shapdecrement”
    Shape Description Alphabet

    View Slide

  109. “975 643 643 750 992 992 992 580”
    “sharpdecrement flat increment
    sharpincrement flat flat
    shapdecrement”
    Shape Description Alphabet
    “24 4 4 11 25 25 25 0 1”
    (normalization step)

    View Slide

  110. View Slide

  111. Search for shape description
    fingerprint in Elasticsearch

    View Slide

  112. Run DTW on results
    as final polish

    View Slide

  113. O(N2) on ~10k metrics

    View Slide

  114. Still too slow.

    View Slide

  115. Fast DTW - O(N)
    coarsen
    project
    refine

    View Slide

  116. Elasticsearch Details
    Phrase search for first
    pass scores across shape
    description fingerprints

    View Slide

  117. Elasticsearch Details
    Phrase search for first pass scores
    across shape description fingerprints
    Custom FastDTW and euclidian
    distance plugins to score across the
    remaining filtered timeseries

    View Slide

  118. Elasticsearch Structure
    {
    :id => “statsd.numStats”,
    :fingerprint => “sdec inc sinc sdec”,
    :values => "10 1 2 15 4"
    }

    View Slide

  119. Mappings
    Specify tokenizers
    “Untouched” fields

    View Slide

  120. First pass query
    :match => {
    :fingerprint => {
    :query => “sdec inc sinc sdec inc”,
    :type => "phrase",
    :slop => 20
    }
    }
    shape description
    fingerprint

    View Slide

  121. Refinement query
    {:custom_score => {
    :query => <first_pass_query>,
    :script => "oculus_dtw",
    :params => {
    :query_value => “10 20 20 10 30”,
    :query_field => "values.untouched",
    },
    }
    raw timeseries

    View Slide

  122. Skyline
    Elasticsearch
    Resque
    Sinatra
    Ganglia
    Graphite
    StatsD
    KALE
    Flask

    View Slide

  123. Populating Elasticsearch

    View Slide

  124. ES
    Index
    resque workers

    View Slide

  125. Too slow to
    update and search

    View Slide

  126. New
    Index
    Last
    Index
    Webapp

    View Slide

  127. Sinatra frontend
    Queries ES
    Renders results

    View Slide

  128. Collections

    View Slide

  129. devops <3

    View Slide

  130. View Slide

  131. Special thanks to:
    Dr. Neil Gunther, PerfDynamics
    Dr. Brian Whitman, Echonest
    Burc Arpat, Facebook
    Seth Walker, Etsy
    Rafe Colburn, Etsy
    Mike Rembetsy, Etsy
    John Allspaw, Etsy

    View Slide

  132. @abestanway @jonlives
    Thanks!
    github.com/etsy/skyline
    github.com/etsy/oculus

    View Slide