Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond The Mean Time To Recover (MTTR)

j.hand
September 13, 2017

Beyond The Mean Time To Recover (MTTR)

j.hand

September 13, 2017
Tweet

More Decks by j.hand

Other Decks in Technology

Transcript

  1. BEYOND THE
    MTTR
    (Mean Time To Recover)
    1 — @jasonhand

    View Slide

  2. @JASONHAND
    JASON HAND
    VICTOROPS
    2 — @jasonhand

    View Slide

  3. The most relevant
    metric in evaluating the
    effectiveness of
    emergency response is
    how quickly the
    response team can
    bring the system back
    to health 
    -- that is, the MTTR.
    — Benjamin Treynor Sloss
    3 — @jasonhand

    View Slide

  4. HIGH
    AVAILABILITY &
    RELIABILITY
    4 — @jasonhand

    View Slide

  5. 99.999 %
    UPTIME
    5 — @jasonhand

    View Slide

  6. 6 — @jasonhand

    View Slide

  7. PREDICT &
    PREVENT
    7 — @jasonhand

    View Slide

  8. WHAT ABOUT
    MTBF?
    (Mean Time Between Failure)
    8 — @jasonhand

    View Slide

  9. COMPLEX
    SYSTEMS
    9 — @jasonhand

    View Slide

  10. FAILURE
    10 — @jasonhand

    View Slide

  11. Failure cares not about the architecture
    designs you slave over, the code you write
    and review, or the alerts and metrics you
    meticulously pore through..
    ..Failure happens. This is a foregone
    conclusion when working with complex
    systems.
    — John Allspaw (Former CTO Etsy)
    11 — @jasonhand

    View Slide

  12. 12 — @jasonhand

    View Slide

  13. AVAILABILITY =
    MTBF/(MTBF + MTTR)
    A commonly used measurement of how
    often a system or service is available
    compared to the total time it should be
    usable.5
    5 Effective DevOps (Jennifer Davis, Katherine Daniels)
    13 — @jasonhand

    View Slide

  14. AVAILABILITY & RELIABILITY:
    THE RESULT OF A TEAM'S ABILITY TO...
    14 — @jasonhand

    View Slide

  15. RESPOND
    & RECOVER
    QUICKLY
    15 — @jasonhand

    View Slide

  16. 16 — @jasonhand

    View Slide

  17. 17 — @jasonhand

    View Slide

  18. HIGH-PERFORMING
    ORGANIZATIONS
    resolve production incidents 168 times faster than their
    peers 7
    7 State of DevOps
    18 — @jasonhand

    View Slide

  19. 19 — @jasonhand

    View Slide

  20. 20 — @jasonhand

    View Slide

  21. 21 — @jasonhand

    View Slide

  22. 22 — @jasonhand

    View Slide

  23. 23 — @jasonhand

    View Slide

  24. WHAT IS THE
    ROI
    OF DEVOPS?
    24 — @jasonhand

    View Slide

  25. 25 — @jasonhand

    View Slide

  26. HOW MUCH DID THAT
    OUTAGE
    COST THE COMPANY?
    26 — @jasonhand

    View Slide

  27. 27 — @jasonhand

    View Slide

  28. 28 — @jasonhand

    View Slide

  29. 29 — @jasonhand

    View Slide

  30. 30 — @jasonhand

    View Slide

  31. 31 — @jasonhand

    View Slide

  32. 32 — @jasonhand

    View Slide

  33. 33 — @jasonhand

    View Slide

  34. LET'S CALCULATE
    34 — @jasonhand

    View Slide

  35. COST OF DOWNTIME =
    = Deployment frequency
    = Change Failure Rate
    = Mean Time To Recover (MTTR)
    = Hourly Cost of Outage
    35 — @jasonhand

    View Slide

  36. DEPLOYMENT
    FREQUENCY
    36 — @jasonhand

    View Slide

  37. DEPLOYMENT
    FREQUENCY
    How many times per year your own org
    deploys changes1
    High performers = 1,460/year
    Low performers = 7/year
    1 State of DevOps [2016] - Puppet labs.
    37 — @jasonhand

    View Slide

  38. CHANGE
    FAILURE RATE
    38 — @jasonhand

    View Slide

  39. CHANGE
    FAILURE RATE
    Percentage of changes that cause an
    outage in an organization
    Higher performers = 0-15%
    Low performers = 16-30%
    39 — @jasonhand

    View Slide

  40. MTTR
    & HOURLY COST OF OUTAGE
    40 — @jasonhand

    View Slide

  41. 41 — @jasonhand

    View Slide

  42. WHAT IS
    MTTR
    EXACTLY?
    42 — @jasonhand

    View Slide

  43. MTTR
    (Mean Time To Recover)
    How long does it generally take to
    restore when a service incident occurs
    (e.g., unplanned outage, service
    impairment)?11
    11 State of DevOps Report (2016)
    43 — @jasonhand

    View Slide

  44. BUT WHAT DOES
    MEAN
    MEAN?
    44 — @jasonhand

    View Slide

  45. ARITHMETIC MEAN
    (ĂRˌĬTH-MĔTˈĬK MĒN)
    n. The value obtained by dividing the sum of a set of
    quantities by the number of quantities in the set
    ALSO CALLED AVERAGE
    45 — @jasonhand

    View Slide

  46. AVERAGE
    (AS A METRIC)
    TELLS YOU NOTHING ABOUT THE
    ACTUAL INCIDENTS
    46 — @jasonhand

    View Slide

  47. LIMITATIONS OF
    ARITHMETIC MEAN
    In data sets that are skewed or where outliers are
    present, calculating the arithmetic mean often provides
    a misleading result.3
    3 Arithmetic Mean
    47 — @jasonhand

    View Slide

  48. DISTORTED
    BIG PICTURE
    48 — @jasonhand

    View Slide

  49. AVERAGES...
    The number of data points can vary
    greatly depending on the complexity
    and scale of systems.
    Furthermore, averages assume there is
    a normal event or that your data is a
    normal distribution.13
    — Richart Thaler
    Average is a horrible metric for
    optimizing performance
    13 (MisBehaving - The Making of Behavorial Economics)
    49 — @jasonhand

    View Slide

  50. 5,600
    ALERTS PER DAY
    Customers w/ 50 seats or more8
    8 Average skewed by some high volume customers
    50 — @jasonhand

    View Slide

  51. 66
    INCIDENTS / DAY
    Critical Alerts 12
    12 Average Based on Customers w/ 50+ seats.
    NOTE: Again, average ¯|(ϑ)/¯
    51 — @jasonhand

    View Slide

  52. 52 — @jasonhand

    View Slide

  53. MEAN TIME TO
    WTF
    53 — @jasonhand

    View Slide

  54. REASONS INCLUDE:
    Incidents are auto-resolving?
    Engineers are re-routing alerts?
    Incidents remain unresolved until postmortem?
    Engineers are constantly hitting “resolve” because ...
    54 — @jasonhand

    View Slide

  55. IS DOWN
    AGAIN!
    55 — @jasonhand

    View Slide

  56. SEV1 OR SEV2
    OUTAGE
    Just one extended outage can severely
    tip the scale on data points and
    normalization
    56 — @jasonhand

    View Slide

  57. WHAT ELSE SHOULD WE
    MEASURE?
    57 — @jasonhand

    View Slide

  58. MEDIAN
    TIME TO RECOVER
    More robust to outliers.
    58 — @jasonhand

    View Slide

  59. MAXIMUM
    TIME TO RECOVER
    59 — @jasonhand

    View Slide

  60. VOLUME
    OF ALERTS
    BY SEVERITY
    AND TOTAL
    60 — @jasonhand

    View Slide

  61. TOTAL NUMBER
    OF OUTAGES
    determine total downtime
    (Related: Service Level Agreements)
    61 — @jasonhand

    View Slide

  62. NOISY HOSTS OR SERVICES
    62 — @jasonhand

    View Slide

  63. ALERT
    ACTIONABILITY
    63 — @jasonhand

    View Slide

  64. ALERT TYPES
    INFO, WARNING, CRITICAL
    64 — @jasonhand

    View Slide

  65. ALERT
    VOLUME
    / day
    (pssst ...careful of avg)
    65 — @jasonhand

    View Slide

  66. ALERT TIMES
    66 — @jasonhand

    View Slide

  67. MTTA
    MEAN TIME TO
    ACKNOWLEDGE
    67 — @jasonhand

    View Slide

  68. 68 — @jasonhand

    View Slide

  69. 69 — @jasonhand

    View Slide

  70. BIOMETRIC DATA
    70 — @jasonhand

    View Slide

  71. SLEEP DISRUPTION
    (MEAN TIME TO SLEEP)15
    15 https://www.slideshare.net/lozzd/mean-time-to-sleep-quantifying-the-oncall-experience
    71 — @jasonhand

    View Slide

  72. FINAL THOUGHTS:
    > Identify noisiest alerts and address
    them now
    > Bring more in to the fold (Devs on-
    call)
    > Shift observability left
    > Share the pain
    > Share the information
    72 — @jasonhand

    View Slide

  73. YOUR CHALLENGE:
    Examine your own
    Mean Time To Recover
    Discuss additional methods of
    understanding data
    73 — @jasonhand

    View Slide

  74. INCENTIVE
    STRUCTURES
    Be mindful of incentive structures to
    "encourage" a reduction of MTTR
    We may believe that our efforts are
    improving when the truth is they
    aren't.
    74 — @jasonhand

    View Slide

  75. WHAT DOES
    YOUR
    DOWNTIME COST?
    How is MTTR impacting it?
    What else is impacting it?
    75 — @jasonhand

    View Slide

  76. DEVOPS
    ROI
    76 — @jasonhand

    View Slide

  77. ABOVE
    ALL...
    77 — @jasonhand

    View Slide

  78. CONTINUOUSLY
    IMPROVE
    78 — @jasonhand

    View Slide

  79. BEYOND THE
    MEAN TIME TO RECOVER
    @JASONHAND
    VICTOROPS
    79 — @jasonhand

    View Slide

  80. THANK
    YOU
    80 — @jasonhand

    View Slide

  81. Abstract
    Mean time to Repair (MTTR) has long been the defacto metric for those tapped with the responsibility of up-time. It’s the
    cornerstone measurement of how well teams respond to service disruptions and a key performance metric that nearly
    all in IT should aim to consistently improve. Swift recovery provides far more benefits than attempts to engineer failure
    from complex systems.
    As important as MTTR has become, the mean time to repair is no more than an average of how long it took to manage
    individual incidents (from acknowledgement to resolve) over the course of time. The number of data points during that time
    can vary greatly depending on the complexity and scale of systems. Furthermore, averages assume there is a normal
    event or that your data is a normal distribution. Anyone who has been on-call can attest that some incidents require
    longer to resolve than others and that variance is something you shouldn’t ignore.
    Within any time-series dataset there are in fact high and low values hidden within the data. These outliers may indicate that
    while we think the time it takes to recover from failure is good, bad, or otherwise, many high values in our average distort
    or hide lower values and vice versa. We may believe that our efforts to reduce the time it takes to recover from failure is
    in fact working when the truth is it’s not.
    In this talk, we’ll discuss the metric of Mean Time To Repair as well as additional methods of understanding data related
    to building and maintaining reliable systems at scale. MTTR must be made a priority for any IT team that habitually
    follows old-view approaches to incident response however; a deeper understanding of the data provides much higher
    fidelity regarding the true health of your systems and the teams that support them.
    81 — @jasonhand

    View Slide

  82. Additional resources:
    VictorOps.com (http://www.victorops.com)
    Kitchen Soap : Blogs by John Allspaw (http://
    www.kitchensoap.com)
    Signalvnoise.com (https://m.signalvnoise.com/)
    82 — @jasonhand

    View Slide