Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond The Mean Time To Recover (MTTR)

516fcd20ab7b946f50090ce1d557638c?s=47 j.hand
September 13, 2017

Beyond The Mean Time To Recover (MTTR)

516fcd20ab7b946f50090ce1d557638c?s=128

j.hand

September 13, 2017
Tweet

Transcript

  1. BEYOND THE MTTR (Mean Time To Recover) 1 — @jasonhand

  2. @JASONHAND JASON HAND VICTOROPS 2 — @jasonhand

  3. The most relevant metric in evaluating the effectiveness of emergency

    response is how quickly the response team can bring the system back to health  -- that is, the MTTR. — Benjamin Treynor Sloss 3 — @jasonhand
  4. HIGH AVAILABILITY & RELIABILITY 4 — @jasonhand

  5. 99.999 % UPTIME 5 — @jasonhand

  6. 6 — @jasonhand

  7. PREDICT & PREVENT 7 — @jasonhand

  8. WHAT ABOUT MTBF? (Mean Time Between Failure) 8 — @jasonhand

  9. COMPLEX SYSTEMS 9 — @jasonhand

  10. FAILURE 10 — @jasonhand

  11. Failure cares not about the architecture designs you slave over,

    the code you write and review, or the alerts and metrics you meticulously pore through.. ..Failure happens. This is a foregone conclusion when working with complex systems. — John Allspaw (Former CTO Etsy) 11 — @jasonhand
  12. 12 — @jasonhand

  13. AVAILABILITY = MTBF/(MTBF + MTTR) A commonly used measurement of

    how often a system or service is available compared to the total time it should be usable.5 5 Effective DevOps (Jennifer Davis, Katherine Daniels) 13 — @jasonhand
  14. AVAILABILITY & RELIABILITY: THE RESULT OF A TEAM'S ABILITY TO...

    14 — @jasonhand
  15. RESPOND & RECOVER QUICKLY 15 — @jasonhand

  16. 16 — @jasonhand

  17. 17 — @jasonhand

  18. HIGH-PERFORMING ORGANIZATIONS resolve production incidents 168 times faster than their

    peers 7 7 State of DevOps 18 — @jasonhand
  19. 19 — @jasonhand

  20. 20 — @jasonhand

  21. 21 — @jasonhand

  22. 22 — @jasonhand

  23. 23 — @jasonhand

  24. WHAT IS THE ROI OF DEVOPS? 24 — @jasonhand

  25. 25 — @jasonhand

  26. HOW MUCH DID THAT OUTAGE COST THE COMPANY? 26 —

    @jasonhand
  27. 27 — @jasonhand

  28. 28 — @jasonhand

  29. 29 — @jasonhand

  30. 30 — @jasonhand

  31. 31 — @jasonhand

  32. 32 — @jasonhand

  33. 33 — @jasonhand

  34. LET'S CALCULATE 34 — @jasonhand

  35. COST OF DOWNTIME = = Deployment frequency = Change Failure

    Rate = Mean Time To Recover (MTTR) = Hourly Cost of Outage 35 — @jasonhand
  36. DEPLOYMENT FREQUENCY 36 — @jasonhand

  37. DEPLOYMENT FREQUENCY How many times per year your own org

    deploys changes1 High performers = 1,460/year Low performers = 7/year 1 State of DevOps [2016] - Puppet labs. 37 — @jasonhand
  38. CHANGE FAILURE RATE 38 — @jasonhand

  39. CHANGE FAILURE RATE Percentage of changes that cause an outage

    in an organization Higher performers = 0-15% Low performers = 16-30% 39 — @jasonhand
  40. MTTR & HOURLY COST OF OUTAGE 40 — @jasonhand

  41. 41 — @jasonhand

  42. WHAT IS MTTR EXACTLY? 42 — @jasonhand

  43. MTTR (Mean Time To Recover) How long does it generally

    take to restore when a service incident occurs (e.g., unplanned outage, service impairment)?11 11 State of DevOps Report (2016) 43 — @jasonhand
  44. BUT WHAT DOES MEAN MEAN? 44 — @jasonhand

  45. ARITHMETIC MEAN (ĂRˌĬTH-MĔTˈĬK MĒN) n. The value obtained by dividing

    the sum of a set of quantities by the number of quantities in the set ALSO CALLED AVERAGE 45 — @jasonhand
  46. AVERAGE (AS A METRIC) TELLS YOU NOTHING ABOUT THE ACTUAL

    INCIDENTS 46 — @jasonhand
  47. LIMITATIONS OF ARITHMETIC MEAN In data sets that are skewed

    or where outliers are present, calculating the arithmetic mean often provides a misleading result.3 3 Arithmetic Mean 47 — @jasonhand
  48. DISTORTED BIG PICTURE 48 — @jasonhand

  49. AVERAGES... The number of data points can vary greatly depending

    on the complexity and scale of systems. Furthermore, averages assume there is a normal event or that your data is a normal distribution.13 — Richart Thaler Average is a horrible metric for optimizing performance 13 (MisBehaving - The Making of Behavorial Economics) 49 — @jasonhand
  50. 5,600 ALERTS PER DAY Customers w/ 50 seats or more8

    8 Average skewed by some high volume customers 50 — @jasonhand
  51. 66 INCIDENTS / DAY Critical Alerts 12 12 Average Based

    on Customers w/ 50+ seats. NOTE: Again, average ¯|(ϑ)/¯ 51 — @jasonhand
  52. 52 — @jasonhand

  53. MEAN TIME TO WTF 53 — @jasonhand

  54. REASONS INCLUDE: Incidents are auto-resolving? Engineers are re-routing alerts? Incidents

    remain unresolved until postmortem? Engineers are constantly hitting “resolve” because ... 54 — @jasonhand
  55. IS DOWN AGAIN! 55 — @jasonhand

  56. SEV1 OR SEV2 OUTAGE Just one extended outage can severely

    tip the scale on data points and normalization 56 — @jasonhand
  57. WHAT ELSE SHOULD WE MEASURE? 57 — @jasonhand

  58. MEDIAN TIME TO RECOVER More robust to outliers. 58 —

    @jasonhand
  59. MAXIMUM TIME TO RECOVER 59 — @jasonhand

  60. VOLUME OF ALERTS BY SEVERITY AND TOTAL 60 — @jasonhand

  61. TOTAL NUMBER OF OUTAGES determine total downtime (Related: Service Level

    Agreements) 61 — @jasonhand
  62. NOISY HOSTS OR SERVICES 62 — @jasonhand

  63. ALERT ACTIONABILITY 63 — @jasonhand

  64. ALERT TYPES INFO, WARNING, CRITICAL 64 — @jasonhand

  65. ALERT VOLUME / day (pssst ...careful of avg) 65 —

    @jasonhand
  66. ALERT TIMES 66 — @jasonhand

  67. MTTA MEAN TIME TO ACKNOWLEDGE 67 — @jasonhand

  68. 68 — @jasonhand

  69. 69 — @jasonhand

  70. BIOMETRIC DATA 70 — @jasonhand

  71. SLEEP DISRUPTION (MEAN TIME TO SLEEP)15 15 https://www.slideshare.net/lozzd/mean-time-to-sleep-quantifying-the-oncall-experience 71 —

    @jasonhand
  72. FINAL THOUGHTS: > Identify noisiest alerts and address them now

    > Bring more in to the fold (Devs on- call) > Shift observability left > Share the pain > Share the information 72 — @jasonhand
  73. YOUR CHALLENGE: Examine your own Mean Time To Recover Discuss

    additional methods of understanding data 73 — @jasonhand
  74. INCENTIVE STRUCTURES Be mindful of incentive structures to "encourage" a

    reduction of MTTR We may believe that our efforts are improving when the truth is they aren't. 74 — @jasonhand
  75. WHAT DOES YOUR DOWNTIME COST? How is MTTR impacting it?

    What else is impacting it? 75 — @jasonhand
  76. DEVOPS ROI 76 — @jasonhand

  77. ABOVE ALL... 77 — @jasonhand

  78. CONTINUOUSLY IMPROVE 78 — @jasonhand

  79. BEYOND THE MEAN TIME TO RECOVER @JASONHAND VICTOROPS 79 —

    @jasonhand
  80. THANK YOU 80 — @jasonhand

  81. Abstract Mean time to Repair (MTTR) has long been the

    defacto metric for those tapped with the responsibility of up-time. It’s the cornerstone measurement of how well teams respond to service disruptions and a key performance metric that nearly all in IT should aim to consistently improve. Swift recovery provides far more benefits than attempts to engineer failure from complex systems. As important as MTTR has become, the mean time to repair is no more than an average of how long it took to manage individual incidents (from acknowledgement to resolve) over the course of time. The number of data points during that time can vary greatly depending on the complexity and scale of systems. Furthermore, averages assume there is a normal event or that your data is a normal distribution. Anyone who has been on-call can attest that some incidents require longer to resolve than others and that variance is something you shouldn’t ignore. Within any time-series dataset there are in fact high and low values hidden within the data. These outliers may indicate that while we think the time it takes to recover from failure is good, bad, or otherwise, many high values in our average distort or hide lower values and vice versa. We may believe that our efforts to reduce the time it takes to recover from failure is in fact working when the truth is it’s not. In this talk, we’ll discuss the metric of Mean Time To Repair as well as additional methods of understanding data related to building and maintaining reliable systems at scale. MTTR must be made a priority for any IT team that habitually follows old-view approaches to incident response however; a deeper understanding of the data provides much higher fidelity regarding the true health of your systems and the teams that support them. 81 — @jasonhand
  82. Additional resources: VictorOps.com (http://www.victorops.com) Kitchen Soap : Blogs by John

    Allspaw (http:// www.kitchensoap.com) Signalvnoise.com (https://m.signalvnoise.com/) 82 — @jasonhand