Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond The Mean Time To Recover (MTTR)

September 13, 2017

Beyond The Mean Time To Recover (MTTR)


September 13, 2017

More Decks by j.hand

Other Decks in Technology


  1. The most relevant metric in evaluating the effectiveness of emergency

    response is how quickly the response team can bring the system back to health  -- that is, the MTTR. — Benjamin Treynor Sloss 3 — @jasonhand
  2. Failure cares not about the architecture designs you slave over,

    the code you write and review, or the alerts and metrics you meticulously pore through.. ..Failure happens. This is a foregone conclusion when working with complex systems. — John Allspaw (Former CTO Etsy) 11 — @jasonhand
  3. AVAILABILITY = MTBF/(MTBF + MTTR) A commonly used measurement of

    how often a system or service is available compared to the total time it should be usable.5 5 Effective DevOps (Jennifer Davis, Katherine Daniels) 13 — @jasonhand
  4. COST OF DOWNTIME = = Deployment frequency = Change Failure

    Rate = Mean Time To Recover (MTTR) = Hourly Cost of Outage 35 — @jasonhand
  5. DEPLOYMENT FREQUENCY How many times per year your own org

    deploys changes1 High performers = 1,460/year Low performers = 7/year 1 State of DevOps [2016] - Puppet labs. 37 — @jasonhand
  6. CHANGE FAILURE RATE Percentage of changes that cause an outage

    in an organization Higher performers = 0-15% Low performers = 16-30% 39 — @jasonhand
  7. MTTR (Mean Time To Recover) How long does it generally

    take to restore when a service incident occurs (e.g., unplanned outage, service impairment)?11 11 State of DevOps Report (2016) 43 — @jasonhand
  8. ARITHMETIC MEAN (ĂRˌĬTH-MĔTˈĬK MĒN) n. The value obtained by dividing

    the sum of a set of quantities by the number of quantities in the set ALSO CALLED AVERAGE 45 — @jasonhand
  9. LIMITATIONS OF ARITHMETIC MEAN In data sets that are skewed

    or where outliers are present, calculating the arithmetic mean often provides a misleading result.3 3 Arithmetic Mean 47 — @jasonhand
  10. AVERAGES... The number of data points can vary greatly depending

    on the complexity and scale of systems. Furthermore, averages assume there is a normal event or that your data is a normal distribution.13 — Richart Thaler Average is a horrible metric for optimizing performance 13 (MisBehaving - The Making of Behavorial Economics) 49 — @jasonhand
  11. 5,600 ALERTS PER DAY Customers w/ 50 seats or more8

    8 Average skewed by some high volume customers 50 — @jasonhand
  12. 66 INCIDENTS / DAY Critical Alerts 12 12 Average Based

    on Customers w/ 50+ seats. NOTE: Again, average ¯|(ϑ)/¯ 51 — @jasonhand
  13. REASONS INCLUDE: Incidents are auto-resolving? Engineers are re-routing alerts? Incidents

    remain unresolved until postmortem? Engineers are constantly hitting “resolve” because ... 54 — @jasonhand
  14. SEV1 OR SEV2 OUTAGE Just one extended outage can severely

    tip the scale on data points and normalization 56 — @jasonhand
  15. FINAL THOUGHTS: > Identify noisiest alerts and address them now

    > Bring more in to the fold (Devs on- call) > Shift observability left > Share the pain > Share the information 72 — @jasonhand
  16. YOUR CHALLENGE: Examine your own Mean Time To Recover Discuss

    additional methods of understanding data 73 — @jasonhand
  17. INCENTIVE STRUCTURES Be mindful of incentive structures to "encourage" a

    reduction of MTTR We may believe that our efforts are improving when the truth is they aren't. 74 — @jasonhand
  18. WHAT DOES YOUR DOWNTIME COST? How is MTTR impacting it?

    What else is impacting it? 75 — @jasonhand
  19. Abstract Mean time to Repair (MTTR) has long been the

    defacto metric for those tapped with the responsibility of up-time. It’s the cornerstone measurement of how well teams respond to service disruptions and a key performance metric that nearly all in IT should aim to consistently improve. Swift recovery provides far more benefits than attempts to engineer failure from complex systems. As important as MTTR has become, the mean time to repair is no more than an average of how long it took to manage individual incidents (from acknowledgement to resolve) over the course of time. The number of data points during that time can vary greatly depending on the complexity and scale of systems. Furthermore, averages assume there is a normal event or that your data is a normal distribution. Anyone who has been on-call can attest that some incidents require longer to resolve than others and that variance is something you shouldn’t ignore. Within any time-series dataset there are in fact high and low values hidden within the data. These outliers may indicate that while we think the time it takes to recover from failure is good, bad, or otherwise, many high values in our average distort or hide lower values and vice versa. We may believe that our efforts to reduce the time it takes to recover from failure is in fact working when the truth is it’s not. In this talk, we’ll discuss the metric of Mean Time To Repair as well as additional methods of understanding data related to building and maintaining reliable systems at scale. MTTR must be made a priority for any IT team that habitually follows old-view approaches to incident response however; a deeper understanding of the data provides much higher fidelity regarding the true health of your systems and the teams that support them. 81 — @jasonhand
  20. Additional resources: VictorOps.com (http://www.victorops.com) Kitchen Soap : Blogs by John

    Allspaw (http:// www.kitchensoap.com) Signalvnoise.com (https://m.signalvnoise.com/) 82 — @jasonhand