Beyond The Mean Time To Recover (MTTR)

BEYOND THE MTTR (Mean Time To Recover) 1 — @jasonhand

@JASONHAND JASON HAND VICTOROPS 2 — @jasonhand

The most relevant metric in evaluating the effectiveness of emergency
response is how quickly the response team can bring the system back to health -- that is, the MTTR. — Benjamin Treynor Sloss 3 — @jasonhand

HIGH AVAILABILITY & RELIABILITY 4 — @jasonhand

99.999 % UPTIME 5 — @jasonhand

6 — @jasonhand

PREDICT & PREVENT 7 — @jasonhand

WHAT ABOUT MTBF? (Mean Time Between Failure) 8 — @jasonhand

COMPLEX SYSTEMS 9 — @jasonhand

FAILURE 10 — @jasonhand

Failure cares not about the architecture designs you slave over,
the code you write and review, or the alerts and metrics you meticulously pore through.. ..Failure happens. This is a foregone conclusion when working with complex systems. — John Allspaw (Former CTO Etsy) 11 — @jasonhand

12 — @jasonhand

AVAILABILITY = MTBF/(MTBF + MTTR) A commonly used measurement of
how often a system or service is available compared to the total time it should be usable.5 5 Effective DevOps (Jennifer Davis, Katherine Daniels) 13 — @jasonhand

AVAILABILITY & RELIABILITY: THE RESULT OF A TEAM'S ABILITY TO...
14 — @jasonhand

RESPOND & RECOVER QUICKLY 15 — @jasonhand

16 — @jasonhand

17 — @jasonhand

HIGH-PERFORMING ORGANIZATIONS resolve production incidents 168 times faster than their
peers 7 7 State of DevOps 18 — @jasonhand

19 — @jasonhand

20 — @jasonhand

21 — @jasonhand

22 — @jasonhand

23 — @jasonhand

WHAT IS THE ROI OF DEVOPS? 24 — @jasonhand

25 — @jasonhand

HOW MUCH DID THAT OUTAGE COST THE COMPANY? 26 —
@jasonhand

27 — @jasonhand

28 — @jasonhand

29 — @jasonhand

30 — @jasonhand

31 — @jasonhand

32 — @jasonhand

33 — @jasonhand

LET'S CALCULATE 34 — @jasonhand

COST OF DOWNTIME = = Deployment frequency = Change Failure
Rate = Mean Time To Recover (MTTR) = Hourly Cost of Outage 35 — @jasonhand

DEPLOYMENT FREQUENCY 36 — @jasonhand

DEPLOYMENT FREQUENCY How many times per year your own org
deploys changes1 High performers = 1,460/year Low performers = 7/year 1 State of DevOps [2016] - Puppet labs. 37 — @jasonhand

CHANGE FAILURE RATE 38 — @jasonhand

CHANGE FAILURE RATE Percentage of changes that cause an outage
in an organization Higher performers = 0-15% Low performers = 16-30% 39 — @jasonhand

MTTR & HOURLY COST OF OUTAGE 40 — @jasonhand

41 — @jasonhand

WHAT IS MTTR EXACTLY? 42 — @jasonhand

MTTR (Mean Time To Recover) How long does it generally
take to restore when a service incident occurs (e.g., unplanned outage, service impairment)?11 11 State of DevOps Report (2016) 43 — @jasonhand

BUT WHAT DOES MEAN MEAN? 44 — @jasonhand

ARITHMETIC MEAN (ĂRˌĬTH-MĔTˈĬK MĒN) n. The value obtained by dividing
the sum of a set of quantities by the number of quantities in the set ALSO CALLED AVERAGE 45 — @jasonhand

AVERAGE (AS A METRIC) TELLS YOU NOTHING ABOUT THE ACTUAL
INCIDENTS 46 — @jasonhand

LIMITATIONS OF ARITHMETIC MEAN In data sets that are skewed
or where outliers are present, calculating the arithmetic mean often provides a misleading result.3 3 Arithmetic Mean 47 — @jasonhand

DISTORTED BIG PICTURE 48 — @jasonhand

AVERAGES... The number of data points can vary greatly depending
on the complexity and scale of systems. Furthermore, averages assume there is a normal event or that your data is a normal distribution.13 — Richart Thaler Average is a horrible metric for optimizing performance 13 (MisBehaving - The Making of Behavorial Economics) 49 — @jasonhand

5,600 ALERTS PER DAY Customers w/ 50 seats or more8
8 Average skewed by some high volume customers 50 — @jasonhand

66 INCIDENTS / DAY Critical Alerts 12 12 Average Based
on Customers w/ 50+ seats. NOTE: Again, average ¯|(ϑ)/¯ 51 — @jasonhand

52 — @jasonhand

MEAN TIME TO WTF 53 — @jasonhand

REASONS INCLUDE: Incidents are auto-resolving? Engineers are re-routing alerts? Incidents
remain unresolved until postmortem? Engineers are constantly hitting “resolve” because ... 54 — @jasonhand

IS DOWN AGAIN! 55 — @jasonhand

SEV1 OR SEV2 OUTAGE Just one extended outage can severely
tip the scale on data points and normalization 56 — @jasonhand

WHAT ELSE SHOULD WE MEASURE? 57 — @jasonhand

MEDIAN TIME TO RECOVER More robust to outliers. 58 —
@jasonhand

MAXIMUM TIME TO RECOVER 59 — @jasonhand

VOLUME OF ALERTS BY SEVERITY AND TOTAL 60 — @jasonhand

TOTAL NUMBER OF OUTAGES determine total downtime (Related: Service Level
Agreements) 61 — @jasonhand

NOISY HOSTS OR SERVICES 62 — @jasonhand

ALERT ACTIONABILITY 63 — @jasonhand

ALERT TYPES INFO, WARNING, CRITICAL 64 — @jasonhand

ALERT VOLUME / day (pssst ...careful of avg) 65 —
@jasonhand

ALERT TIMES 66 — @jasonhand

MTTA MEAN TIME TO ACKNOWLEDGE 67 — @jasonhand

68 — @jasonhand

69 — @jasonhand

BIOMETRIC DATA 70 — @jasonhand

SLEEP DISRUPTION (MEAN TIME TO SLEEP)15 15 https://www.slideshare.net/lozzd/mean-time-to-sleep-quantifying-the-oncall-experience 71 —
@jasonhand

FINAL THOUGHTS: > Identify noisiest alerts and address them now
> Bring more in to the fold (Devs oncall) > Shift observability left > Share the pain > Share the information 72 — @jasonhand

YOUR CHALLENGE: Examine your own Mean Time To Recover Discuss
additional methods of understanding data 73 — @jasonhand

INCENTIVE STRUCTURES Be mindful of incentive structures to "encourage" a
reduction of MTTR We may believe that our efforts are improving when the truth is they aren't. 74 — @jasonhand

WHAT DOES YOUR DOWNTIME COST? How is MTTR impacting it?
What else is impacting it? 75 — @jasonhand

DEVOPS ROI 76 — @jasonhand

ABOVE ALL... 77 — @jasonhand

CONTINUOUSLY IMPROVE 78 — @jasonhand

BEYOND THE MEAN TIME TO RECOVER @JASONHAND VICTOROPS 79 —
@jasonhand

THANK YOU 80 — @jasonhand

Abstract Mean time to Repair (MTTR) has long been the
defacto metric for those tapped with the responsibility of up-time. It’s the cornerstone measurement of how well teams respond to service disruptions and a key performance metric that nearly all in IT should aim to consistently improve. Swift recovery provides far more benefits than attempts to engineer failure from complex systems. As important as MTTR has become, the mean time to repair is no more than an average of how long it took to manage individual incidents (from acknowledgement to resolve) over the course of time. The number of data points during that time can vary greatly depending on the complexity and scale of systems. Furthermore, averages assume there is a normal event or that your data is a normal distribution. Anyone who has been on-call can attest that some incidents require longer to resolve than others and that variance is something you shouldn’t ignore. Within any time-series dataset there are in fact high and low values hidden within the data. These outliers may indicate that while we think the time it takes to recover from failure is good, bad, or otherwise, many high values in our average distort or hide lower values and vice versa. We may believe that our efforts to reduce the time it takes to recover from failure is in fact working when the truth is it’s not. In this talk, we’ll discuss the metric of Mean Time To Repair as well as additional methods of understanding data related to building and maintaining reliable systems at scale. MTTR must be made a priority for any IT team that habitually follows old-view approaches to incident response however; a deeper understanding of the data provides much higher fidelity regarding the true health of your systems and the teams that support them. 81 — @jasonhand

Additional resources: VictorOps.com (http://www.victorops.com) Kitchen Soap : Blogs by John
Allspaw (http:// www.kitchensoap.com) Signalvnoise.com (https://m.signalvnoise.com/) 82 — @jasonhand

Beyond The Mean Time To Recover (MTTR)

Beyond The Mean Time To Recover (MTTR)

More Decks by j.hand

Other Decks in Technology

Featured

Transcript