Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Formula for Faster Outage Recovery

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

The Formula for Faster Outage Recovery

Production outages can be extremely costly. For a global business, just one minute of downtime can exceed an engineer’s annual compensation. So, how can engineering leaders strengthen incident response across their organizations?

After over a decade of building and operating systems used by millions of people, I’ve distilled an approach that helps engineering leaders strengthen their teams’ incident management practices, based on a simple formula:

Outage Duration = Time to Detect + Time to Acknowledge + Time to Repair.

To resolve outages quickly, we need to be efficient in all three stages. But shortening the time of each stage requires a coordinated mix of technical, process, and cultural changes. This is where engineering leaders can truly enable and empower their teams.

We’ll unpack each component and examine practical strategies—from tooling investments and observability practices to cultural habits and on-call readiness—that can dramatically shorten outage duration. By the end of the talk, you’ll learn how to set up your teams for success when they face an outage.

More engineering insight in my book: Engineering Manager’s Compass: Insights for building effective engineering organizations

Avatar for Maxim Schepelin

Maxim Schepelin

May 08, 2026

More Decks by Maxim Schepelin

Other Decks in Technology

Transcript

  1. The formula Outage Duration = Time to Detect + Time

    to Acknowledge + Time to Repair Detect Time from the moment something breaks to when you know it. Acknowledge Time from knowing about the problem to someone jumping in to handle it. Repair The time from acknowledgment to full resolution.
  2. Time to Detect Outage Duration = Time to Detect +

    Time to Acknowledge + Time to Repair
  3. Define Service-Level Objectives (SLOs) “X must be true Y percentage

    of the time” Examples Technical 99% of page loads must be under 300ms. Technical 99.9% of API requests must succeed.
  4. Include business metrics into your SLOs Why do users interact

    with that system? What do they need from it? Examples Technical 99% of page loads must be under 300ms. Technical 99.9% of API requests must succeed. Business 99.9% of money transfers must be processed successfully. Business 99% of marketing emails must be sent within 15 minutes.
  5. Time to Acknowledge Outage Duration = Time to Detect +

    Time to Acknowledge + Time to Repair
  6. Your on-call process is the bottleneck When an alert fires:

    1 Who responds to it? 2 What if it fires outside office hours? 3 Do people even know they’re expected to respond?
  7. On-call process: Identify critical services Focus on services where an

    outage directly impacts customers or revenue. Service Impact of downtime checkout-service Payment & order finalization Users cannot complete purchases → direct revenue loss. payment-gateway Card processing & fraud checks All transactions declined → complete store halt. basket-api Shopping cart state Unable to add/remove items → purchase funnel blocked.
  8. On-call process: Map ownership Every critical service must be owned

    by a team. Service Impact of downtime Owners checkout-service Payment & order finalization Users cannot complete purchases → direct revenue loss. Team X payment-gateway Card processing & fraud checks All transactions declined → complete store halt. Team Y basket-api Shopping cart state Unable to add/remove items → purchase funnel blocked. Team Z
  9. On-call process: Create schedule Assign people to time blocks, so

    there’s always someone ready. Use tooling like PagerDuty, OpsGenie, or Spike to manage schedule.
  10. On-Call Only Works With Buy-In Forcing people into on-call never

    works. Being on-call is a choice every engineer makes. Check contracts Ensure employment agreements include on-call responsibilities. Compensate fairly Give engineers a real reason to be on-call: compensation, time-off, recognition, or a shared purpose.
  11. Recap: Time to Acknowledge 1 Identify business-critical services Focus on

    services where an outage directly impacts customers or revenue. Not everything is equally important. 2 Map service ownership Every critical service must be owned by a team. Build a pool of people capable of operating each service. 3 Build an on-call schedule Assign people to time blocks, so there’s always someone ready. Use tooling like PagerDuty, OpsGenie, or Spike to manage it.
  12. Time to Repair Outage Duration = Time to Detect +

    Time to Acknowledge + Time to Repair
  13. Invest in observability You need metrics to understand what’s going

    on with your system. Image source: https://grafana.com/grafana/dashboards/16694-kubernetes-overview/
  14. Training: ensure engineers know what to do When you receive

    an alert at 3 a.m., you're not at your best to learn new things. 1 How to revert changes? 2 How to scale workloads? 3 How to fail over to another region?
  15. Plan ahead: prepare for typical failure modes Certain things will

    happen, write down how to act when the do: 1 Sudden traffic spikes. 2 Slow or partitioned network. 3 Dependency failure.
  16. Recap: Time to Repair 1 Invest in observability Collect metrics

    to understand what’s going on with your systems. 2 Train engineers Invest upfront in knowledge sharing and training: show how to use the tools. Let people practice in safe environment. 3 Write playbooks for typical failure modes Document known solutions to the known problems to improve efficiency.
  17. The Formula in Practice Detect Define SLOs that reflect what

    your system is meant to do—not just infrastructure metrics. Acknowledge Build a structured on-call process: identify critical services, map ownership, create schedule. Repair Invest in observability, train engineers to use tools, plan for common failure modes.
  18. Learn from every outage Conduct a postmortem after every major

    incident to improve your response. 1 What happened? What was the impact? 2 How the problem was detected? 3 How efficient was the incident response? 4 Can this issue happen again?
  19. You can’t control when an outage hits But you can

    control how quickly you recover Maxim Schepelin