The Formula for Faster Outage Recovery

The anatomy of outages The formula for faster recovery Maxim
Schepelin · Booking.com

The formula Outage Duration = Time to Detect + Time
to Acknowledge + Time to Repair Detect Time from the moment something breaks to when you know it. Acknowledge Time from knowing about the problem to someone jumping in to handle it. Repair The time from acknowledgment to full resolution.

Time to Detect Outage Duration = Time to Detect +
Time to Acknowledge + Time to Repair

Define Service-Level Objectives (SLOs) “X must be true Y percentage
of the time” Examples Technical 99% of page loads must be under 300ms. Technical 99.9% of API requests must succeed.

Measure you SLOs

Include business metrics into your SLOs Why do users interact
with that system? What do they need from it? Examples Technical 99% of page loads must be under 300ms. Technical 99.9% of API requests must succeed. Business 99.9% of money transfers must be processed successfully. Business 99% of marketing emails must be sent within 15 minutes.

Your system is fine, your users are not schepelin.com/go/slo

Time to Acknowledge Outage Duration = Time to Detect +

Your on-call process is the bottleneck When an alert fires:
1 Who responds to it? 2 What if it fires outside office hours? 3 Do people even know they’re expected to respond?

On-call process: Identify critical services Focus on services where an
outage directly impacts customers or revenue. Service Impact of downtime checkout-service Payment & order finalization Users cannot complete purchases → direct revenue loss. payment-gateway Card processing & fraud checks All transactions declined → complete store halt. basket-api Shopping cart state Unable to add/remove items → purchase funnel blocked.

On-call process: Map ownership Every critical service must be owned
by a team. Service Impact of downtime Owners checkout-service Payment & order finalization Users cannot complete purchases → direct revenue loss. Team X payment-gateway Card processing & fraud checks All transactions declined → complete store halt. Team Y basket-api Shopping cart state Unable to add/remove items → purchase funnel blocked. Team Z

On-call process: Create schedule Assign people to time blocks, so
there’s always someone ready. Use tooling like PagerDuty, OpsGenie, or Spike to manage schedule.

On-Call Only Works With Buy-In Forcing people into on-call never
works. Being on-call is a choice every engineer makes. Check contracts Ensure employment agreements include on-call responsibilities. Compensate fairly Give engineers a real reason to be on-call: compensation, time-off, recognition, or a shared purpose.

Recap: Time to Acknowledge 1 Identify business-critical services Focus on
services where an outage directly impacts customers or revenue. Not everything is equally important. 2 Map service ownership Every critical service must be owned by a team. Build a pool of people capable of operating each service. 3 Build an on-call schedule Assign people to time blocks, so there’s always someone ready. Use tooling like PagerDuty, OpsGenie, or Spike to manage it.

Time to Repair Outage Duration = Time to Detect +

Invest in observability You need metrics to understand what’s going
on with your system. Image source: https://grafana.com/grafana/dashboards/16694-kubernetes-overview/

Training: ensure engineers know what to do When you receive
an alert at 3 a.m., you're not at your best to learn new things. 1 How to revert changes? 2 How to scale workloads? 3 How to fail over to another region?

Plan ahead: prepare for typical failure modes Certain things will
happen, write down how to act when the do: 1 Sudden traffic spikes. 2 Slow or partitioned network. 3 Dependency failure.

Recap: Time to Repair 1 Invest in observability Collect metrics
to understand what’s going on with your systems. 2 Train engineers Invest upfront in knowledge sharing and training: show how to use the tools. Let people practice in safe environment. 3 Write playbooks for typical failure modes Document known solutions to the known problems to improve efficiency.

The Formula in Practice Detect Define SLOs that reflect what
your system is meant to do—not just infrastructure metrics. Acknowledge Build a structured on-call process: identify critical services, map ownership, create schedule. Repair Invest in observability, train engineers to use tools, plan for common failure modes.

Learn from every outage Conduct a postmortem after every major
incident to improve your response. 1 What happened? What was the impact? 2 How the problem was detected? 3 How efficient was the incident response? 4 Can this issue happen again?

You can’t control when an outage hits But you can
control how quickly you recover Maxim Schepelin

Q&A and shameless self-promotion For more insights on engineering leadership.
Check out my book: schepelin.com/go/book

The Formula for Faster Outage Recovery

The Formula for Faster Outage Recovery

Maxim Schepelin

More Decks by Maxim Schepelin

Other Decks in Technology

Featured

Transcript

The anatomy of outages The formula for faster recovery Maxim

The formula Outage Duration = Time to Detect + Time

Time to Detect Outage Duration = Time to Detect +

Define Service-Level Objectives (SLOs) “X must be true Y percentage

Measure you SLOs

Include business metrics into your SLOs Why do users interact

Your system is fine, your users are not schepelin.com/go/slo

Time to Acknowledge Outage Duration = Time to Detect +

Your on-call process is the bottleneck When an alert fires:

On-call process: Identify critical services Focus on services where an

On-call process: Map ownership Every critical service must be owned

On-call process: Create schedule Assign people to time blocks, so

On-Call Only Works With Buy-In Forcing people into on-call never

Recap: Time to Acknowledge 1 Identify business-critical services Focus on

Time to Repair Outage Duration = Time to Detect +

Invest in observability You need metrics to understand what’s going

Training: ensure engineers know what to do When you receive

Plan ahead: prepare for typical failure modes Certain things will

Recap: Time to Repair 1 Invest in observability Collect metrics

The Formula in Practice Detect Define SLOs that reflect what

Learn from every outage Conduct a postmortem after every major

You can’t control when an outage hits But you can

Q&A and shameless self-promotion For more insights on engineering leadership.