Designing an Incident Response Process at Scale

Designing an Incident Response Process at Scale. Opeyemi Onikute

Hi! My name is Opeyemi, I’m an SRE at Cloudﬂare.
I have extensive experience in building resilience into engineering organisations, and I spend a lot of time thinking about observability and performance. In my spare time, I take on many hobbies such as trying to convince my friends to get me a Panda.

What this talk will cover 1. A theoretical model of
Resilience 2. Introduction to Incident Response - What/Why? 3. Designing an Incident Response process 4. Optimizing the process

What this talk won’t cover 1. Security Incident Response 2.
Introduction to Site Reliability Engineering 3. Setting up Monitoring/Observability 4. Troubleshooting techniques

The thing that amazes you is not that your system
goes down sometimes — it’s that it is up at all Dr. Richard Cook

A theoretical model of Resilience - Developed by Jens Rasmussen
in the 1980s. - Used to depict how close a system is to an accident. - Explains how to ﬁnd a balance between economic and engineering goals.

- Economic-Failure Boundary: Business concerns. e.g. cutting costs, race to
launch before competitors. - Unacceptable-Workload Boundary: Least-effort concerns. e.g. cutting corners to meet business demands. - Accident Boundary: Where accidents happen. A theoretical model of Resilience

What is an Incident? - Any situation that adversely affects
a company’s ability to serve its customers. - Serious incidents have an impact on company revenue. - Example - https://blog.cloudﬂare.com/october-2021-faceboo k-outage/

What is an Incident? https://engineering.fb.com/2021/10/04/networking -trafﬁc/outage/

Case Study/Example - Company Y - Small startup - Single
engineer (e.g. CTO) - Customers notify during serious issues - MTTR is very high. e.g. 2 hours - Always close to the accident boundary Operating Point Economic-failure Boundary Unacceptable-workload Boundary Accident Boundary Pressure for least-effort Pressure for efﬁciency

Case Study/Example - Company Y - Hires x more engineers
- MTTR goes even higher. Why? Operating Point Economic-failure Boundary Unacceptable-workload Boundary Accident Boundary Pressure for efﬁciency Pressure for least-effort

What happens during incidents after the team grows? Disjointed Effort
• Haphazard approach by engineers • Duplicated workstreams Inadequate Communication Longer Resolution Time Negative public perception Revenue Loss • Difﬁcult to provide updates to customers • Disjointed effort prolongs the response time • Lack of communicatio n frustrates customers • Some customers even decide to leave as they believe they’ve been patient enough

How to do Incident Response? 1. Understand the nature of
incidents 🧠 2. Decide on Incident Severities 🚦 3. Tune Monitoring and Observability to match 🚨 4. Establish Incident Responder Roles 󰘎 5. Set up on-call schedules 📅 6. Set up communication channels ☎ 7. Establish set of procedures for Incident Response 📝 8. Learn and optimise the process 󰲎

#1 Understand the nature of Incidents - Severities The Incident
severity determines the level of attention required. e.g. how many teams need to be involved?

#1 Understand the nature of Incidents - States Understanding Incident
states help teams communicate clearly with stakeholders.

#2 Decide on Incident Severities What scenarios can trigger the
most severe incidents? Some examples: - Degraded availability of the entire system - SEV-1. - Increased errors in only one product - SEV-2. - Degraded availability for users in a speciﬁc region - SEV-3. - Single degraded system in a highly-available cluster - SEV-4. - Rare edge case from odd customer interaction with the system - SEV-5.

#3 Tune Monitoring and Alerting The monitoring and alerting should
reﬂect the severities. Some rules: 1. All alerts should have the right severity. 2. Only SEV-1 to SEV-3 levels should kick off a full response process. 3. Each service should be catalogued along with their priorities and runbooks. 4. Alerting should catch general availability issues and assign the highest priority.

#4 Establishing Incident Responder Roles

#5 Set up on-call schedules An on-call schedule rotates the
workload and prevents individual burnout. Source: https://support.pagerduty.com/docs/schedule-basics

#6 Set up communication channels Internal Communication: Chat, Video -
Incident Announcements Channel, Dedicated Incidents Channel with threads etc Source: https://slack.com/intl/en-gb/resources/using-slack/slack-for-incident-management

#6 Set up communication channels External Communication: Status Page Source:
https://status.paystack.com

Setting up the Process - what do you actually do?
Determine Severity • Is correct severity to determine required attention. • Don’t be afraid to get it wrong. Notify Communicate Triage Remediate • Establish internal/external communication channels e.g. Chat, Video Call, Status Page, Incident JIRA ticket etc. • Occurs throughout the incident, in every step. • Information recorded by the scribe and incident status. • IC coordinates with the SMEs to determine the root cause. • SMEs push a ﬁx • Use break-glass release process if necessary. • Speed up releases using known mechanisms.

#7 Setting up the procedure 6 7 Setting up the
Process - what do you actually do? Incident Report • Summary of the incident. • Internal and external report. Incident Reviews • Regular reviews of incidents over a cycle. • Indicates long-running problems that need prioritization.

Optimising the Process 1. Practice 2. Automate all the things
3. Set up process reminders

#1 Practice Optimising the Process

#2 Automate the things Optimising the Process

#3 Set up Process Reminders Optimising the Process

Company Y now has a process - MTTR is reduced
- Customers appreciate knowing what is going on - Constant feedback cycle helps with prioritisation of reliability - Fewer incidents - Farther away from the accident boundary Operating Point Economic-failure Boundary Unacceptable-workload Boundary Accident Boundary Pressure for efﬁciency Pressure for least-effort Incident prevention techniques

Summary 1. Admit you have a problem. 2. Decide on
a process that works for you. 3. Preach. 4. Practice, practice. 5. Use the process and iterate.

@opeyemi.jpg @Ope__O Opeyemi Onikute -

Designing an Incident Response Process at Scale

Designing an Incident Response Process at Scale

Opeyemi Onikute

Other Decks in Technology

Featured

Transcript