Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing an Incident Response Process at Scale

Designing an Incident Response Process at Scale

An incident is any situation that negatively affects customers' use of your product/service. Without a well-designed process, responding to these can be a nightmare. This talk will help you understand how to build an incident response process that scales - including responder roles, communication strategies, optimization tips, and resilience theory.

As we build more complex systems, we have started to wonder not why our systems fail sometimes, but how they even stay up at all. Companies are increasingly tasked with creating a process that reduces the Mean Time to Resolve (MTTR) and subsequent loss of revenue.

Along with learning about what your Incident Response process should look like, you will also learn what it takes to make this process work for your team and how to convince relevant stakeholders within your organization that this is worth doing.

Opeyemi Onikute

September 16, 2023
Tweet

Other Decks in Technology

Transcript

  1. Hi! My name is Opeyemi, I’m an SRE at Cloudflare.

    I have extensive experience in building resilience into engineering organisations, and I spend a lot of time thinking about observability and performance. In my spare time, I take on many hobbies such as trying to convince my friends to get me a Panda.
  2. What this talk will cover 1. A theoretical model of

    Resilience 2. Introduction to Incident Response - What/Why? 3. Designing an Incident Response process 4. Optimizing the process
  3. What this talk won’t cover 1. Security Incident Response 2.

    Introduction to Site Reliability Engineering 3. Setting up Monitoring/Observability 4. Troubleshooting techniques
  4. The thing that amazes you is not that your system

    goes down sometimes — it’s that it is up at all Dr. Richard Cook
  5. A theoretical model of Resilience - Developed by Jens Rasmussen

    in the 1980s. - Used to depict how close a system is to an accident. - Explains how to find a balance between economic and engineering goals.
  6. - Economic-Failure Boundary: Business concerns. e.g. cutting costs, race to

    launch before competitors. - Unacceptable-Workload Boundary: Least-effort concerns. e.g. cutting corners to meet business demands. - Accident Boundary: Where accidents happen. A theoretical model of Resilience
  7. What is an Incident? - Any situation that adversely affects

    a company’s ability to serve its customers. - Serious incidents have an impact on company revenue. - Example - https://blog.cloudflare.com/october-2021-faceboo k-outage/
  8. Case Study/Example - Company Y - Small startup - Single

    engineer (e.g. CTO) - Customers notify during serious issues - MTTR is very high. e.g. 2 hours - Always close to the accident boundary Operating Point Economic-failure Boundary Unacceptable-workload Boundary Accident Boundary Pressure for least-effort Pressure for efficiency
  9. Case Study/Example - Company Y - Hires x more engineers

    - MTTR goes even higher. Why? Operating Point Economic-failure Boundary Unacceptable-workload Boundary Accident Boundary Pressure for efficiency Pressure for least-effort
  10. What happens during incidents after the team grows? Disjointed Effort

    • Haphazard approach by engineers • Duplicated workstreams Inadequate Communication Longer Resolution Time Negative public perception Revenue Loss • Difficult to provide updates to customers • Disjointed effort prolongs the response time • Lack of communicatio n frustrates customers • Some customers even decide to leave as they believe they’ve been patient enough
  11. How to do Incident Response? 1. Understand the nature of

    incidents 🧠 2. Decide on Incident Severities 🚦 3. Tune Monitoring and Observability to match 🚨 4. Establish Incident Responder Roles 󰘎 5. Set up on-call schedules 📅 6. Set up communication channels ☎ 7. Establish set of procedures for Incident Response 📝 8. Learn and optimise the process 󰲎
  12. #1 Understand the nature of Incidents - Severities The Incident

    severity determines the level of attention required. e.g. how many teams need to be involved?
  13. #1 Understand the nature of Incidents - States Understanding Incident

    states help teams communicate clearly with stakeholders.
  14. #2 Decide on Incident Severities What scenarios can trigger the

    most severe incidents? Some examples: - Degraded availability of the entire system - SEV-1. - Increased errors in only one product - SEV-2. - Degraded availability for users in a specific region - SEV-3. - Single degraded system in a highly-available cluster - SEV-4. - Rare edge case from odd customer interaction with the system - SEV-5.
  15. #3 Tune Monitoring and Alerting The monitoring and alerting should

    reflect the severities. Some rules: 1. All alerts should have the right severity. 2. Only SEV-1 to SEV-3 levels should kick off a full response process. 3. Each service should be catalogued along with their priorities and runbooks. 4. Alerting should catch general availability issues and assign the highest priority.
  16. #5 Set up on-call schedules An on-call schedule rotates the

    workload and prevents individual burnout. Source: https://support.pagerduty.com/docs/schedule-basics
  17. #6 Set up communication channels Internal Communication: Chat, Video -

    Incident Announcements Channel, Dedicated Incidents Channel with threads etc Source: https://slack.com/intl/en-gb/resources/using-slack/slack-for-incident-management
  18. Setting up the Process - what do you actually do?

    Determine Severity • Is correct severity to determine required attention. • Don’t be afraid to get it wrong. Notify Communicate Triage Remediate • Establish internal/external communication channels e.g. Chat, Video Call, Status Page, Incident JIRA ticket etc. • Occurs throughout the incident, in every step. • Information recorded by the scribe and incident status. • IC coordinates with the SMEs to determine the root cause. • SMEs push a fix • Use break-glass release process if necessary. • Speed up releases using known mechanisms.
  19. #7 Setting up the procedure 6 7 Setting up the

    Process - what do you actually do? Incident Report • Summary of the incident. • Internal and external report. Incident Reviews • Regular reviews of incidents over a cycle. • Indicates long-running problems that need prioritization.
  20. Company Y now has a process - MTTR is reduced

    - Customers appreciate knowing what is going on - Constant feedback cycle helps with prioritisation of reliability - Fewer incidents - Farther away from the accident boundary Operating Point Economic-failure Boundary Unacceptable-workload Boundary Accident Boundary Pressure for efficiency Pressure for least-effort Incident prevention techniques
  21. Summary 1. Admit you have a problem. 2. Decide on

    a process that works for you. 3. Preach. 4. Practice, practice. 5. Use the process and iterate.