Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing an Incident Response Process at Scale

Designing an Incident Response Process at Scale

An incident is any situation that negatively affects customers' use of your product/service. Without a well-designed process, responding to these can be a nightmare. This talk will help you understand how to build an incident response process that scales - including responder roles, communication strategies, optimization tips, and resilience theory.

As we build more complex systems, we have started to wonder not why our systems fail sometimes, but how they even stay up at all. Companies are increasingly tasked with creating a process that reduces the Mean Time to Resolve (MTTR) and subsequent loss of revenue.

Along with learning about what your Incident Response process should look like, you will also learn what it takes to make this process work for your team and how to convince relevant stakeholders within your organization that this is worth doing.

Opeyemi Onikute

September 16, 2023
Tweet

Other Decks in Technology

Transcript

  1. Designing an Incident
    Response Process at
    Scale.
    Opeyemi Onikute

    View Slide

  2. Hi! My name is Opeyemi, I’m an SRE
    at Cloudflare.
    I have extensive experience in building
    resilience into engineering organisations,
    and I spend a lot of time thinking about
    observability and performance.
    In my spare time, I take on many hobbies
    such as trying to convince my friends to
    get me a Panda.

    View Slide

  3. What this talk will cover
    1. A theoretical model of Resilience
    2. Introduction to Incident Response - What/Why?
    3. Designing an Incident Response process
    4. Optimizing the process

    View Slide

  4. What this talk won’t cover
    1. Security Incident Response
    2. Introduction to Site Reliability Engineering
    3. Setting up Monitoring/Observability
    4. Troubleshooting techniques

    View Slide

  5. The thing that amazes you is not that your system goes
    down sometimes — it’s that it is up at all
    Dr. Richard Cook

    View Slide

  6. A theoretical
    model of
    Resilience
    - Developed by Jens Rasmussen in the
    1980s.
    - Used to depict how close a system is to
    an accident.
    - Explains how to find a balance between
    economic and engineering goals.

    View Slide

  7. - Economic-Failure Boundary: Business
    concerns. e.g. cutting costs, race to launch
    before competitors.
    - Unacceptable-Workload Boundary:
    Least-effort concerns. e.g. cutting corners to
    meet business demands.
    - Accident Boundary: Where accidents happen.
    A theoretical
    model of
    Resilience

    View Slide

  8. What is an
    Incident?
    - Any situation that adversely affects a company’s
    ability to serve its customers.
    - Serious incidents have an impact on company
    revenue.
    - Example -
    https://blog.cloudflare.com/october-2021-faceboo
    k-outage/

    View Slide

  9. What is an
    Incident?
    https://engineering.fb.com/2021/10/04/networking
    -traffic/outage/

    View Slide

  10. Case Study/Example - Company Y
    - Small startup
    - Single engineer (e.g. CTO)
    - Customers notify during serious issues
    - MTTR is very high. e.g. 2 hours
    - Always close to the accident boundary
    Operating
    Point
    Economic-failure
    Boundary
    Unacceptable-workload
    Boundary
    Accident
    Boundary
    Pressure for
    least-effort
    Pressure for
    efficiency

    View Slide

  11. Case Study/Example - Company Y
    - Hires x more engineers
    - MTTR goes even higher. Why? Operating
    Point
    Economic-failure
    Boundary
    Unacceptable-workload
    Boundary
    Accident
    Boundary
    Pressure for
    efficiency
    Pressure for
    least-effort

    View Slide

  12. What happens during incidents after the team grows?
    Disjointed Effort
    ● Haphazard
    approach by
    engineers
    ● Duplicated
    workstreams
    Inadequate Communication Longer Resolution Time Negative public perception Revenue Loss
    ● Difficult to
    provide
    updates to
    customers
    ● Disjointed
    effort
    prolongs the
    response time
    ● Lack of
    communicatio
    n frustrates
    customers
    ● Some
    customers even
    decide to leave
    as they believe
    they’ve been
    patient enough

    View Slide

  13. How to do
    Incident
    Response?
    1. Understand the nature of incidents 🧠
    2. Decide on Incident Severities 🚦
    3. Tune Monitoring and Observability to match 🚨
    4. Establish Incident Responder Roles 󰘎
    5. Set up on-call schedules 📅
    6. Set up communication channels ☎
    7. Establish set of procedures for Incident Response 📝
    8. Learn and optimise the process 󰲎

    View Slide

  14. #1 Understand the nature of Incidents - Severities
    The Incident severity determines the level of attention required. e.g. how many teams
    need to be involved?

    View Slide

  15. #1 Understand the nature of Incidents - States
    Understanding Incident states help teams communicate clearly with stakeholders.

    View Slide

  16. #2 Decide on Incident Severities
    What scenarios can trigger the most severe incidents?
    Some examples:
    - Degraded availability of the entire system - SEV-1.
    - Increased errors in only one product - SEV-2.
    - Degraded availability for users in a specific region - SEV-3.
    - Single degraded system in a highly-available cluster - SEV-4.
    - Rare edge case from odd customer interaction with the system - SEV-5.

    View Slide

  17. #3 Tune Monitoring and Alerting
    The monitoring and alerting should reflect the severities.
    Some rules:
    1. All alerts should have the right severity.
    2. Only SEV-1 to SEV-3 levels should kick off a full response process.
    3. Each service should be catalogued along with their priorities and runbooks.
    4. Alerting should catch general availability issues and assign the highest priority.

    View Slide

  18. #4 Establishing Incident Responder Roles

    View Slide

  19. #5 Set up on-call schedules
    An on-call schedule rotates the workload and prevents individual burnout.
    Source: https://support.pagerduty.com/docs/schedule-basics

    View Slide

  20. #6 Set up communication channels
    Internal Communication: Chat, Video
    - Incident Announcements Channel, Dedicated Incidents Channel with threads etc
    Source: https://slack.com/intl/en-gb/resources/using-slack/slack-for-incident-management

    View Slide

  21. #6 Set up communication channels
    External Communication: Status Page
    Source: https://status.paystack.com

    View Slide

  22. Setting up the Process - what do you actually do?
    Determine Severity
    ● Is correct
    severity to
    determine
    required
    attention.
    ● Don’t be
    afraid to get it
    wrong.
    Notify Communicate Triage Remediate
    ● Establish
    internal/external
    communication
    channels e.g.
    Chat, Video Call,
    Status Page,
    Incident JIRA
    ticket etc.
    ● Occurs
    throughout
    the incident,
    in every step.
    ● Information
    recorded by
    the scribe and
    incident
    status.
    ● IC
    coordinates
    with the SMEs
    to determine
    the root
    cause.
    ● SMEs push a fix
    ● Use break-glass
    release process
    if necessary.
    ● Speed up
    releases using
    known
    mechanisms.

    View Slide

  23. #7 Setting up the procedure
    6 7
    Setting up the Process - what do you actually do?
    Incident Report
    ● Summary of
    the incident.
    ● Internal and
    external
    report.
    Incident Reviews
    ● Regular
    reviews of
    incidents over
    a cycle.
    ● Indicates
    long-running
    problems that
    need
    prioritization.

    View Slide

  24. Optimising
    the
    Process
    1. Practice
    2. Automate all the things
    3. Set up process reminders

    View Slide

  25. #1 Practice
    Optimising the Process

    View Slide

  26. #2 Automate
    the things
    Optimising the Process

    View Slide

  27. #3 Set up
    Process
    Reminders
    Optimising the Process

    View Slide

  28. Company Y now has a process
    - MTTR is reduced
    - Customers appreciate knowing what
    is going on
    - Constant feedback cycle helps with
    prioritisation of reliability
    - Fewer incidents
    - Farther away from the accident
    boundary
    Operating
    Point
    Economic-failure
    Boundary
    Unacceptable-workload
    Boundary
    Accident
    Boundary
    Pressure for
    efficiency
    Pressure for
    least-effort
    Incident
    prevention
    techniques

    View Slide

  29. Summary
    1. Admit you have a problem.
    2. Decide on a process that works for you.
    3. Preach.
    4. Practice, practice.
    5. Use the process and iterate.

    View Slide

  30. @opeyemi.jpg
    @Ope__O
    Opeyemi Onikute
    -

    View Slide