Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[DevDojo] Mercari's Incident Management Process

mercari
PRO
December 23, 2022

[DevDojo] Mercari's Incident Management Process

In this course, we will explain incident management in Mercari and its best practices of it. We share a complete incident journey, including three phases "before, during, and after the incident." We also cover how we conduct incident reviews and improve retrospective qualities throughout the company.

mercari
PRO

December 23, 2022
Tweet

More Decks by mercari

Other Decks in Technology

Transcript

  1. Mercari’s Incident management process
    Fall 2022 Dev Dojo
    @maruti

    View Slide

  2. 2
    Incident Management Background
    Agenda
    Incident Management Best Practices
    02
    01

    View Slide

  3. 3
    01. Background : What is an incident?
    “An unplanned interruption to an IT service or reduction in the quality of an IT
    service.”
    -- Schnepp, Rob. Incident Management for Operations (p. 1). O'Reilly Media. Kindle Edition.
    Incident


    View Slide

  4. 4
    01. Background : Normal Operation vs Incident
    Incidents are NOT Day to Day Operation
    ● Declare an Incident explicitly!
    ● Incident state requires a special set of rules
    ● Declare when the Incident is over (Resolved)
    Goals
    ● Return to normal operation with as little impact as possible
    ● As fast as possible
    ● Follow up with a Postmortem shortly after

    View Slide

  5. 5
    01. Background : Incident States

    View Slide

  6. 6
    01. Background : Incident severity levels
    Check detailed flow to estimate Incident Severity.
    Severity General description
    SEV1 Highly critical issue that warrants public notification and liaison with executive teams
    SEV2 Critical issue actively impacting many customers' ability to use the product
    Anything above this line is considered a Major Incident.
    SEV3 Customer-impacting issues that require immediate attention from service owners
    Anything above this line is considered an Incident impacting customers.
    SEV4 Issues requiring action, but not affecting customer's ability to use the product.
    This includes internal tool incidents, cron job failure incidents or potential risk which
    can lead to incidents if no action is taken.

    View Slide

  7. 7
    01. Background : Incident Roles

    View Slide

  8. 8
    01. Background : Incident Commander
    ● Takes complete ownership of the outcome of the
    incident
    ● Not necessarily the most senior person
    ● Should not be casually replaced during an incident
    ● Assembles team and delegates their responsibilities
    as appropriate
    ● Single source of truth of what’s happening and
    what’s planned
    ● Develop and maintain the IAP (Incident action Plan)
    ● Manages/Updates the Conditions Actions Needs
    (CAN) report
    Incident commander


    View Slide

  9. 9
    01. Background : Communications Lead
    ● Communicates with entities beyond the response team
    ● Similar to Public Information Officer
    ● “Voice of the Incident Commander”
    ● Passes info from outside of incident to the Incident
    Commander or Technical Lead

    View Slide

  10. 10
    01. Background : Technical Lead
    ● Expected to be an SME (Subject Matter Expert)
    ● Responsible for the execution of technical tasks
    ● Advises the Incident Commander on technical decisions and gives
    updates
    ● “The hands of the Incident Commander”
    ● Defer to Incident Commander for policy and planning decisions

    View Slide

  11. 11
    01. Background : CAN report
    ● Conditions
    ○ Type of Incident
    ○ Current Status of incident including
    State and Severity Level
    ○ Summary
    ○ Blast Radius (Customer Impact)
    ● Actions
    ○ What is being done
    ○ Who is doing it
    ● Needs
    ○ Additional personnel or resources

    View Slide

  12. 12
    01. Background : Incident Timeline
    MTTA : Mean Time To Acknowledge
    MTTR : Mean Time To Resolve
    MTRS : Mean Time to Restore Service
    Time
    Normal Operation Normal Operation
    MTTA
    MTTR =
    Incident impact duration
    MTRS =
    Customer impact duration
    Incident
    Acknowledged
    Next Incident (distant
    future :)
    Start of an
    Incident
    Incident Resolved =
    End of an
    incident

    View Slide

  13. 13
    01. Background : MTTA
    MTTA: Mean Time To Acknowledge
    MTTA is time taken to acknowledge an incident after incident has actually started.

    View Slide

  14. 14
    01. Background : MTTR
    MTTR : Mean Time To Resolve
    MTTR is time taken to resolve an incident after incident is acknowledged.
    It is equal to Incident impact duration in which teams/members spent time to resolve
    the incident.

    View Slide

  15. 15
    01. Background : MTRS
    MTRS : Mean Time To Restore Service
    MTRS = MTTA + MTTR
    MTRS is total time taken to resolve an incident after incident has started.
    It is also equivalent to Customer impact duration.
    Customer impact duration = Time to acknowledge(MTTA) + Incident impact duration by teams (MTTR)

    View Slide

  16. 16
    02. Best Practices : Incident response
    ● Prioritize : Stop the bleeding, restore service, and preserve the evidence for root causing
    ● Prepare : Develop and document your incident management procedures in advance...
    ● Trust : Give full autonomy within the assigned role to all incident participants.
    ● Introspect : Pay attention to your emotional state while responding to an incident…
    ● Consider alternatives : Periodically consider your options and re-evaluate whether it still makes
    sense to continue
    ● Practice : Use the process routinely so it becomes second nature.
    ● Change it around : Were you incident commander last time? Take on a different role this time.
    Ref :
    -- Stribblehill, Andrew. “Chapter 14.” Site Reliability Engineering, edited by Kavita Guliani,
    http://landing.google.com/sre/sre-book/chapters/managing-incidents/#id-MJbuNS0Fd

    View Slide

  17. 17
    02. Best Practices : Decreasing MTTA
    ● Improve Monitoring
    ● “Panic Button” for Customer Support
    ● Automatic Incident Triggering
    ● Automatic Response Team Alerting (Paging)
    ● Automatic Construction of Communication channels
    (chat, voice bridge)
    ● Established procedures THAT ARE PRACTICED!

    View Slide

  18. 18
    02. Best Practices : Decreasing MTTR
    ● Codified Incident Process
    ● If only we could orchestrate parallel paths of investigation
    ● Multiple SMEs running multiple “swimlanes”
    ● Discipline in following process (or Consistency &
    Dedication)

    View Slide

  19. 19
    02. Best Practices : Decreasing MTTR
    ● Codified Incident Process
    ● Proper Training for your Incident Response Team
    ● Practice, Practice, Practice
    ● Discipline or, if you prefer... Consistency & Dedication
    ● Archive, Analyze and Learn from your Postmortems
    ● Did I forget to mention Discipline?

    View Slide

  20. 20
    02. Best Practices : Incident management process
    ● Predictable
    ● Repeatable
    ● Optimized
    ● Clear
    ● Evaluated
    ● Scalable
    ● Sustainable

    View Slide

  21. 21
    02. Best Practices : Incident postmortem
    “Without a predictable way to respond to
    incidents, any organization — growing or
    mature — is at risk.”
    Schnepp, Rob. Incident Management for Operations . O'Reilly Media. Kindle
    Edition.

    View Slide

  22. 22
    02. Best Practices : Incident post mortem process
    ● Assign Postmortem owner
    ● Complete the timeline
    ● Schedule meeting to collaborate on postmortem
    ● Discuss & assign actionable follow-up actions
    ● Complete follow-up actions
    ● Share the learnings out

    View Slide

  23. 23
    02. Best Practices : Postmortem Time Consumption

    View Slide

  24. 24
    02. Best Practices : Successful Postmortem
    ● Clear ownership
    ● Context & Key Details
    ● On Time Completion
    ● Tracked follow-up actions
    ● Blameless language
    ● Referencability

    View Slide

  25. 25
    02. Best Practices : Results of Successful Postmortems
    ● Less blame
    ● Less toil
    ● Less panic
    ● Continuous improvement & faster delivery
    ● Happy & successful customers

    View Slide

  26. 26
    Thank you !

    View Slide