Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[DevDojo] Mercari's Incident Management Process

mercari
PRO
December 23, 2022

[DevDojo] Mercari's Incident Management Process

In this course, we will explain incident management in Mercari and its best practices of it. We share a complete incident journey, including three phases "before, during, and after the incident." We also cover how we conduct incident reviews and improve retrospective qualities throughout the company.

mercari
PRO

December 23, 2022
Tweet

More Decks by mercari

Other Decks in Technology

Transcript

  1. Mercari’s Incident management process Fall 2022 Dev Dojo @maruti

  2. 2 Incident Management Background Agenda Incident Management Best Practices 02

    01
  3. 3 01. Background : What is an incident? “An unplanned

    interruption to an IT service or reduction in the quality of an IT service.” -- Schnepp, Rob. Incident Management for Operations (p. 1). O'Reilly Media. Kindle Edition. Incident

  4. 4 01. Background : Normal Operation vs Incident Incidents are

    NOT Day to Day Operation • Declare an Incident explicitly! • Incident state requires a special set of rules • Declare when the Incident is over (Resolved) Goals • Return to normal operation with as little impact as possible • As fast as possible • Follow up with a Postmortem shortly after
  5. 5 01. Background : Incident States

  6. 6 01. Background : Incident severity levels Check detailed flow

    to estimate Incident Severity. Severity General description SEV1 Highly critical issue that warrants public notification and liaison with executive teams SEV2 Critical issue actively impacting many customers' ability to use the product Anything above this line is considered a Major Incident. SEV3 Customer-impacting issues that require immediate attention from service owners Anything above this line is considered an Incident impacting customers. SEV4 Issues requiring action, but not affecting customer's ability to use the product. This includes internal tool incidents, cron job failure incidents or potential risk which can lead to incidents if no action is taken.
  7. 7 01. Background : Incident Roles

  8. 8 01. Background : Incident Commander • Takes complete ownership

    of the outcome of the incident • Not necessarily the most senior person • Should not be casually replaced during an incident • Assembles team and delegates their responsibilities as appropriate • Single source of truth of what’s happening and what’s planned • Develop and maintain the IAP (Incident action Plan) • Manages/Updates the Conditions Actions Needs (CAN) report Incident commander

  9. 9 01. Background : Communications Lead • Communicates with entities

    beyond the response team • Similar to Public Information Officer • “Voice of the Incident Commander” • Passes info from outside of incident to the Incident Commander or Technical Lead
  10. 10 01. Background : Technical Lead • Expected to be

    an SME (Subject Matter Expert) • Responsible for the execution of technical tasks • Advises the Incident Commander on technical decisions and gives updates • “The hands of the Incident Commander” • Defer to Incident Commander for policy and planning decisions
  11. 11 01. Background : CAN report • Conditions ◦ Type

    of Incident ◦ Current Status of incident including State and Severity Level ◦ Summary ◦ Blast Radius (Customer Impact) • Actions ◦ What is being done ◦ Who is doing it • Needs ◦ Additional personnel or resources
  12. 12 01. Background : Incident Timeline MTTA : Mean Time

    To Acknowledge MTTR : Mean Time To Resolve MTRS : Mean Time to Restore Service Time Normal Operation Normal Operation MTTA MTTR = Incident impact duration MTRS = Customer impact duration Incident Acknowledged Next Incident (distant future :) Start of an Incident Incident Resolved = End of an incident
  13. 13 01. Background : MTTA MTTA: Mean Time To Acknowledge

    MTTA is time taken to acknowledge an incident after incident has actually started.
  14. 14 01. Background : MTTR MTTR : Mean Time To

    Resolve MTTR is time taken to resolve an incident after incident is acknowledged. It is equal to Incident impact duration in which teams/members spent time to resolve the incident.
  15. 15 01. Background : MTRS MTRS : Mean Time To

    Restore Service MTRS = MTTA + MTTR MTRS is total time taken to resolve an incident after incident has started. It is also equivalent to Customer impact duration. Customer impact duration = Time to acknowledge(MTTA) + Incident impact duration by teams (MTTR)
  16. 16 02. Best Practices : Incident response • Prioritize :

    Stop the bleeding, restore service, and preserve the evidence for root causing • Prepare : Develop and document your incident management procedures in advance... • Trust : Give full autonomy within the assigned role to all incident participants. • Introspect : Pay attention to your emotional state while responding to an incident… • Consider alternatives : Periodically consider your options and re-evaluate whether it still makes sense to continue • Practice : Use the process routinely so it becomes second nature. • Change it around : Were you incident commander last time? Take on a different role this time. Ref : -- Stribblehill, Andrew. “Chapter 14.” Site Reliability Engineering, edited by Kavita Guliani, http://landing.google.com/sre/sre-book/chapters/managing-incidents/#id-MJbuNS0Fd
  17. 17 02. Best Practices : Decreasing MTTA • Improve Monitoring

    • “Panic Button” for Customer Support • Automatic Incident Triggering • Automatic Response Team Alerting (Paging) • Automatic Construction of Communication channels (chat, voice bridge) • Established procedures THAT ARE PRACTICED!
  18. 18 02. Best Practices : Decreasing MTTR • Codified Incident

    Process • If only we could orchestrate parallel paths of investigation • Multiple SMEs running multiple “swimlanes” • Discipline in following process (or Consistency & Dedication)
  19. 19 02. Best Practices : Decreasing MTTR • Codified Incident

    Process • Proper Training for your Incident Response Team • Practice, Practice, Practice • Discipline or, if you prefer... Consistency & Dedication • Archive, Analyze and Learn from your Postmortems • Did I forget to mention Discipline?
  20. 20 02. Best Practices : Incident management process • Predictable

    • Repeatable • Optimized • Clear • Evaluated • Scalable • Sustainable
  21. 21 02. Best Practices : Incident postmortem “Without a predictable

    way to respond to incidents, any organization — growing or mature — is at risk.” Schnepp, Rob. Incident Management for Operations . O'Reilly Media. Kindle Edition.
  22. 22 02. Best Practices : Incident post mortem process •

    Assign Postmortem owner • Complete the timeline • Schedule meeting to collaborate on postmortem • Discuss & assign actionable follow-up actions • Complete follow-up actions • Share the learnings out
  23. 23 02. Best Practices : Postmortem Time Consumption

  24. 24 02. Best Practices : Successful Postmortem • Clear ownership

    • Context & Key Details • On Time Completion • Tracked follow-up actions • Blameless language • Referencability
  25. 25 02. Best Practices : Results of Successful Postmortems •

    Less blame • Less toil • Less panic • Continuous improvement & faster delivery • Happy & successful customers
  26. 26 Thank you !