[DevDojo] Mercari's Incident Management Process

Mercari’s Incident management process Fall 2022 Dev Dojo @maruti

2 Incident Management Background Agenda Incident Management Best Practices 02
01

3 01. Background : What is an incident? “An unplanned
interruption to an IT service or reduction in the quality of an IT service.” -- Schnepp, Rob. Incident Management for Operations (p. 1). O'Reilly Media. Kindle Edition. Incident 

4 01. Background : Normal Operation vs Incident Incidents are
NOT Day to Day Operation • Declare an Incident explicitly! • Incident state requires a special set of rules • Declare when the Incident is over (Resolved) Goals • Return to normal operation with as little impact as possible • As fast as possible • Follow up with a Postmortem shortly after

5 01. Background : Incident States

6 01. Background : Incident severity levels Check detailed ﬂow
to estimate Incident Severity. Severity General description SEV1 Highly critical issue that warrants public notiﬁcation and liaison with executive teams SEV2 Critical issue actively impacting many customers' ability to use the product Anything above this line is considered a Major Incident. SEV3 Customer-impacting issues that require immediate attention from service owners Anything above this line is considered an Incident impacting customers. SEV4 Issues requiring action, but not affecting customer's ability to use the product. This includes internal tool incidents, cron job failure incidents or potential risk which can lead to incidents if no action is taken.

7 01. Background : Incident Roles

8 01. Background : Incident Commander • Takes complete ownership
of the outcome of the incident • Not necessarily the most senior person • Should not be casually replaced during an incident • Assembles team and delegates their responsibilities as appropriate • Single source of truth of what’s happening and what’s planned • Develop and maintain the IAP (Incident action Plan) • Manages/Updates the Conditions Actions Needs (CAN) report Incident commander 

9 01. Background : Communications Lead • Communicates with entities
beyond the response team • Similar to Public Information Oﬃcer • “Voice of the Incident Commander” • Passes info from outside of incident to the Incident Commander or Technical Lead

10 01. Background : Technical Lead • Expected to be
an SME (Subject Matter Expert) • Responsible for the execution of technical tasks • Advises the Incident Commander on technical decisions and gives updates • “The hands of the Incident Commander” • Defer to Incident Commander for policy and planning decisions

11 01. Background : CAN report • Conditions ◦ Type
of Incident ◦ Current Status of incident including State and Severity Level ◦ Summary ◦ Blast Radius (Customer Impact) • Actions ◦ What is being done ◦ Who is doing it • Needs ◦ Additional personnel or resources

12 01. Background : Incident Timeline MTTA : Mean Time
To Acknowledge MTTR : Mean Time To Resolve MTRS : Mean Time to Restore Service Time Normal Operation Normal Operation MTTA MTTR = Incident impact duration MTRS = Customer impact duration Incident Acknowledged Next Incident (distant future :) Start of an Incident Incident Resolved = End of an incident

13 01. Background : MTTA MTTA: Mean Time To Acknowledge
MTTA is time taken to acknowledge an incident after incident has actually started.

14 01. Background : MTTR MTTR : Mean Time To
Resolve MTTR is time taken to resolve an incident after incident is acknowledged. It is equal to Incident impact duration in which teams/members spent time to resolve the incident.

15 01. Background : MTRS MTRS : Mean Time To
Restore Service MTRS = MTTA + MTTR MTRS is total time taken to resolve an incident after incident has started. It is also equivalent to Customer impact duration. Customer impact duration = Time to acknowledge(MTTA) + Incident impact duration by teams (MTTR)

16 02. Best Practices : Incident response • Prioritize :
Stop the bleeding, restore service, and preserve the evidence for root causing • Prepare : Develop and document your incident management procedures in advance... • Trust : Give full autonomy within the assigned role to all incident participants. • Introspect : Pay attention to your emotional state while responding to an incident… • Consider alternatives : Periodically consider your options and re-evaluate whether it still makes sense to continue • Practice : Use the process routinely so it becomes second nature. • Change it around : Were you incident commander last time? Take on a different role this time. Ref : -- Stribblehill, Andrew. “Chapter 14.” Site Reliability Engineering, edited by Kavita Guliani, http://landing.google.com/sre/sre-book/chapters/managing-incidents/#id-MJbuNS0Fd

17 02. Best Practices : Decreasing MTTA • Improve Monitoring
• “Panic Button” for Customer Support • Automatic Incident Triggering • Automatic Response Team Alerting (Paging) • Automatic Construction of Communication channels (chat, voice bridge) • Established procedures THAT ARE PRACTICED!

18 02. Best Practices : Decreasing MTTR • Codiﬁed Incident
Process • If only we could orchestrate parallel paths of investigation • Multiple SMEs running multiple “swimlanes” • Discipline in following process (or Consistency & Dedication)

19 02. Best Practices : Decreasing MTTR • Codiﬁed Incident
Process • Proper Training for your Incident Response Team • Practice, Practice, Practice • Discipline or, if you prefer... Consistency & Dedication • Archive, Analyze and Learn from your Postmortems • Did I forget to mention Discipline?

20 02. Best Practices : Incident management process • Predictable
• Repeatable • Optimized • Clear • Evaluated • Scalable • Sustainable

21 02. Best Practices : Incident postmortem “Without a predictable
way to respond to incidents, any organization — growing or mature — is at risk.” Schnepp, Rob. Incident Management for Operations . O'Reilly Media. Kindle Edition.

22 02. Best Practices : Incident post mortem process •
Assign Postmortem owner • Complete the timeline • Schedule meeting to collaborate on postmortem • Discuss & assign actionable follow-up actions • Complete follow-up actions • Share the learnings out

23 02. Best Practices : Postmortem Time Consumption

24 02. Best Practices : Successful Postmortem • Clear ownership
• Context & Key Details • On Time Completion • Tracked follow-up actions • Blameless language • Referencability

25 02. Best Practices : Results of Successful Postmortems •
Less blame • Less toil • Less panic • Continuous improvement & faster delivery • Happy & successful customers

26 Thank you !

[DevDojo] Mercari's Incident Management Process

[DevDojo] Mercari's Incident Management Process

mercari PRO

More Decks by mercari

Other Decks in Technology

Featured

Transcript

Mercari’s Incident management process Fall 2022 Dev Dojo @maruti

2 Incident Management Background Agenda Incident Management Best Practices 02

3 01. Background : What is an incident? “An unplanned

4 01. Background : Normal Operation vs Incident Incidents are

5 01. Background : Incident States

6 01. Background : Incident severity levels Check detailed ﬂow

7 01. Background : Incident Roles

8 01. Background : Incident Commander • Takes complete ownership

9 01. Background : Communications Lead • Communicates with entities

10 01. Background : Technical Lead • Expected to be

11 01. Background : CAN report • Conditions ◦ Type

12 01. Background : Incident Timeline MTTA : Mean Time

13 01. Background : MTTA MTTA: Mean Time To Acknowledge

14 01. Background : MTTR MTTR : Mean Time To

15 01. Background : MTRS MTRS : Mean Time To

16 02. Best Practices : Incident response • Prioritize :

17 02. Best Practices : Decreasing MTTA • Improve Monitoring

18 02. Best Practices : Decreasing MTTR • Codiﬁed Incident

19 02. Best Practices : Decreasing MTTR • Codiﬁed Incident

20 02. Best Practices : Incident management process • Predictable

21 02. Best Practices : Incident postmortem “Without a predictable

22 02. Best Practices : Incident post mortem process •

23 02. Best Practices : Postmortem Time Consumption

24 02. Best Practices : Successful Postmortem • Clear ownership

25 02. Best Practices : Results of Successful Postmortems •

26 Thank you !