Slide 1

Slide 1 text

Mercari’s Incident management process Fall 2022 Dev Dojo @maruti

Slide 2

Slide 2 text

2 Incident Management Background Agenda Incident Management Best Practices 02 01

Slide 3

Slide 3 text

3 01. Background : What is an incident? “An unplanned interruption to an IT service or reduction in the quality of an IT service.” -- Schnepp, Rob. Incident Management for Operations (p. 1). O'Reilly Media. Kindle Edition. Incident


Slide 4

Slide 4 text

4 01. Background : Normal Operation vs Incident Incidents are NOT Day to Day Operation ● Declare an Incident explicitly! ● Incident state requires a special set of rules ● Declare when the Incident is over (Resolved) Goals ● Return to normal operation with as little impact as possible ● As fast as possible ● Follow up with a Postmortem shortly after

Slide 5

Slide 5 text

5 01. Background : Incident States

Slide 6

Slide 6 text

6 01. Background : Incident severity levels Check detailed flow to estimate Incident Severity. Severity General description SEV1 Highly critical issue that warrants public notification and liaison with executive teams SEV2 Critical issue actively impacting many customers' ability to use the product Anything above this line is considered a Major Incident. SEV3 Customer-impacting issues that require immediate attention from service owners Anything above this line is considered an Incident impacting customers. SEV4 Issues requiring action, but not affecting customer's ability to use the product. This includes internal tool incidents, cron job failure incidents or potential risk which can lead to incidents if no action is taken.

Slide 7

Slide 7 text

7 01. Background : Incident Roles

Slide 8

Slide 8 text

8 01. Background : Incident Commander ● Takes complete ownership of the outcome of the incident ● Not necessarily the most senior person ● Should not be casually replaced during an incident ● Assembles team and delegates their responsibilities as appropriate ● Single source of truth of what’s happening and what’s planned ● Develop and maintain the IAP (Incident action Plan) ● Manages/Updates the Conditions Actions Needs (CAN) report Incident commander


Slide 9

Slide 9 text

9 01. Background : Communications Lead ● Communicates with entities beyond the response team ● Similar to Public Information Officer ● “Voice of the Incident Commander” ● Passes info from outside of incident to the Incident Commander or Technical Lead

Slide 10

Slide 10 text

10 01. Background : Technical Lead ● Expected to be an SME (Subject Matter Expert) ● Responsible for the execution of technical tasks ● Advises the Incident Commander on technical decisions and gives updates ● “The hands of the Incident Commander” ● Defer to Incident Commander for policy and planning decisions

Slide 11

Slide 11 text

11 01. Background : CAN report ● Conditions ○ Type of Incident ○ Current Status of incident including State and Severity Level ○ Summary ○ Blast Radius (Customer Impact) ● Actions ○ What is being done ○ Who is doing it ● Needs ○ Additional personnel or resources

Slide 12

Slide 12 text

12 01. Background : Incident Timeline MTTA : Mean Time To Acknowledge MTTR : Mean Time To Resolve MTRS : Mean Time to Restore Service Time Normal Operation Normal Operation MTTA MTTR = Incident impact duration MTRS = Customer impact duration Incident Acknowledged Next Incident (distant future :) Start of an Incident Incident Resolved = End of an incident

Slide 13

Slide 13 text

13 01. Background : MTTA MTTA: Mean Time To Acknowledge MTTA is time taken to acknowledge an incident after incident has actually started.

Slide 14

Slide 14 text

14 01. Background : MTTR MTTR : Mean Time To Resolve MTTR is time taken to resolve an incident after incident is acknowledged. It is equal to Incident impact duration in which teams/members spent time to resolve the incident.

Slide 15

Slide 15 text

15 01. Background : MTRS MTRS : Mean Time To Restore Service MTRS = MTTA + MTTR MTRS is total time taken to resolve an incident after incident has started. It is also equivalent to Customer impact duration. Customer impact duration = Time to acknowledge(MTTA) + Incident impact duration by teams (MTTR)

Slide 16

Slide 16 text

16 02. Best Practices : Incident response ● Prioritize : Stop the bleeding, restore service, and preserve the evidence for root causing ● Prepare : Develop and document your incident management procedures in advance... ● Trust : Give full autonomy within the assigned role to all incident participants. ● Introspect : Pay attention to your emotional state while responding to an incident… ● Consider alternatives : Periodically consider your options and re-evaluate whether it still makes sense to continue ● Practice : Use the process routinely so it becomes second nature. ● Change it around : Were you incident commander last time? Take on a different role this time. Ref : -- Stribblehill, Andrew. “Chapter 14.” Site Reliability Engineering, edited by Kavita Guliani, http://landing.google.com/sre/sre-book/chapters/managing-incidents/#id-MJbuNS0Fd

Slide 17

Slide 17 text

17 02. Best Practices : Decreasing MTTA ● Improve Monitoring ● “Panic Button” for Customer Support ● Automatic Incident Triggering ● Automatic Response Team Alerting (Paging) ● Automatic Construction of Communication channels (chat, voice bridge) ● Established procedures THAT ARE PRACTICED!

Slide 18

Slide 18 text

18 02. Best Practices : Decreasing MTTR ● Codified Incident Process ● If only we could orchestrate parallel paths of investigation ● Multiple SMEs running multiple “swimlanes” ● Discipline in following process (or Consistency & Dedication)

Slide 19

Slide 19 text

19 02. Best Practices : Decreasing MTTR ● Codified Incident Process ● Proper Training for your Incident Response Team ● Practice, Practice, Practice ● Discipline or, if you prefer... Consistency & Dedication ● Archive, Analyze and Learn from your Postmortems ● Did I forget to mention Discipline?

Slide 20

Slide 20 text

20 02. Best Practices : Incident management process ● Predictable ● Repeatable ● Optimized ● Clear ● Evaluated ● Scalable ● Sustainable

Slide 21

Slide 21 text

21 02. Best Practices : Incident postmortem “Without a predictable way to respond to incidents, any organization — growing or mature — is at risk.” Schnepp, Rob. Incident Management for Operations . O'Reilly Media. Kindle Edition.

Slide 22

Slide 22 text

22 02. Best Practices : Incident post mortem process ● Assign Postmortem owner ● Complete the timeline ● Schedule meeting to collaborate on postmortem ● Discuss & assign actionable follow-up actions ● Complete follow-up actions ● Share the learnings out

Slide 23

Slide 23 text

23 02. Best Practices : Postmortem Time Consumption

Slide 24

Slide 24 text

24 02. Best Practices : Successful Postmortem ● Clear ownership ● Context & Key Details ● On Time Completion ● Tracked follow-up actions ● Blameless language ● Referencability

Slide 25

Slide 25 text

25 02. Best Practices : Results of Successful Postmortems ● Less blame ● Less toil ● Less panic ● Continuous improvement & faster delivery ● Happy & successful customers

Slide 26

Slide 26 text

26 Thank you !