[DevDojo] Mercari Incident Management - 2024

1 Mercari’s Incident Management Process Account Moderation & Privacy Tech
Engineering Team @maruti

2 Incident Management Background インシデント管理の背景 Agenda Incident Management Best Practices
インシデント管理のベストプラクティス 02 01

3 01. Background : What is an incident? 背景：インシデントとは？ “An
unplanned interruption to an IT service or reduction in the quality of an IT service.” 製品を利用する上で、ユーザーに影響を与える予期せぬサービスの中断や品質の低下 -- Schnepp, Rob. Incident Management for Operations (p. 1). O'Reilly Media. Kindle Edition. Incident 

4 01. Background : Normal Operation vs Incident Incidents are
NOT Day to Day Operation インシデント対応は日々の業務であってはならない • Declare an Incident explicitly! 明確にインシデントを報告する！ • Incident state requires a special set of rules インシデントの対応状態には特別なルールが必要 • Declare when the Incident is over (Resolved) インシデントが解消したら報告する

5 01. Background : Normal Operation vs Incident Goals 目標
• Return to normal operation with as little impact as possible 出来るだけ小さい影響で通常業務に戻る • As fast as possible 出来るだけ速やかに • Follow up with a Postmortem shortly after 障害対応後すぐにポストモーテム（事後分析）を行う

6 01. Background : Incident States

7 01. Background : Incident severity levels Severity General Description
SEV1 Highly critical issue that warrants public notiﬁcation and liaison with executive teams 公の通知と経営陣との連絡を保証する非常に重要な問題。 SEV2 Critical issue actively impacting many customers' ability to use the product Productを使用する多くのお客さまに影響を与える重大な問題。 Anything above this line is considered a Major Incident. この線より上のものはすべてメジャーインシデントと見なされます。 SEV3 Customer-impacting issues that require immediate attention from service owners サービスオーナーからの迅速な対応を必要とするお客さんに影響を与える問題 Anything above this line is considered an Incident impacting customers. この線より上のものはすべてお客さんに影響を与えるインシデントと見なされます。 SEV4 Issues requiring action, but not affecting customer's ability to use the product. This includes internal tool incidents, cron job failure incidents or potential risk which can lead to incidents if no action is taken. 対応が必要ですが、 Productを使用するお客さんに影響をを与えない問題。この中に、内部ツールインシデント、 cronジョブ失敗インシデント、またはアクションを実行しないとインシデントにつながる可能性がある潜在的なリスクが含まれます。 Note : Cosmetic issues/bugs which do not lead to incidents can be categorized as SEV5 and does not need to be registered in Blameless.

8 01. Background : Incident Roles

9 01. Background : Incident Commander インシデントコマンダー • Takes
complete ownership of the outcome of the incident インシデントの結果に対して責任を持つ • Not necessarily the most senior person 最もシニアなメンバーである必要はない • Should not be casually replaced during an incident インシデント対応中に他のメンバーが不用意に代わるべきではない • Assembles team and delegates their responsibilities as appropriate • チームを立ち上げ、適切に責任を委任する • Single source of truth of what’s happening and what’s planned 発生している問題と対応策を一元管理する（Single Source of Truth） • Develop and maintain the IAP (Incident action Plan) インシデントアクションプラン（IAP）を立て、管理する • Manages/Updates the Conditions Actions Needs (CAN) report Conditions Actions Needs (CAN) 報告書を管理し、更新する • Add PM responsible for affected feature to communication loop in the ﬁrst stage. If PM is unknown, escalate to CPO 影響を受ける機能を担当する PM を最初の段階の通信ループに追加します。 PM が不明の場合は CEO にエスカレーション Incident commander 

10 01. Background : Communications Lead コミュニケーションリード • Communicates
with entities beyond the response team 対応チームを超えて関係者に情報を伝える • Similar to Public Information Oﬃcer 広報に似た役割 • “Voice of the Incident Commander” 「インシデントコマンダーの声」になる • Passes info from outside of incident to the Incident Commander or Technical Lead 情報を収集し、インシデントコマンダーやテクニカルリードにそれを伝える

11 01. Background : Technical Lead テクニカルリード • Expected
to be an SME (Subject Matter Expert) 特定分野の専門家（Subject Matter Expert（SME）であることが期待される • Responsible for the execution of technical tasks テクニカルタスク遂行をリードし、責任を持つ • Advises the Incident Commander on technical decisions and gives updates インシデントコマンダーに技術的な意思決定に関してアドバイスしたり、情報共有する • “The hands of the Incident Commander” 「インシデントコマンダーの右腕」になる • Defer to Incident Commander for policy and planning decisions 方針と計画の決定事項はインシデントコマンダーに従う

12 01. Background : CAN report Conditions Actions Needs (CAN)
報告書 • Conditions 　状況 ◦ Type of Incident インシデントの種類 ◦ Current Status of incident including State and Severity Level インシデントの現在の状況と重大度 ◦ Summary まとめ ◦ Blast Radius (Customer Impact) 影響範囲（お客さまへの影響） • Actions　行動 ◦ What is being done 何を対応しているのか ◦ Who is doing it 誰が対応しているのか • Needs　ニーズ ◦ Additional personnel or resources 追加の人員やリソース

13 01. Background : Incident Timeline インシデントタイムライン MTTA : Mean
Time To Acknowledge MTTR : Mean Time To Resolve MTRS : Mean Time to Restore Service Time Normal Operation Normal Operation MTTA MTTR = Incident impact duration MTRS = Customer impact duration Incident Acknowledged Next Incident (distant future :) Start of an Incident Incident Resolved = End of an incident 通常業務インシデントの始まりインシデント認知済みインシデント解決済み次のインシデント（遠い将来）通常業務時間

14 01. Background : MTTA MTTA: Mean Time To Acknowledge
MTTA is time taken to acknowledge an incident after incident has actually started. MTTA はインシデントが実際に開始してから、インシデントを気づくのにかかる時間です。

15 01. Background : MTTR MTTR : Mean Time To
Resolve MTTR is time taken to resolve an incident after incident is acknowledged. It is equal to Incident impact duration in which teams/members spent time to resolve the incident. MTTR は、インシデントが気づいてからインシデントを解決するのにかかる時間です。これは、チーム/メンバーがインシデントを解決するために費やした時間になります。

16 01. Background : MTRS MTRS : Mean Time To
Restore Service MTRS = MTTA + MTTR MTRS is total time taken to resolve an incident after incident has started. It is also equivalent to Customer impact duration. MTRS は、インシデントが開始してからインシデントを解決するのにかかった合計時間です。 Customer impact duration = Time to acknowledge(MTTA) + Incident impact duration by teams (MTTR)

17 02. Best Practices : Incident response 02. ベストプラクティス：インシデント対応 •
Prioritize : Stop the bleeding, restore service, and preserve the evidence for root causing 優先度：障害の応急処置をする、サービスを復旧させて、原因のエビデンスを保存する • Prepare : Develop and document your incident management procedures in advance. 準備：事前にインシデント対応の手順を作り、書面化する • Trust : Give full autonomy within the assigned role to all incident participants. 信頼：インシデント対応に関わるメンバーには役割に応じた自律性を与える • Introspect : Pay attention to your emotional state while responding to an incident 振り返る：インシデント対応中の自身の心の状態に注意する • Consider alternatives : Periodically consider your options and re-evaluate whether it still makes sense to continue 代替案を考える：定期的に他の選択肢も考え、続行することに意味があるのか再評価する • Practice : Use the process routinely so it becomes second nature. 演習：習慣になるようにプロセスを日常的に使う • Change it around : Were you incident commander last time? Take on a different role this time. 変えてみる : インシデントコマンダーを前回担当しましたか？今回は他の役割を担当してみてください！ •

18 02. Best Practices : Decreasing MTTA 02. ベストプラクティス：MTTAを減らす •
Improve Monitoring モニタリングを改善する • “Panic Button” for Customer Support カスタマーサポート用の「非常ボタン」 • Automatic Incident Triggering インシデントで自動起動 • Automatic Response Team Alerting (Paging) 対応チームに自動でアラートを出す（ページング） • Automatic Construction of Communication channels (chat, voice bridge) コミュニケーションチャンネルを自動で作成する（チャットなど） • Established procedures THAT ARE PRACTICED! 既に実施されている手法を使う！　

19 02. Best Practices : Decreasing MTTR 02. ベストプラクティス：MTTRを減らす •
Codiﬁed Incident Process インシデント対応プロセスを体系化する • If only we could orchestrate parallel paths of investigation 調査を上手く同時進行できたら良い • Multiple SMEs running multiple “swimlanes” 複数のSMEが各専門領域「スイムレーン」を同時に確認する　 • Discipline in following process (or Consistency & Dedication) 規律、もし難しいなら...一貫性と努力でも良い

20 02. Best Practices : Decreasing MTTR 02. ベストプラクティス：MTTRを減らす •
Codiﬁed Incident Process インシデント対応プロセスを体系化する • Proper Training for your Incident Response Team 皆さんのインシデント対応チームに合った適切なトレーニング • Practice, Practice, Practice 練習、練習、練習　 • Discipline or, if you prefer... Consistency & Dedication 規律、もし難しいなら...一貫性と努力 • Archive, Analyze and Learn from your Postmortems ポストモーテム（事後分析）をアーカイブ、分析し、そこから学ぶ • Did I forget to mention Discipline? 規律についてもう話しましたか？

21 02. Best Practices : Incident postmortem インシデントのポストモーテム（事後分析） “Without a
predictable way to respond to incidents, any organization — growing or mature — is at risk.” 「インシデント対応に予測可能な方法を持っていなければ、未熟な組織であれ、大企業であれ、リスクがある。」 Schnepp, Rob. Incident Management for Operations . O'Reilly Media. Kindle Edition.

22 02. Best Practices : Incident post mortem process インシデントのポストモーテム（事後分析）
• Assign Postmortem owner ポストモーテムのオーナーをアサインする • Complete the timeline インシデントタイムラインを完了する • Schedule meeting to collaborate on postmortem チームとのポストモーテムミーティングを設定する • Discuss & assign actionable follow-up actions アクション可能で具体的なフォローアップアクションを話し合って決める • Complete follow-up actions フォローアップのアクションを完了する • Share the learnings out 学びを共有する

23 02. Best Practices : Postmortem Time Consumption

24 02. Best Practices : Successful Postmortem 02. ベストプラクティス：効果的なポストモーテム •
Clear ownership 明確なオーナーシップ • Context & Key Details コンテキストと大切な情報の詳細 • On Time Completion 時間通りに遂行する • Tracked follow-up actions フォローアップのアクションをトラックする • Blameless language 相手を非難しない姿勢 • Referencability 後で参考可能

25 02. Best Practices : Results of Successful Postmortems 02.
ベストプラクティス：効果的なポストモーテムの結果 • Less blame 非難しない • Less toil 辛さの軽減 • Less panic 混乱も小さい • Continuous improvement & faster delivery 継続的な改善とスピーディーなデリバリー • Happy & successful customers お客さまにも良い体験

26 Thank you !

[DevDojo] Mercari Incident Management - 2024

[DevDojo] Mercari Incident Management - 2024

mercari PRO

More Decks by mercari

Other Decks in Technology

Featured

Transcript

1 Mercari’s Incident Management Process Account Moderation & Privacy Tech

2 Incident Management Background インシデント管理の背景 Agenda Incident Management Best Practices

3 01. Background : What is an incident? 背景：インシデントとは？ “An

4 01. Background : Normal Operation vs Incident Incidents are

5 01. Background : Normal Operation vs Incident Goals 目標

6 01. Background : Incident States

7 01. Background : Incident severity levels Severity General Description

8 01. Background : Incident Roles

9 01. Background : Incident Commander インシデントコマンダー • Takes

10 01. Background : Communications Lead コミュニケーションリード • Communicates

11 01. Background : Technical Lead テクニカルリード • Expected

12 01. Background : CAN report Conditions Actions Needs (CAN)

13 01. Background : Incident Timeline インシデントタイムライン MTTA : Mean

14 01. Background : MTTA MTTA: Mean Time To Acknowledge

15 01. Background : MTTR MTTR : Mean Time To

16 01. Background : MTRS MTRS : Mean Time To

17 02. Best Practices : Incident response 02. ベストプラクティス：インシデント対応 •

18 02. Best Practices : Decreasing MTTA 02. ベストプラクティス：MTTAを減らす •

19 02. Best Practices : Decreasing MTTR 02. ベストプラクティス：MTTRを減らす •

20 02. Best Practices : Decreasing MTTR 02. ベストプラクティス：MTTRを減らす •

21 02. Best Practices : Incident postmortem インシデントのポストモーテム（事後分析） “Without a

22 02. Best Practices : Incident post mortem process インシデントのポストモーテム（事後分析）

23 02. Best Practices : Postmortem Time Consumption

24 02. Best Practices : Successful Postmortem 02. ベストプラクティス：効果的なポストモーテム •

25 02. Best Practices : Results of Successful Postmortems 02.

26 Thank you !