[DevDojo] Mercari Incident Management - 2024

Embed

Start on current slide

Slide 1

Slide 1 text

1 Mercari’s Incident Management Process Account Moderation & Privacy Tech Engineering Team @maruti

Slide 2

Slide 2 text

2 Incident Management Background インシデント管理の背景 Agenda Incident Management Best Practices インシデント管理のベストプラクティス 02 01

Slide 3

Slide 3 text

3 01. Background : What is an incident? 背景：インシデントとは？ “An unplanned interruption to an IT service or reduction in the quality of an IT service.” 製品を利用する上で、ユーザーに影響を与える予期せぬサービスの中断や品質の低下 -- Schnepp, Rob. Incident Management for Operations (p. 1). O'Reilly Media. Kindle Edition. Incident 

Slide 4

Slide 4 text

4 01. Background : Normal Operation vs Incident Incidents are NOT Day to Day Operation インシデント対応は日々の業務であってはならない ● Declare an Incident explicitly! 明確にインシデントを報告する！ ● Incident state requires a special set of rules インシデントの対応状態には特別なルールが必要 ● Declare when the Incident is over (Resolved) インシデントが解消したら報告する

Slide 5

Slide 5 text

5 01. Background : Normal Operation vs Incident Goals 目標 ● Return to normal operation with as little impact as possible 出来るだけ小さい影響で通常業務に戻る ● As fast as possible 出来るだけ速やかに ● Follow up with a Postmortem shortly after 障害対応後すぐにポストモーテム（事後分析）を行う

Slide 6

Slide 6 text

6 01. Background : Incident States

Slide 7

Slide 7 text

7 01. Background : Incident severity levels Severity General Description SEV1 Highly critical issue that warrants public notiﬁcation and liaison with executive teams 公の通知と経営陣との連絡を保証する非常に重要な問題。 SEV2 Critical issue actively impacting many customers' ability to use the product Productを使用する多くのお客さまに影響を与える重大な問題。 Anything above this line is considered a Major Incident. この線より上のものはすべてメジャーインシデントと見なされます。 SEV3 Customer-impacting issues that require immediate attention from service owners サービスオーナーからの迅速な対応を必要とするお客さんに影響を与える問題 Anything above this line is considered an Incident impacting customers. この線より上のものはすべてお客さんに影響を与えるインシデントと見なされます。 SEV4 Issues requiring action, but not affecting customer's ability to use the product. This includes internal tool incidents, cron job failure incidents or potential risk which can lead to incidents if no action is taken. 対応が必要ですが、 Productを使用するお客さんに影響をを与えない問題。この中に、内部ツールインシデント、 cronジョブ失敗インシデント、またはアクションを実行しないとインシデントにつながる可能性がある潜在的なリスクが含まれます。 Note : Cosmetic issues/bugs which do not lead to incidents can be categorized as SEV5 and does not need to be registered in Blameless.

Slide 8

Slide 8 text

8 01. Background : Incident Roles

Slide 9

Slide 9 text

9 01. Background : Incident Commander インシデントコマンダー ● Takes complete ownership of the outcome of the incident インシデントの結果に対して責任を持つ ● Not necessarily the most senior person 最もシニアなメンバーである必要はない ● Should not be casually replaced during an incident インシデント対応中に他のメンバーが不用意に代わるべきではない ● Assembles team and delegates their responsibilities as appropriate ● チームを立ち上げ、適切に責任を委任する ● Single source of truth of what’s happening and what’s planned 発生している問題と対応策を一元管理する（Single Source of Truth） ● Develop and maintain the IAP (Incident action Plan) インシデントアクションプラン（IAP）を立て、管理する ● Manages/Updates the Conditions Actions Needs (CAN) report Conditions Actions Needs (CAN) 報告書を管理し、更新する ● Add PM responsible for affected feature to communication loop in the ﬁrst stage. If PM is unknown, escalate to CPO 影響を受ける機能を担当する PM を最初の段階の通信ループに追加します。 PM が不明の場合は CEO にエスカレーション Incident commander 

Slide 10

Slide 10 text

10 01. Background : Communications Lead コミュニケーションリード ● Communicates with entities beyond the response team 対応チームを超えて関係者に情報を伝える ● Similar to Public Information Oﬃcer 広報に似た役割 ● “Voice of the Incident Commander” 「インシデントコマンダーの声」になる ● Passes info from outside of incident to the Incident Commander or Technical Lead 情報を収集し、インシデントコマンダーやテクニカルリードにそれを伝える

Slide 11

Slide 11 text

11 01. Background : Technical Lead テクニカルリード ● Expected to be an SME (Subject Matter Expert) 特定分野の専門家（Subject Matter Expert（SME）であることが期待される ● Responsible for the execution of technical tasks テクニカルタスク遂行をリードし、責任を持つ ● Advises the Incident Commander on technical decisions and gives updates インシデントコマンダーに技術的な意思決定に関してアドバイスしたり、情報共有する ● “The hands of the Incident Commander” 「インシデントコマンダーの右腕」になる ● Defer to Incident Commander for policy and planning decisions 方針と計画の決定事項はインシデントコマンダーに従う

Slide 12

Slide 12 text

12 01. Background : CAN report Conditions Actions Needs (CAN) 報告書 ● Conditions 　状況 ○ Type of Incident インシデントの種類 ○ Current Status of incident including State and Severity Level インシデントの現在の状況と重大度 ○ Summary まとめ ○ Blast Radius (Customer Impact) 影響範囲（お客さまへの影響） ● Actions　行動 ○ What is being done 何を対応しているのか ○ Who is doing it 誰が対応しているのか ● Needs　ニーズ ○ Additional personnel or resources 追加の人員やリソース

Slide 13

Slide 13 text

13 01. Background : Incident Timeline インシデントタイムライン MTTA : Mean Time To Acknowledge MTTR : Mean Time To Resolve MTRS : Mean Time to Restore Service Time Normal Operation Normal Operation MTTA MTTR = Incident impact duration MTRS = Customer impact duration Incident Acknowledged Next Incident (distant future :) Start of an Incident Incident Resolved = End of an incident 通常業務インシデントの始まりインシデント認知済みインシデント解決済み次のインシデント（遠い将来）通常業務時間

Slide 14

Slide 14 text

14 01. Background : MTTA MTTA: Mean Time To Acknowledge MTTA is time taken to acknowledge an incident after incident has actually started. MTTA はインシデントが実際に開始してから、インシデントを気づくのにかかる時間です。

Slide 15

Slide 15 text

15 01. Background : MTTR MTTR : Mean Time To Resolve MTTR is time taken to resolve an incident after incident is acknowledged. It is equal to Incident impact duration in which teams/members spent time to resolve the incident. MTTR は、インシデントが気づいてからインシデントを解決するのにかかる時間です。これは、チーム/メンバーがインシデントを解決するために費やした時間になります。

Slide 16

Slide 16 text

16 01. Background : MTRS MTRS : Mean Time To Restore Service MTRS = MTTA + MTTR MTRS is total time taken to resolve an incident after incident has started. It is also equivalent to Customer impact duration. MTRS は、インシデントが開始してからインシデントを解決するのにかかった合計時間です。 Customer impact duration = Time to acknowledge(MTTA) + Incident impact duration by teams (MTTR)

Slide 17

Slide 17 text

17 02. Best Practices : Incident response 02. ベストプラクティス：インシデント対応 ● Prioritize : Stop the bleeding, restore service, and preserve the evidence for root causing 優先度：障害の応急処置をする、サービスを復旧させて、原因のエビデンスを保存する ● Prepare : Develop and document your incident management procedures in advance. 準備：事前にインシデント対応の手順を作り、書面化する ● Trust : Give full autonomy within the assigned role to all incident participants. 信頼：インシデント対応に関わるメンバーには役割に応じた自律性を与える ● Introspect : Pay attention to your emotional state while responding to an incident 振り返る：インシデント対応中の自身の心の状態に注意する ● Consider alternatives : Periodically consider your options and re-evaluate whether it still makes sense to continue 代替案を考える：定期的に他の選択肢も考え、続行することに意味があるのか再評価する ● Practice : Use the process routinely so it becomes second nature. 演習：習慣になるようにプロセスを日常的に使う ● Change it around : Were you incident commander last time? Take on a different role this time. 変えてみる : インシデントコマンダーを前回担当しましたか？今回は他の役割を担当してみてください！ ●

Slide 18

Slide 18 text

18 02. Best Practices : Decreasing MTTA 02. ベストプラクティス：MTTAを減らす ● Improve Monitoring モニタリングを改善する ● “Panic Button” for Customer Support カスタマーサポート用の「非常ボタン」 ● Automatic Incident Triggering インシデントで自動起動 ● Automatic Response Team Alerting (Paging) 対応チームに自動でアラートを出す（ページング） ● Automatic Construction of Communication channels (chat, voice bridge) コミュニケーションチャンネルを自動で作成する（チャットなど） ● Established procedures THAT ARE PRACTICED! 既に実施されている手法を使う！　

Slide 19

Slide 19 text

19 02. Best Practices : Decreasing MTTR 02. ベストプラクティス：MTTRを減らす ● Codiﬁed Incident Process インシデント対応プロセスを体系化する ● If only we could orchestrate parallel paths of investigation 調査を上手く同時進行できたら良い ● Multiple SMEs running multiple “swimlanes” 複数のSMEが各専門領域「スイムレーン」を同時に確認する　 ● Discipline in following process (or Consistency & Dedication) 規律、もし難しいなら...一貫性と努力でも良い

Slide 20

Slide 20 text

20 02. Best Practices : Decreasing MTTR 02. ベストプラクティス：MTTRを減らす ● Codiﬁed Incident Process インシデント対応プロセスを体系化する ● Proper Training for your Incident Response Team 皆さんのインシデント対応チームに合った適切なトレーニング ● Practice, Practice, Practice 練習、練習、練習　 ● Discipline or, if you prefer... Consistency & Dedication 規律、もし難しいなら...一貫性と努力 ● Archive, Analyze and Learn from your Postmortems ポストモーテム（事後分析）をアーカイブ、分析し、そこから学ぶ ● Did I forget to mention Discipline? 規律についてもう話しましたか？

Slide 21

Slide 21 text

21 02. Best Practices : Incident postmortem インシデントのポストモーテム（事後分析） “Without a predictable way to respond to incidents, any organization — growing or mature — is at risk.” 「インシデント対応に予測可能な方法を持っていなければ、未熟な組織であれ、大企業であれ、リスクがある。」 Schnepp, Rob. Incident Management for Operations . O'Reilly Media. Kindle Edition.

Slide 22

Slide 22 text

22 02. Best Practices : Incident post mortem process インシデントのポストモーテム（事後分析） ● Assign Postmortem owner ポストモーテムのオーナーをアサインする ● Complete the timeline インシデントタイムラインを完了する ● Schedule meeting to collaborate on postmortem チームとのポストモーテムミーティングを設定する ● Discuss & assign actionable follow-up actions アクション可能で具体的なフォローアップアクションを話し合って決める ● Complete follow-up actions フォローアップのアクションを完了する ● Share the learnings out 学びを共有する

Slide 23

Slide 23 text

23 02. Best Practices : Postmortem Time Consumption

Slide 24

Slide 24 text

24 02. Best Practices : Successful Postmortem 02. ベストプラクティス：効果的なポストモーテム ● Clear ownership 明確なオーナーシップ ● Context & Key Details コンテキストと大切な情報の詳細 ● On Time Completion 時間通りに遂行する ● Tracked follow-up actions フォローアップのアクションをトラックする ● Blameless language 相手を非難しない姿勢 ● Referencability 後で参考可能

Slide 25

Slide 25 text

25 02. Best Practices : Results of Successful Postmortems 02. ベストプラクティス：効果的なポストモーテムの結果 ● Less blame 非難しない ● Less toil 辛さの軽減 ● Less panic 混乱も小さい ● Continuous improvement & faster delivery 継続的な改善とスピーディーなデリバリー ● Happy & successful customers お客さまにも良い体験

Slide 26

Slide 26 text

26 Thank you !