Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Minimum Viable Incident Response Plan

Building a Minimum Viable Incident Response Plan

A minimum viable product (MVP) allows you to obtain rapid feedback, and implement continuous and iterative improvements. When you put an MVP into production, however, you need to take the next step and make sure that you can respond effectively when problems inevitably arise. You need a minimum viable response plan. One that allows you to keep your MVP operational without burdening your team.

In this talk, we'll discuss the full lifecycle of an incident, what it means to be robust versus resilient when building a response plan. And we'll help you determine who is on-call when, and for which kinds of problems. We'll also talk about notifications, escalations, and how to enable learning from each problem your service encounters. This will give you the basis of a minimum viable response plan, so you and your team have a baseline to start and continuously improve up.

3c84c9b8370a1242028b7f5f8cbb21b0?s=128

Jason Hand

April 09, 2019
Tweet

Transcript

  1. 1/71

  2. 2/71

  3. August 1st 2012  3/71

  4. $162,962.96 4/71

  5. $325,925.93 5/71

  6. $488,888.89 6/71

  7. $651,851.85 7/71

  8. $814,814.81 8/71

  9. $977,777.78 9/71

  10. $1,140,740.74 10/71

  11. 11/71

  12.  Computers do what they're told. If they're told to

    do the wrong thing, they're going to do it and they're going to do it really, really well. - Lawrence Pingree - Gartner 12/71
  13. How Will You Respond? 13/71

  14. Video Playback Disabled Jason jasonhand Hand Senior Cloud Advocate 14/71

  15. 15/71

  16. 16/71

  17. Complex Systems 17/71

  18. Cynefin Framework 18/71

  19. COMP LI CA TE D Sense-Analyze-Respond (Good Practice) SI MP

    LE Sense-Categorize- Respond (Best Practice) CH A OTI C Act-Sense-Respond (Novel) COMP LE X Probe-Sense-Respond (Emergent) Disorder 19/71
  20. Video Playback Disabled Failure Is Unavoidable 20/71

  21. Richard I. Cook, MD - 1998 jhand.co/HowComplexSystemsFail 21/71

  22. some think... Control & Process can prevent incidents 22/71

  23. jhand.co/RF_Control SREWeekly.com 23/71

  24. Video Playback Disabled 24/71

  25.  Incidents aren't deviations from some idyllic norm: they are

    the norm. - Rob England - itskeptic Part of the job 25/71
  26. Human Factors & System Safety 26/71

  27. Video Playback Disabled Fire & Rescue 27/71

  28.  28/71

  29. React vs Respond 29/71

  30. Robustness 30/71

  31. Resiliency 31/71

  32. jhand.co/Resilient_Ryn  While software can be robust, only humans can

    be truly resilient. - Ryn Daniels 32/71
  33. Minimum Viable (incident response) Plan 33/71

  34. Incidents 34/71

  35.  35/71

  36. Lifecycle of an incident  36/71

  37. Detection 37/71

  38. Detection First phase of an incident 37/71

  39. Detection First phase of an incident Tooling has identified a

    problem 37/71
  40. Detection First phase of an incident Tooling has identified a

    problem Notification has been triggered 37/71
  41. Response 38/71

  42. Response Notification acknowledged 38/71

  43. Response Notification acknowledged Engineer is explicitly on-call 38/71

  44. Response Notification acknowledged Engineer is explicitly on-call Troubleshooting, querying, diagnosing,

    and triaging 38/71
  45. Response Notification acknowledged Engineer is explicitly on-call Troubleshooting, querying, diagnosing,

    and triaging Formulate theories around remediation steps 38/71
  46. Remediation 39/71

  47. Remediation Action taken 39/71

  48. Remediation Action taken Service restored 39/71

  49. Analysis 40/71

  50. Analysis Retrospective discussion of the timline 40/71

  51. Analysis Retrospective discussion of the timline Opportunity for actionable improvement

    40/71
  52. Analysis Retrospective discussion of the timline Opportunity for actionable improvement

    Deeper learning on "How the system actually works" 40/71
  53. Readiness 41/71

  54. Readiness Less Unknown Unknowns 41/71

  55. Readiness Less Unknown Unknowns Learn & improve (people, process, tech)

    41/71
  56. Readiness Less Unknown Unknowns Learn & improve (people, process, tech)

    Implement work 41/71
  57. How can we Shorten  each phase? 42/71

  58. 43/71

  59. 1. Monitoring  44/71

  60. Tools  Prometheus Grafana DataDog Splunk iCinga Zabbix Azure Monitoring

    45/71
  61. 2. Response Plan  46/71

  62. Tools  VictorOps PagerDuty OpsGenie Azure Monitoring* 47/71

  63. Rotations  48/71

  64. Standard 49/71

  65. Standard Good starting point 49/71

  66. Standard Good starting point 24 X 7 49/71

  67. Follow the Sun 50/71

  68. Follow the Sun Multiple shifts 50/71

  69. Follow the Sun Multiple shifts Great for distributed teams 50/71

  70. Follow the Sun Multiple shifts Great for distributed teams On-call

    during "office hours" 50/71
  71. Custom 51/71

  72. Custom Complex scenarios 51/71

  73. Custom Complex scenarios Great for weekend coverage 51/71

  74. Who Is On-call? 52/71

  75. Primary  53/71

  76. Secondary  54/71

  77. Incident Commander  55/71

  78. Escalation Paths:  56/71

  79. Alerting  57/71

  80. Context 58/71

  81. Rotation  Handoffs & Debriefings 59/71

  82. Video Playback Disabled Game Day Exercises 60/71

  83. Prioritize 61/71

  84. Prepare 62/71

  85. Prepare Prioritize for failure 62/71

  86. Prepare Prioritize for failure Instrument for observability 62/71

  87. Create 63/71

  88. Create On-Call 63/71

  89. Create On-Call  - Roles & Rotations (Primary, Secondary, IC)

    63/71
  90. Create On-Call  - Roles & Rotations (Primary, Secondary, IC)

     - Contextual, Actionable Alerting 63/71
  91. Create On-Call  - Roles & Rotations (Primary, Secondary, IC)

     - Contextual, Actionable Alerting  - Escalation Paths 63/71
  92. Create 64/71

  93. Create Routine 64/71

  94. Create Routine  - Handoffs 64/71

  95. Create Routine  - Handoffs  - Practice, Practice, Practice

    64/71
  96. Continuous Improvement 65/71

  97. Resources Check out the market leaders in this space: VictorOps

    PagerDuty OpsGenie 66/71
  98. Resources Knight Capital Story Control "How Complex Systems Fail" "Shit

    Happens..." 67/71
  99. Video Playback Disabled Thank 68/71

  100. @jasonhand 69/71

  101. Video Playback Disabled 70/71

  102. Video Playback Disabled 71/71