Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How To Establish A High Severity Incident Management Program

How To Establish A High Severity Incident Management Program


Tammy Bryant Butow

May 22, 2018



  2. What do jelly beans have to do with incident management

  3. Insert kid crying

  4. Insert kid running around

  5. Insert calm kid calling on the phone

  6. Insert Jelly Beans

  7. Insert photo of my mum and me

  8. Hi I’m Tammy Butow, SRE @ gremlin.com I’ve worked on

    high severity incidents my entire life, and I’ve gotten better at it!
  9. 10+ years.

  10. Gremlin Dropbox DigitalOcean National Australia Bank Queensland University of Technology

    My home in Eastwood, NSW, Australia
  11. How do you empower everyone in your company to identify

    problems and get help?
  12. Empower Everyone.

  13. Insert illustration of a building

  14. Has that ever happened where you’ve worked?


  16. One common misconception…

  17. All people who resolve incidents are heroes.

  18. Hero vs Helper

  19. I’m a helper.

  20. None
  21. What is High Severity Incident Management?

  22. SEVs

  23. What are the 4 most common types of SEVs?

  24. 1. The Availability Drop

  25. None
  26. 2. The Broken Feature

  27. None
  28. 3. The Loss of Data

  29. Cry baby

  30. 4. The Security Risk

  31. None
  32. Let’s take a journey together outside this room

  33. Put on your SEV backpack

  34. Monday 7pm

  35. You’re out on a date enjoying a lovely dinner

  36. You start getting errors from the database for your service.

    “ MySQL server has gone away”.
  37. You use the SEV tool to get help

  38. Getting errors, app having issues too. Not sure what’s happening

    yet. MySQL? SEV Reported by you: Current SEV Level: 1
  39. IMOC is auto-paged and on the case

  40. The SEV is automatically named

  41. SEV 1 Fast Frog

  42. The IMOC finds a TLOC to resolve the issue

  43. Tons of teams across the company getting alerts It’s an

    alert storm!
  44. Insert storm pic

  45. Everyone across the company looks in #sevs on Slack and

    check the sevs@ mailing list for updates
  46. Threads running is high, the database is hot!

  47. None
  48. Database is being hammered!

  49. What’s happening?

  50. TLOC is looking at the database queries

  51. None
  52. Normal queries, nothing has changed

  53. More queries than usual

  54. Where are they coming from?

  55. Our queries have metadata for the service

  56. 1. It’s the API


  58. Alarm! Availability SLA is breached for WWW and API

  59. SEV is upgraded to a SEV 0

  60. SEV 0 Fast Frog

  61. Automation in full-force

  62. Executive Leadership Team are auto-emailed

  63. We have only 15 min remaining to resolve the SEV

  64. 15 MINUTES

  65. Keep going!

  66. Start killing queries to restore service

  67. None
  68. Are the queries in the slow log from one user

    or many users?
  69. 2. It’s mostly one user


  71. Is the one user legitimate?

  72. What kind of workload are they performing?

  73. 3 — It’s a heavy workload, heavier than we usually


  75. Do we have rate limiting and throttling?

  76. 4 — It isn’t working well in this situation


  78. Let’s temporarily kill queries for this user. We can use

    a query kill loop or use the support app. Then service will return to normal for everyone.
  79. SLA is back on-track MITIGATED the SEV 0 in 5

  80. Let’s open up our evidence backpack

  81. Our Evidence Backpack It’s the API It’s one user It’s

    a heavier workload Our rate limiting & throttling can’t handle this workload We temp resolved by killing queries from this customer
  82. Let’s check what rate limiting and throttling is currently set

  83. We need to fix that, add an action item.

  84. Let’s also reach out to the customer and understand this

    heavy workload they are performing
  85. They do batch-style processing using our API. They plan to

    do it Monday 7pm every week. How can we better support it long-term?
  86. That’s what a SEV 0 looks like

  87. What are SEV levels?

  88. SEV Level Description Target resolution time Who is notified SEV

    0 Catastrophic Service Impact Resolve within 10 min Ambulance SEV 1 Critical Service Impact Resolve within 8 hours Neighbour & Best Friend SEV 2 High Service Impact Resolve within 24 hours Best Friend How To Establish SEV levels - Diabetes
  89. SEV Level Description Target resolution time Who is notified SEV

    0 Catastrophic Service Impact Resolve within 15 min Entire company SEV 1 Critical Service Impact Resolve within 8 hours Teams working on SEV & CTO SEV 2 High Service Impact Resolve within 24 hours Teams working on SEV How To Establish SEV levels
  90. How do your resolution times impact SLOs/SLAs?

  91. What is an SLA of 99.99%?

  92. Daily: 8.6s Weekly: 1m 0.5s Monthly: 4m 23.0s Yearly: 52m

  93. What is 52 minutes in a year? Less than 1

  94. How can you be ready to sprint to mitigation at

    any moment?
  95. What is the full lifecycle of a SEV?

  96. None
  97. How are SEVs measured?

  98. % loss * outage duration

  99. How do you create SEV levels for your company?

  100. SEV levels for data loss SEV Level Data Loss Impact

    SEV 0 Loss of customer data SEV 1 Loss of primary backup SEV 2 Loss of secondary backup
  101. None
  102. What does a SEV look like?

  103. None
  104. We measure this SEV as: 0.2% * 30 min (6)

    for WWW 0.11% * 30 min (3.3) for API
  105. How do you ensure your team operates effectively during a

    SEV 0?
  106. Incident Manager On-Call (IMOC)

  107. Small Rotation of Engineering Leaders

  108. One person is on-call in this role at any point

    in time
  109. Can be paged by emailing imoc-pager@

  110. Wide knowledge of services and engineering teams

  111. Tech Lead On-Call (TLOC)

  112. The engineer responsible for resolving the SEV

  113. Deep knowledge of own service area

  114. Deep knowledge of upstream and downstream dependencies

  115. How do you setup IMOCs for success during SEV 0s?

  116. How do you categorise SEVs?

  117. None
  118. How do you empower everyone in your company to fix

    things that are broken?
  119. None
  120. None
  121. How should you name SEVs?

  122. 0086343430

  123. SEV 0 Fast Frog

  124. None
  125. What causes SEVs?

  126. Pareto Principle

  127. Technical & Cultural Issues

  128. What are some of the expected issues you are likely

    to experience?
  129. Technical Issues Dependency Failure Region/Zone Failure Provider Failure Overheating PDU

    failure Network upgrades Rack failures Core Switch failures Connectivity issues Flaky DNS Misconfigured machines Bugs Corrupt or unavailable backups Cultural Issues Lack of knowledge sharing Lack of knowledge handover Lack of on-call training Lack of chaos engineering Lack of an incident management program Lack of documentation and playbooks Lack of alerts and pages Lack of effective alerting thresholds Lack of backup strategy
  130. How do you prevent SEVs from repeating?

  131. Let’s look at high impact practices….

  132. An Incident Management Program

  133. A helpful IMOC Rotation

  134. Automation Tooling For Incident Management

  135. Chaos Engineering

  136. Insert calm kid calling on the phone Calling for help

    when an incident happens is awesome!
  137. Calling for help when an incident happens is awesome!

  138. Create Your Own Incident Management Program 1. Determine how you

    will measure SEVs 2. Determine your SEV Levels 3. Set your SLOs 4. Create your IMOC rotation 5. Start using automation tooling for SEVs 6. Build a critical service dashboard
  139. It’s a beautiful day to start

  140. Learn from and help others on this journey: Join the

    Chaos & Reliability Community gremlin.com/community Thank you @tammybutow tammy@gremlin.com gremlin.com/slack