Velocity 2018 - How To Establish A High Severity Incident Management Program

203e64aeb53ae59b2b4dcf923c163c23?s=47 Tammy Bütow
June 12, 2018
53

Velocity 2018 - How To Establish A High Severity Incident Management Program

https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/66481

How to establish a high-severity incident management program
Tammy Butow (Gremlin)
9:00am–12:30pm Tuesday, June 12, 2018
Location: 230 A
Level: Beginner
Secondary topics: Resilient, Performant & Secure Distributed Systems

203e64aeb53ae59b2b4dcf923c163c23?s=128

Tammy Bütow

June 12, 2018
Tweet

Transcript

  1. HOW TO ESTABLISH A HIGH SEVERITY INCIDENT MANAGEMENT PROGRAM. @TAMMYBUTOW

    @ANA_M_MEDINA GREMLIN
  2. AGENDA 09:00 — WELCOME & INTRODUCTIONS 09:30 — ESTABLISHING YOUR

    SEV PROGRAM 10:30 — MORNING BREAK ☕ 11:00 — MEASURING SUCCESS 11:30 — HANDS-ON PRACTICE 12:15 - Q & A 12:30 — THANKS & CYA LATER! @TAMMYBUTOW @ANA_M_MEDINA GREMLIN
  3. TAMMY BÜTOW ANA MEDINA Principal SRE, Gremlin Chaos Engineer, Gremlin

    @tammybutow @ana_m_medina
  4. INTRODUCTIONS

  5. SURVIVAL SKILLS FROM THE OUTBACK TO THE CITY.

  6. “Many fears cloud people’s engagement with our wilderness. The fear

    of snakes, spiders, becoming lost and being alone are all common fears. Survival skills can replace fear with respect for, and trust in, nature. Such knowledge enables people to walk freely and feel safer in our natural environment.“
  7. HOW TO SURVIVE A SNAKE BITE 1.TRUST (NOT FEAR!) 2.CALL

    FOR HELP 3.BANDAGE & IMMOBOLISE LIMB 4.STOP SPREAD OF POISON 5.VENOM DETECTION KIT 6.ANTIVENOM “SURVIVAL IS A MIND GAME” — BOB COOPER
  8. HOW TO SURVIVE A SEV 1.TRUST (NOT FEAR!) 2.CALL FOR

    HELP 3.APPLY BANDAGE 4.STOP SPREAD 5.DIAGNOSIS OF ISSUE 6.TECHNICAL RESOLUTION “SURVIVAL IS A MIND GAME” — BOB COOPER
  9. KNOW THE 5 KEYS TO WILDERNESS SURVIVAL 1.KNOW HOW TO

    BUILD A SHELTER 2.HOW HOW TO SIGNAL FOR HELP 3.KNOW WHAT TO EAT & HOW TO FIND IT 4.KNOW HOW TO BUILD AND MAINTAIN A FIRE 5.KNOW HOW TO FIND WATER AND PREPARE SAFE WATER TO DRINK
  10. KNOW THE 5 KEYS TO SEV SURVIVAL 1.KNOW HOW TO

    FIND SHELTER & WIFI 2.KNOW HOW TO SIGNAL FOR HELP 3.KNOW YOUR CRITICAL SYSTEMS & HOW TO ASSESS THEIR HEALTH 4.KNOW HOW TO BANDAGE ISSUES AND STOP THEIR SPREAD 5.KNOW HOW TO PERFORM TECHNICAL EMERGENCY RESOLUTION
  11. THE PRIMARY OBJECTIVE OF THIS WORKSHOP IS TO PROVIDE AN

    UNDERSTANDING OF HIGH SEVERITY INCIDENT MANAGEMENT AND ITS RELATED PRACTICES IN AN EASY AND SYSTEMIC WAY, INCLUDING PRACTICE AS WELL AS THEORY.
  12. SUCCESS IS BASED ON FOUR ASPECTS: TRUST, KNOWLEDGE, PRACTICE &

    MEASUREMENT
  13. SURVEY: CURRENT STATE OF INCIDENT MANAGEMENT https://goo.gl/Yma4d2

  14. HOW DO YOU EMPOWER EVERYONE IN YOUR COMPANY
 TO IDENTIFY

    PROBLEMS AND SIGNAL FOR HELP?
  15. Insert illustration of a building

  16. HAS THAT EVER HAPPENED WHERE YOU’VE WORKED?

  17. EMPOWER EVERYONE.

  18. #velocityconf

  19. AGENDA 09:00 — WELCOME & INTRODUCTIONS 09:30 — ESTABLISHING YOUR

    SEV PROGRAM 10:30 — MORNING BREAK ☕ 11:00 — MEASURING SUCCESS 11:30 — HANDS-ON PRACTICE 12:15 - Q & A 12:30 — THANKS & CYA LATER! @TAMMYBUTOW @ANA_M_MEDINA GREMLIN
  20. @TAMMYBUTOW @ANA_M_MEDINA GREMLIN HOW TO ESTABLISH A HIGH SEVERITY INCIDENT

    MANAGEMENT PROGRAM
  21. What is High Severity Incident Management?

  22. None
  23. SEVs

  24. What are the 4 most common types of SEVs?

  25. 1. The Availability Drop

  26. None
  27. 2. The Broken Feature

  28. None
  29. 3. The Loss of Data

  30. Cry baby

  31. 4. The Security Risk

  32. None
  33. Let’s take a journey together outside this room

  34. Put on your SEV backpack

  35. Monday 7pm

  36. You’re out having dinner

  37. You start getting errors from the database for your service.

    “ MySQL server has gone away”
  38. You use your SEV tool to get help

  39. Getting errors, app having issues too. Not sure what’s happening

    yet. MySQL? SEV Reported by you: Current SEV Level: 1
  40. IMOC is auto-paged and on the case

  41. The SEV is automatically named

  42. SEV 1 Fast Frog

  43. The IMOC finds a TLOC to resolve the issue

  44. Tons of teams across the company getting alerts It’s an

    alert storm!
  45. Insert storm pic

  46. Everyone across the company looks in #sevs on Slack and

    check the sevs@ mailing list for updates
  47. Threads running is high, the database is hot!

  48. None
  49. Database is being hammered!

  50. What’s happening?

  51. TLOC is looking at the database queries

  52. None
  53. Normal queries, nothing has changed

  54. More queries than usual

  55. Where are they coming from?

  56. Our queries have metadata for the service

  57. 1. It’s the API

  58. PUT THAT EVIDENCE IN YOUR BACKPACK

  59. Alarm! Availability SLA is breached for WWW and API

  60. SEV is upgraded to a SEV 0

  61. SEV 0 Fast Frog

  62. Automation in full-force

  63. Executive Leadership Team are auto-emailed

  64. We have only 15 min remaining to resolve the SEV

    0
  65. 15 MINUTES

  66. Keep going!

  67. None
  68. Start killing queries to restore service

  69. None
  70. Are the queries in the slow log from one user

    or many users?
  71. 2. It’s mostly one user

  72. PUT THAT EVIDENCE IN YOUR BACKPACK

  73. Is the one user legitimate?

  74. What kind of workload are they performing?

  75. 3 — It’s a heavy workload, heavier than we usually

    get.
  76. PUT THAT EVIDENCE IN YOUR BACKPACK

  77. Do we have rate limiting and throttling?

  78. 4 — It isn’t working well in this situation

  79. PUT THAT EVIDENCE IN YOUR BACKPACK

  80. Let’s temporarily kill queries for this user. We can use

    a query kill loop or use the support app. Then service will return to normal for everyone.
  81. SLA is back on-track MITIGATED the SEV 0 in 5

    minutes!
  82. Let’s open up our evidence backpack

  83. Our Evidence Backpack It’s the API It’s one user It’s

    a heavier workload Our rate limiting & throttling can’t handle this workload We temp resolved by killing queries from this customer
  84. Let’s check what rate limiting and throttling is currently set

    to
  85. We need to fix that, add an action item.

  86. Let’s also reach out to the customer and understand this

    heavy workload they are performing
  87. They do batch-style processing using our API. They plan to

    do it Monday 7pm every week. How can we better support it long-term?
  88. That’s what a SEV 0 looks like

  89. What are SEV levels?

  90. SEV Level Description Target resolution time Who is notified SEV

    0 Catastrophic Service Impact Resolve within 15 min Entire company SEV 1 Critical Service Impact Resolve within 8 hours Teams working on SEV & CTO SEV 2 High Service Impact Resolve within 24 hours Teams working on SEV How To Establish SEV levels
  91. How do your resolution times impact SLOs/SLAs?

  92. What is an SLA of 99.99%?

  93. Daily: 8.6s Weekly: 1m 0.5s Monthly: 4m 23.0s Yearly: 52m

    35.7s
  94. What is 52 minutes in a year? Less than 1

    meeting
  95. How can you be ready to sprint to mitigation at

    any moment?
  96. What should a SEV not look like?

  97. None
  98. What is the full lifecycle of a SEV?

  99. None
  100. How are SEVs measured?

  101. % loss * outage duration

  102. How do you create SEV levels for your company?

  103. SEV levels for data loss SEV Level Data Loss Impact

    SEV 0 Loss of customer data SEV 1 Loss of primary backup SEV 2 Loss of secondary backup
  104. None
  105. What does a SEV look like?

  106. None
  107. We measure this SEV as: 0.2% * 30 min (6)

    for WWW 0.11% * 30 min (3.3) for API
  108. How do you ensure your team operates effectively during a

    SEV 0?
  109. Incident Manager On-Call (IMOC)

  110. Small Rotation of Engineering Leaders

  111. One person is on-call in this role at any point

    in time
  112. Can be paged by emailing imoc-pager@

  113. Wide knowledge of services and engineering teams

  114. Tech Lead On-Call (TLOC)

  115. The engineer responsible for resolving the SEV

  116. Deep knowledge of own service area

  117. Deep knowledge of upstream and downstream dependencies

  118. How do you setup IMOCs for success during SEV 0s?

  119. How do you categorise SEVs?

  120. None
  121. How do you empower everyone in your company to fix

    things that are broken?
  122. gremlin.com/community

  123. gremlin.com/community

  124. How should you name SEVs?

  125. 0086343430

  126. SEV 0 Fast Frog

  127. What causes SEVs?

  128. Pareto Principle

  129. Technical & Cultural Issues

  130. What are some of the expected issues you are likely

    to experience?
  131. Technical Issues Dependency Failure Region/Zone Failure Provider Failure Overheating PDU

    failure Network upgrades Rack failures Core Switch failures Connectivity issues Flaky DNS Misconfigured machines Bugs Corrupt or unavailable backups Cultural Issues Lack of knowledge sharing Lack of knowledge handover Lack of on-call training Lack of chaos engineering Lack of an incident management program Lack of documentation and playbooks Lack of alerts and pages Lack of effective alerting thresholds Lack of backup strategy
  132. How do you prevent SEVs from repeating?

  133. Let’s look at high impact practices….

  134. An Incident Management Program

  135. A helpful IMOC Rotation

  136. Automation Tooling For Incident Management

  137. Chaos Engineering

  138. Insert calm kid calling on the phone Calling for help

    when an incident happens is awesome!
  139. HANDS-ON EXERCISE (GROUPS OF 3 OR 4)

  140. CREATE YOUR OWN INCIDENT MANAGEMENT PROGRAM 1. DETERMINE HOW YOU

    WILL MEASURE SEVS 2. DETERMINE SEV LEVELS 3. SET YOUR SLOS 4. CREATE YOUR IMOC ROTATION 5. START USING AUTOMATION TOOLING FOR SEVS 6. BUILD A CRITICAL SERVICE DASHBOARD
  141. AGENDA 09:00 — WELCOME & INTRODUCTIONS 09:30 — ESTABLISHING YOUR

    SEV PROGRAM 10:30 — MORNING BREAK ☕ 11:00 — MEASURING SUCCESS 11:30 — HANDS-ON PRACTICE 12:15 - Q & A 12:30 — THANKS & CYA LATER! @TAMMYBUTOW @ANA_M_MEDINA GREMLIN
  142. ENJOY YOUR MORNING BREAK ☕ @TAMMYBUTOW @ANA_M_MEDINA GREMLIN

  143. AGENDA 09:00 — WELCOME & INTRODUCTIONS 09:30 — ESTABLISHING YOUR

    SEV PROGRAM 10:30 — MORNING BREAK ☕ 11:00 — MEASURING SUCCESS 11:30 — HANDS-ON PRACTICE 12:15 - Q & A 12:30 — THANKS & CYA LATER! @TAMMYBUTOW @ANA_M_MEDINA GREMLIN
  144. MEASURING THE SUCCESS OF YOUR
 INCIDENT MANAGEMENT PROGRAM @TAMMYBUTOW @ANA_M_MEDINA

    GREMLIN
  145. MEASURE YOUR INCIDENT MANAGEMENT PROGRAM 1. ENSURING YOUR TEAM OPERATES

    EFFECTIVELY DURING A SEV 0 2. SETTING UP IMOCS FOR SUCCESS DURING SEV 0s 3. EMPOWERING EVERYONE IN YOUR COMPANY TO REPORT SEVs 4. SEV CAUSES 5. CATEGORISING SEVs 6. PREVENTING SEVs FROM REPEATING 7. USING CHAOS ENGINEERING FOR SEV PREVENTION
  146. GOAL: ENSURING YOUR TEAM OPERATES EFFECTIVELY DURING A SEV 0

    MEASURED BY: SURVEY FEEDBACK & TTR
  147. GOAL: SETTING UP IMOCS FOR SUCCESS DURING SEV 0s MEASURED

    BY: IMOC & TLOC SURVEYS
  148. GOAL: EMPOWERING EVERYONE IN YOUR COMPANY
 TO RECORD SEVS MEASURED

    BY: TTD & COMPANY-WIDE SURVEY
  149. GOAL: UNDERSTAND SEV CAUSES MEASURED BY: TAG SEVS BY CAUSES

  150. GOAL: CATEGORISE SEVS MEASURE BY: TAGS FOR SERVICE, TEAM, DEPARTMENT

    ETC.
  151. GOAL: PREVENT SEVS FROM REPEATING MEASURE BY: TBF FOR SEVS

  152. GOAL: USE CHAOS ENGINEERING TO EMPOWER YOUR TEAMS TO PREVENT

    SEVS MEASURE BY: SEVS WHICH HAVE BEEN REPRODUCED THROUGH CHAOS ENGINEERING.
  153. WHAT ELSE CAN YOU MEASURE?

  154. METRICS FOR YOUR INCIDENT MANAGEMENT PROGRAM 1. A SEV DASHBOARD

    2. CREATE & SEND SEV REPORTS 3. CREATE & SEND KPI REPORTS 4. SET GOALS FOR SEV REDUCTION 5. ESTABLISH A MONTHLY SRE SYNC 6. SEV TRAINING INCLUDING GAMEDAYS & CHAOSDAYS
  155. AGENDA 09:00 — WELCOME & INTRODUCTIONS 09:30 — ESTABLISHING YOUR

    SEV PROGRAM 10:30 — MORNING BREAK ☕ 11:00 — MEASURING SUCCESS 11:30 — HANDS-ON PRACTICE 12:15 - Q & A 12:30 — THANKS & CYA LATER! @TAMMYBUTOW @ANA_M_MEDINA GREMLIN
  156. HANDS-ON PRACTICE @TAMMYBUTOW @ANA_M_MEDINA GREMLIN

  157. BRAINSTORM: HOW DO YOU REDUCE TTD FOR YOUR TOP 5

    CRITICAL SERVICES?
  158. •Step 0 - Incident classification including; SEV descriptions and levels,

    the SEV timeline and the TTD timeline
 •Step 1 - Organization-wide critical service monitoring including; key dashboards and KPI metrics emails 
 •Step 2 - Service ownership and metrics including; measuring TTD by service, service triage, service ownership, building a service ownership service (SOS) and service alerting.
 •Step 3 - On-Call Principles including; pareto principle, rotation structure, alert threshold maintenance and escalation practices.
 •Step 4 - Chaos Engineering including; chaos days and continuous chaos. 
 •Step 5 - Self-Healing Systems including; when automation incidents occur, monitoring and metrics for self-healing system automation
 •
  159. PRACTICE A SEV REVIEW

  160. BAD POST-SEV REVIEW EXAMPLE

  161. None
  162. GOOD POST-SEV REVIEW EXAMPLE

  163. None
  164. DETERMINE HOW YOU WOULD CREATE THE FOLLOWING: 1. A SEV

    DASHBOARD 2. CREATE & SEND SEV REPORTS 3. CREATE & SEND KPI REPORTS 4. SET GOALS FOR SEV REDUCTION 5. ESTABLISH A MONTHLY SRE SYNC 6. SEV TRAINING INCLUDING GAMEDAYS & CHAOSDAYS
  165. SHARE SOMETHING YOU 
 WILL TAKE BACK TO YOUR 


    COMPANY WITH EVERYONE
  166. TODAY IS A BEAUTIFUL DAY TO START A HIGH SEVERITY

    INCIDENT MANAGEMENT PROGRAM
  167. Learn from & help others on this journey: Join the

    Chaos & Reliability Community
 gremlin.com/community Thank you tammy@gremlin.com gremlin.com/slack @TAMMYBUTOW @ANA_M_MEDINA GREMLIN ana@gremlin.com