Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How To Establish A High Severity Incident Management Program

How To Establish A High Severity Incident Management Program

Tammy Bryant Butow

May 22, 2018
Tweet

More Decks by Tammy Bryant Butow

Other Decks in Technology

Transcript

  1. Hi I’m Tammy Butow, SRE @ gremlin.com I’ve worked on

    high severity incidents my entire life, and I’ve gotten better at it!
  2. Getting errors, app having issues too. Not sure what’s happening

    yet. MySQL? SEV Reported by you: Current SEV Level: 1
  3. Everyone across the company looks in #sevs on Slack and

    check the sevs@ mailing list for updates
  4. Let’s temporarily kill queries for this user. We can use

    a query kill loop or use the support app. Then service will return to normal for everyone.
  5. Our Evidence Backpack It’s the API It’s one user It’s

    a heavier workload Our rate limiting & throttling can’t handle this workload We temp resolved by killing queries from this customer
  6. They do batch-style processing using our API. They plan to

    do it Monday 7pm every week. How can we better support it long-term?
  7. SEV Level Description Target resolution time Who is notified SEV

    0 Catastrophic Service Impact Resolve within 10 min Ambulance SEV 1 Critical Service Impact Resolve within 8 hours Neighbour & Best Friend SEV 2 High Service Impact Resolve within 24 hours Best Friend How To Establish SEV levels - Diabetes
  8. SEV Level Description Target resolution time Who is notified SEV

    0 Catastrophic Service Impact Resolve within 15 min Entire company SEV 1 Critical Service Impact Resolve within 8 hours Teams working on SEV & CTO SEV 2 High Service Impact Resolve within 24 hours Teams working on SEV How To Establish SEV levels
  9. SEV levels for data loss SEV Level Data Loss Impact

    SEV 0 Loss of customer data SEV 1 Loss of primary backup SEV 2 Loss of secondary backup
  10. We measure this SEV as: 0.2% * 30 min (6)

    for WWW 0.11% * 30 min (3.3) for API
  11. Technical Issues Dependency Failure Region/Zone Failure Provider Failure Overheating PDU

    failure Network upgrades Rack failures Core Switch failures Connectivity issues Flaky DNS Misconfigured machines Bugs Corrupt or unavailable backups Cultural Issues Lack of knowledge sharing Lack of knowledge handover Lack of on-call training Lack of chaos engineering Lack of an incident management program Lack of documentation and playbooks Lack of alerts and pages Lack of effective alerting thresholds Lack of backup strategy
  12. Insert calm kid calling on the phone Calling for help

    when an incident happens is awesome!
  13. Create Your Own Incident Management Program 1. Determine how you

    will measure SEVs 2. Determine your SEV Levels 3. Set your SLOs 4. Create your IMOC rotation 5. Start using automation tooling for SEVs 6. Build a critical service dashboard
  14. Learn from and help others on this journey: Join the

    Chaos & Reliability Community gremlin.com/community Thank you @tammybutow [email protected] gremlin.com/slack