Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Velocity 2018 - How To Establish A High Severit...

Tammy Bryant Butow
June 12, 2018
84

Velocity 2018 - How To Establish A High Severity Incident Management Program

https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/66481

How to establish a high-severity incident management program
Tammy Butow (Gremlin)
9:00am–12:30pm Tuesday, June 12, 2018
Location: 230 A
Level: Beginner
Secondary topics: Resilient, Performant & Secure Distributed Systems

Tammy Bryant Butow

June 12, 2018
Tweet

More Decks by Tammy Bryant Butow

Transcript

  1. AGENDA 09:00 — WELCOME & INTRODUCTIONS 09:30 — ESTABLISHING YOUR

    SEV PROGRAM 10:30 — MORNING BREAK ☕ 11:00 — MEASURING SUCCESS 11:30 — HANDS-ON PRACTICE 12:15 - Q & A 12:30 — THANKS & CYA LATER! @TAMMYBUTOW @ANA_M_MEDINA GREMLIN
  2. “Many fears cloud people’s engagement with our wilderness. The fear

    of snakes, spiders, becoming lost and being alone are all common fears. Survival skills can replace fear with respect for, and trust in, nature. Such knowledge enables people to walk freely and feel safer in our natural environment.“
  3. HOW TO SURVIVE A SNAKE BITE 1.TRUST (NOT FEAR!) 2.CALL

    FOR HELP 3.BANDAGE & IMMOBOLISE LIMB 4.STOP SPREAD OF POISON 5.VENOM DETECTION KIT 6.ANTIVENOM “SURVIVAL IS A MIND GAME” — BOB COOPER
  4. HOW TO SURVIVE A SEV 1.TRUST (NOT FEAR!) 2.CALL FOR

    HELP 3.APPLY BANDAGE 4.STOP SPREAD 5.DIAGNOSIS OF ISSUE 6.TECHNICAL RESOLUTION “SURVIVAL IS A MIND GAME” — BOB COOPER
  5. KNOW THE 5 KEYS TO WILDERNESS SURVIVAL 1.KNOW HOW TO

    BUILD A SHELTER 2.HOW HOW TO SIGNAL FOR HELP 3.KNOW WHAT TO EAT & HOW TO FIND IT 4.KNOW HOW TO BUILD AND MAINTAIN A FIRE 5.KNOW HOW TO FIND WATER AND PREPARE SAFE WATER TO DRINK
  6. KNOW THE 5 KEYS TO SEV SURVIVAL 1.KNOW HOW TO

    FIND SHELTER & WIFI 2.KNOW HOW TO SIGNAL FOR HELP 3.KNOW YOUR CRITICAL SYSTEMS & HOW TO ASSESS THEIR HEALTH 4.KNOW HOW TO BANDAGE ISSUES AND STOP THEIR SPREAD 5.KNOW HOW TO PERFORM TECHNICAL EMERGENCY RESOLUTION
  7. THE PRIMARY OBJECTIVE OF THIS WORKSHOP IS TO PROVIDE AN

    UNDERSTANDING OF HIGH SEVERITY INCIDENT MANAGEMENT AND ITS RELATED PRACTICES IN AN EASY AND SYSTEMIC WAY, INCLUDING PRACTICE AS WELL AS THEORY.
  8. AGENDA 09:00 — WELCOME & INTRODUCTIONS 09:30 — ESTABLISHING YOUR

    SEV PROGRAM 10:30 — MORNING BREAK ☕ 11:00 — MEASURING SUCCESS 11:30 — HANDS-ON PRACTICE 12:15 - Q & A 12:30 — THANKS & CYA LATER! @TAMMYBUTOW @ANA_M_MEDINA GREMLIN
  9. Getting errors, app having issues too. Not sure what’s happening

    yet. MySQL? SEV Reported by you: Current SEV Level: 1
  10. Everyone across the company looks in #sevs on Slack and

    check the sevs@ mailing list for updates
  11. Let’s temporarily kill queries for this user. We can use

    a query kill loop or use the support app. Then service will return to normal for everyone.
  12. Our Evidence Backpack It’s the API It’s one user It’s

    a heavier workload Our rate limiting & throttling can’t handle this workload We temp resolved by killing queries from this customer
  13. They do batch-style processing using our API. They plan to

    do it Monday 7pm every week. How can we better support it long-term?
  14. SEV Level Description Target resolution time Who is notified SEV

    0 Catastrophic Service Impact Resolve within 15 min Entire company SEV 1 Critical Service Impact Resolve within 8 hours Teams working on SEV & CTO SEV 2 High Service Impact Resolve within 24 hours Teams working on SEV How To Establish SEV levels
  15. SEV levels for data loss SEV Level Data Loss Impact

    SEV 0 Loss of customer data SEV 1 Loss of primary backup SEV 2 Loss of secondary backup
  16. We measure this SEV as: 0.2% * 30 min (6)

    for WWW 0.11% * 30 min (3.3) for API
  17. Technical Issues Dependency Failure Region/Zone Failure Provider Failure Overheating PDU

    failure Network upgrades Rack failures Core Switch failures Connectivity issues Flaky DNS Misconfigured machines Bugs Corrupt or unavailable backups Cultural Issues Lack of knowledge sharing Lack of knowledge handover Lack of on-call training Lack of chaos engineering Lack of an incident management program Lack of documentation and playbooks Lack of alerts and pages Lack of effective alerting thresholds Lack of backup strategy
  18. Insert calm kid calling on the phone Calling for help

    when an incident happens is awesome!
  19. CREATE YOUR OWN INCIDENT MANAGEMENT PROGRAM 1. DETERMINE HOW YOU

    WILL MEASURE SEVS 2. DETERMINE SEV LEVELS 3. SET YOUR SLOS 4. CREATE YOUR IMOC ROTATION 5. START USING AUTOMATION TOOLING FOR SEVS 6. BUILD A CRITICAL SERVICE DASHBOARD
  20. AGENDA 09:00 — WELCOME & INTRODUCTIONS 09:30 — ESTABLISHING YOUR

    SEV PROGRAM 10:30 — MORNING BREAK ☕ 11:00 — MEASURING SUCCESS 11:30 — HANDS-ON PRACTICE 12:15 - Q & A 12:30 — THANKS & CYA LATER! @TAMMYBUTOW @ANA_M_MEDINA GREMLIN
  21. AGENDA 09:00 — WELCOME & INTRODUCTIONS 09:30 — ESTABLISHING YOUR

    SEV PROGRAM 10:30 — MORNING BREAK ☕ 11:00 — MEASURING SUCCESS 11:30 — HANDS-ON PRACTICE 12:15 - Q & A 12:30 — THANKS & CYA LATER! @TAMMYBUTOW @ANA_M_MEDINA GREMLIN
  22. MEASURE YOUR INCIDENT MANAGEMENT PROGRAM 1. ENSURING YOUR TEAM OPERATES

    EFFECTIVELY DURING A SEV 0 2. SETTING UP IMOCS FOR SUCCESS DURING SEV 0s 3. EMPOWERING EVERYONE IN YOUR COMPANY TO REPORT SEVs 4. SEV CAUSES 5. CATEGORISING SEVs 6. PREVENTING SEVs FROM REPEATING 7. USING CHAOS ENGINEERING FOR SEV PREVENTION
  23. GOAL: USE CHAOS ENGINEERING TO EMPOWER YOUR TEAMS TO PREVENT

    SEVS MEASURE BY: SEVS WHICH HAVE BEEN REPRODUCED THROUGH CHAOS ENGINEERING.
  24. METRICS FOR YOUR INCIDENT MANAGEMENT PROGRAM 1. A SEV DASHBOARD

    2. CREATE & SEND SEV REPORTS 3. CREATE & SEND KPI REPORTS 4. SET GOALS FOR SEV REDUCTION 5. ESTABLISH A MONTHLY SRE SYNC 6. SEV TRAINING INCLUDING GAMEDAYS & CHAOSDAYS
  25. AGENDA 09:00 — WELCOME & INTRODUCTIONS 09:30 — ESTABLISHING YOUR

    SEV PROGRAM 10:30 — MORNING BREAK ☕ 11:00 — MEASURING SUCCESS 11:30 — HANDS-ON PRACTICE 12:15 - Q & A 12:30 — THANKS & CYA LATER! @TAMMYBUTOW @ANA_M_MEDINA GREMLIN
  26. •Step 0 - Incident classification including; SEV descriptions and levels,

    the SEV timeline and the TTD timeline
 •Step 1 - Organization-wide critical service monitoring including; key dashboards and KPI metrics emails 
 •Step 2 - Service ownership and metrics including; measuring TTD by service, service triage, service ownership, building a service ownership service (SOS) and service alerting.
 •Step 3 - On-Call Principles including; pareto principle, rotation structure, alert threshold maintenance and escalation practices.
 •Step 4 - Chaos Engineering including; chaos days and continuous chaos. 
 •Step 5 - Self-Healing Systems including; when automation incidents occur, monitoring and metrics for self-healing system automation
 •
  27. DETERMINE HOW YOU WOULD CREATE THE FOLLOWING: 1. A SEV

    DASHBOARD 2. CREATE & SEND SEV REPORTS 3. CREATE & SEND KPI REPORTS 4. SET GOALS FOR SEV REDUCTION 5. ESTABLISH A MONTHLY SRE SYNC 6. SEV TRAINING INCLUDING GAMEDAYS & CHAOSDAYS
  28. Learn from & help others on this journey: Join the

    Chaos & Reliability Community
 gremlin.com/community Thank you [email protected] gremlin.com/slack @TAMMYBUTOW @ANA_M_MEDINA GREMLIN [email protected]