Resiliency During High Severity Incidents

Resiliency During High Severity Incidents

I have been the incident manager on-call when AWS was down for hours while I worked at Dropbox. I was the engineer responsible for bringing back the mortgage broking systems at the National Australia Bank. I have been the manager responsible when entire datacenters went down impacting thousands of customers. I hope this deck will help you pick up some tips for how you can develop your resiliency. I've done formal resiliency training at the Resilience Institute, participated in hundreds of hours of on-call training, gamedays, chaos engineering experiments and disaster recovery tests over the years. It has all helped. This deck was presented at the Facebook Developer Circle in Melbourne, Australia on Jan 23, 2018.

203e64aeb53ae59b2b4dcf923c163c23?s=128

Tammy Bütow

January 23, 2018
Tweet

Transcript

  1. Resiliency During High Severity Incidents TAMMY BUTOW PRINCIPAL SRE @

    GREMLIN
  2. FIND ME ON TWITTER Me: @tammybutow Gremlin: @gremlininc

  3. WHERE HAVE I WORKED?

  4. EXPERIENCES • BUILDING TOOLS • AUTOMATION • INCIDENT RESPONSE &

    DRTs • OBSERVABILITY • HARDWARE ENGINEERING • CHAOS ENGINEERING & GAMEDAYS • TEAM LEADERSHIP • SECURITY & PRODUCT ENGINEERING
  5. • BUILDING TOOLS • AUTOMATION • INCIDENT RESPONSE & DRTs

    • OBSERVABILITY • HARDWARE ENGINEERING • CHAOS ENGINEERING & GAMEDAYS • TEAM LEADERSHIP • SECURITY & PRODUCT ENGINEERING
  6. WHAT IS GREMLIN? Gremlin helps developers build resilient systems using

    our control plane and API. Join us: http://gremlin.com/community Twitter: @gremlininc
  7. RESILIENCY DURING HIGH SEVERITY INCIDENTS

  8. Let’s walk through how to prepare, respond and prevent incidents.

    Let’s discuss methods that can be used to grow as a team and develop resiliency in the face of difficult challenges. Let’s also explore the difficult decisions that need to be made, who should be on-call? When? Hello
  9. None
  10. The management of high severity incidents encompasses high severity incident

    (SEV) detection, diagnosis, mitigation, prevention, and closure. SEV prevention includes SEV review and SEV correlation. High severity incident management
  11. SEV is a term used to refer to an incident,

    it is derived from the word severity. What are SEVs?
  12. • Availability Drop • Product Issue / Feature Broken •

    Data Loss • Security Risk What are common types of SEVs?
  13. 1. Nintendo Switch users unable to download games via the

    eShop, Christmas Day 2017 2. Commonwealth Bank ATMs, online banking and EFTPOS machines stopped working for hours, preventing all customers from accessing cash What are examples of SEVs?
  14. What are SEV levels? SEV Level Description Target resolution time

    Who is notified SEV 0 Catastrophic Service Impact Resolve within 15 min Entire company SEV 1 Critical Service Impact Resolve within 8 hours Teams working on SEV & CTO SEV 2 High Service Impact Resolve within 24 hours Teams working on SEV
  15. What are SEV levels?

  16. What are SEV levels?

  17. What are SEV levels?

  18. WHO’S WORKED ON AN INCIDENT? “strap in for the ride!”

  19. HOW TO PREPARE 2. ON-CALL TRAINING 1. RUNBOOKS / PLAYBOOKS

    3. KEEP UP WITH WHAT’S NEW 4. FOCUS ON LEARNING 5. FAILURE IS OK WHEN LEARNING 6. COMPUTER GAMES + SKATEBOARDING
  20. HOW TO GROW AS A TEAM AND DEVELOP RESILIENCY DO

    YOU HAVE A LEARNING CULTURE OR A PERFORMANCE CULTURE? IN A LEARNING CULTURE FAILURE IS *TOTALLY* OK. Elissa Steamer: https://youtu.be/pbbBa5lPPeU?t=1m21s
  21. HOW TO GROW AS A TEAM AND DEVELOP RESILIENCY DO

    YOU HAVE A LEARNING CULTURE OR A PERFORMANCE CULTURE? IN A LEARNING CULTURE FAILURE IS *TOTALLY* OK. Elissa Steamer: https://youtu.be/pbbBa5lPPeU?t=1m21s
  22. HOW TO GROW AS A TEAM AND DEVELOP RESILIENCY DO

    YOU HAVE A LEARNING CULTURE OR A PERFORMANCE CULTURE? IN A LEARNING CULTURE FAILURE IS *TOTALLY* OK. Elissa Steamer: https://youtu.be/pbbBa5lPPeU?t=1m21s
  23. HOW TO RESPOND 1. ONE FOR ALL, AND ALL FOR

    ONE 2. KEEP MOVING FORWARD 3. CALM UNDER PRESSURE 4. KNOW YOUR CRAFT BY HEART 5. SPEED AND ACCURACY HELP 6. BE WHO YOU’D WANT TO WORK WITH
  24. HOW TO RESPOND ONE FOR ALL, AND ALL FOR ONE

    KEEP MOVING FORWARD CALM UNDER PRESSURE
  25. HOW TO PREVENT SEV REVIEWS (THE NEW POSTMORTEM) DON’T CREATE

    A CULTURE OF BLAME FOCUS ON MTTP & MTBF METRICS MTTP = Mean Time To Prevention MTBF = Mean Time Between Failures CHAOS ENGINEERING & GAMEDAYS
  26. None
  27. HOW TO GROW AS A TEAM AND DEVELOP RESILIENCY DO

    YOU HAVE A LEARNING CULTURE WHICH FOCUSES ON PRACTICE?….. OR A PERFORMANCE CULTURE? IN A LEARNING CULTURE FAILURE IS *TOTALLY* OK. Elissa Steamer: https://youtu.be/pbbBa5lPPeU?t=1m21s
  28. RESILIENCE TRAINING

  29. THE DIFFICULT DECISIONS WHO’S ON-CALL? WHAT’S THE ON-CALL SCHEDULE? DO

    YOU HAVE SECONDARIES? DO YOU HAVE SUPER PRIMARIES?
  30. What’s Next? Join our Chaos Engineering Slack: gremlin.com/community Find me

    on Twitter: twitter.com/tammybutow SRE & Chaos Engineering info: twitter.com/gremlininc Chaos Eng workshops: meetup.com/pro/chaos
  31. Thank you. Enjoy the journey! @tammybutow