Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resiliency During High Severity Incidents

Resiliency During High Severity Incidents

I have been the incident manager on-call when AWS was down for hours while I worked at Dropbox. I was the engineer responsible for bringing back the mortgage broking systems at the National Australia Bank. I have been the manager responsible when entire datacenters went down impacting thousands of customers. I hope this deck will help you pick up some tips for how you can develop your resiliency. I've done formal resiliency training at the Resilience Institute, participated in hundreds of hours of on-call training, gamedays, chaos engineering experiments and disaster recovery tests over the years. It has all helped. This deck was presented at the Facebook Developer Circle in Melbourne, Australia on Jan 23, 2018.

Tammy Bryant Butow

January 23, 2018
Tweet

More Decks by Tammy Bryant Butow

Other Decks in Technology

Transcript

  1. EXPERIENCES • BUILDING TOOLS • AUTOMATION • INCIDENT RESPONSE &

    DRTs • OBSERVABILITY • HARDWARE ENGINEERING • CHAOS ENGINEERING & GAMEDAYS • TEAM LEADERSHIP • SECURITY & PRODUCT ENGINEERING
  2. • BUILDING TOOLS • AUTOMATION • INCIDENT RESPONSE & DRTs

    • OBSERVABILITY • HARDWARE ENGINEERING • CHAOS ENGINEERING & GAMEDAYS • TEAM LEADERSHIP • SECURITY & PRODUCT ENGINEERING
  3. WHAT IS GREMLIN? Gremlin helps developers build resilient systems using

    our control plane and API. Join us: http://gremlin.com/community Twitter: @gremlininc
  4. Let’s walk through how to prepare, respond and prevent incidents.

    Let’s discuss methods that can be used to grow as a team and develop resiliency in the face of difficult challenges. Let’s also explore the difficult decisions that need to be made, who should be on-call? When? Hello
  5. The management of high severity incidents encompasses high severity incident

    (SEV) detection, diagnosis, mitigation, prevention, and closure. SEV prevention includes SEV review and SEV correlation. High severity incident management
  6. SEV is a term used to refer to an incident,

    it is derived from the word severity. What are SEVs?
  7. • Availability Drop • Product Issue / Feature Broken •

    Data Loss • Security Risk What are common types of SEVs?
  8. 1. Nintendo Switch users unable to download games via the

    eShop, Christmas Day 2017 2. Commonwealth Bank ATMs, online banking and EFTPOS machines stopped working for hours, preventing all customers from accessing cash What are examples of SEVs?
  9. What are SEV levels? SEV Level Description Target resolution time

    Who is notified SEV 0 Catastrophic Service Impact Resolve within 15 min Entire company SEV 1 Critical Service Impact Resolve within 8 hours Teams working on SEV & CTO SEV 2 High Service Impact Resolve within 24 hours Teams working on SEV
  10. HOW TO PREPARE 2. ON-CALL TRAINING 1. RUNBOOKS / PLAYBOOKS

    3. KEEP UP WITH WHAT’S NEW 4. FOCUS ON LEARNING 5. FAILURE IS OK WHEN LEARNING 6. COMPUTER GAMES + SKATEBOARDING
  11. HOW TO GROW AS A TEAM AND DEVELOP RESILIENCY DO

    YOU HAVE A LEARNING CULTURE OR A PERFORMANCE CULTURE? IN A LEARNING CULTURE FAILURE IS *TOTALLY* OK. Elissa Steamer: https://youtu.be/pbbBa5lPPeU?t=1m21s
  12. HOW TO GROW AS A TEAM AND DEVELOP RESILIENCY DO

    YOU HAVE A LEARNING CULTURE OR A PERFORMANCE CULTURE? IN A LEARNING CULTURE FAILURE IS *TOTALLY* OK. Elissa Steamer: https://youtu.be/pbbBa5lPPeU?t=1m21s
  13. HOW TO GROW AS A TEAM AND DEVELOP RESILIENCY DO

    YOU HAVE A LEARNING CULTURE OR A PERFORMANCE CULTURE? IN A LEARNING CULTURE FAILURE IS *TOTALLY* OK. Elissa Steamer: https://youtu.be/pbbBa5lPPeU?t=1m21s
  14. HOW TO RESPOND 1. ONE FOR ALL, AND ALL FOR

    ONE 2. KEEP MOVING FORWARD 3. CALM UNDER PRESSURE 4. KNOW YOUR CRAFT BY HEART 5. SPEED AND ACCURACY HELP 6. BE WHO YOU’D WANT TO WORK WITH
  15. HOW TO RESPOND ONE FOR ALL, AND ALL FOR ONE

    KEEP MOVING FORWARD CALM UNDER PRESSURE
  16. HOW TO PREVENT SEV REVIEWS (THE NEW POSTMORTEM) DON’T CREATE

    A CULTURE OF BLAME FOCUS ON MTTP & MTBF METRICS MTTP = Mean Time To Prevention MTBF = Mean Time Between Failures CHAOS ENGINEERING & GAMEDAYS
  17. HOW TO GROW AS A TEAM AND DEVELOP RESILIENCY DO

    YOU HAVE A LEARNING CULTURE WHICH FOCUSES ON PRACTICE?….. OR A PERFORMANCE CULTURE? IN A LEARNING CULTURE FAILURE IS *TOTALLY* OK. Elissa Steamer: https://youtu.be/pbbBa5lPPeU?t=1m21s
  18. THE DIFFICULT DECISIONS WHO’S ON-CALL? WHAT’S THE ON-CALL SCHEDULE? DO

    YOU HAVE SECONDARIES? DO YOU HAVE SUPER PRIMARIES?
  19. What’s Next? Join our Chaos Engineering Slack: gremlin.com/community Find me

    on Twitter: twitter.com/tammybutow SRE & Chaos Engineering info: twitter.com/gremlininc Chaos Eng workshops: meetup.com/pro/chaos