Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures

Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures

A core concept in SRE is that we learn from major system failures, using the experience gained to improve resiliency of our systems. If we are successful at this, we avoid repeating the same customer impact the next time our systems fail in a similar way. This is wonderful, but there is a frightening corollary: when the next big failure happens, it will often be a novel problem. This talk will focus on how to prepare for novel large scale failures. I will start by summarizing common methods of incident training. This includes simulated disaster scenarios, and live system exercises that test the response of our systems and engineering teams to controlled but real production system failures. I will outline the benefits of each approach, and our experience in employing them over the years as our company has grown. Our SRE team has grown from about 40 three years ago to 120 today, and the methods we used in the past became less effective as both our systems and team organization grew more complex and distributed. While simple playbooks and fallbacks once worked in the past, we have found that with complexity came a greater need for creativity and coordination of larger teams to fight problems effectively. High trust, communication, and psychological safety are now central ingredients to an effective response, leading us to seek more novel forms of offline training. This talk will wrap up with a summary of one such large scale incident exercise we ran involving a hundred people, an office building, and 20,000 pieces of lego.

A5f3383a1a0c7e6d3df7f06361e39a5c?s=128

John Arthorne

October 04, 2019
Tweet

Transcript

  1. Expect the Unexpected

  2. Help > About John Arthorne • Developer/Manager/SRE at Shopify Shopify

    • Software for Commerce • 30 → 120 SREs • Average $1700 GMV / second
  3. • • • What We’ll Cover

  4. high failure novelty rate

  5. • • •

  6. Transparent Response • • •

  7. Incident Simulation • • •

  8. Game Days • • •

  9. Turn Rusty Knobs • • •

  10. Automated Failure Tests • • •

  11. Software Change Rate People Change Rate High High Low Low

  12. Software Change Rate People Change Rate High High Low Low

    Incident Transparency Automated Tests Game Days Incident Simulations Rusty Knobs
  13. novel failures?

  14. Magic Recipe for Novel Failures

  15. Training Exercise Formula • • • • •

  16. None
  17. None
  18. None
  19. None
  20. Summing Up

  21. Thank You! github.com/jarthorn/lego-incident-response