Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures

A core concept in SRE is that we learn from major system failures, using the experience gained to improve resiliency of our systems. If we are successful at this, we avoid repeating the same customer impact the next time our systems fail in a similar way. This is wonderful, but there is a frightening corollary: when the next big failure happens, it will often be a novel problem. This talk will focus on how to prepare for novel large scale failures. I will start by summarizing common methods of incident training. This includes simulated disaster scenarios, and live system exercises that test the response of our systems and engineering teams to controlled but real production system failures. I will outline the benefits of each approach, and our experience in employing them over the years as our company has grown. Our SRE team has grown from about 40 three years ago to 120 today, and the methods we used in the past became less effective as both our systems and team organization grew more complex and distributed. While simple playbooks and fallbacks once worked in the past, we have found that with complexity came a greater need for creativity and coordination of larger teams to fight problems effectively. High trust, communication, and psychological safety are now central ingredients to an effective response, leading us to seek more novel forms of offline training. This talk will wrap up with a summary of one such large scale incident exercise we ran involving a hundred people, an office building, and 20,000 pieces of lego.

John Arthorne

October 04, 2019

More Decks by John Arthorne

See All by John Arthorne

Black Friday: Lessons in Resiliency and Incident Response at Shopify

0

460

Operating Systems in Cloud City

0

380

Tracking and automating software infrastructure with GitHub

0

540

Slack Superpowers

0

430

Continuous Delivery at Shopify

0

930

Tracking Service Infrastructure at Scale

1

520

Planetary Scale Web Architecture: A Gentle Introduction

1

370

DevOps Toronto 2016: Scaling out Continuous Delivery

1

940

DevOps Vancouver 2016: Scaling out Continuous Delivery

0

690

Other Decks in Programming

See All in Programming

Augmenting AI with the Power of Jakarta EE

0

230

自作OSでスライド発表する

1

3.9k

テーブルをDELETEした

0

110

PostgreSQL 18で考えるUUID主キー

0

420

変わらないものが、変わるものを決める — 意図駆動開発 × イベントソーシング × イミュータブル | What Doesn't Change Decides What Can — IDD × Event Sourcing × Immutability

0

490

【SRE NEXT 2026 Lunch Session】一人目専任SREの立ち上げを加速する ― AIと進めたオンボーディングで2分を0.04秒にした話

PRO

0

3.1k

GDG Korea Android: 2026 I/O Extended ~ What's new in Android development tools

0

180

The Past, Present, and Future of Enterprise Java

0

410

継続モナドとリアクティブプログラミング

3

650

霧の中の代数的エフェクト

1

430

今さら聞けない .NET CLI

0

140

Haskell/Servantを通してWebミドルウェアを捉え直す

1

610

Featured

See All Featured

Site-Speed That Sticks

13

1.4k

AI in Enterprises - Java and Open Source to the Rescue

0

1.4k

Build The Right Thing And Hit Your Dates

39

3.3k

Code Review Best Practice

74

20k

Distributed Sagas: A Protocol for Coordinating Microservices

333

23k

Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf

0

620

Music & Morning Musume

47

7.3k

Designing for Timeless Needs

1

400

Money Talks: Using Revenue to Get Sh*t Done

0

420

Abbi's Birthday

3

8.8k

Highjacked: Video Game Concept Design

PRO

1

410

Refactoring Trust on Your Teams (GOTO; Chicago 2020)

35

3.7k

Transcript

Expect the Unexpected
Help > About John Arthorne • Developer/Manager/SRE at Shopify Shopify
• Software for Commerce • 30 → 120 SREs • Average $1700 GMV / second
• • • What We’ll Cover
high failure novelty rate
• • •
Transparent Response • • •
Incident Simulation • • •
Game Days • • •
Turn Rusty Knobs • • •
Automated Failure Tests • • •
Software Change Rate People Change Rate High High Low Low
Software Change Rate People Change Rate High High Low Low
Incident Transparency Automated Tests Game Days Incident Simulations Rusty Knobs
novel failures?
Magic Recipe for Novel Failures
Training Exercise Formula • • • • •
None
None
None
None
Summing Up
Thank You! github.com/jarthorn/lego-incident-response