Building robust software systems means anticipating how failures may occur with components and subsystems and developing answers to the question:
“What is needed for the design of systems that prevents or limits catastrophic failure?” Investing in, developing, and sustaining the adaptive capacity to cope with unexpected situations is at the core of Resilience Engineering. In the software community, this means developing (continually!) ever-better answers to the question:
“When our preventative designs fail us, what are ways that teams of engineers successfully anticipate, resolve, and learn from those catastrophes?”
The Resilience Engineering community has been studying how people in high-consequence/high-tempo domains answer this latter question. Applying Resilience Engineering thinking and paradigms to the world of software engineering and operations is still in its infancy, but we have some promising routes for making progress. This talk will outline productive avenues to locate, amplify, support, and build this capacity that exists (sometimes invisibly) in the expertise of your organization. Spoiler: looking closely at the origins, handling, and perception of incidents is part of this story.