Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incident insights from NASA, NTSB, and the CDC

Emil Stolarsky
September 30, 2017

Incident insights from NASA, NTSB, and the CDC

Full talk can be found @ https://youtu.be/ODYO2MPymJ4

All complex systems eventually fail. With that inevitability, understanding recovery is paramount. Beyond the investment to keep a system running, we must know how to effectively recover upon failure, and ensure we don't encounter the same failure twice. The stakes are high: in a connected future, one with self-driving cars and fully-automated economies, outages won't only damage customer trust and the bottom line, but could cost lives.

Luckily, software isn't the only industry that deals with the failure of complex systems. Instead of reinventing the wheel, we should take a cross-disciplinary approach and draw inspiration from decades of experience in other fields. Lessons from industries dealing with similar challenges abound: medicine with surgery, transportation with air travel, and aerospace with rockets.

In this talk, I'll share my research into the incident handling and postmortem practices of other fields, surfacing the lessons we can take away. Questions we'll answer include: what has the NTSB learned from investigating 140,000 transport accidents? How does the CDC prevent epidemics from becoming pandemics in the midst of chaos? What can we learn from NASA's postmortem culture?

Still in its early days, SRE has figured out incident management and analysis through trial-and-error and tribal knowledge. As the field matures, and the world relies more heavily on our systems, we can craft best practices by learning from others rather than from inevitable catastrophe.

Emil Stolarsky

September 30, 2017

More Decks by Emil Stolarsky

Other Decks in Technology


  1. • Opening or attention getter • State your concern •

    State the problem as you see it • State a solution • Obtain agreement (or buy-in) Crew Resource Management
  2. • "Hey Captain..." • "I'm concerned that we may not

    have enough fuel" • "We're showing only 40 minutes of fuel left" • "Let's divert to another airport and refuel" • "Does that sound good to you, Captain?" Crew Resource Management
  3. - Craig Fugate, Director of FEMA (2009 –2017) “If you

    get there and the Waffle House is closed? That's really bad.”