Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incident insights from NASA, NTSB, and the CDC

Emil Stolarsky
September 30, 2017

Incident insights from NASA, NTSB, and the CDC

Full talk can be found @ https://youtu.be/ODYO2MPymJ4

All complex systems eventually fail. With that inevitability, understanding recovery is paramount. Beyond the investment to keep a system running, we must know how to effectively recover upon failure, and ensure we don't encounter the same failure twice. The stakes are high: in a connected future, one with self-driving cars and fully-automated economies, outages won't only damage customer trust and the bottom line, but could cost lives.

Luckily, software isn't the only industry that deals with the failure of complex systems. Instead of reinventing the wheel, we should take a cross-disciplinary approach and draw inspiration from decades of experience in other fields. Lessons from industries dealing with similar challenges abound: medicine with surgery, transportation with air travel, and aerospace with rockets.

In this talk, I'll share my research into the incident handling and postmortem practices of other fields, surfacing the lessons we can take away. Questions we'll answer include: what has the NTSB learned from investigating 140,000 transport accidents? How does the CDC prevent epidemics from becoming pandemics in the midst of chaos? What can we learn from NASA's postmortem culture?

Still in its early days, SRE has figured out incident management and analysis through trial-and-error and tribal knowledge. As the field matures, and the world relies more heavily on our systems, we can craft best practices by learning from others rather than from inevitable catastrophe.

Emil Stolarsky

September 30, 2017

More Decks by Emil Stolarsky

Other Decks in Technology


  1. Incident Insights Learning from NASA, NTSB, and the CDC Emil

    Stolarsky | emil@shopify.com | @EmilStolarsky
  2. April 9, 2014

  3. None
  4. None
  5. None
  6. None
  7. Tombstone Mentality

  8. Proactive Reactive

  9. Aerospace Energy Medicine Natural Disasters

  10. $whoami

  11. Mitigation Preparedness Response Recovery

  12. Mitigation

  13. None
  14. Probabilistic Risk Assessment

  15. Fault Tree Analysis

  16. None
  17. Preparedness

  18. Laguna Fire


  20. Incident Command System

  21. Response

  22. None
  23. None
  24. None
  25. Thought Automation

  26. None
  27. EAL 401 UA 2860 Tenerife Disaster UA 173

  28. Crew Resource Management

  29. • Opening or attention getter • State your concern •

    State the problem as you see it • State a solution • Obtain agreement (or buy-in) Crew Resource Management
  30. • "Hey Captain..." • "I'm concerned that we may not

    have enough fuel" • "We're showing only 40 minutes of fuel left" • "Let's divert to another airport and refuel" • "Does that sound good to you, Captain?" Crew Resource Management
  31. Recovery

  32. Bert Brugghemans, Fire Chief (Antwerp, Belgium) "Never waste a good

  33. Not why, but how

  34. “human error”

  35. Causal Factor Trees

  36. Causal Factor Tree Example

  37. Aviation Safety Reporting System

  38. Future is coming faster than we think.

  39. What does proactive look like?

  40. None
  41. None
  42. - Craig Fugate, Director of FEMA (2009 –2017) “If you

    get there and the Waffle House is closed? That's really bad.”
  43. None
  44. Thank you.