Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incident insights from NASA, NTSB, and the CDC

Emil Stolarsky
September 30, 2017

Incident insights from NASA, NTSB, and the CDC

Full talk can be found @ https://youtu.be/ODYO2MPymJ4

All complex systems eventually fail. With that inevitability, understanding recovery is paramount. Beyond the investment to keep a system running, we must know how to effectively recover upon failure, and ensure we don't encounter the same failure twice. The stakes are high: in a connected future, one with self-driving cars and fully-automated economies, outages won't only damage customer trust and the bottom line, but could cost lives.

Luckily, software isn't the only industry that deals with the failure of complex systems. Instead of reinventing the wheel, we should take a cross-disciplinary approach and draw inspiration from decades of experience in other fields. Lessons from industries dealing with similar challenges abound: medicine with surgery, transportation with air travel, and aerospace with rockets.

In this talk, I'll share my research into the incident handling and postmortem practices of other fields, surfacing the lessons we can take away. Questions we'll answer include: what has the NTSB learned from investigating 140,000 transport accidents? How does the CDC prevent epidemics from becoming pandemics in the midst of chaos? What can we learn from NASA's postmortem culture?

Still in its early days, SRE has figured out incident management and analysis through trial-and-error and tribal knowledge. As the field matures, and the world relies more heavily on our systems, we can craft best practices by learning from others rather than from inevitable catastrophe.

Emil Stolarsky

September 30, 2017
Tweet

More Decks by Emil Stolarsky

Other Decks in Technology

Transcript

  1. Incident Insights
    Learning from NASA, NTSB, and the CDC
    Emil Stolarsky | [email protected] | @EmilStolarsky

    View Slide

  2. April 9, 2014

    View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. Tombstone Mentality

    View Slide

  8. Proactive
    Reactive

    View Slide

  9. Aerospace Energy
    Medicine
    Natural
    Disasters

    View Slide

  10. $whoami

    View Slide

  11. Mitigation Preparedness
    Response Recovery

    View Slide

  12. Mitigation

    View Slide

  13. View Slide

  14. Probabilistic Risk
    Assessment

    View Slide

  15. Fault Tree
    Analysis

    View Slide

  16. View Slide

  17. Preparedness

    View Slide

  18. Laguna Fire

    View Slide

  19. FIRESCOPE

    View Slide

  20. Incident
    Command System

    View Slide

  21. Response

    View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. Thought
    Automation

    View Slide

  26. View Slide

  27. EAL 401
    UA 2860
    Tenerife
    Disaster
    UA 173

    View Slide

  28. Crew Resource Management

    View Slide

  29. • Opening or attention getter
    • State your concern
    • State the problem as you see it
    • State a solution
    • Obtain agreement (or buy-in)
    Crew Resource Management

    View Slide

  30. • "Hey Captain..."
    • "I'm concerned that we may not have enough fuel"
    • "We're showing only 40 minutes of fuel left"
    • "Let's divert to another airport and refuel"
    • "Does that sound good to you, Captain?"
    Crew Resource Management

    View Slide

  31. Recovery

    View Slide

  32. Bert Brugghemans, Fire Chief (Antwerp, Belgium)
    "Never waste a good crisis"

    View Slide

  33. Not why, but how

    View Slide

  34. “human error”

    View Slide

  35. Causal Factor
    Trees

    View Slide

  36. Causal Factor
    Tree Example

    View Slide

  37. Aviation Safety
    Reporting System

    View Slide

  38. Future is coming
    faster than we think.

    View Slide

  39. What does
    proactive look like?

    View Slide

  40. View Slide

  41. View Slide

  42. - Craig Fugate, Director of FEMA (2009 –2017)
    “If you get there and the Waffle
    House is closed? That's really
    bad.”

    View Slide

  43. View Slide

  44. Thank you.

    View Slide