Slide 12
Slide 12 text
HTTPS://HOW.COMPLEXSYSTEMS.FAIL/
HOW COMPLEX SYSTEMS FAIL:
BEING A SHORT TREATISE ON THE NATURE OF
FAILURE; HOW FAILURE IS EVALUATED; HOW
FAILURE IS ATTRIBUTED TO PROXIMATE
CAUSE; AND THE RESULTING NEW
UNDERSTANDING OF PATIENT SAFETY
I wish I had the time to fully explore this material, but I want everyone to jot down the URL how [dot] complex systems [dot] fail, or you can find it in the slides online after
the talk. This page summarizes some research done by Dr. Richard Cook. He’s a medical doctor who researches failure modes of complex systems, whether they are the
electrical grid, hospitals, or production distributed systems. If you’ve got a handful of webservers, a load balancer, a network filesystem, a database, and a monitoring
system, you’ve got a complex system.
One of the major observations of Dr. Cook’s research is that for complex systems, they already do a lot of work to handle failures, so any operational failure will have
multiple contributing factors. Because of this, it’s likely that whatever changed to initiative the operational failure merely triggered a set of pre-existing, latent, bugs. It’s
critical we regard the latent bugs as being as much a part of the cause of the incident as the proximate change which made them active.
It’s also important that we consider things that may not be bugs, but which nonetheless are contributing factors to out incident.
Now go forth and read this website, ideally after I’m done talking, but it’s pretty good stuff, I won’t blame you if you start reading it right away.