of interconnected, interdependent systems. They cannot—must not—allow bugs to cause a chain of failures. Bugs will happen. They cannot be eliminated, so they must be survived instead. Production is the only place to learn how the software will respond to real-world Release 1.0 is the beginning of your software’s life, not the end of the project.
customers • Know why an issue happened • Don’t depend on somebody looking at error log, daily email, ... • Prevent different conditions in dev/testing & production
to ensure the continuing function of a piece of software in spite of unforeseeable usage of said software. The idea can be viewed as reducing or eliminating the prospect of Murphy's Law having effect. Resilient system stays responsive in the face of failure, any system that is not resilient will be unresponsive after a failure. Resilience is achieved by replication, containment, isolation and delegation.
number-one killer of systems. • A subsystem should be as isolate as possible • Consider health checks • Design and architecture decisions are also financial decisions.
changing code (no release needed) • Do “Dark launches” when possible • When replacing old code, always keep it until you know new one works fine • Configuration files (overriding & hot reloading) Feature Disabling
composed for many methods • Each one of them can fail by a lot of reasons • Depending of the underlying tech, not all of them may be catchable • How do you detect something is failing?
events started and ended tells you something is wrong • Integration points without timeouts is a surefire way to create cascading failures • Consider fail fast
protected over other systems failures or service degradation • Be careful with operations that make changes • Don’t make request too quick • Check if operation is pending, even if previous call failed • Circuit Breaker • Limit the number of retries & Log them
third parties input & outputs • As humans read (or even just scan) log files for a new system, they are learning what “normal” means for that system • Reserve “ERROR” for a serious system problem • Don’t leave log files on production systems. Copy them to a staging area for analysis • Log file rotation
long in production • Good data enables good decision making • Logging and monitoring are both good for exposing and understanding the immediate behavior of an application or system