of interconnected, interdependent systems. They cannot—must not—allow bugs to cause a chain of failures. Bugs will happen. They cannot be eliminated, so they must be survived instead. Production is the only place to learn how the software will respond to real-world Release 1.0 is the beginning of your software’s life, not the end of the project.
to ensure the continuing function of a piece of software in spite of unforeseeable usage of said software. The idea can be viewed as reducing or eliminating the prospect of Murphy's Law having effect. Resilient system stays responsive in the face of failure, any system that is not resilient will be unresponsive after a failure. Resilience is achieved by replication, containment, isolation and delegation.
changing code (no release needed) • Do “Dark launches” when possible • When replacing old code, always keep it until you know new one works fine • Configuration files (overriding & hot reloading) Feature Disabling
protected over other systems failures or service degradation • Be careful with operations that make changes • Don’t make request too quick • Check if operation is pending, even if previous call failed • Circuit Breaker • Limit the number of retries & Log them
third parties input & outputs • As humans read (or even just scan) log files for a new system, they are learning what “normal” means for that system • Reserve “ERROR” for a serious system problem • Don’t leave log files on production systems. Copy them to a staging area for analysis • Log file rotation