system probably has latent bugs: obviously incorrect behaviour that hasn't been found yet. New changes may also invalidate previously valid constraints, which are not always explicit, and not always known.
to have opinions on, but often low value changes*. Ask questions that are hard, but necessary, and provide high value: deployment strategy, interaction with existing systems, monitoring, documentation. * Do this during pairing, mentoring, onboarding. Have a "house style" or at least a "repo style".
"type system" to "but we implemented bounded eventual consistency" to "my PaaS takes care of that" “All human actions are equivalent and all are on principle doomed to failure.” – Jean-Paul Sartre
a race condition! That's OK, we have an automatic fail-safe. bug in the fail-safe code used an empty config! No problem, there's a canary. previously unknown bug ignored the canary!
a race condition! That's OK, we have an automatic fail-safe. bug in the fail-safe code used an empty config! No problem, there's a canary. previously unknown bug ignored the canary! All of GCE goes off the internet.
config change to bring down a global service. Two of these bugs were in systems designed to automatically correct or avert failures. Reverted to known-good config before root- causing the outage.
Resolving Internet Service Outages" In-depth examination of an outage at Etsy, concentrating on human-human interactions while dealing with the failure. It also has a great literature review. Seriously go read it.
Etsy homepage. Systems involved: personalized homepage module (two separate submodules), cacheing layer, and specific production data. Team focused on mitigation (worked this time) and investigation of the live system.
change 2. Widen the scope of potential signals you're looking at (look for anything unusual). 3. Look for recurrences of "familiar" failures 3.a) Investigate specific past problem systems 3.b) Investigate specific recent problem systems
levels & layers of the system. Neither of these failures were related to "new code" or other aspects of deployment as typically understood by developers.
caused later. How can you facilitate this? Logging. Machine-parsable, easily greppable (see trentm/node-bunyan). Application-level monitors. Core files.
debugging methodologies, like Gregg's USE method, to limit the scope of inquiries or redirect from "it could be anything". Have simple and discoverable tools for at least initial investigations.
exert a minimal human cost in order to successfully scale. Understanding how people recover from production failure is crucial to producing operable software.