test add log statements until you can verify what causes the broken state. if the bug did work at some point, find the point at which it did work. write tests to represent the configuration and flow of the fixed state
not connect to server: Connection refused was bubbling up all over the place. Jobs won’t run, emails won’t send, every submit button on the site fatal errored.” on-call log 24 June 2014 WILDLY CHAOTIC
Julia Evans • Systems Performance, Brendan Gregg • Site Reliability Engineering, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy • “Why Do Computers Stop and What Can Be Done About It?”, Jim Gray • “Debug Patterns for Efficient High- levelSystemC Debugging”, Frank Rogin, Erhard Fehlauer, Christian Haufe, Sebastian Ohnewald