Roles Describe Practices for operationalizing roles Intermission for Praxis Offer Tools that support roles and practices Laundry list Everything Else and Adjourn
(big melons) Business Continuity Event • A feature is inaccessible to a small number of users • Website is running slowly • Something is spelled wrong • Expectations don’t match reality • Entire service is temporarily unavailable • Temporary infrastructure or networking failure • Significant misconfiguration • Everything gets deleted • Major contract compliance failure • Unrecoverable data • Critical service partner (AWS, Twilio, Github) has business continuity-level failure Note: “Events” are ambiguous because all systems run with a certain level of accepted failure (P2s).
reporting on the state of the incident Establishes impact, and manages expectations of recovery Requests resources and sets objectives. Maintains a regular cadence (30min) of outbound updates Coordinates messaging with internal-external partners (e.g. client/partner success teams) Investigates and remediates the problem. In practice: “I am IC” To IC: “SitRep?” “I’ll take comms” “I can help”
in progress. “GetCalfresh.org is unavailable.” • Declaratively narrate actions, assumption of roles, and handoffs.“I am logging into the production server”; “I am IC”, “You have comms”; “I have comms” • Establish command before requesting “all hands on deck” • Be aware of your limitations (experience, physical, external obligations) and no-shame escalate. • Do NOT gripe (now) about how previous bad decisions led to this • Write down the steps you take/took to recover for next time in a reliable place Act to restore confidence in the system-as-a-whole.
Pritchard, M; Broom, T. “The Concrete Sumo: Exigent Decision-Making in Engineering”. Science and Engineering Ethics 1999 October; 5(4): 541-567. Intermission
after another, methodically looking for a solution until you run out of oxygen. We practice the “warn, gather, work” protocol for responding to fire alarms so frequently that it doesnʼt just become second nature; it actually supplants our natural instincts. So when we heard the alarm on the Station, instead of rushing to don masks and arm ourselves with extinguishers, one astronaut calmly got on the intercom to warn that a fire alarm was going off – maybe the Russians couldnʼt hear it in their module – while another went to the computer to see which smoke detector was going off. No one was moving in a leisurely fashion, but the response was one of focused curiosity; as though we were dealing with an abstract puzzle rather than an imminent threat to our survival. To an observer it might have looked a little bizarre, actually: no agitation, no barked commands, no haste." Chris Hadfield - “An Astronautʼs Guide to Life on Earth”