With information bombarding us every minute of our lives, it can be tough to know what warrants triggering a page. Alert fatigue is a real danger, but ignoring real problems is dangerous too. What lessons can we learn from other fields - such as public policy, public health, or clinical medicine - to reduce the risk of alert fatigue while also keeping our systems as healthy as possible?
With the STAT framework - Supported, Trustworthy, Actionable, and Triaged - we have a rapid diagnostic test we can use to identify the portions of our alerting systems that cause alert fatigue, and some strategies for reducing it.
Warning: This Talk Contains Content
Known to the State of California to Reduce
Observability Engineer at Stripe
Why we can learn from clinical healthcare
•Direct personal contact
•Systems which are difficult to control
Alert Fatigue and Decision Fatigue
When the frequency or severity of alerts causes the responder
either to ignore important alerts or make mistakes more frequently
When the frequency or complexity of decision points causes a
person to avoid decisions or make mistakes more frequently.
Alert Fatigue deals with the observability of systems
Decision Fatigue deals with the controllability of systems
72-99% of clinical alarms are false positives
…but certain patterns of alerts and decisions contribute
disproportionately to fatigue!
Four Steps to Reducing Alert Fatigue: STAT
(Supported, Trustworthy, Actionable, Triaged)
•Who owns this monitor?
•Who has the right or authority to change it?
An alerting system includes the people who participate in
responding to alerts, not just the software that generates alerts
The person responding to an alert always has the right to
change it, whether we realize it or not
Responders must feel ownership over the end result
• Do I trust this alert to notify me when a problem happens?
• Do I trust this alert to stay silent when all is well?
• Do I trust this alert to give me sufficient information to diagnose problems?
Anomaly detection and opaque algorithms
If you don’t understand why an alert is firing, you don’t understand
whether it’s real or not
When to use modeling for monitors
•Does the model represent the interconnectedness of your systems?
•Can the thresholds be adjusted?
•Are the model parameters and outputs human-interpretable?
•At most one decision required to respond
•Alerts that are difficult to action become alerts that are ignored
Making alerts more actionable
“investigate”, “something”, “somewhere”, “someone”
Decision trees, interactive tooling, making the alerts specific
If it’s unclear who should be taking action, the alert is not actionable
•Meticulously triage alerts
•Alert type should reflect urgency
•Urgency of alerts can change
Steps for triaging
• Commonly-understood tiers
• Regular, periodic re-evaluation process
What’s wrong with Prop 65 warnings?
STAT is just the beginning
•Alert fatigue and decision fatigue deplete executive function
•Tackle alert fatigue and decision fatigue in tandem
•Use STAT as a quick check to evaluate alerting systems
•Regularly re-evaluate your alerts and alerting systems