Tackling Alert Fatigue

Tackling Alert Fatigue

9128d500301ae51524e887bb680f471d?s=128

Caitie McCaffrey

June 26, 2016
Tweet

Transcript

  1. Tackling Alert Fatigue Monitorama 2016

  2. CaitieM.com Distributed Systems Engineer Caitie McCaffrey @caitie

  3. “When alerts are more often false than true, the on-call’s

    sense of urgency in responding to alerts is diminished … the simple burden of alerts desensitizes the on-call to alerts.”
  4. “When alarms are more often false than true, the nursing

    staff’s sense of urgency in responding to alarms is diminished … the simple burden of alerts desensitizes caregivers to alarms.” Novel Approach to Cardiac Alarm Management on Telemetry Units
  5. The High Cost of: Alert Fatigue Ignored Alerts Unreliable Systems

    Unhappy Customers
  6. The High Cost of: Alert Fatigue Unplanned Work Inability to

    Complete Planned Work Less Time to Focus on Core Business
  7. The High Cost of: Alert Fatigue Fatigue Fire- Fighting Burnout

  8. Tackling Alert Fatigue Increase thresholds for patient vitals Only Crisis

    Alarms would emit audible alerts Nursing staff required to tune false positive alerts in hospitals Novel Approach to Cardiac Alarm Management on Telemetry Units
  9. Cmd Line Tool Viz / Dashboad Alerting Svc Cuckoo-Read Cuckoo-Write

    Indexing Svc Relay Svc Twitter Front End Twitter Svc Twitter Statsite Twitter Svc Twitter Svc Scribe Collection Agent HDFS Manhattan Database Public Cloud Observability at Twitter
  10. Runbook & Alert Audits

  11. None
  12. Runbook & Alert Audits

  13. Runbook & Alert Audits

  14. Runbook & Alert Audits

  15. None
  16. None
  17. None
  18. None
  19. Empower the Oncall Tune Alert Thresholds Disable or Delete Inactionable

    Alerts
  20. Business Hours Alerts

  21. Weekly On-Call Retro Handoff on going issues Review alerts fired

    in the previous week Schedule work to improve on-call or reliability
  22. –Astrid Atkinson “The goal is not to never get paged,

    the goal is to never get paged for the same thing twice” Engineering for the Long Game
  23. 50% Reduction of Alerts In One Quarter

  24. On-call slept through the night More time to do scheduled

    work while on-call Faster to ramp up new teammates
  25. Q1-Q3 2015 Q4 2015 Improved Visibility Q1 2016 Alerts Per

    Service
  26. Prevention

  27. Critical Alerts Need to Be Actionable

  28. Do Not Alert on Machine Specific Metrics

  29. The Tech Lead or Engineering Manager should be on-call

  30. Cultural Change

  31. The goal is to build systems that can scale linearly

    with machines & sub-linearly with people
  32. More Reliable Systems Less Unplanned Work Happier Developers Benefits of:

    Tackling Alert Fatigue
  33. Thank you! @caitie https://github.com/CaitieM20/Monitorama2016 References: