$30 off During Our Annual Pro Sale. View Details »

Tackling Alert Fatigue

Tackling Alert Fatigue

Caitie McCaffrey

June 26, 2016
Tweet

More Decks by Caitie McCaffrey

Other Decks in Technology

Transcript

  1. Tackling Alert Fatigue
    Monitorama 2016

    View Slide

  2. CaitieM.com
    Distributed Systems Engineer
    Caitie McCaffrey
    @caitie

    View Slide

  3. “When alerts are more often
    false than true, the on-call’s
    sense of urgency in
    responding to alerts is
    diminished … the simple burden
    of alerts desensitizes the on-call
    to alerts.”

    View Slide

  4. “When alarms are more often
    false than true, the nursing
    staff’s sense of urgency in
    responding to alarms is
    diminished … the simple burden
    of alerts desensitizes caregivers
    to alarms.”
    Novel Approach to Cardiac Alarm Management on Telemetry Units

    View Slide

  5. The High Cost of:
    Alert Fatigue
    Ignored Alerts
    Unreliable Systems
    Unhappy Customers

    View Slide

  6. The High Cost of:
    Alert Fatigue
    Unplanned Work
    Inability to Complete
    Planned Work
    Less Time to Focus
    on Core Business

    View Slide

  7. The High Cost of:
    Alert Fatigue
    Fatigue
    Fire- Fighting
    Burnout

    View Slide

  8. Tackling Alert Fatigue
    Increase thresholds for patient vitals
    Only Crisis Alarms would emit audible alerts
    Nursing staff required to tune false positive alerts
    in hospitals
    Novel Approach to Cardiac Alarm Management on Telemetry Units

    View Slide

  9. Cmd Line Tool Viz / Dashboad Alerting Svc
    Cuckoo-Read
    Cuckoo-Write
    Indexing Svc
    Relay Svc
    Twitter Front End
    Twitter
    Svc
    Twitter
    Statsite
    Twitter
    Svc
    Twitter
    Svc
    Scribe
    Collection
    Agent
    HDFS
    Manhattan Database
    Public Cloud
    Observability at Twitter

    View Slide

  10. Runbook & Alert Audits

    View Slide

  11. View Slide

  12. Runbook & Alert Audits

    View Slide

  13. Runbook & Alert Audits

    View Slide

  14. Runbook & Alert Audits

    View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. View Slide

  19. Empower the Oncall
    Tune Alert Thresholds
    Disable or Delete Inactionable Alerts

    View Slide

  20. Business Hours Alerts

    View Slide

  21. Weekly On-Call Retro
    Handoff on going issues
    Review alerts fired in the previous week
    Schedule work to improve on-call or reliability

    View Slide

  22. –Astrid Atkinson
    “The goal is not to
    never get paged, the
    goal is to never get
    paged for the same
    thing twice”
    Engineering for the Long Game

    View Slide

  23. 50% Reduction of Alerts
    In One Quarter

    View Slide

  24. On-call slept through the night
    More time to do scheduled
    work while on-call
    Faster to ramp up new teammates

    View Slide

  25. Q1-Q3 2015 Q4 2015
    Improved Visibility
    Q1 2016
    Alerts Per Service

    View Slide

  26. Prevention

    View Slide

  27. Critical Alerts Need to
    Be Actionable

    View Slide

  28. Do Not
    Alert on
    Machine
    Specific
    Metrics

    View Slide

  29. The Tech Lead or Engineering
    Manager should be on-call

    View Slide

  30. Cultural Change

    View Slide

  31. The goal is to build
    systems that can scale
    linearly with machines &
    sub-linearly with people

    View Slide

  32. More Reliable Systems
    Less Unplanned Work
    Happier Developers
    Benefits of:
    Tackling Alert Fatigue

    View Slide

  33. Thank you!
    @caitie
    https://github.com/CaitieM20/Monitorama2016
    References:

    View Slide