Slide 1

Slide 1 text

Tackling Alert Fatigue Monitorama 2016

Slide 2

Slide 2 text

CaitieM.com Distributed Systems Engineer Caitie McCaffrey @caitie

Slide 3

Slide 3 text

“When alerts are more often false than true, the on-call’s sense of urgency in responding to alerts is diminished … the simple burden of alerts desensitizes the on-call to alerts.”

Slide 4

Slide 4 text

“When alarms are more often false than true, the nursing staff’s sense of urgency in responding to alarms is diminished … the simple burden of alerts desensitizes caregivers to alarms.” Novel Approach to Cardiac Alarm Management on Telemetry Units

Slide 5

Slide 5 text

The High Cost of: Alert Fatigue Ignored Alerts Unreliable Systems Unhappy Customers

Slide 6

Slide 6 text

The High Cost of: Alert Fatigue Unplanned Work Inability to Complete Planned Work Less Time to Focus on Core Business

Slide 7

Slide 7 text

The High Cost of: Alert Fatigue Fatigue Fire- Fighting Burnout

Slide 8

Slide 8 text

Tackling Alert Fatigue Increase thresholds for patient vitals Only Crisis Alarms would emit audible alerts Nursing staff required to tune false positive alerts in hospitals Novel Approach to Cardiac Alarm Management on Telemetry Units

Slide 9

Slide 9 text

Cmd Line Tool Viz / Dashboad Alerting Svc Cuckoo-Read Cuckoo-Write Indexing Svc Relay Svc Twitter Front End Twitter Svc Twitter Statsite Twitter Svc Twitter Svc Scribe Collection Agent HDFS Manhattan Database Public Cloud Observability at Twitter

Slide 10

Slide 10 text

Runbook & Alert Audits

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Runbook & Alert Audits

Slide 13

Slide 13 text

Runbook & Alert Audits

Slide 14

Slide 14 text

Runbook & Alert Audits

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Empower the Oncall Tune Alert Thresholds Disable or Delete Inactionable Alerts

Slide 20

Slide 20 text

Business Hours Alerts

Slide 21

Slide 21 text

Weekly On-Call Retro Handoff on going issues Review alerts fired in the previous week Schedule work to improve on-call or reliability

Slide 22

Slide 22 text

–Astrid Atkinson “The goal is not to never get paged, the goal is to never get paged for the same thing twice” Engineering for the Long Game

Slide 23

Slide 23 text

50% Reduction of Alerts In One Quarter

Slide 24

Slide 24 text

On-call slept through the night More time to do scheduled work while on-call Faster to ramp up new teammates

Slide 25

Slide 25 text

Q1-Q3 2015 Q4 2015 Improved Visibility Q1 2016 Alerts Per Service

Slide 26

Slide 26 text

Prevention

Slide 27

Slide 27 text

Critical Alerts Need to Be Actionable

Slide 28

Slide 28 text

Do Not Alert on Machine Specific Metrics

Slide 29

Slide 29 text

The Tech Lead or Engineering Manager should be on-call

Slide 30

Slide 30 text

Cultural Change

Slide 31

Slide 31 text

The goal is to build systems that can scale linearly with machines & sub-linearly with people

Slide 32

Slide 32 text

More Reliable Systems Less Unplanned Work Happier Developers Benefits of: Tackling Alert Fatigue

Slide 33

Slide 33 text

Thank you! @caitie https://github.com/CaitieM20/Monitorama2016 References: