What you should monitor and alert on in a product system

What you should Monitor and Alert on in a Production
System

@arupchak Arup Chakrabarti OPERATIONS ENGINEERING @arupchak [email protected]

@arupchak Disclaimer

@arupchak What we are going to cover • Story Time
• What is the problem? • How did we get into this mess? • How can we avoid the problem? • What can you start doing today to ﬁx it? • What is coming in the future? • Q and A

@arupchak The sad tale of John Story Time

@arupchak Meet John • Enjoys his job • Runs a
24/7 service • Participates in on-call one week out of every four • Has proper monitoring and alerting for his service • Has proper de-duping on his alerts

@arupchak and then.... • Service is scaling up • Outages
increase • More alerts are created for ‘proactive’ alerting !

@arupchak Love is in the air • John gets married
• Wife is not happy • Unhappy wife = Unhappy life !

@arupchak John sleeps in the basement during on-call

@arupchak (besides sleeping in the basement) What is the problem?

@arupchak Too many alerts!

@arupchak Too many alerts! • Each outage created a new
set of alerts • No push back • No analysis • Alerts became meaningless

@arupchak Another quick story Meaningless Alerts

@arupchak It’s not just Big Data’s fault How did we
get into this mess?

@arupchak and automation is easier Computing getting cheaper • Infrastructure
Automation enables this • aws ec2 run-instances • apt-get install nagios • Businesses want to collect EVERYTHING

@arupchak is a bad idea Linearly scaling alerts with data
$$ Data

@arupchak (no insults intended) How some companies deal with this
• Staff up with 24/7 NOC’s • Engineers never sleep • Bring up ‘follow the sun’ teams

@arupchak Visa NOC

Going forward at least @arupchak How can we avoid the
problem?

@arupchak For Alerts on New Services • Only a few
metrics to alert on • Availability (%) • Error Rate (%) • Latency (Performance) • At small scale • Host level metrics Keep it short

For new services @arupchak Availability Alerting • Availability • Customer
cannot access service • What does your business need? % Downtime/Week 99% 1.68 hrs 99.9% 10.1 min 99.99% 1.01 min 99.999% 6.05 sec

Not the inverse of Availability @arupchak Error Rate Alerting •
Do not treat Error Rates as (1 - Availability) • Your web server is serving 500’s • Customer gets a response, just not what they wanted

More complicated @arupchak Performance Alerting • Avoid using averages •
Use percentile based alerts instead • Focus on worst experience • % of customers had this or better

@arupchak Normally Distributed Data 0 400 800 1200 1600 0
100 200 300 400 500 600 700 800 900 1000 Average Latency # Requests When averages make sense

When averages are misleading @arupchak Bi-Normally Distributed Data 0 250
500 750 1000 0 100 200 300 400 500 600 700 800 900 1000 Average Latency # Requests

@arupchak Normally Distributed Data 0 400 800 1200 1600 0
100 200 300 400 500 600 700 800 900 1000 95% Latency # Requests Percentiles still work 99%

Percentiles work even better @arupchak Bi-Normally Distributed Data 0 250
500 750 1000 0 100 200 300 400 500 600 700 800 900 1000 95% 99% Latency # Requests

@arupchak for existing services What can you start doing today?

@arupchak Turn off the alerts!

@arupchak Ok ok, it’s not that easy

@arupchak about newly created alerts Think Different • Only alert
on metrics your customers care about • Resist the temptation to alert on everything • Customers do not care who gets paged - Alerts goto right people • Talk to your business guys for help

@arupchak and sleep easy Turn off existing alerts • Collect
alerting data (PagerDuty!) • How many times are you getting paged per week? • Questions to ask per alert: • Was any action taken? • Was a customer affected? • Was this fully within my control?

@arupchak before you lose the sense of urgency Fix problems
quickly • Alert comes in at 3am • Engineer validates it is not an issue • Fixes root cause following day • Longer term, start tagging alerts for data collection

@arupchak and avoid thundering herds Improve the signal to noise
• Aggregate alerts together • Avoid ‘Everything is OK’ type alarms ! ! ! • Use the right monitoring tools

@arupchak There is no single solution yet Use the right
tools • Metrics belong in metrics data stores • Event based monitoring does not • Visualizing Metrics • Sampling Rates Matter

@arupchak Numbers that do not change often

@arupchak Numbers that do change often

@arupchak I swear that I can predict it What’s coming
in the future

@arupchak aka non-brittle alerting Non-Threshold based alerting Moving Baseline (Avg)
Rapid Change (2σ)

@arupchak aka Only wake me up when everything breaks Correlation
based alerting

All hail our platform overlords @arupchak Prescriptive alerting • Shared
Computing Infrastructure • Architectures are looking similar • Best practices emerging • Platform takes care of headache

@arupchak [email protected] Thank you. We are Hiring! http://pagerduty.com Arup Chakrabarti
OPERATIONS ENGINEERING @arupchak Someone you know!

[email protected] Q and A. I have t-shirts http://pagerduty.com Arup Chakrabarti
OPERATIONS ENGINEERING @arupchak @arupchak

What you should monitor and alert on in a produ...

What you should monitor and alert on in a product system

More Decks by Arup Chakrabarti

Other Decks in Technology

Featured

Transcript