Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What you should monitor and alert on in a product system

What you should monitor and alert on in a product system

This is a talk that I gave at Nagios World about some of the alerting best practices that I have seen over the years.

Here is the video:

https://www.youtube.com/watch?v=OCtGlp2M6wI

Arup Chakrabarti

October 02, 2013
Tweet

More Decks by Arup Chakrabarti

Other Decks in Technology

Transcript

  1. @arupchak What we are going to cover • Story Time

    • What is the problem? • How did we get into this mess? • How can we avoid the problem? • What can you start doing today to fix it? • What is coming in the future? • Q and A
  2. @arupchak Meet John • Enjoys his job • Runs a

    24/7 service • Participates in on-call one week out of every four • Has proper monitoring and alerting for his service • Has proper de-duping on his alerts
  3. @arupchak and then.... • Service is scaling up • Outages

    increase • More alerts are created for ‘proactive’ alerting !
  4. @arupchak Love is in the air • John gets married

    • Wife is not happy • Unhappy wife = Unhappy life !
  5. @arupchak Too many alerts! • Each outage created a new

    set of alerts • No push back • No analysis • Alerts became meaningless
  6. @arupchak and automation is easier Computing getting cheaper • Infrastructure

    Automation enables this • aws ec2 run-instances • apt-get install nagios • Businesses want to collect EVERYTHING
  7. @arupchak (no insults intended) How some companies deal with this

    • Staff up with 24/7 NOC’s • Engineers never sleep • Bring up ‘follow the sun’ teams
  8. @arupchak For Alerts on New Services • Only a few

    metrics to alert on • Availability (%) • Error Rate (%) • Latency (Performance) • At small scale • Host level metrics Keep it short
  9. For new services @arupchak Availability Alerting • Availability • Customer

    cannot access service • What does your business need? % Downtime/Week 99% 1.68 hrs 99.9% 10.1 min 99.99% 1.01 min 99.999% 6.05 sec
  10. Not the inverse of Availability @arupchak Error Rate Alerting •

    Do not treat Error Rates as (1 - Availability) • Your web server is serving 500’s • Customer gets a response, just not what they wanted
  11. More complicated @arupchak Performance Alerting • Avoid using averages •

    Use percentile based alerts instead • Focus on worst experience • % of customers had this or better
  12. @arupchak Normally Distributed Data 0 400 800 1200 1600 0

    100 200 300 400 500 600 700 800 900 1000 Average Latency # Requests When averages make sense
  13. When averages are misleading @arupchak Bi-Normally Distributed Data 0 250

    500 750 1000 0 100 200 300 400 500 600 700 800 900 1000 Average Latency # Requests
  14. @arupchak Normally Distributed Data 0 400 800 1200 1600 0

    100 200 300 400 500 600 700 800 900 1000 95% Latency # Requests Percentiles still work 99%
  15. Percentiles work even better @arupchak Bi-Normally Distributed Data 0 250

    500 750 1000 0 100 200 300 400 500 600 700 800 900 1000 95% 99% Latency # Requests
  16. @arupchak about newly created alerts Think Different • Only alert

    on metrics your customers care about • Resist the temptation to alert on everything • Customers do not care who gets paged - Alerts goto right people • Talk to your business guys for help
  17. @arupchak and sleep easy Turn off existing alerts • Collect

    alerting data (PagerDuty!) • How many times are you getting paged per week? • Questions to ask per alert: • Was any action taken? • Was a customer affected? • Was this fully within my control?
  18. @arupchak before you lose the sense of urgency Fix problems

    quickly • Alert comes in at 3am • Engineer validates it is not an issue • Fixes root cause following day • Longer term, start tagging alerts for data collection
  19. @arupchak and avoid thundering herds Improve the signal to noise

    • Aggregate alerts together • Avoid ‘Everything is OK’ type alarms ! ! ! • Use the right monitoring tools
  20. @arupchak There is no single solution yet Use the right

    tools • Metrics belong in metrics data stores • Event based monitoring does not • Visualizing Metrics • Sampling Rates Matter
  21. All hail our platform overlords @arupchak Prescriptive alerting • Shared

    Computing Infrastructure • Architectures are looking similar • Best practices emerging • Platform takes care of headache