Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What you should monitor and alert on in a product system

What you should monitor and alert on in a product system

This is a talk that I gave at Nagios World about some of the alerting best practices that I have seen over the years.

Here is the video:

https://www.youtube.com/watch?v=OCtGlp2M6wI

Ebe1d126c7c859171156efb4c08db14f?s=128

Arup Chakrabarti

October 02, 2013
Tweet

Transcript

  1. What you should Monitor and Alert on in a Production

    System
  2. @arupchak Arup Chakrabarti OPERATIONS ENGINEERING @arupchak arup@pagerduty.com

  3. @arupchak Disclaimer

  4. @arupchak What we are going to cover • Story Time

    • What is the problem? • How did we get into this mess? • How can we avoid the problem? • What can you start doing today to fix it? • What is coming in the future? • Q and A
  5. @arupchak The sad tale of John Story Time

  6. @arupchak Meet John • Enjoys his job • Runs a

    24/7 service • Participates in on-call one week out of every four • Has proper monitoring and alerting for his service • Has proper de-duping on his alerts
  7. @arupchak and then.... • Service is scaling up • Outages

    increase • More alerts are created for ‘proactive’ alerting !
  8. @arupchak Love is in the air • John gets married

    • Wife is not happy • Unhappy wife = Unhappy life !
  9. @arupchak John sleeps in the basement during on-call

  10. @arupchak (besides sleeping in the basement) What is the problem?

  11. @arupchak Too many alerts!

  12. @arupchak Too many alerts! • Each outage created a new

    set of alerts • No push back • No analysis • Alerts became meaningless
  13. @arupchak Another quick story Meaningless Alerts

  14. @arupchak It’s not just Big Data’s fault How did we

    get into this mess?
  15. @arupchak and automation is easier Computing getting cheaper • Infrastructure

    Automation enables this • aws ec2 run-instances • apt-get install nagios • Businesses want to collect EVERYTHING
  16. @arupchak is a bad idea Linearly scaling alerts with data

    $$ Data
  17. @arupchak (no insults intended) How some companies deal with this

    • Staff up with 24/7 NOC’s • Engineers never sleep • Bring up ‘follow the sun’ teams
  18. @arupchak Visa NOC

  19. Going forward at least @arupchak How can we avoid the

    problem?
  20. @arupchak For Alerts on New Services • Only a few

    metrics to alert on • Availability (%) • Error Rate (%) • Latency (Performance) • At small scale • Host level metrics Keep it short
  21. For new services @arupchak Availability Alerting • Availability • Customer

    cannot access service • What does your business need? % Downtime/Week 99% 1.68 hrs 99.9% 10.1 min 99.99% 1.01 min 99.999% 6.05 sec
  22. Not the inverse of Availability @arupchak Error Rate Alerting •

    Do not treat Error Rates as (1 - Availability) • Your web server is serving 500’s • Customer gets a response, just not what they wanted
  23. More complicated @arupchak Performance Alerting • Avoid using averages •

    Use percentile based alerts instead • Focus on worst experience • % of customers had this or better
  24. @arupchak Normally Distributed Data 0 400 800 1200 1600 0

    100 200 300 400 500 600 700 800 900 1000 Average Latency # Requests When averages make sense
  25. When averages are misleading @arupchak Bi-Normally Distributed Data 0 250

    500 750 1000 0 100 200 300 400 500 600 700 800 900 1000 Average Latency # Requests
  26. @arupchak Normally Distributed Data 0 400 800 1200 1600 0

    100 200 300 400 500 600 700 800 900 1000 95% Latency # Requests Percentiles still work 99%
  27. Percentiles work even better @arupchak Bi-Normally Distributed Data 0 250

    500 750 1000 0 100 200 300 400 500 600 700 800 900 1000 95% 99% Latency # Requests
  28. @arupchak for existing services What can you start doing today?

  29. @arupchak Turn off the alerts!

  30. @arupchak Ok ok, it’s not that easy

  31. @arupchak about newly created alerts Think Different • Only alert

    on metrics your customers care about • Resist the temptation to alert on everything • Customers do not care who gets paged - Alerts goto right people • Talk to your business guys for help
  32. @arupchak and sleep easy Turn off existing alerts • Collect

    alerting data (PagerDuty!) • How many times are you getting paged per week? • Questions to ask per alert: • Was any action taken? • Was a customer affected? • Was this fully within my control?
  33. @arupchak before you lose the sense of urgency Fix problems

    quickly • Alert comes in at 3am • Engineer validates it is not an issue • Fixes root cause following day • Longer term, start tagging alerts for data collection
  34. @arupchak and avoid thundering herds Improve the signal to noise

    • Aggregate alerts together • Avoid ‘Everything is OK’ type alarms ! ! ! • Use the right monitoring tools
  35. @arupchak There is no single solution yet Use the right

    tools • Metrics belong in metrics data stores • Event based monitoring does not • Visualizing Metrics • Sampling Rates Matter
  36. @arupchak Numbers that do not change often

  37. @arupchak Numbers that do change often

  38. @arupchak I swear that I can predict it What’s coming

    in the future
  39. @arupchak aka non-brittle alerting Non-Threshold based alerting Moving Baseline (Avg)

    Rapid Change (2σ)
  40. @arupchak aka Only wake me up when everything breaks Correlation

    based alerting
  41. All hail our platform overlords @arupchak Prescriptive alerting • Shared

    Computing Infrastructure • Architectures are looking similar • Best practices emerging • Platform takes care of headache
  42. @arupchak arup@pagerduty.com Thank you. We are Hiring! http://pagerduty.com Arup Chakrabarti

    OPERATIONS ENGINEERING @arupchak Someone you know!
  43. arup@pagerduty.com Q and A. I have t-shirts http://pagerduty.com Arup Chakrabarti

    OPERATIONS ENGINEERING @arupchak @arupchak