Alerting: more signal, less noise, less pain

Alerting: more signal, less noise, less pain

Velocity 2013

6bcba0c09e7fdeed29218918248fec2f?s=128

Alexis Lê-Quôc

October 15, 2013
Tweet

Transcript

  1. Alerting: More Signal Less Noise Less Pain Alexis Lê-Quôc (@alq)

  2. Is this talk for me? ✓I am or will be

    on-call ✓I don’t like being alerted ✓I want the pain to go away
  3. The next 40 minutes 1. Alerts == pain? 2. Measure

    alerts 3. Concrete (& fun) steps
  4. Alleviate the pain

  5. None
  6. None
  7. Pain

  8. Man vs Machine

  9. “too frequently” “odd hours” “always the same”

  10. 3 simple things to measure

  11. “Always the same”

  12. Steps •Group alert stream by “alert signature” •Rank by occurrences

    •Graph
  13. Alert Signatures (example) name | count ----------------------+------- Root disk space

    | 88 redis-queue | 71 Zombies | 50 Total Processes | 47 dispatcher | 37 pgsql backends | 35 cassandra JVM Heap | 32 SSH | 30 Naive: alert headers
  14. Case 1: Top 5 = 25% in volume Alert count

    by signature % 0 1 2 3 4 5 6 7 Zoom on top 10 % 0 1 2 3 4 5 6 7 Sample size: 1123 alerts 6 months
  15. Case 2: Top 5 = 38% in volume Alert count

    by signature % 0 2 4 6 8 10 12 Zoom on top 10 % 0 2 4 6 8 10 12 Sample size: 2324 alerts 6 months
  16. 0 500 1000 1500 2000 2500 3000 0 20 40

    60 80 alert sample size % in volume of top 5 % in volume of top 5 Frequency 0 20 40 60 80 0 2 4 6 8 10 12 Solve the top 5 alerts and drop the volume by 20-80% Solve the top 5 alerts and drop the volume by 20-80% Outlier (due to naive signature) Outlier (due to naive signature) Top 5 over 103 alert streams Min. 100 alerts per stream
  17. “Odd hours”

  18. Steps •Group alert stream by signature, •... day of week,

    hour of day •Graph
  19. Sunday Monday Tuesday Wednesday Thursday Friday Saturday 0 50 100

    150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Hour of Day UTC Alert count dow Sunday Monday Tuesday Wednesday Thursday Friday Saturday Work days Work days ZZZzzz ZZZzzz Sunday Sunday Saturday Saturday Evenings Evenings
  20. “Too frequently”

  21. Steps •Group alerts by signature •Measure time elapsed between first

    and last occurrence & average/%-ile time elapsed between occurrences •Graph
  22. 0 25 50 75 0 50 100 150 200 Alert

    age in days Alert count Old and frequent Old and frequent Age = days between first and last occurrence Age = days between first and last occurrence New and frequent New and frequent
  23. 0 50 100 150 0 25 50 75 Occurences per

    Alert Average period between occurrences Occur every 2-3 days on average Occur every 2-3 days on average Once in a blue moon Once in a blue moon
  24. “too frequently” “odd hours” “always the same”

  25. “too frequently” “odd hours” “always the same” Quantified

  26. Concrete steps

  27. Measure your alerts 1. Collect 2. Massage 3. Visualize 4.

    Learn
  28. Collect your alerts • From PagerDuty (OpsGenie, Nagios, etc.) •

    Import with Python (pygerduty) • Store in PostgreSQL
  29. Massage your alerts •Use any of •SQL (windowing functions) •R

    (reshape) •Python (pandas)
  30. Visualize •R (or d3.js, excel, etc.) •Key is quick feedback

  31. Slides, Code & Data https://github.com/alq666/velocity-ny-2013

  32. Enjoyed it? Hated it? Don’t care? --- Let me know

    @alq