Pro Yearly is on sale from $80 to $50! »

Anomaly Detection using Prometheus

Anomaly Detection using Prometheus

I discuss some techniques for doing anomaly detection using prometheus queries and recording rules.

679358dfa6d36763e1918a0c23aad4c0?s=128

Andrew Newdigate

June 05, 2019
Tweet

Transcript

  1. @suprememoocow Anomaly Detection using Prometheus Andrew Newdigate Software Engineer, Infrastructure

    @ GitLab
  2. @suprememoocow What is anomaly detection?

  3. @suprememoocow What is anomaly detection? GitLab.com Pages service RPS over

    48 hours
  4. @suprememoocow Why is anomaly detection useful? • Diagnosing incidents /

    Improving MTTD • Detecting application performance regressions • Abuse • Security issues
  5. @suprememoocow Getting Started - example metric http_requests_total{ job="apiserver", method="GET", controller="ProjectsController",

    status_code="200", environment="prod" }
  6. @suprememoocow Aggregating your metrics # Too much aggregation! - record:

    job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) # No labels # --> job:http_requests:rate5m{} 12389
  7. @suprememoocow Aggregating your metrics - too little - record: instance_method_controller:http_requests:rate5m

    expr: sum without(status_code) (rate(http_requests_total[5m])) # --> instance_method_controller:http_requests:rate5m{ # instance="api-01:8080", # job="apiserver", # method="GET", # controller="ProjectsController", # environment="prod"} 21321 # --> instance_method_controller:http_requests:rate5m{ # instance="api-01:8080", # job="apiserver", # method="POST", # controller="ProjectsController", # environment="prod"} 2133 # ... 400 more series ...
  8. @suprememoocow Aggregating your metrics - just right - record: job:http_requests:rate5m

    expr: sum on(job, environment) (rate(http_requests_total[5m])) # --> job:http_requests:rate5m{job="apiserver", environment="prod"} 21321 # --> job:http_requests:rate5m{job="gitserver", environment="prod"} 2212 # --> job:http_requests:rate5m{job="webserver", environment="prod"} 53091
  9. @suprememoocow Statistics 101: Normal Distributions

  10. @suprememoocow Statistics 101: Z-Scores

  11. @suprememoocow Anomaly detection using z-scores # Long-term average value for

    the series - record: job:http_requests:rate5m:avg_over_time_1w expr: avg_over_time(job:http_requests:rate5m[1w]) # Long-term standard deviation for the series - record: job:http_requests:rate5m:stddev_over_time_1w expr: stddev_over_time(job:http_requests:rate5m[1w])
  12. @suprememoocow Anomaly detection using z-scores # Z-Score for aggregation (

    job:http_requests:rate5m - job:http_requests:rate5m:avg_over_time_1w ) / job:http_requests:rate5m:stddev_over_time_1w
  13. @suprememoocow Visualizing z-scores GitLab.com Pages service RPS over 48 hours,

    with ±3 z-score region in green
  14. @suprememoocow Visualizing z-scores GitLab.com Pages service RPS over 48 hours,

    with ±3 z-score region in green job:http_requests:rate5m ZSCORE: 0 job:http_requests:rate5m:avg_over_time_1w ZSCORE: 3 job:http_requests:rate5m:avg_over_time_1w + 3 * job:http_requests:rate5m:stddev_over_time_1w ZSCORE: -3 job:http_requests:rate5m:avg_over_time_1w - 3 * job:http_requests:rate5m:stddev_over_time_1w
  15. @suprememoocow Don’t let the statistics lie...

  16. @suprememoocow Basic normal distribution test ( max_over_time(job:http_requests:rate5m[1w]) - avg_over_time(job:http_requests:rate5m[1w]) )

    / stddev_over_time(job:http_requests:rate5m[1w]) # --> {job="apiserver", environment="prod"} 4.01 # --> {job="gitserver", environment="prod"} 3.96 # --> {job="webserver", environment="prod"} 2.96 ( min_over_time(job:http_requests:rate5m[1w]) - avg_over_time(job:http_requests:rate5m[1w]) ) / stddev_over_time(job:http_requests:rate5m[1w]) # --> {job="apiserver", environment="prod"} -3.8 # --> {job="gitserver", environment="prod"} -4.1 # --> {job="webserver", environment="prod"} -3.2
  17. @suprememoocow Seasonality Gitaly requests-per-second, Monday-Sunday, 4 consecutive weeks

  18. @suprememoocow Leveraging seasonality with offset Gitaly service RPS (blue), with

    7-day rolling average (yellow), over two week period
  19. @suprememoocow Seasonality with Prometheus, v1 - record: job:http_requests:rate5m_prediction expr: >

    job:http_requests:rate5m offset 1w # Value from last period + job:http_requests:rate5m:avg_over_time_1w # Add 1w growth trend - job:http_requests:rate5m:avg_over_time_1w offset 1w
  20. @suprememoocow Seasonality with Prometheus, v2 - record: job:http_requests:rate5m_prediction expr: >

    avg_over_time(job:http_requests:rate5m[4h] offset 166h) # Rounded value from last period + job:http_requests:rate5m:avg_over_time_1w # Add 1w growth trend - job:http_requests:rate5m:avg_over_time_1w offset 1w
  21. @suprememoocow Seasonal predictions Gitaly service RPS (yellow) vs prediction (dotted

    red), over two weeks.
  22. @suprememoocow Avoiding last weeks anomalies... Gitaly service RPS over two

    week period, from Monday 29 April 2019 1 May 2019, International Labour Day: no work today...
  23. @suprememoocow Avoiding last weeks anomalies... Gitaly service RPS (yellow), with

    prediction (blue), over two week period, from Monday 29 April 2019
  24. @suprememoocow Use multiple predictions... avg_over_time(job:http_requests:rate5m[4h] offset 166h) # 1 week

    - 2 hours + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 1w avg_over_time(job:http_requests:rate5m[4h] offset 334h) # 2 weeks - 2 hours + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 2w avg_over_time(job:http_requests:rate5m[4h] offset 502h) # 3 weeks - 2 hours + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 3w
  25. @suprememoocow Three predictions from three Wednesdays 3 predictions vs actual

    Gitaly RPS, Wednesday 8 May (1 week following Intl Labour Day)
  26. @suprememoocow - record: job:http_requests:rate5m_prediction expr: > quantile(0.5, label_replace( avg_over_time(job:http_requests:rate5m[4h] offset

    166h) + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 1w , "offset", "1w", "", "") or label_replace( avg_over_time(job:http_requests:rate5m[4h] offset 334h) + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 2w , "offset", "2w", "", "") or label_replace( avg_over_time(job:http_requests:rate5m[4h] offset 502h) + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 3w , "offset", "3w", "", "") ) without (offset)
  27. @suprememoocow One prediction from three weeks… Median predictions vs actual

    Gitaly RPS, Wednesday 8 May (1 week following Intl Labour Day)
  28. @suprememoocow Visualising seasonal anomalies Predicted normal range ± 1.5σ for

    Gitaly Service https://dashboards.gitlab.com/d/26q8nTzZz/service-platform-metrics?var-type=gitaly&orgId=1
  29. @suprememoocow Alerting - alert: RequestRateOutsideNormalRange expr: > abs( ( job:http_requests:rate5m

    - job:http_requests:rate5m_prediction ) / job:http_requests:rate5m:stddev_over_time_1w ) > 2 for: 10m labels: severity: warning annotations: summary: Requests for job {{ $labels.job }} are outside of expected operating parameters
  30. @suprememoocow Using z-scores for anomaly triage GitLab.com Triage with RPS

    (left) and RPS z-scores (right) https://dashboards.gitlab.com/d/XufqmIGWk/platform-triage
  31. @suprememoocow Conclusion • Anomaly detection is possible in Prometheus •

    The right level of aggregation is the key to anomaly detection • Z-scores will only work with normally distributed data • Seasonal metrics are good for anomaly detection
  32. @suprememoocow Thank You! • GitLab Public Grafana: http://dashboards.gitlab.com • GitLab

    Prometheus Recording Rules for Anomaly Detection https://gitlab.com/gitlab-com/runbooks • #talk-andrew-newdigate • All the code snippets: https://gitlab.com/snippets/1863717