Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Anomaly Detection using Prometheus

Anomaly Detection using Prometheus

I discuss some techniques for doing anomaly detection using prometheus queries and recording rules.

Avatar for Andrew Newdigate

Andrew Newdigate

June 05, 2019
Tweet

More Decks by Andrew Newdigate

Other Decks in Technology

Transcript

  1. @suprememoocow Why is anomaly detection useful? • Diagnosing incidents /

    Improving MTTD • Detecting application performance regressions • Abuse • Security issues
  2. @suprememoocow Aggregating your metrics # Too much aggregation! - record:

    job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) # No labels # --> job:http_requests:rate5m{} 12389
  3. @suprememoocow Aggregating your metrics - too little - record: instance_method_controller:http_requests:rate5m

    expr: sum without(status_code) (rate(http_requests_total[5m])) # --> instance_method_controller:http_requests:rate5m{ # instance="api-01:8080", # job="apiserver", # method="GET", # controller="ProjectsController", # environment="prod"} 21321 # --> instance_method_controller:http_requests:rate5m{ # instance="api-01:8080", # job="apiserver", # method="POST", # controller="ProjectsController", # environment="prod"} 2133 # ... 400 more series ...
  4. @suprememoocow Aggregating your metrics - just right - record: job:http_requests:rate5m

    expr: sum on(job, environment) (rate(http_requests_total[5m])) # --> job:http_requests:rate5m{job="apiserver", environment="prod"} 21321 # --> job:http_requests:rate5m{job="gitserver", environment="prod"} 2212 # --> job:http_requests:rate5m{job="webserver", environment="prod"} 53091
  5. @suprememoocow Anomaly detection using z-scores # Long-term average value for

    the series - record: job:http_requests:rate5m:avg_over_time_1w expr: avg_over_time(job:http_requests:rate5m[1w]) # Long-term standard deviation for the series - record: job:http_requests:rate5m:stddev_over_time_1w expr: stddev_over_time(job:http_requests:rate5m[1w])
  6. @suprememoocow Anomaly detection using z-scores # Z-Score for aggregation (

    job:http_requests:rate5m - job:http_requests:rate5m:avg_over_time_1w ) / job:http_requests:rate5m:stddev_over_time_1w
  7. @suprememoocow Visualizing z-scores GitLab.com Pages service RPS over 48 hours,

    with ±3 z-score region in green job:http_requests:rate5m ZSCORE: 0 job:http_requests:rate5m:avg_over_time_1w ZSCORE: 3 job:http_requests:rate5m:avg_over_time_1w + 3 * job:http_requests:rate5m:stddev_over_time_1w ZSCORE: -3 job:http_requests:rate5m:avg_over_time_1w - 3 * job:http_requests:rate5m:stddev_over_time_1w
  8. @suprememoocow Basic normal distribution test ( max_over_time(job:http_requests:rate5m[1w]) - avg_over_time(job:http_requests:rate5m[1w]) )

    / stddev_over_time(job:http_requests:rate5m[1w]) # --> {job="apiserver", environment="prod"} 4.01 # --> {job="gitserver", environment="prod"} 3.96 # --> {job="webserver", environment="prod"} 2.96 ( min_over_time(job:http_requests:rate5m[1w]) - avg_over_time(job:http_requests:rate5m[1w]) ) / stddev_over_time(job:http_requests:rate5m[1w]) # --> {job="apiserver", environment="prod"} -3.8 # --> {job="gitserver", environment="prod"} -4.1 # --> {job="webserver", environment="prod"} -3.2
  9. @suprememoocow Leveraging seasonality with offset Gitaly service RPS (blue), with

    7-day rolling average (yellow), over two week period
  10. @suprememoocow Seasonality with Prometheus, v1 - record: job:http_requests:rate5m_prediction expr: >

    job:http_requests:rate5m offset 1w # Value from last period + job:http_requests:rate5m:avg_over_time_1w # Add 1w growth trend - job:http_requests:rate5m:avg_over_time_1w offset 1w
  11. @suprememoocow Seasonality with Prometheus, v2 - record: job:http_requests:rate5m_prediction expr: >

    avg_over_time(job:http_requests:rate5m[4h] offset 166h) # Rounded value from last period + job:http_requests:rate5m:avg_over_time_1w # Add 1w growth trend - job:http_requests:rate5m:avg_over_time_1w offset 1w
  12. @suprememoocow Avoiding last weeks anomalies... Gitaly service RPS over two

    week period, from Monday 29 April 2019 1 May 2019, International Labour Day: no work today...
  13. @suprememoocow Avoiding last weeks anomalies... Gitaly service RPS (yellow), with

    prediction (blue), over two week period, from Monday 29 April 2019
  14. @suprememoocow Use multiple predictions... avg_over_time(job:http_requests:rate5m[4h] offset 166h) # 1 week

    - 2 hours + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 1w avg_over_time(job:http_requests:rate5m[4h] offset 334h) # 2 weeks - 2 hours + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 2w avg_over_time(job:http_requests:rate5m[4h] offset 502h) # 3 weeks - 2 hours + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 3w
  15. @suprememoocow Three predictions from three Wednesdays 3 predictions vs actual

    Gitaly RPS, Wednesday 8 May (1 week following Intl Labour Day)
  16. @suprememoocow - record: job:http_requests:rate5m_prediction expr: > quantile(0.5, label_replace( avg_over_time(job:http_requests:rate5m[4h] offset

    166h) + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 1w , "offset", "1w", "", "") or label_replace( avg_over_time(job:http_requests:rate5m[4h] offset 334h) + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 2w , "offset", "2w", "", "") or label_replace( avg_over_time(job:http_requests:rate5m[4h] offset 502h) + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 3w , "offset", "3w", "", "") ) without (offset)
  17. @suprememoocow One prediction from three weeks… Median predictions vs actual

    Gitaly RPS, Wednesday 8 May (1 week following Intl Labour Day)
  18. @suprememoocow Visualising seasonal anomalies Predicted normal range ± 1.5σ for

    Gitaly Service https://dashboards.gitlab.com/d/26q8nTzZz/service-platform-metrics?var-type=gitaly&orgId=1
  19. @suprememoocow Alerting - alert: RequestRateOutsideNormalRange expr: > abs( ( job:http_requests:rate5m

    - job:http_requests:rate5m_prediction ) / job:http_requests:rate5m:stddev_over_time_1w ) > 2 for: 10m labels: severity: warning annotations: summary: Requests for job {{ $labels.job }} are outside of expected operating parameters
  20. @suprememoocow Using z-scores for anomaly triage GitLab.com Triage with RPS

    (left) and RPS z-scores (right) https://dashboards.gitlab.com/d/XufqmIGWk/platform-triage
  21. @suprememoocow Conclusion • Anomaly detection is possible in Prometheus •

    The right level of aggregation is the key to anomaly detection • Z-scores will only work with normally distributed data • Seasonal metrics are good for anomaly detection
  22. @suprememoocow Thank You! • GitLab Public Grafana: http://dashboards.gitlab.com • GitLab

    Prometheus Recording Rules for Anomaly Detection https://gitlab.com/gitlab-com/runbooks • #talk-andrew-newdigate • All the code snippets: https://gitlab.com/snippets/1863717