Slide 1

Slide 1 text

@suprememoocow Anomaly Detection using Prometheus Andrew Newdigate Software Engineer, Infrastructure @ GitLab

Slide 2

Slide 2 text

@suprememoocow What is anomaly detection?

Slide 3

Slide 3 text

@suprememoocow What is anomaly detection? GitLab.com Pages service RPS over 48 hours

Slide 4

Slide 4 text

@suprememoocow Why is anomaly detection useful? ● Diagnosing incidents / Improving MTTD ● Detecting application performance regressions ● Abuse ● Security issues

Slide 5

Slide 5 text

@suprememoocow Getting Started - example metric http_requests_total{ job="apiserver", method="GET", controller="ProjectsController", status_code="200", environment="prod" }

Slide 6

Slide 6 text

@suprememoocow Aggregating your metrics # Too much aggregation! - record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) # No labels # --> job:http_requests:rate5m{} 12389

Slide 7

Slide 7 text

@suprememoocow Aggregating your metrics - too little - record: instance_method_controller:http_requests:rate5m expr: sum without(status_code) (rate(http_requests_total[5m])) # --> instance_method_controller:http_requests:rate5m{ # instance="api-01:8080", # job="apiserver", # method="GET", # controller="ProjectsController", # environment="prod"} 21321 # --> instance_method_controller:http_requests:rate5m{ # instance="api-01:8080", # job="apiserver", # method="POST", # controller="ProjectsController", # environment="prod"} 2133 # ... 400 more series ...

Slide 8

Slide 8 text

@suprememoocow Aggregating your metrics - just right - record: job:http_requests:rate5m expr: sum on(job, environment) (rate(http_requests_total[5m])) # --> job:http_requests:rate5m{job="apiserver", environment="prod"} 21321 # --> job:http_requests:rate5m{job="gitserver", environment="prod"} 2212 # --> job:http_requests:rate5m{job="webserver", environment="prod"} 53091

Slide 9

Slide 9 text

@suprememoocow Statistics 101: Normal Distributions

Slide 10

Slide 10 text

@suprememoocow Statistics 101: Z-Scores

Slide 11

Slide 11 text

@suprememoocow Anomaly detection using z-scores # Long-term average value for the series - record: job:http_requests:rate5m:avg_over_time_1w expr: avg_over_time(job:http_requests:rate5m[1w]) # Long-term standard deviation for the series - record: job:http_requests:rate5m:stddev_over_time_1w expr: stddev_over_time(job:http_requests:rate5m[1w])

Slide 12

Slide 12 text

@suprememoocow Anomaly detection using z-scores # Z-Score for aggregation ( job:http_requests:rate5m - job:http_requests:rate5m:avg_over_time_1w ) / job:http_requests:rate5m:stddev_over_time_1w

Slide 13

Slide 13 text

@suprememoocow Visualizing z-scores GitLab.com Pages service RPS over 48 hours, with ±3 z-score region in green

Slide 14

Slide 14 text

@suprememoocow Visualizing z-scores GitLab.com Pages service RPS over 48 hours, with ±3 z-score region in green job:http_requests:rate5m ZSCORE: 0 job:http_requests:rate5m:avg_over_time_1w ZSCORE: 3 job:http_requests:rate5m:avg_over_time_1w + 3 * job:http_requests:rate5m:stddev_over_time_1w ZSCORE: -3 job:http_requests:rate5m:avg_over_time_1w - 3 * job:http_requests:rate5m:stddev_over_time_1w

Slide 15

Slide 15 text

@suprememoocow Don’t let the statistics lie...

Slide 16

Slide 16 text

@suprememoocow Basic normal distribution test ( max_over_time(job:http_requests:rate5m[1w]) - avg_over_time(job:http_requests:rate5m[1w]) ) / stddev_over_time(job:http_requests:rate5m[1w]) # --> {job="apiserver", environment="prod"} 4.01 # --> {job="gitserver", environment="prod"} 3.96 # --> {job="webserver", environment="prod"} 2.96 ( min_over_time(job:http_requests:rate5m[1w]) - avg_over_time(job:http_requests:rate5m[1w]) ) / stddev_over_time(job:http_requests:rate5m[1w]) # --> {job="apiserver", environment="prod"} -3.8 # --> {job="gitserver", environment="prod"} -4.1 # --> {job="webserver", environment="prod"} -3.2

Slide 17

Slide 17 text

@suprememoocow Seasonality Gitaly requests-per-second, Monday-Sunday, 4 consecutive weeks

Slide 18

Slide 18 text

@suprememoocow Leveraging seasonality with offset Gitaly service RPS (blue), with 7-day rolling average (yellow), over two week period

Slide 19

Slide 19 text

@suprememoocow Seasonality with Prometheus, v1 - record: job:http_requests:rate5m_prediction expr: > job:http_requests:rate5m offset 1w # Value from last period + job:http_requests:rate5m:avg_over_time_1w # Add 1w growth trend - job:http_requests:rate5m:avg_over_time_1w offset 1w

Slide 20

Slide 20 text

@suprememoocow Seasonality with Prometheus, v2 - record: job:http_requests:rate5m_prediction expr: > avg_over_time(job:http_requests:rate5m[4h] offset 166h) # Rounded value from last period + job:http_requests:rate5m:avg_over_time_1w # Add 1w growth trend - job:http_requests:rate5m:avg_over_time_1w offset 1w

Slide 21

Slide 21 text

@suprememoocow Seasonal predictions Gitaly service RPS (yellow) vs prediction (dotted red), over two weeks.

Slide 22

Slide 22 text

@suprememoocow Avoiding last weeks anomalies... Gitaly service RPS over two week period, from Monday 29 April 2019 1 May 2019, International Labour Day: no work today...

Slide 23

Slide 23 text

@suprememoocow Avoiding last weeks anomalies... Gitaly service RPS (yellow), with prediction (blue), over two week period, from Monday 29 April 2019

Slide 24

Slide 24 text

@suprememoocow Use multiple predictions... avg_over_time(job:http_requests:rate5m[4h] offset 166h) # 1 week - 2 hours + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 1w avg_over_time(job:http_requests:rate5m[4h] offset 334h) # 2 weeks - 2 hours + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 2w avg_over_time(job:http_requests:rate5m[4h] offset 502h) # 3 weeks - 2 hours + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 3w

Slide 25

Slide 25 text

@suprememoocow Three predictions from three Wednesdays 3 predictions vs actual Gitaly RPS, Wednesday 8 May (1 week following Intl Labour Day)

Slide 26

Slide 26 text

@suprememoocow - record: job:http_requests:rate5m_prediction expr: > quantile(0.5, label_replace( avg_over_time(job:http_requests:rate5m[4h] offset 166h) + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 1w , "offset", "1w", "", "") or label_replace( avg_over_time(job:http_requests:rate5m[4h] offset 334h) + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 2w , "offset", "2w", "", "") or label_replace( avg_over_time(job:http_requests:rate5m[4h] offset 502h) + job:http_requests:rate5m:avg_over_time_1w - job:http_requests:rate5m:avg_over_time_1w offset 3w , "offset", "3w", "", "") ) without (offset)

Slide 27

Slide 27 text

@suprememoocow One prediction from three weeks… Median predictions vs actual Gitaly RPS, Wednesday 8 May (1 week following Intl Labour Day)

Slide 28

Slide 28 text

@suprememoocow Visualising seasonal anomalies Predicted normal range ± 1.5σ for Gitaly Service https://dashboards.gitlab.com/d/26q8nTzZz/service-platform-metrics?var-type=gitaly&orgId=1

Slide 29

Slide 29 text

@suprememoocow Alerting - alert: RequestRateOutsideNormalRange expr: > abs( ( job:http_requests:rate5m - job:http_requests:rate5m_prediction ) / job:http_requests:rate5m:stddev_over_time_1w ) > 2 for: 10m labels: severity: warning annotations: summary: Requests for job {{ $labels.job }} are outside of expected operating parameters

Slide 30

Slide 30 text

@suprememoocow Using z-scores for anomaly triage GitLab.com Triage with RPS (left) and RPS z-scores (right) https://dashboards.gitlab.com/d/XufqmIGWk/platform-triage

Slide 31

Slide 31 text

@suprememoocow Conclusion ● Anomaly detection is possible in Prometheus ● The right level of aggregation is the key to anomaly detection ● Z-scores will only work with normally distributed data ● Seasonal metrics are good for anomaly detection

Slide 32

Slide 32 text

@suprememoocow Thank You! ● GitLab Public Grafana: http://dashboards.gitlab.com ● GitLab Prometheus Recording Rules for Anomaly Detection https://gitlab.com/gitlab-com/runbooks ● #talk-andrew-newdigate ● All the code snippets: https://gitlab.com/snippets/1863717