Slide 1

Slide 1 text

Alerting with Time Series github.com/fabxc @fabxc Fabian Reinartz, CoreOS

Slide 2

Slide 2 text

Stream of pairs associated with an identifier http_requests_total{job="nginx",instance="1.2.3.4:80",path="/status",status="200"} 1348 @ 1480502384 1899 @ 1480502389 2023 @ 1480502394 http_requests_total{job="nginx",instance="1.2.3.1:80",path="/settings",status="201"} http_requests_total{job="nginx",instance="1.2.3.5:80",path="/",status="500"} ... Time Series

Slide 3

Slide 3 text

Stream of pairs associated with an identifier sum by(path) (rate(http_requests_total{job="nginx"}[5m])) {path="/status",status="200"} 32.13 @ 1480502384 {path="/status",status="500"} 19.133 @ 1480502394 {path="/profile",status="200"} 44.52 @ 1480502389 Time Series

Slide 4

Slide 4 text

Prometheus Targets Service Discovery (Kubernetes, AWS, Consul, custom...) Grafana HTTP API UI

Slide 5

Slide 5 text

A lot of traffic to monitor Monitoring traffic should not be proportional to user traffic

Slide 6

Slide 6 text

A lot of targets to monitor A single host can run hundreds of machines/procs/containers/...

Slide 7

Slide 7 text

Targets constantly change Deployments, scaling up, scaling down, and rescheduling

Slide 8

Slide 8 text

Need a fleet-wide view What’s my 99th percentile request latency across all frontends?

Slide 9

Slide 9 text

Drill-down for investigation Which pod/node/... has turned unhealthy? How and why?

Slide 10

Slide 10 text

Monitor all levels, with the same system Query and correlate metrics across the stack

Slide 11

Slide 11 text

Translate that to Meaningful Alerting

Slide 12

Slide 12 text

Anomaly Detection Automated Alert Correlation Self-Healing Machine Learning

Slide 13

Slide 13 text

Anomaly Detection If you are actually monitoring at scale, something will always correlate. Huge efforts to eliminate huge number of false positives. Huge chance to introduce false negatives.

Slide 14

Slide 14 text

Prometheus Alerts != = current state desired state alerts

Slide 15

Slide 15 text

Symptom-based pages Urgent issues – Does it hurt your user? system user dependency dependency dependency dependency

Slide 16

Slide 16 text

Latency Four Golden Signals system user dependency dependency dependency dependency

Slide 17

Slide 17 text

Traffic Four Golden Signals system user dependency dependency dependency dependency

Slide 18

Slide 18 text

Errors Four Golden Signals system user dependency dependency dependency dependency

Slide 19

Slide 19 text

Cause-based warnings Helpful context, non-urgent problems system user dependency dependency dependency dependency

Slide 20

Slide 20 text

Saturation / Capacity Four Golden Signals system user dependency dependency dependency dependency

Slide 21

Slide 21 text

etcd_has_leader{job="etcd", instance="A"} 0 etcd_has_leader{job="etcd", instance="B"} 0 etcd_has_leader{job="etcd", instance="C"} 1

Slide 22

Slide 22 text

Prometheus Alerts ALERT IF FOR LABELS { ... } ANNOTATIONS { ... } ... Each result entry is one alert:

Slide 23

Slide 23 text

requests_total{instance="web-1", path="/index", method="GET"} 8913435 requests_total{instance="web-1", path="/index", method="POST"} 34845 requests_total{instance="web-3", path="/api/profile", method="GET"} 654118 requests_total{instance="web-2", path="/api/profile", method="GET"} 774540 … request_errors_total{instance="web-1", path="/index", method="GET"} 84513 request_errors_total{instance="web-1", path="/index", method="POST"} 434 request_errors_total{instance="web-3", path="/api/profile", method="GET"} 6562 request_errors_total{instance="web-2", path="/api/profile", method="GET"} 3571 …

Slide 24

Slide 24 text

Prometheus Alerts ALERT EtcdNoLeader IF etcd_has_leader == 0 FOR 1m LABELS { severity=”page” } {job=”etcd”,instance=”A”} 0.0 {job=”etcd”,instance=”B”} 0.0 {job=”etcd”,alertname=”EtcdNoLeader”,severity=”page”,instance=”A”} {job=”etcd”,alertname=”EtcdNoLeader”,severity=”page”,instance=”B”}

Slide 25

Slide 25 text

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) > 500 {} 534

Slide 26

Slide 26 text

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) > 500 {} 534 WRONG Absolute threshold alerting rule needs constant tuning as traffic changes

Slide 27

Slide 27 text

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) > 500 {} 534 traffic changes over days

Slide 28

Slide 28 text

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) > 500 {} 534 traffic changes over months

Slide 29

Slide 29 text

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) > 500 {} 534 traffic when you release awesome feature X

Slide 30

Slide 30 text

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) / sum(rate(requests_total[5m])) > 0.01 {} 1.8354

Slide 31

Slide 31 text

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) / sum(rate(requests_total[5m])) > 0.01 {} 1.8354

Slide 32

Slide 32 text

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) / sum(rate(requests_total[5m])) > 0.01 {} 1.8354 WRONG No dimensionality in result loss of detail, signal cancelation

Slide 33

Slide 33 text

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) / sum(rate(requests_total[5m])) > 0.01 {} 1.8354 high error / low traffic low error / high traffic total sum

Slide 34

Slide 34 text

ALERT HighErrorRate IF sum by(instance, path) (rate(request_errors_total[5m])) / sum by(instance, path) (rate(requests_total[5m])) > 0.01 {instance=”web-2”, path=”/api/comments”} 0.02435 {instance=”web-1”, path=”/api/comments”} 0.01055 {instance=”web-2”, path=”/api/profile”} 0.34124

Slide 35

Slide 35 text

ALERT HighErrorRate IF sum by(instance, path) (rate(request_errors_total[5m])) / sum by(instance, path) (rate(requests_total[5m])) > 0.01 {instance=”web-2”, path=”/api/v1/comments”} 0.022435 ... WRONG Wrong dimensions aggregates away dimensions of fault-tolerance

Slide 36

Slide 36 text

ALERT HighErrorRate IF sum by(instance, path) (rate(request_errors_total[5m])) / sum by(instance, path) (rate(requests_total[5m])) > 0.01 {instance=”web-2”, path=”/api/v1/comments”} 0.02435 ... instance 1 instance 2..1000

Slide 37

Slide 37 text

ALERT HighErrorRate IF sum without(instance) (rate(request_errors_total[5m])) / sum without(instance) (rate(requests_total[5m])) > 0.01 {method=”GET”, path=”/api/v1/comments”} 0.02435 {method=”POST”, path=”/api/v1/comments”} 0.015 {method=”POST”, path=”/api/v1/profile”} 0.34124

Slide 38

Slide 38 text

ALERT DiskWillFillIn4Hours IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0 FOR 5m ... 0 now -1h +4h

Slide 39

Slide 39 text

ALERT DiskWillFillIn4Hours IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0 FOR 5m ANNOTATIONS { summary = “device filling up”, description = “{{$labels.device}} mounted on {{$labels.mountpoint}} on {{$labels.instance}} will fill up within 4 hours.” }

Slide 40

Slide 40 text

Alertmanager Aggregate, deduplicate, and route alerts

Slide 41

Slide 41 text

Prometheus Targets Service Discovery (Kubernetes, AWS, Consul, custom...) Alertmanager Email, Slack, PagerDuty, OpsGenie, ...

Slide 42

Slide 42 text

Alerting Rule Alerting Rule Alerting Rule Alerting Rule ... 04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/profile, method=GET 04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET 04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET 04:11 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/settings, method=POST 04:12 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=GET 04:13 hey, HighLatency, service=”X”, zone=”eu-west”, path=/index, method=POST 04:13 hey, CacheServerSlow, service=”X”, zone=”eu-west”, path=/user/profile, method=POST . . . 04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/comments, method=GET 04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=POST

Slide 43

Slide 43 text

Alerting Rule Alerting Rule Alerting Rule Alerting Rule ... Alert Manager Chat JIRA PagerDuty ... You have 15 alerts for Service X in zone eu-west 3x HighLatency 10x HighErrorRate 2x CacheServerSlow Individual alerts: ...

Slide 44

Slide 44 text

Inhibition {alertname=”LatencyHigh”, severity=”page”, ..., zone=”eu-west”} ... {alertname=”LatencyHigh”, severity=”page”, ..., zone=”eu-west”} {alertname=”ErrorsHigh”, severity=”page”, ..., zone=”eu-west”} ... {alertname=”ServiceDown”, severity=”page”, ..., zone=”eu-west”} {alertname=”DatacenterOnFire”, severity=”huge-page”, zone=”eu-west”} if active, mute everything else in same zone

Slide 45

Slide 45 text

Anomaly Detection

Slide 46

Slide 46 text

Practical Example 1 job:requests:rate5m = sum by(job) (rate(requests_total[5m])) job:requests:holt_winters_rate1h = holt_winters( job:requests:rate5m[1h], 0.6, 0.4 )

Slide 47

Slide 47 text

Practical Example 1 ALERT AbnormalTraffic IF abs( job:requests:rate5m - job:requests:holt_winters_rate1h offset 7d ) > 0.2 * job:request_rate:holt_winters_rate1h offset 7d FOR 10m ...

Slide 48

Slide 48 text

Practical Example 2 instance:latency_seconds:mean5m > on (job) group_left() ( avg by (job)(instance:latency_seconds:mean5m) + on (job) 2 * stddev by (job)(instance:latency_seconds:mean5m) )

Slide 49

Slide 49 text

Practical Example 2 ( instance:latency_seconds:mean5m > on (job) group_left() ( avg by (job)(instance:latency_seconds:mean5m) + on (job) 2 * stddev by (job)(instance:latency_seconds:mean5m) ) ) > on (job) group_left() 1.2 * avg by (job)(instance:latency_seconds:mean5m)

Slide 50

Slide 50 text

Practical Example 2 ( instance:latency_seconds:mean5m > on (job) group_left() ( avg by (job)(instance:latency_seconds:mean5m) + on (job) 2 * stddev by (job)(instance:latency_seconds:mean5m) ) ) > on (job) group_left() 1.2 * avg by (job)(instance:latency_seconds:mean5m) and on (job) avg by (job)(instance:latency_seconds_count:rate5m) > 1

Slide 51

Slide 51 text

Self Healing

Slide 52

Slide 52 text

Prom Alert manager wh node scrape notify alert action

Slide 53

Slide 53 text

Conclusion - Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are adaptive to change, preserve as many dimensions as possible, aggregate away dimensions of fault tolerance - Use linear prediction for capacity planning and saturation detection - Advanced alerting expressions allow for well-scoped and practical anomaly detection - Raw alerts are not meant for human consumption - The Alertmanager aggregates, silences, and routes groups of alerts as meaningful notifications

Slide 54

Slide 54 text

Join us! careers: coreos.com/careers (now in Berlin!)