Monitoring at Hellofresh
By Rafael Jesus
~ Prometheus Meetup - Berlin - 2019 ~
Slide 2
Slide 2 text
HelloFresh
Slide 3
Slide 3 text
Agenda
Why we care about monitoring
What we monitor
How we monitor
What comes next
Slide 4
Slide 4 text
Why monitoring
Debugging certain bugs might take months, even
years without proper monitoring.
Has API success rate slightly dropped over time?
What if a portion of customers are getting
logged out?
What if CDN caches 4-5xx response codes for
1m due to a miscon g?
What if users from Japan/Australia complains
about website home page being slow?
Slide 5
Slide 5 text
Why monitoring
In the event of engineers need to rely on
data to drive decisions, otherwise they will
always have to depend exclusively from the
opinion and feelings of the most senior engineer
around.
You don't have software ownership if you don't
monitor it.
Slide 6
Slide 6 text
Why monitoring
Slide 7
Slide 7 text
What we monitor
Infrastructure
Reverse Proxies & Edge Services
Storage Systems
MongoDB, RDS, Elastic Search, Redis
Observability Services
Jaeger, Graylog, Prometheus
Kubertenes Clusters
k8s nodes, core-dns, pods
Slide 8
Slide 8 text
Infrastructure Engineering
Core-DNS
Slide 9
Slide 9 text
Infrastructure Engineering
Core-DNS
Slide 10
Slide 10 text
What we monitor
Product Engineering
APIs and AMQP consumers
Slide 11
Slide 11 text
How we monitor
Slide 12
Slide 12 text
How we monitor
Our Monitoring Stack
Prometheus
Prometheus Operator
Thanos
Grafana
Legacy
Graphite & StatsD
Slide 13
Slide 13 text
How we monitor
Prometheus Operator
Serves to make running Prometheus on top of
Kubernetes as easy as possible.
Introduces additional resources in Kubernetes:
Prometheus, ServiceMonitor, Alertmanager and
PrometheusRule.
Example
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
Slide 14
Slide 14 text
How we monitor
Prometheus Operator
Slide 15
Slide 15 text
How we monitor
Prometheus Operator Helm Chart
Initial installation and con guration of:
Prometheus servers
Alertmanager
Grafana
Host node_exporter
kube-state-metrics
Default set of Alerts and Dashboards
Slide 16
Slide 16 text
How we monitor
Thanos
Global View
Long term storage
Why?
We had 1 year of retention on Graphite
Going global with prometheus federation might be
hard for us
Slide 17
Slide 17 text
How we monitor
Without thanos
Slide 18
Slide 18 text
How we monitor
Without thanos
Federation allows you to have a global
Prometheus that pulls aggregated metrics from
your datacenter Prometheus servers
Slide 19
Slide 19 text
How we monitor
Slide 20
Slide 20 text
How we monitor
Thanos
Global View
Slide 21
Slide 21 text
How we monitor
Thanos
Storing
Slide 22
Slide 22 text
How we monitor
Thanos
Querying Sidecar
Slide 23
Slide 23 text
How we monitor
Thanos
Querying Store
Slide 24
Slide 24 text
How we monitor
Thanos
Compactor
Slide 25
Slide 25 text
How we monitor
How we started
Prometheus operator helm chart
Full k8s cluster monitoring
Golden Signals w/ Ingress, cAdvisor metrics
Thanos for global view & LTS
compactor don't like empty blocks , ok we've
removed all from S3
it couldn't compact istio metrics , ok we go
w/out compactor then
Slide 26
Slide 26 text
How we monitor
The need of Compactor
Without downsampling, long term queries can
bring huge amount of data affecting overall store
availability.
Product engineering services needs historical data
for data driven decisions.
Slide 27
Slide 27 text
How we monitor
Compactor + Ruler
Long term recording rules for calculating SLOs.
Slide 28
Slide 28 text
How we monitor
Compactor Gotchas
There's no way to remove metrics with high
cardinality or with too many time series from S3
bucket
If object store has metrics with high cardinality or
too many series and compactor fails, purging data
might be the only solution
Slide 29
Slide 29 text
How we monitor
msg="critical error detected; halting"
err="overlaps found while gathering blocks.
Slide 30
Slide 30 text
How we monitor
Good luck purging teams data
Just need to announce to Engineering Teams
(~300 ppl) and run. Running away is inevitable.
Slide 31
Slide 31 text
How we monitor
Adopting SRE practices to operate
Prometheus
Monitoring Prometheus
De ning Reliability Target
Being On-Call
Managing Overload
Slide 32
Slide 32 text
How we monitor
Monitoring Prometheus
prometheus_tsdb_head_series
Slide 33
Slide 33 text
How we monitor
Monitoring Prometheus
prometheus_tsdb_head_samples_appended_total
Slide 34
Slide 34 text
How we monitor
Monitoring Prometheus
avg(avg_over_time(up{job=~"$job"}[5m])) by (pod)
Slide 35
Slide 35 text
How we monitor
Adopting SRE practices to operate
Prometheus
De ning querier SLO rule
- record: thanos_query:availability_slo:rate2w
expr: |
sum(thanos_query:successful_queries:rate2w)
/
sum(thanos_query:total_queries:rate2w)
Slide 36
Slide 36 text
How we monitor
Adopting SRE practices to operate
Prometheus
De ning prometheus SLO rule
- record: kube_prometheus:uptime:max_avg5m
expr: |
max (
avg_over_time (
up{job="kube-prometheus"}[5m]
)
)
Slide 37
Slide 37 text
How we monitor
Adopting SRE practices to operate
Prometheus
De ning prometheus SLO rule
- record: kube_prometheus:uptime:max_avg5m
expr: |
avg_over_time (
kube_prometheus:uptime:max_avg5m[2w]
)
Slide 38
Slide 38 text
How we monitor
Adopting SRE practices to operate
Prometheus
Composing SLOs
- record: prometheus:slo:rate2w
expr: |
(
kube_prometheus:uptime:avg2w +
prox_prometheus:uptime:avg2w +
thanos_query:availability_slo:rate2w +
) / 3 * 100 )
Slide 39
Slide 39 text
How we monitor
Adopting SRE practices to operate
Prometheus
Alerting over SLOs violatation
rules:
- alert: PrometheusQueriesSLOViolation
annotations:
summary: 'Prometheus queries availability SLO is affected'
description: 'Prometheus has query availability SLO violatio
runbook_url: https://github.com/hellofresh/runbooks/blob/mas
expr: prometheus:slo:rate2w < 99.9
labels:
severity: page
slack: platform-alerts
Slide 40
Slide 40 text
How we monitor
Adopting SRE practices to operate
Prometheus
Alerting when SLI is in danger
rules:
- alert: TooManyThanosQueriesErrors
annotations:
summary: Percentage of gRPC errors are too high on Thanos
description: 'Thanos Query has too many gRPC queries errors
runbook_url: https://github.com/hellofresh/runbooks/blob/mas
expr: |
sum(thanos_query:grpc_queries_errors:rate5m)
/
sum(thanos_query:grpc_queries:rate5m) > 0.05
Slide 41
Slide 41 text
How we monitor
Adopting SRE practices to operate
Prometheus
Being On-Call
Slide 42
Slide 42 text
How we monitor
Adopting SRE practices to operate
Prometheus
Being On-Call
Slide 43
Slide 43 text
How we monitor
Managing Overload
Detecting a Problem
Check if the total time series has increased
prometheus_tsdb_head_series
Check if Prometheus is ingesting metrics
prometheus_tsdb_head_samples_appended_to
tal
Slide 44
Slide 44 text
How we monitor
Managing Overload
Find Expensive Metrics, Rules and Targets
Find rules that take take too long to evaluate
prometheus_rule_group_iterations_missed_tot
al
Find metrics that have the most time series
topk(10, count by (__name__, job
({__name__=~".+"}))
Find expensive recording rules
prometheus_rule_group_last_duration_second
s
Slide 45
Slide 45 text
How we monitor
Managing Overload
Reducing Load
Drop labels, metrics which introduces high
cardinality or you don't use
Increase scrape_interval and
evaluation_interval
sample_limit causes a scrape to fail if more than
the given number of time series is returned
Slide 46
Slide 46 text
Managing Overload
nginx_ingress_controller_request_duration_second
s_bucket
Up to 50k time series
Product Eng. services relies on it for SLOs
Queries towards it were Prom and Thanos
metric_relabel_configs:
- separator: ;
regex: (controller_namespace|controller_pod|endpoint|exporte
replacement: $1
action: labeldrop
How we monitor
Results
From 2h 22m of downtime/unavailability weekly
Slide 49
Slide 49 text
How we monitor
Results
To 18.1s of downtime/unavailability weekly
Slide 50
Slide 50 text
How we monitor
Instrumenting Applications
Prometheus SDKs
OpenCensus for more observability
Automating away metrics for php services
Parsing nginx logs with OpenResty
RED metrics out of the box
Common dashboard for everyone
Slide 51
Slide 51 text
How we monitor
Empowering Engineering Teams
De ning alerts and recording rules on app source
code
Provisioned Grafana dashboards
Exposing common dashboards for everyone
The 4 golden signals FTW
Slide 52
Slide 52 text
How we monitor
Empowering Engineering Teams
On applications source code
provisionDashboards:
enabled: true
dashboardLabel: grafana_dashboard
prometheusRules:
- name: svc-recording-rules
rules: ...
- name: svc-alerting-rules
rules: ...
How we monitor
Empowering Engineering Teams
{{- if .Values.provisionDashboards }}
{{- if .Values.provisionDashboards.enabled }}
{{- $root := . }}
{{- range $path, $bytes := .Files.Glob "dashboards/*" }}
---
kind: ConfigMap
metadata:
...
$.Values.provisionDashboards.dashboardLabel }}: "1"
data:
{{ base $path }}: |-
{{ $root.Files.Get $path | indent 4}}
{{ end }}
{{- end }}
{{- end }}
Slide 55
Slide 55 text
Empowering Engineering Teams
NGINX Ingress
Slide 56
Slide 56 text
Empowering Engineering Teams
NGINX Ingress
Slide 57
Slide 57 text
Infrastructure Engineering
AWS RDS Metrics
Slide 58
Slide 58 text
How we alert
At the Edge
Sympthom Based Alerts
Write Runbooks
Slide 59
Slide 59 text
How we alert
At the Edge
As close to the user as possible. If one backend goes
away, you still have metrics to alert.
Slide 60
Slide 60 text
How we alert
Sympthom
Not only 5xx, high rates of 4xx should also page
When SLO is in danger
On user journeys (soon)
Slide 61
Slide 61 text
How we alert
Sharing responsibility with runbooks
Slide 62
Slide 62 text
What comes next?
Run chaos experiments on Prometheus
Experiment 1: Query Availability
Attack: Thanos Sidecar Pod Deletion
Scope: Multiple pods
Expected Results:
Rate of good thanos querier gRPC queries are not
affected
TooManyQueriesErrors alert should not re
Slide 63
Slide 63 text
What comes next?
More vertical sharding
Storage Systems
Observability Services
Prometheus per k8s namespace
Istio for out of the box metrics and tracing
Slide 64
Slide 64 text
What comes next?
Query Federation
Data engineering services don't run on k8s