Monitoring at HelloFresh

Slide 1

Slide 1 text

Monitoring at Hellofresh By Rafael Jesus ~ Prometheus Meetup - Berlin - 2019 ~

Slide 2

Slide 2 text

HelloFresh

Slide 3

Slide 3 text

Agenda Why we care about monitoring What we monitor How we monitor What comes next

Slide 4

Slide 4 text

Why monitoring Debugging certain bugs might take months, even years without proper monitoring. Has API success rate slightly dropped over time? What if a portion of customers are getting logged out? What if CDN caches 4-5xx response codes for 1m due to a miscon g? What if users from Japan/Australia complains about website home page being slow?

Slide 5

Slide 5 text

Why monitoring In the event of engineers need to rely on data to drive decisions, otherwise they will always have to depend exclusively from the opinion and feelings of the most senior engineer around. You don't have software ownership if you don't monitor it.

Slide 6

Slide 6 text

Why monitoring

Slide 7

Slide 7 text

What we monitor Infrastructure Reverse Proxies & Edge Services Storage Systems MongoDB, RDS, Elastic Search, Redis Observability Services Jaeger, Graylog, Prometheus Kubertenes Clusters k8s nodes, core-dns, pods

Slide 8

Slide 8 text

Infrastructure Engineering Core-DNS

Slide 9

Slide 9 text

Infrastructure Engineering Core-DNS

Slide 10

Slide 10 text

What we monitor Product Engineering APIs and AMQP consumers

Slide 11

Slide 11 text

How we monitor

Slide 12

Slide 12 text

How we monitor Our Monitoring Stack Prometheus Prometheus Operator Thanos Grafana Legacy Graphite & StatsD

Slide 13

Slide 13 text

How we monitor Prometheus Operator Serves to make running Prometheus on top of Kubernetes as easy as possible. Introduces additional resources in Kubernetes: Prometheus, ServiceMonitor, Alertmanager and PrometheusRule. Example apiVersion: monitoring.coreos.com/v1 kind: Prometheus

Slide 14

Slide 14 text

How we monitor Prometheus Operator

Slide 15

Slide 15 text

How we monitor Prometheus Operator Helm Chart Initial installation and con guration of: Prometheus servers Alertmanager Grafana Host node_exporter kube-state-metrics Default set of Alerts and Dashboards

Slide 16

Slide 16 text

How we monitor Thanos Global View Long term storage Why? We had 1 year of retention on Graphite Going global with prometheus federation might be hard for us

Slide 17

Slide 17 text

How we monitor Without thanos

Slide 18

Slide 18 text

How we monitor Without thanos Federation allows you to have a global Prometheus that pulls aggregated metrics from your datacenter Prometheus servers

Slide 19

Slide 19 text

How we monitor

Slide 20

Slide 20 text

How we monitor Thanos Global View

Slide 21

Slide 21 text

How we monitor Thanos Storing

Slide 22

Slide 22 text

How we monitor Thanos Querying Sidecar

Slide 23

Slide 23 text

How we monitor Thanos Querying Store

Slide 24

Slide 24 text

How we monitor Thanos Compactor

Slide 25

Slide 25 text

How we monitor How we started Prometheus operator helm chart Full k8s cluster monitoring Golden Signals w/ Ingress, cAdvisor metrics Thanos for global view & LTS compactor don't like empty blocks , ok we've removed all from S3 it couldn't compact istio metrics , ok we go w/out compactor then

Slide 26

Slide 26 text

How we monitor The need of Compactor Without downsampling, long term queries can bring huge amount of data affecting overall store availability. Product engineering services needs historical data for data driven decisions.

Slide 27

Slide 27 text

How we monitor Compactor + Ruler Long term recording rules for calculating SLOs.

Slide 28

Slide 28 text

How we monitor Compactor Gotchas There's no way to remove metrics with high cardinality or with too many time series from S3 bucket If object store has metrics with high cardinality or too many series and compactor fails, purging data might be the only solution

Slide 29

Slide 29 text

How we monitor msg="critical error detected; halting" err="overlaps found while gathering blocks.

Slide 30

Slide 30 text

How we monitor Good luck purging teams data Just need to announce to Engineering Teams (~300 ppl) and run. Running away is inevitable.

Slide 31

Slide 31 text

How we monitor Adopting SRE practices to operate Prometheus Monitoring Prometheus De ning Reliability Target Being On-Call Managing Overload

Slide 32

Slide 32 text

How we monitor Monitoring Prometheus prometheus_tsdb_head_series

Slide 33

Slide 33 text

How we monitor Monitoring Prometheus prometheus_tsdb_head_samples_appended_total

Slide 34

Slide 34 text

How we monitor Monitoring Prometheus avg(avg_over_time(up{job=~"$job"}[5m])) by (pod)

Slide 35

Slide 35 text

How we monitor Adopting SRE practices to operate Prometheus De ning querier SLO rule - record: thanos_query:availability_slo:rate2w expr: | sum(thanos_query:successful_queries:rate2w) / sum(thanos_query:total_queries:rate2w)

Slide 36

Slide 36 text

How we monitor Adopting SRE practices to operate Prometheus De ning prometheus SLO rule - record: kube_prometheus:uptime:max_avg5m expr: | max ( avg_over_time ( up{job="kube-prometheus"}[5m] ) )

Slide 37

Slide 37 text

How we monitor Adopting SRE practices to operate Prometheus De ning prometheus SLO rule - record: kube_prometheus:uptime:max_avg5m expr: | avg_over_time ( kube_prometheus:uptime:max_avg5m[2w] )

Slide 38

Slide 38 text

How we monitor Adopting SRE practices to operate Prometheus Composing SLOs - record: prometheus:slo:rate2w expr: | ( kube_prometheus:uptime:avg2w + prox_prometheus:uptime:avg2w + thanos_query:availability_slo:rate2w + ) / 3 * 100 )

Slide 39

Slide 39 text

How we monitor Adopting SRE practices to operate Prometheus Alerting over SLOs violatation rules: - alert: PrometheusQueriesSLOViolation annotations: summary: 'Prometheus queries availability SLO is affected' description: 'Prometheus has query availability SLO violatio runbook_url: https://github.com/hellofresh/runbooks/blob/mas expr: prometheus:slo:rate2w < 99.9 labels: severity: page slack: platform-alerts

Slide 40

Slide 40 text

How we monitor Adopting SRE practices to operate Prometheus Alerting when SLI is in danger rules: - alert: TooManyThanosQueriesErrors annotations: summary: Percentage of gRPC errors are too high on Thanos description: 'Thanos Query has too many gRPC queries errors runbook_url: https://github.com/hellofresh/runbooks/blob/mas expr: | sum(thanos_query:grpc_queries_errors:rate5m) / sum(thanos_query:grpc_queries:rate5m) > 0.05

Slide 41

Slide 41 text

How we monitor Adopting SRE practices to operate Prometheus Being On-Call

Slide 42

Slide 42 text

How we monitor Adopting SRE practices to operate Prometheus Being On-Call

Slide 43

Slide 43 text

How we monitor Managing Overload Detecting a Problem Check if the total time series has increased prometheus_tsdb_head_series Check if Prometheus is ingesting metrics prometheus_tsdb_head_samples_appended_to tal

Slide 44

Slide 44 text

How we monitor Managing Overload Find Expensive Metrics, Rules and Targets Find rules that take take too long to evaluate prometheus_rule_group_iterations_missed_tot al Find metrics that have the most time series topk(10, count by (__name__, job ({__name__=~".+"})) Find expensive recording rules prometheus_rule_group_last_duration_second s

Slide 45

Slide 45 text

How we monitor Managing Overload Reducing Load Drop labels, metrics which introduces high cardinality or you don't use Increase scrape_interval and evaluation_interval sample_limit causes a scrape to fail if more than the given number of time series is returned

Slide 46

Slide 46 text

Managing Overload nginx_ingress_controller_request_duration_second s_bucket Up to 50k time series Product Eng. services relies on it for SLOs Queries towards it were Prom and Thanos metric_relabel_configs: - separator: ; regex: (controller_namespace|controller_pod|endpoint|exporte replacement: $1 action: labeldrop

Slide 47

Slide 47 text

How we monitor Sharding Prometheus Reverse Proxies & Edge services ┌────────────┬─────────┐ ┌────────────┬─────────┐ │ Prox Prom │ Sidecar │ ... │ Prox Prom │ Sidecar │ └────────────┴─────-───┘ └────────────┴────-────┘ NGINX Ingress, OpenResty, ELB, CDN Istio soon

Slide 48

Slide 48 text

How we monitor Results From 2h 22m of downtime/unavailability weekly

Slide 49

Slide 49 text

How we monitor Results To 18.1s of downtime/unavailability weekly

Slide 50

Slide 50 text

How we monitor Instrumenting Applications Prometheus SDKs OpenCensus for more observability Automating away metrics for php services Parsing nginx logs with OpenResty RED metrics out of the box Common dashboard for everyone

Slide 51

Slide 51 text

How we monitor Empowering Engineering Teams De ning alerts and recording rules on app source code Provisioned Grafana dashboards Exposing common dashboards for everyone The 4 golden signals FTW

Slide 52

Slide 52 text

How we monitor Empowering Engineering Teams On applications source code provisionDashboards: enabled: true dashboardLabel: grafana_dashboard prometheusRules: - name: svc-recording-rules rules: ... - name: svc-alerting-rules rules: ...

Slide 53

Slide 53 text

How we monitor Empowering Engineering Teams {{ if .Values.prometheusRules }} apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule # operator CRD metadata: labels: app: {{ template "svc.name" $ }} chart: {{ template "svc.chart" $ }} tribe: {{ $.Values.tribe }} prometheus: kube-prometheus name: {{ template "svc.name" $ }} spec: groups: {{ toYaml .Values.prometheusRules | indent 4 }} {{ end }}

Slide 54

Slide 54 text

How we monitor Empowering Engineering Teams {{- if .Values.provisionDashboards }} {{- if .Values.provisionDashboards.enabled }} {{- $root := . }} {{- range $path, $bytes := .Files.Glob "dashboards/*" }} --- kind: ConfigMap metadata: ... $.Values.provisionDashboards.dashboardLabel }}: "1" data: {{ base $path }}: |- {{ $root.Files.Get $path | indent 4}} {{ end }} {{- end }} {{- end }}

Slide 55

Slide 55 text

Empowering Engineering Teams NGINX Ingress

Slide 56

Slide 56 text

Empowering Engineering Teams NGINX Ingress

Slide 57

Slide 57 text

Infrastructure Engineering AWS RDS Metrics

Slide 58

Slide 58 text

How we alert At the Edge Sympthom Based Alerts Write Runbooks

Slide 59

Slide 59 text

How we alert At the Edge As close to the user as possible. If one backend goes away, you still have metrics to alert.

Slide 60

Slide 60 text

How we alert Sympthom Not only 5xx, high rates of 4xx should also page When SLO is in danger On user journeys (soon)

Slide 61

Slide 61 text

How we alert Sharing responsibility with runbooks

Slide 62

Slide 62 text

What comes next? Run chaos experiments on Prometheus Experiment 1: Query Availability Attack: Thanos Sidecar Pod Deletion Scope: Multiple pods Expected Results: Rate of good thanos querier gRPC queries are not affected TooManyQueriesErrors alert should not re

Slide 63

Slide 63 text

What comes next? More vertical sharding Storage Systems Observability Services Prometheus per k8s namespace Istio for out of the box metrics and tracing