Monitoring at HelloFresh

Monitoring at Hellofresh By Rafael Jesus ~ Prometheus Meetup -
Berlin - 2019 ~

HelloFresh

Agenda Why we care about monitoring What we monitor How
we monitor What comes next

Why monitoring Debugging certain bugs might take months, even years
without proper monitoring. Has API success rate slightly dropped over time? What if a portion of customers are getting logged out? What if CDN caches 4-5xx response codes for 1m due to a miscon g? What if users from Japan/Australia complains about website home page being slow?

Why monitoring In the event of engineers need to rely
on data to drive decisions, otherwise they will always have to depend exclusively from the opinion and feelings of the most senior engineer around. You don't have software ownership if you don't monitor it.

Why monitoring

What we monitor Infrastructure Reverse Proxies & Edge Services Storage
Systems MongoDB, RDS, Elastic Search, Redis Observability Services Jaeger, Graylog, Prometheus Kubertenes Clusters k8s nodes, core-dns, pods

Infrastructure Engineering Core-DNS

What we monitor Product Engineering APIs and AMQP consumers

How we monitor

How we monitor Our Monitoring Stack Prometheus Prometheus Operator Thanos
Grafana Legacy Graphite & StatsD

How we monitor Prometheus Operator Serves to make running Prometheus
on top of Kubernetes as easy as possible. Introduces additional resources in Kubernetes: Prometheus, ServiceMonitor, Alertmanager and PrometheusRule. Example apiVersion: monitoring.coreos.com/v1 kind: Prometheus

How we monitor Prometheus Operator

How we monitor Prometheus Operator Helm Chart Initial installation and
con guration of: Prometheus servers Alertmanager Grafana Host node_exporter kube-state-metrics Default set of Alerts and Dashboards

How we monitor Thanos Global View Long term storage Why?
We had 1 year of retention on Graphite Going global with prometheus federation might be hard for us

How we monitor Without thanos

How we monitor Without thanos Federation allows you to have
a global Prometheus that pulls aggregated metrics from your datacenter Prometheus servers

How we monitor

How we monitor Thanos Global View

How we monitor Thanos Storing

How we monitor Thanos Querying Sidecar

How we monitor Thanos Querying Store

How we monitor Thanos Compactor

How we monitor How we started Prometheus operator helm chart
Full k8s cluster monitoring Golden Signals w/ Ingress, cAdvisor metrics Thanos for global view & LTS compactor don't like empty blocks , ok we've removed all from S3 it couldn't compact istio metrics , ok we go w/out compactor then

How we monitor The need of Compactor Without downsampling, long
term queries can bring huge amount of data affecting overall store availability. Product engineering services needs historical data for data driven decisions.

How we monitor Compactor + Ruler Long term recording rules
for calculating SLOs.

How we monitor Compactor Gotchas There's no way to remove
metrics with high cardinality or with too many time series from S3 bucket If object store has metrics with high cardinality or too many series and compactor fails, purging data might be the only solution

How we monitor msg="critical error detected; halting" err="overlaps found while
gathering blocks.

How we monitor Good luck purging teams data Just need
to announce to Engineering Teams (~300 ppl) and run. Running away is inevitable.

How we monitor Adopting SRE practices to operate Prometheus Monitoring
Prometheus De ning Reliability Target Being On-Call Managing Overload

How we monitor Monitoring Prometheus prometheus_tsdb_head_series

How we monitor Monitoring Prometheus prometheus_tsdb_head_samples_appended_total

How we monitor Monitoring Prometheus avg(avg_over_time(up{job=~"$job"}[5m])) by (pod)

How we monitor Adopting SRE practices to operate Prometheus De
ning querier SLO rule - record: thanos_query:availability_slo:rate2w expr: | sum(thanos_query:successful_queries:rate2w) / sum(thanos_query:total_queries:rate2w)

ning prometheus SLO rule - record: kube_prometheus:uptime:max_avg5m expr: | max ( avg_over_time ( up{job="kube-prometheus"}[5m] ) )

ning prometheus SLO rule - record: kube_prometheus:uptime:max_avg5m expr: | avg_over_time ( kube_prometheus:uptime:max_avg5m[2w] )

How we monitor Adopting SRE practices to operate Prometheus Composing
SLOs - record: prometheus:slo:rate2w expr: | ( kube_prometheus:uptime:avg2w + prox_prometheus:uptime:avg2w + thanos_query:availability_slo:rate2w + ) / 3 * 100 )

How we monitor Adopting SRE practices to operate Prometheus Alerting
over SLOs violatation rules: - alert: PrometheusQueriesSLOViolation annotations: summary: 'Prometheus queries availability SLO is affected' description: 'Prometheus has query availability SLO violatio runbook_url: https://github.com/hellofresh/runbooks/blob/mas expr: prometheus:slo:rate2w < 99.9 labels: severity: page slack: platform-alerts

How we monitor Adopting SRE practices to operate Prometheus Alerting
when SLI is in danger rules: - alert: TooManyThanosQueriesErrors annotations: summary: Percentage of gRPC errors are too high on Thanos description: 'Thanos Query has too many gRPC queries errors runbook_url: https://github.com/hellofresh/runbooks/blob/mas expr: | sum(thanos_query:grpc_queries_errors:rate5m) / sum(thanos_query:grpc_queries:rate5m) > 0.05

How we monitor Adopting SRE practices to operate Prometheus Being
On-Call

How we monitor Managing Overload Detecting a Problem Check if
the total time series has increased prometheus_tsdb_head_series Check if Prometheus is ingesting metrics prometheus_tsdb_head_samples_appended_to tal

How we monitor Managing Overload Find Expensive Metrics, Rules and
Targets Find rules that take take too long to evaluate prometheus_rule_group_iterations_missed_tot al Find metrics that have the most time series topk(10, count by (__name__, job ({__name__=~".+"})) Find expensive recording rules prometheus_rule_group_last_duration_second s

How we monitor Managing Overload Reducing Load Drop labels, metrics
which introduces high cardinality or you don't use Increase scrape_interval and evaluation_interval sample_limit causes a scrape to fail if more than the given number of time series is returned

Managing Overload nginx_ingress_controller_request_duration_second s_bucket Up to 50k time series Product
Eng. services relies on it for SLOs Queries towards it were Prom and Thanos metric_relabel_configs: - separator: ; regex: (controller_namespace|controller_pod|endpoint|exporte replacement: $1 action: labeldrop

How we monitor Sharding Prometheus Reverse Proxies & Edge services
┌────────────┬─────────┐ ┌────────────┬─────────┐ │ Prox Prom │ Sidecar │ ... │ Prox Prom │ Sidecar │ └────────────┴─────-───┘ └────────────┴────-────┘ NGINX Ingress, OpenResty, ELB, CDN Istio soon

How we monitor Results From 2h 22m of downtime/unavailability weekly

How we monitor Results To 18.1s of downtime/unavailability weekly

How we monitor Instrumenting Applications Prometheus SDKs OpenCensus for more
observability Automating away metrics for php services Parsing nginx logs with OpenResty RED metrics out of the box Common dashboard for everyone

How we monitor Empowering Engineering Teams De ning alerts and
recording rules on app source code Provisioned Grafana dashboards Exposing common dashboards for everyone The 4 golden signals FTW

How we monitor Empowering Engineering Teams On applications source code
provisionDashboards: enabled: true dashboardLabel: grafana_dashboard prometheusRules: - name: svc-recording-rules rules: ... - name: svc-alerting-rules rules: ...

How we monitor Empowering Engineering Teams {{ if .Values.prometheusRules }}
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule # operator CRD metadata: labels: app: {{ template "svc.name" $ }} chart: {{ template "svc.chart" $ }} tribe: {{ $.Values.tribe }} prometheus: kube-prometheus name: {{ template "svc.name" $ }} spec: groups: {{ toYaml .Values.prometheusRules | indent 4 }} {{ end }}

How we monitor Empowering Engineering Teams {{- if .Values.provisionDashboards }}
{{- if .Values.provisionDashboards.enabled }} {{- $root := . }} {{- range $path, $bytes := .Files.Glob "dashboards/*" }} --- kind: ConfigMap metadata: ... $.Values.provisionDashboards.dashboardLabel }}: "1" data: {{ base $path }}: |- {{ $root.Files.Get $path | indent 4}} {{ end }} {{- end }} {{- end }}

Empowering Engineering Teams NGINX Ingress

Infrastructure Engineering AWS RDS Metrics

How we alert At the Edge Sympthom Based Alerts Write
Runbooks

How we alert At the Edge As close to the
user as possible. If one backend goes away, you still have metrics to alert.

How we alert Sympthom Not only 5xx, high rates of
4xx should also page When SLO is in danger On user journeys (soon)

How we alert Sharing responsibility with runbooks

What comes next? Run chaos experiments on Prometheus Experiment 1:
Query Availability Attack: Thanos Sidecar Pod Deletion Scope: Multiple pods Expected Results: Rate of good thanos querier gRPC queries are not affected TooManyQueriesErrors alert should not re

What comes next? More vertical sharding Storage Systems Observability Services
Prometheus per k8s namespace Istio for out of the box metrics and tracing

What comes next? Query Federation Data engineering services don't run
on k8s

Thank you

Monitoring at HelloFresh

Monitoring at HelloFresh

More Decks by Rafael Jesus

Other Decks in Technology

Featured

Transcript