Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring at HelloFresh

Monitoring at HelloFresh

Rafael Jesus

April 17, 2019
Tweet

More Decks by Rafael Jesus

Other Decks in Technology

Transcript

  1. Why monitoring Debugging certain bugs might take months, even years

    without proper monitoring. Has API success rate slightly dropped over time? What if a portion of customers are getting logged out? What if CDN caches 4-5xx response codes for 1m due to a miscon g? What if users from Japan/Australia complains about website home page being slow?
  2. Why monitoring In the event of engineers need to rely

    on data to drive decisions, otherwise they will always have to depend exclusively from the opinion and feelings of the most senior engineer around. You don't have software ownership if you don't monitor it.
  3. What we monitor Infrastructure Reverse Proxies & Edge Services Storage

    Systems MongoDB, RDS, Elastic Search, Redis Observability Services Jaeger, Graylog, Prometheus Kubertenes Clusters k8s nodes, core-dns, pods
  4. How we monitor Prometheus Operator Serves to make running Prometheus

    on top of Kubernetes as easy as possible. Introduces additional resources in Kubernetes: Prometheus, ServiceMonitor, Alertmanager and PrometheusRule. Example apiVersion: monitoring.coreos.com/v1 kind: Prometheus
  5. How we monitor Prometheus Operator Helm Chart Initial installation and

    con guration of: Prometheus servers Alertmanager Grafana Host node_exporter kube-state-metrics Default set of Alerts and Dashboards
  6. How we monitor Thanos Global View Long term storage Why?

    We had 1 year of retention on Graphite Going global with prometheus federation might be hard for us
  7. How we monitor Without thanos Federation allows you to have

    a global Prometheus that pulls aggregated metrics from your datacenter Prometheus servers
  8. How we monitor How we started Prometheus operator helm chart

    Full k8s cluster monitoring Golden Signals w/ Ingress, cAdvisor metrics Thanos for global view & LTS compactor don't like empty blocks , ok we've removed all from S3 it couldn't compact istio metrics , ok we go w/out compactor then
  9. How we monitor The need of Compactor Without downsampling, long

    term queries can bring huge amount of data affecting overall store availability. Product engineering services needs historical data for data driven decisions.
  10. How we monitor Compactor Gotchas There's no way to remove

    metrics with high cardinality or with too many time series from S3 bucket If object store has metrics with high cardinality or too many series and compactor fails, purging data might be the only solution
  11. How we monitor Good luck purging teams data Just need

    to announce to Engineering Teams (~300 ppl) and run. Running away is inevitable.
  12. How we monitor Adopting SRE practices to operate Prometheus Monitoring

    Prometheus De ning Reliability Target Being On-Call Managing Overload
  13. How we monitor Adopting SRE practices to operate Prometheus De

    ning querier SLO rule - record: thanos_query:availability_slo:rate2w expr: | sum(thanos_query:successful_queries:rate2w) / sum(thanos_query:total_queries:rate2w)
  14. How we monitor Adopting SRE practices to operate Prometheus De

    ning prometheus SLO rule - record: kube_prometheus:uptime:max_avg5m expr: | max ( avg_over_time ( up{job="kube-prometheus"}[5m] ) )
  15. How we monitor Adopting SRE practices to operate Prometheus De

    ning prometheus SLO rule - record: kube_prometheus:uptime:max_avg5m expr: | avg_over_time ( kube_prometheus:uptime:max_avg5m[2w] )
  16. How we monitor Adopting SRE practices to operate Prometheus Composing

    SLOs - record: prometheus:slo:rate2w expr: | ( kube_prometheus:uptime:avg2w + prox_prometheus:uptime:avg2w + thanos_query:availability_slo:rate2w + ) / 3 * 100 )
  17. How we monitor Adopting SRE practices to operate Prometheus Alerting

    over SLOs violatation rules: - alert: PrometheusQueriesSLOViolation annotations: summary: 'Prometheus queries availability SLO is affected' description: 'Prometheus has query availability SLO violatio runbook_url: https://github.com/hellofresh/runbooks/blob/mas expr: prometheus:slo:rate2w < 99.9 labels: severity: page slack: platform-alerts
  18. How we monitor Adopting SRE practices to operate Prometheus Alerting

    when SLI is in danger rules: - alert: TooManyThanosQueriesErrors annotations: summary: Percentage of gRPC errors are too high on Thanos description: 'Thanos Query has too many gRPC queries errors runbook_url: https://github.com/hellofresh/runbooks/blob/mas expr: | sum(thanos_query:grpc_queries_errors:rate5m) / sum(thanos_query:grpc_queries:rate5m) > 0.05
  19. How we monitor Managing Overload Detecting a Problem Check if

    the total time series has increased prometheus_tsdb_head_series Check if Prometheus is ingesting metrics prometheus_tsdb_head_samples_appended_to tal
  20. How we monitor Managing Overload Find Expensive Metrics, Rules and

    Targets Find rules that take take too long to evaluate prometheus_rule_group_iterations_missed_tot al Find metrics that have the most time series topk(10, count by (__name__, job ({__name__=~".+"})) Find expensive recording rules prometheus_rule_group_last_duration_second s
  21. How we monitor Managing Overload Reducing Load Drop labels, metrics

    which introduces high cardinality or you don't use Increase scrape_interval and evaluation_interval sample_limit causes a scrape to fail if more than the given number of time series is returned
  22. Managing Overload nginx_ingress_controller_request_duration_second s_bucket Up to 50k time series Product

    Eng. services relies on it for SLOs Queries towards it were Prom and Thanos metric_relabel_configs: - separator: ; regex: (controller_namespace|controller_pod|endpoint|exporte replacement: $1 action: labeldrop
  23. How we monitor Sharding Prometheus Reverse Proxies & Edge services

    ┌────────────┬─────────┐ ┌────────────┬─────────┐ │ Prox Prom │ Sidecar │ ... │ Prox Prom │ Sidecar │ └────────────┴─────-───┘ └────────────┴────-────┘ NGINX Ingress, OpenResty, ELB, CDN Istio soon
  24. How we monitor Instrumenting Applications Prometheus SDKs OpenCensus for more

    observability Automating away metrics for php services Parsing nginx logs with OpenResty RED metrics out of the box Common dashboard for everyone
  25. How we monitor Empowering Engineering Teams De ning alerts and

    recording rules on app source code Provisioned Grafana dashboards Exposing common dashboards for everyone The 4 golden signals FTW
  26. How we monitor Empowering Engineering Teams On applications source code

    provisionDashboards: enabled: true dashboardLabel: grafana_dashboard prometheusRules: - name: svc-recording-rules rules: ... - name: svc-alerting-rules rules: ...
  27. How we monitor Empowering Engineering Teams {{ if .Values.prometheusRules }}

    apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule # operator CRD metadata: labels: app: {{ template "svc.name" $ }} chart: {{ template "svc.chart" $ }} tribe: {{ $.Values.tribe }} prometheus: kube-prometheus name: {{ template "svc.name" $ }} spec: groups: {{ toYaml .Values.prometheusRules | indent 4 }} {{ end }}
  28. How we monitor Empowering Engineering Teams {{- if .Values.provisionDashboards }}

    {{- if .Values.provisionDashboards.enabled }} {{- $root := . }} {{- range $path, $bytes := .Files.Glob "dashboards/*" }} --- kind: ConfigMap metadata: ... $.Values.provisionDashboards.dashboardLabel }}: "1" data: {{ base $path }}: |- {{ $root.Files.Get $path | indent 4}} {{ end }} {{- end }} {{- end }}
  29. How we alert At the Edge As close to the

    user as possible. If one backend goes away, you still have metrics to alert.
  30. How we alert Sympthom Not only 5xx, high rates of

    4xx should also page When SLO is in danger On user journeys (soon)
  31. What comes next? Run chaos experiments on Prometheus Experiment 1:

    Query Availability Attack: Thanos Sidecar Pod Deletion Scope: Multiple pods Expected Results: Rate of good thanos querier gRPC queries are not affected TooManyQueriesErrors alert should not re
  32. What comes next? More vertical sharding Storage Systems Observability Services

    Prometheus per k8s namespace Istio for out of the box metrics and tracing