Slide 1

Slide 1 text

Monitoring with Prometheus & Thanos

Slide 2

Slide 2 text

Agenda Why Monitoring Monitoring Stack Why Thanos Thanos Components Managing Overload

Slide 3

Slide 3 text

Why SRE's Care About Monitoring?

Slide 4

Slide 4 text

How we monitor Monitoring Stack Prometheus, Operator: we k8s Thanos, Grafana Legacy Graphite & StatsD Prometheus on EC2

Slide 5

Slide 5 text

Deployment Prometheus operator helm chart: Prometheus, Alertmanager, Grafana servers Host node_exporter kube-state-metrics Default set of Alerts and Dashboards The helm chart IS NOT kube-prometheus

Slide 6

Slide 6 text

Configuration kube-prometheus/values.yaml prometheus: ingress: ... prometheusSpec: replicas: 2 retention: 48h storageSpec: ... alertmanager: ingress: ... alertmanagerSpec: ... config: route: routes: ... receivers: ...

Slide 7

Slide 7 text

Scraping targets kube-prometheus/values.yaml prometheus: prometheusSpec: serviceMonitorNamespaceSelector: any: true # scrape any ServiceMonitor which has metadata labels `prometheus=my-prometheus` serviceMonitorSelector: matchExpressions: - key: prometheus operator: In values: - my-prometheus ...

Slide 8

Slide 8 text

Scraping targets kubectl get servicemonitor random-cvs-k8s-sm -o yaml kind: ServiceMonitor metadata: labels: prometheus: my-prometheus name: random-svc-k8s-sm spec: namespaceSelector: matchNames: - tribe-name selector: matchLabels: app: random-svc-k8s

Slide 9

Slide 9 text

Namespaced Prometheus HA Prometheus pods per namespace Only for product engineering teams Reduce coupling between teams

Slide 10

Slide 10 text

Provisioning cluster-namespaces/charts/helmfile.yaml - name: {{ requiredEnv "NS_NAME" }}-monitoring namespace: monitoring chart: ./ns-monitoring version: 0.1.0 values: - ../namespaces/{{ requiredEnv "NS_NAME" }}/{{ requiredEnv "ENV_NAME" }}/values.yaml - ns-monitoring/values-{{ requiredEnv "ENV_NAME" }}.yaml.gotmpl

Slide 11

Slide 11 text

Configuration cluster-namespaces/charts/ns-monitoring/values-live.yaml.gotmpl prometheusSpec: matchExpressions: - key: app.kubernetes.io/instance operator: In values: - {{ requiredEnv "NS_NAME" }}-init ruleNamespaceSelector: matchExpressions: - key: app.kubernetes.io/instance operator: In values: - {{ requiredEnv "NS_NAME" }}-init serviceMonitorSelector: matchExpressions: ...

Slide 12

Slide 12 text

Listing Running Prometheus kgp -l app=prometheus -n monitoring | awk '{print $1}' prometheus-consumer-core-monitoring-prometheus-0 prometheus-consumer-core-monitoring-prometheus-1 prometheus-conversions-monitoring-prometheus-0 prometheus-conversions-monitoring-prometheus-1 prometheus-crm-data-monitoring-prometheus-0 prometheus-crm-data-monitoring-prometheus-1 prometheus-discovery-monitoring-prometheus-0 prometheus-discovery-monitoring-prometheus-1 prometheus-payments-monitoring-prometheus-0 prometheus-payments-monitoring-prometheus-1 prometheus-platform-monitoring-prometheus-0 prometheus-platform-monitoring-prometheus-1 ...

Slide 13

Slide 13 text

http://grafana.live-k8s.hellofresh.io/d/prometheus/prometheus? orgId=1&refresh=1m

Slide 14

Slide 14 text

Limitations Alerting on longer periods are not possible, retention is too low

Slide 15

Slide 15 text

Wrapping up HA Prometheus per namespace are automatically provisioned, altho they all run at monitoring namespace. Platform on-call team are paged if smth goes wrong. Shared responsibility between Cloud Runtime and SRE squads. Namespaced Prometheus pods have 12h of retention.

Slide 16

Slide 16 text

Routing Alerts charts/external/kube-prometheus/values-live-eks.yaml alertmanager: config: route: receiver: 'default-receiver' routes: - match: tribe: platform receiver: 'platform-opsgenie' # we need to do the step above for every tribe - match: tribe: page receiver: 'opsgenie'

Slide 17

Slide 17 text

Why Thanos? Prometheus federation is hard Global View Long term storage (LTS) Downsampling

Slide 18

Slide 18 text

Prometheus Federation Global Prometheus that pulls aggregated metrics from datacenter Prometheus servers.

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Prometheus Federation prometheus.yaml scrape_configs: - job_name: 'federate' honor_labels: true metrics_path: '/federate' params: 'match[]': - '{__name__=~"job:.*"}' static_configs: - targets: - 'kube-prometheus:9090' - 'platform-prometheus:9090' - 'consumer-core-prometheus:9090'

Slide 21

Slide 21 text

Thanos Architecture

Slide 22

Slide 22 text

Thanos Querier

Slide 23

Slide 23 text

Thanos Sidecar

Slide 24

Slide 24 text

Thanos Store

Slide 25

Slide 25 text

Thanos Compactor

Slide 26

Slide 26 text

Thanos Ruler

Slide 27

Slide 27 text

HA Ruler Two different deployments rule: replicas: 1 persistence: enabled: true size: 10Gi storageClass: default ruleReplica: replicas: 1 ...

Slide 28

Slide 28 text

Deployment Models Federation, Global View Monitoring Cluster

Slide 29

Slide 29 text

Federation with Thanos

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Managing Overload High cardinality metrics are likely to be the primary cause of the performance problems.

Slide 32

Slide 32 text

TSDB (Time Series Database) Briefly

Slide 33

Slide 33 text

Detecting a Problem prometheus_tsdb_head_series metric tells the total number of time series, check if it has increased

Slide 34

Slide 34 text

Detecting a Problem prometheus_tsdb_head_samples_appended_total tells the ingestion rate

Slide 35

Slide 35 text

Detecting a Problem Find expensive metrics topk(10, count by (name, job) ({name=~".+"}))

Slide 36

Slide 36 text

Reduce Load Dropping labels kubectl edit servicemonitors service-monitor-name endpoints: ... metricRelabelings: - action: labeldrop regex: (id|path|status)

Slide 37

Slide 37 text

Reduce Load Dropping metrics endpoints: ... metricRelabelings: - action: drop regex: component_(expensive_metric_foo|expensive_metric_bar) sourceLabels: - __name__

Slide 38

Slide 38 text

Reduce Load Increase scrape_interval and evaluation_interval On service monitor kubectl edit servicemonitors service-monitor-name endpoints: interval: 45s ... PS: Keep in mind that it’s not practical to increase these beyond 2 minutes

Slide 39

Slide 39 text

Reduce Load On Prometheus operator globally prometheusSpec: scrapeInterval: 45s evaluationInterval: 45s ...

Slide 40

Slide 40 text

Sample Limit On service monitor endpoints: sampleLimit: 150000 On thanos store helm chart config containers: - name: thanos-store args: - "--store.grpc.series-sample-limit={{ .Values.store.sampleLimit }}"

Slide 41

Slide 41 text

Thank you