Upgrade to Pro — share decks privately, control downloads, hide ads and more …

monitoring-prometehus-thanos.pdf

 monitoring-prometehus-thanos.pdf

Rafael Jesus

November 07, 2019
Tweet

More Decks by Rafael Jesus

Other Decks in Technology

Transcript

  1. How we monitor Monitoring Stack Prometheus, Operator: we k8s Thanos,

    Grafana Legacy Graphite & StatsD Prometheus on EC2
  2. Deployment Prometheus operator helm chart: Prometheus, Alertmanager, Grafana servers Host

    node_exporter kube-state-metrics Default set of Alerts and Dashboards The helm chart IS NOT kube-prometheus
  3. Configuration kube-prometheus/values.yaml prometheus: ingress: ... prometheusSpec: replicas: 2 retention: 48h

    storageSpec: ... alertmanager: ingress: ... alertmanagerSpec: ... config: route: routes: ... receivers: ...
  4. Scraping targets kube-prometheus/values.yaml prometheus: prometheusSpec: serviceMonitorNamespaceSelector: any: true # scrape

    any ServiceMonitor which has metadata labels `prometheus=my-prometheus` serviceMonitorSelector: matchExpressions: - key: prometheus operator: In values: - my-prometheus ...
  5. Scraping targets kubectl get servicemonitor random-cvs-k8s-sm -o yaml kind: ServiceMonitor

    metadata: labels: prometheus: my-prometheus name: random-svc-k8s-sm spec: namespaceSelector: matchNames: - tribe-name selector: matchLabels: app: random-svc-k8s
  6. Namespaced Prometheus HA Prometheus pods per namespace Only for product

    engineering teams Reduce coupling between teams
  7. Provisioning cluster-namespaces/charts/helmfile.yaml - name: {{ requiredEnv "NS_NAME" }}-monitoring namespace: monitoring

    chart: ./ns-monitoring version: 0.1.0 values: - ../namespaces/{{ requiredEnv "NS_NAME" }}/{{ requiredEnv "ENV_NAME" }}/values.yaml - ns-monitoring/values-{{ requiredEnv "ENV_NAME" }}.yaml.gotmpl
  8. Configuration cluster-namespaces/charts/ns-monitoring/values-live.yaml.gotmpl prometheusSpec: matchExpressions: - key: app.kubernetes.io/instance operator: In values:

    - {{ requiredEnv "NS_NAME" }}-init ruleNamespaceSelector: matchExpressions: - key: app.kubernetes.io/instance operator: In values: - {{ requiredEnv "NS_NAME" }}-init serviceMonitorSelector: matchExpressions: ...
  9. Listing Running Prometheus kgp -l app=prometheus -n monitoring | awk

    '{print $1}' prometheus-consumer-core-monitoring-prometheus-0 prometheus-consumer-core-monitoring-prometheus-1 prometheus-conversions-monitoring-prometheus-0 prometheus-conversions-monitoring-prometheus-1 prometheus-crm-data-monitoring-prometheus-0 prometheus-crm-data-monitoring-prometheus-1 prometheus-discovery-monitoring-prometheus-0 prometheus-discovery-monitoring-prometheus-1 prometheus-payments-monitoring-prometheus-0 prometheus-payments-monitoring-prometheus-1 prometheus-platform-monitoring-prometheus-0 prometheus-platform-monitoring-prometheus-1 ...
  10. Wrapping up HA Prometheus per namespace are automatically provisioned, altho

    they all run at monitoring namespace. Platform on-call team are paged if smth goes wrong. Shared responsibility between Cloud Runtime and SRE squads. Namespaced Prometheus pods have 12h of retention.
  11. Routing Alerts charts/external/kube-prometheus/values-live-eks.yaml alertmanager: config: route: receiver: 'default-receiver' routes: -

    match: tribe: platform receiver: 'platform-opsgenie' # we need to do the step above for every tribe - match: tribe: page receiver: 'opsgenie'
  12. Prometheus Federation prometheus.yaml scrape_configs: - job_name: 'federate' honor_labels: true metrics_path:

    '/federate' params: 'match[]': - '{__name__=~"job:.*"}' static_configs: - targets: - 'kube-prometheus:9090' - 'platform-prometheus:9090' - 'consumer-core-prometheus:9090'
  13. HA Ruler Two different deployments rule: replicas: 1 persistence: enabled:

    true size: 10Gi storageClass: default ruleReplica: replicas: 1 ...
  14. Managing Overload High cardinality metrics are likely to be the

    primary cause of the performance problems.
  15. Reduce Load Dropping labels kubectl edit servicemonitors service-monitor-name endpoints: ...

    metricRelabelings: - action: labeldrop regex: (id|path|status)
  16. Reduce Load Dropping metrics endpoints: ... metricRelabelings: - action: drop

    regex: component_(expensive_metric_foo|expensive_metric_bar) sourceLabels: - __name__
  17. Reduce Load Increase scrape_interval and evaluation_interval On service monitor kubectl

    edit servicemonitors service-monitor-name endpoints: interval: 45s ... PS: Keep in mind that it’s not practical to increase these beyond 2 minutes
  18. Sample Limit On service monitor endpoints: sampleLimit: 150000 On thanos

    store helm chart config containers: - name: thanos-store args: - "--store.grpc.series-sample-limit={{ .Values.store.sampleLimit }}"