Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitorando seu cluster de Kubernetes com Prometheus [TDC POA 2018]

Monitorando seu cluster de Kubernetes com Prometheus [TDC POA 2018]

Apresentado na The Developers Conference - Porto Alegre 2018

Alexandre Cisneiros

December 08, 2018
Tweet

More Decks by Alexandre Cisneiros

Other Decks in Programming

Transcript

  1. Programa de recompensas totalmente diferente do existente no mercado nacional.

    100% digital, simples e com pontos que não expiram.
  2. Nossa versão de uma conta bancária: uma maneira simples e

    inteligente de guardar, gerenciar o seu dinheiro, com rendimentos diários.
  3. 11 Kubernetes, do grego, Timoneiro. Projeto open source que provisiona

    containers em nós, com base em um estado desejado. Baseado no ambiente da Google.
  4. Master Master etcd API Scheduler Node kubelet Pod 1 Pod

    2 Pod 3 Master etcd API Scheduler Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod kube Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod kube Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod kube Estrutura do Kubernetes
  5. Scheduler Node kubelet Pod Pod Pod Node kubelet Pod Pod

    Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Scheduler Scheduler kubelet Pod 1 Pod 2 Pod 3 Scheduler kubelet Pod Pod Pod kubelet Pod Pod Pod kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Estrutura do Kubernetes
  6. 16 Prometheus é um sistema open source de coleta de

    métricas, avaliação de consultas e disparo de alertas. Surgiu no SoundCloud.
  7. 18 Coleta de métricas Prometheus Aplicação HTTP GET /metrics #TYPE

    requests_total counter requests_total 1040 #TYPE request_latency gauge request_latency{path="a"} 50 request_latency{path="b"} 23
  8. 19 Coleta de métricas Prometheus Aplicação #TYPE requests_total counter requests_total

    1040 #TYPE request_latency gauge request_latency{path="a"} 50 request_latency{path="b"} 23
  9. 19 Coleta de métricas Prometheus Aplicação #TYPE requests_total counter requests_total

    1040 #TYPE request_latency gauge request_latency{path="a"} 50 request_latency{path="b"} 23 Store
  10. Service Discovery • DNS • AWS • K8S 20 Na

    vida real… Prometheu Aplicação HTTP GET /metrics
  11. 21 Tipos de métricas Counter Um número que sempre cresce

    (ou fica parado). Sempre. Número de requisições Número de transações no cartão de crédito Número de exceções disparadas
  12. 22 Tipos de métricas Gauge Um número que pode subir

    ou descer. Quantidade de dados saindo (MB/s) Quantidade de usuários online Quantidade de dead letters não processadas
  13. 23 Tipos de métricas Summary Calcula percentis predefinidos de métricas

    Latência - percentil 50 (mediana) Latência - percentil 95 Latência - percentil 99
  14. 24 Tipos de métricas Histogram Calcula ocorrências de uma métrica

    por intervalos predefinidos. Latência entre
 0-100 ms Latência entre
 101-500ms Latência entre
 501-∞ ms
  15. 26 Consultas services_http_requests_total Timeseries Value services_http_requests_total{service=“billing”, env=“prod”, path=“/api/bill/:id”, status="2XX"} 19202

    services_http_requests_total{service=“cca”, env=“prod”, path=“/api/account/:id”, status="2XX"}} 92838 services_http_requests_total{service=“billing”, env=“staging”, path=“/api/bill/:id", status=“3XX"}} 1020 services_http_requests_total{service=“auth”, env=“prod”, path=“/api/user/:id”, status="4XX"}} 2938 services_http_requests_total{service=“billing”, env=“prod”, path=“/api/bill/:id”, status="5XX"}} 127
  16. 27 Consultas services_http_requests_total{service=“billing”, env=“prod”} Timeseries Value services_http_requests_total{service=“billing”, env=“prod”, path=“/api/bill/:id”, status="2XX"}

    19202 services_http_requests_total{service=“billing”, env=“prod”, path=“/api/bill/:id”, status="3XX"} 2732 services_http_requests_total{service=“billing”, env=“prod”, path=“/api/bill/:id”, status="4XX"} 2023 services_http_requests_total{service=“billing”, env=“prod”, path=“/api/bill/:id”, status=“5XX"} 127 services_http_requests_total{service=“billing”, env=“prod”, path=“/api/close/:id”, status="2XX"}} 49283
  17. 29 Consultas Timeseries Value {path=“/api/bill/:id”} 109202 {path=“/api/close/:id”} 12732 {path=“/api/history/:id”} 90223

    {path=“/api/revert/:id”} 13270 {path=“/api/reprocess/:id”} 50233 sum(services_http_requests_total{service=“billing”, env=“prod”})
 by (path)
  18. Master etcd API Scheduler Node kubelet Pod 1 Pod 2

    Pod 3 Instâncias (VM/Servidores) O que monitorar?
  19. Master etcd API Scheduler Node kubelet Pod 1 Pod 2

    Pod 3 Instâncias (VM/Servidores) Componentes do Kubernetes O que monitorar?
  20. Master etcd API Scheduler Node kubelet Pod 1 Pod 2

    Pod 3 Instâncias (VM/Servidores) Componentes do Kubernetes Containers das aplicações O que monitorar?
  21. Node Exporter CPU Memória Disco Rede ARP Bcache Bonding Boot

    time Conntrack Error detection Entropy Exec stats File descriptors Hwmon IPVS Load avg Mdadm Net class NFS NFSD Sockets Time Uname XFS ZFS Monitorando instâncias
  22. Node Exporter Monitorando instâncias Master Node Master Master Node Node

    Node Node Node DaemonSet Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter
  23. Kube State Metrics Monitorando componentes do Kubernetes Pods Deployments Services

    ReplicaSets ReplicationControllers DaemonSets Jobs Nodes AutoScaler PersistentVolumes Namespaces Secrets ConfigMaps ResourceQuotas
  24. Master Node Master Master Node Node Node Node Node Deployment

    Node Exporter Kube State Metrics Monitorando componentes do Kubernetes Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter
  25. Master Node Master Master Node Node Node Node Node Deployment

    Node Exporter Kube State Metrics Monitorando componentes do Kubernetes KSM Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter KSM
  26. Kubelet + cAdvisor Monitorando containers das aplicações Master Node Master

    Master Node Node Node Node Node Node Exporter KSM Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter KSM Já roda nas instâncias!
  27. Kubelet + cAdvisor Monitorando containers das aplicações Master Node Master

    Master Node Node Node Node Node Node Exporter KSM Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter KSM Já roda nas instâncias!
  28. scrape_configs: - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https

    tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-nodes' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) Service Discovery: YAML Configurando o monitoramento
  29. - source_labels: [__meta_kubernetes_service_name] target_label: kubernetes_name - job_name: 'kubernetes-ingresses' metrics_path: /probe

    params: module: [http_2xx] kubernetes_sd_configs: - role: ingress relabel_configs: - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path] regex: (.+);(.+);(.+) replacement: ${1}://${2}${3} target_label: __param_target - target_label: __address__ replacement: blackbox-exporter.example.com:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_ingress_label_(.+) - source_labels: [__meta_kubernetes_namespace] target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_ingress_name] target_label: kubernetes_name - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name Service Discovery: YAML Configurando o monitoramento
  30. 49 O que é um alerta? 20:00 20:05 20:10 20:15

    20:20 20:25 http_error_ratio
 {service=“billing"} http_error_ratio
 {service=“auth”} http_error_ratio
 {service=“cca”}
  31. 49 O que é um alerta? 20:00 20:05 20:10 20:15

    20:20 20:25 http_error_ratio
 {service=“billing"} http_error_ratio
 {service=“auth”} http_error_ratio
 {service=“cca”} avg(http_error_ratio{env=“prod”}) by (service) > 5% for 10 minutes
  32. 49 O que é um alerta? 20:00 20:05 20:10 20:15

    20:20 20:25 http_error_ratio
 {service=“billing"} http_error_ratio
 {service=“auth”} http_error_ratio
 {service=“cca”} 5% avg(http_error_ratio{env=“prod”}) by (service) > 5% for 10 minutes
  33. 49 O que é um alerta? 20:00 20:05 20:10 20:15

    20:20 20:25 http_error_ratio
 {service=“billing"} http_error_ratio
 {service=“auth”} http_error_ratio
 {service=“cca”} 5% avg(http_error_ratio{env=“prod”}) by (service) > 5% for 10 minutes
  34. Definindo alertas groups: - name: kubernetes_alerts rules: - alert: k8s_container_is_frequently_restarting

    expr: round(increase( kube_pod_container_status_restarts_total [30m])) > 5 for: 10m labels: severity: warning service: kubernetes
 squad: platform annotations: description: Pod {{$labels.namespace}}/ {{$labels.pod}} was restarted {{$value}} times within the Rules YAML
  35. Definindo alertas severity: warning service: kubernetes
 squad: platform annotations: description:

    Pod {{$labels.namespace}}/ {{$labels.pod}} was restarted {{$value}} times within the last hour - alert: k8s_instance_disk_will_fill_in_few_hours expr: predict_linear(node_filesystem_free {job=“kubernetes-node-exporter", mountpoint="/"}[1h], 3600 * 10) < 0 and on(instance, job) (time() - node_boot_time > 3 * 3600) for: 30m labels: severity: critical service: kubernetes
 squad: platform annotations: description: Instance disk {{$labels.instance}} will fill in about 10 hours Rules YAML
  36. route: receiver: default_receiver group_by: [alertname, squad, env] group_wait: 15s repeat_interval:

    1h routes: - receiver: platform_slack repeat_interval: 30m match_re: squad: platform severity: critical|warning|info continue: true - receiver: platform_opsgenie match_re: squad: platform severity: critical continue: true Alertmanager YAML Roteando os alertas
  37. match_re: squad: platform severity: critical continue: true receivers: - name:

    platform_slack slack_configs: - channel: '#platform-alerts' send_resolved: true text: '{{ template "slack.service.text" . }}' color: '{{ template "nu.slack.color" . }}' title: '{{ template "slack.service.title" . }}' actions: - type: 'button' text: ':prometheus: See on Prometheus' url: '{{ template "nu.prometheus.url" . }}' style: 'primary' - type: 'button' text: ':ledger: Open Playbook' url: '{{ template "nu.playbook.url" .}}' - type: 'button' text: ':mute: Silence this Alert' Alertmanager YAML Roteando os alertas
  38. match_re: squad: platform severity: critical continue: true receivers: - name:

    platform_slack slack_configs: - channel: '#platform-alerts' send_resolved: true text: '{{ template "slack.service.text" . }}' color: '{{ template "nu.slack.color" . }}' title: '{{ template "slack.service.title" . }}' actions: - type: 'button' text: ':prometheus: See on Prometheus' url: '{{ template "nu.prometheus.url" . }}' style: 'primary' - type: 'button' text: ':ledger: Open Playbook' url: '{{ template "nu.playbook.url" .}}' - type: 'button' text: ':mute: Silence this Alert' Alertmanager YAML Roteando os alertas
  39. slack_configs: - channel: '#platform-alerts' send_resolved: true text: '{{ template "slack.service.text"

    . }}' color: '{{ template "nu.slack.color" . }}' title: '{{ template "slack.service.title" . }}' actions: - type: 'button' text: ':prometheus: See on Prometheus' url: '{{ template "nu.prometheus.url" . }}' style: 'primary' - type: 'button' text: ':ledger: Open Playbook' url: '{{ template "nu.playbook.url" .}}' - type: 'button' text: ':mute: Silence this Alert' url: '{{ template "nu.silence.url" .}}' style: 'danger' - name: platform_opsgenie opsgenie_configs: - api_key: {{PLATFORM_OPSGENIE_API_KEY}} message: '{{ template "opsgenie.service.message" . }}' priority: '{{ template "opsgenie.alert_priority" . }}' Alertmanager YAML Roteando os alertas
  40. slack_configs: - channel: '#platform-alerts' send_resolved: true text: '{{ template "slack.service.text"

    . }}' color: '{{ template "nu.slack.color" . }}' title: '{{ template "slack.service.title" . }}' actions: - type: 'button' text: ':prometheus: See on Prometheus' url: '{{ template "nu.prometheus.url" . }}' style: 'primary' - type: 'button' text: ':ledger: Open Playbook' url: '{{ template "nu.playbook.url" .}}' - type: 'button' text: ':mute: Silence this Alert' url: '{{ template "nu.silence.url" .}}' style: 'danger' - name: platform_opsgenie opsgenie_configs: - api_key: {{PLATFORM_OPSGENIE_API_KEY}} message: '{{ template "opsgenie.service.message" . }}' priority: '{{ template "opsgenie.alert_priority" . }}' Alertmanager YAML Roteando os alertas
  41. Kubernetes mantém o estado desejado das suas aplicações. Prometheus monitora

    sua infraestrutura sem adicionar dependências nela.
  42. Kubernetes mantém o estado desejado das suas aplicações. Prometheus monitora

    sua infraestrutura sem adicionar dependências nela. Alertas roteados automaticamente avisam em caso de problemas.