Monitorando seu cluster de Kubernetes com Prometheus [TDC POA 2018]

Monitorando seu cluster de Kubernetes com Prometheus TDC POA 2018
Alexandre Cisneiros Software Engineer

O maior banco digital do mundo fora da Ásia.

Combatemos a complexidade para empoderar pessoas!

Nos movemos rápido, crescemos e mudamos com frequência.

Cartão de crédito com experiência 100% digital, sem tarifas e
agências.

Programa de recompensas totalmente diferente do existente no mercado nacional.
100% digital, simples e com pontos que não expiram.

Nossa versão de uma conta bancária: uma maneira simples e
inteligente de guardar, gerenciar o seu dinheiro, com rendimentos diários.

Kubernetes

11 Kubernetes, do grego, Timoneiro. Projeto open source que provisiona
containers em nós, com base em um estado desejado. Baseado no ambiente da Google.

Estrutura do Kubernetes Master etcd API Scheduler Node kubelet Pod
1 Pod 2 Pod 3

Master etcd API Scheduler Node kubelet Pod 1 Pod 2
Pod 3 Estrutura do Kubernetes

Master Master etcd API Scheduler Node kubelet Pod 1 Pod
2 Pod 3 Master etcd API Scheduler Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod kube Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod kube Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod kube Estrutura do Kubernetes

Scheduler Node kubelet Pod Pod Pod Node kubelet Pod Pod
Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Scheduler Scheduler kubelet Pod 1 Pod 2 Pod 3 Scheduler kubelet Pod Pod Pod kubelet Pod Pod Pod kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Node kubelet Pod Pod Pod Estrutura do Kubernetes

Prometheus

16 Prometheus é um sistema open source de coleta de
métricas, avaliação de consultas e disparo de alertas. Surgiu no SoundCloud.

17 Coleta de métricas Prometheus Aplicação

17 Coleta de métricas Prometheus Aplicação HTTP GET /metrics

18 Coleta de métricas Prometheus Aplicação HTTP GET /metrics

18 Coleta de métricas Prometheus Aplicação HTTP GET /metrics #TYPE
requests_total counter requests_total 1040 #TYPE request_latency gauge request_latency{path="a"} 50 request_latency{path="b"} 23

19 Coleta de métricas Prometheus Aplicação #TYPE requests_total counter requests_total
1040 #TYPE request_latency gauge request_latency{path="a"} 50 request_latency{path="b"} 23

19 Coleta de métricas Prometheus Aplicação #TYPE requests_total counter requests_total
1040 #TYPE request_latency gauge request_latency{path="a"} 50 request_latency{path="b"} 23 Store

20 Na vida real… Prometheu Aplicação

20 Na vida real… Prometheu Aplicação Service Discovery • DNS
• AWS • K8S

Service Discovery • DNS • AWS • K8S 20 Na
vida real… Prometheu Aplicação HTTP GET /metrics

21 Tipos de métricas Counter Um número que sempre cresce
(ou fica parado). Sempre. Número de requisições Número de transações no cartão de crédito Número de exceções disparadas

22 Tipos de métricas Gauge Um número que pode subir
ou descer. Quantidade de dados saindo (MB/s) Quantidade de usuários online Quantidade de dead letters não processadas

23 Tipos de métricas Summary Calcula percentis predefinidos de métricas
Latência - percentil 50 (mediana) Latência - percentil 95 Latência - percentil 99

24 Tipos de métricas Histogram Calcula ocorrências de uma métrica
por intervalos predefinidos. Latência entre  0-100 ms Latência entre  101-500ms Latência entre  501-∞ ms

25 Formato de métricas services_http_errors_total{service="billing", shard="s0"} 100.0 services_http_requests_total{service="billing", shard="s1"} 13370.0

Identificador (nome da métrica)

Identificador (nome da métrica) Dimensões  (labels)

Identificador (nome da métrica) Dimensões  (labels) Valor

26 Consultas services_http_requests_total Timeseries Value services_http_requests_total{service=“billing”, env=“prod”, path=“/api/bill/:id”, status="2XX"} 19202
services_http_requests_total{service=“cca”, env=“prod”, path=“/api/account/:id”, status="2XX"}} 92838 services_http_requests_total{service=“billing”, env=“staging”, path=“/api/bill/:id", status=“3XX"}} 1020 services_http_requests_total{service=“auth”, env=“prod”, path=“/api/user/:id”, status="4XX"}} 2938 services_http_requests_total{service=“billing”, env=“prod”, path=“/api/bill/:id”, status="5XX"}} 127

27 Consultas services_http_requests_total{service=“billing”, env=“prod”} Timeseries Value services_http_requests_total{service=“billing”, env=“prod”, path=“/api/bill/:id”, status="2XX"}
19202 services_http_requests_total{service=“billing”, env=“prod”, path=“/api/bill/:id”, status="3XX"} 2732 services_http_requests_total{service=“billing”, env=“prod”, path=“/api/bill/:id”, status="4XX"} 2023 services_http_requests_total{service=“billing”, env=“prod”, path=“/api/bill/:id”, status=“5XX"} 127 services_http_requests_total{service=“billing”, env=“prod”, path=“/api/close/:id”, status="2XX"}} 49283

28 Consultas sum(services_http_requests_total{service=“billing”, env=“prod”}) Timeseries Value {} 928399

29 Consultas Timeseries Value {path=“/api/bill/:id”} 109202 {path=“/api/close/:id”} 12732 {path=“/api/history/:id”} 90223
{path=“/api/revert/:id”} 13270 {path=“/api/reprocess/:id”} 50233 sum(services_http_requests_total{service=“billing”, env=“prod”})  by (path)

30 Consultas sum(services_http_requests_total{service=“billing”, env=“prod”})  by (path)

31 Consultas sum(  rate(services_http_requests_total{service=“billing”, env=“prod”}[5m])  ) by (path)

Monitorando Kubernetes e coisas rodando nele

Pod 3 O que monitorar?

Pod 3 Instâncias (VM/Servidores) O que monitorar?

Pod 3 Instâncias (VM/Servidores) Componentes do Kubernetes O que monitorar?

Pod 3 Instâncias (VM/Servidores) Componentes do Kubernetes Containers das aplicações O que monitorar?

Node Exporter Monitorando instâncias

Node Exporter CPU Memória Disco Rede Monitorando instâncias

Node Exporter CPU Memória Disco Rede ARP Bcache Bonding Boot
time Conntrack Error detection Entropy Exec stats File descriptors Hwmon IPVS Load avg Mdadm Net class NFS NFSD Sockets Time Uname XFS ZFS Monitorando instâncias

Node Exporter Monitorando instâncias Master Node Master Master Node Node
Node Node Node DaemonSet

Node Exporter Monitorando instâncias Master Node Master Master Node Node
Node Node Node DaemonSet Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter

Kube State Metrics Monitorando componentes do Kubernetes

Kube State Metrics Monitorando componentes do Kubernetes Pods Deployments Services

Kube State Metrics Monitorando componentes do Kubernetes Pods Deployments Services
ReplicaSets ReplicationControllers DaemonSets Jobs Nodes AutoScaler PersistentVolumes Namespaces Secrets ConﬁgMaps ResourceQuotas

Master Node Master Master Node Node Node Node Node Deployment
Node Exporter Kube State Metrics Monitorando componentes do Kubernetes Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter

Master Node Master Master Node Node Node Node Node Deployment
Node Exporter Kube State Metrics Monitorando componentes do Kubernetes KSM Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter KSM

Kubelet + cAdvisor Monitorando containers das aplicações

Kubelet + cAdvisor CPU Memória Disco Rede Monitorando containers das
aplicações

Kubelet + cAdvisor Monitorando containers das aplicações Master Node Master
Master Node Node Node Node Node Node Exporter KSM Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter Node Exporter KSM Já roda nas instâncias!

Prometheus exporter Monitorando containers das aplicações

Prometheus exporter Requisições? Latência? Erros? Monitorando containers das aplicações

Prometheus exporter Requisições? Latência? Erros? Você decide! Monitorando containers das
aplicações

Service Discovery: YAML Configurando o monitoramento

scrape_configs: - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https
tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-nodes' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) Service Discovery: YAML Configurando o monitoramento

- source_labels: [__meta_kubernetes_service_name] target_label: kubernetes_name - job_name: 'kubernetes-ingresses' metrics_path: /probe
params: module: [http_2xx] kubernetes_sd_configs: - role: ingress relabel_configs: - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path] regex: (.+);(.+);(.+) replacement: ${1}://${2}${3} target_label: __param_target - target_label: __address__ replacement: blackbox-exporter.example.com:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_ingress_label_(.+) - source_labels: [__meta_kubernetes_namespace] target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_ingress_name] target_label: kubernetes_name - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name Service Discovery: YAML Configurando o monitoramento

Alertando automaticamente

49 O que é um alerta? 20:00 20:05 20:10 20:15
20:20 20:25 http_error_ratio  {service=“billing"} http_error_ratio  {service=“auth”} http_error_ratio  {service=“cca”}

49 O que é um alerta? 20:00 20:05 20:10 20:15
20:20 20:25 http_error_ratio  {service=“billing"} http_error_ratio  {service=“auth”} http_error_ratio  {service=“cca”} avg(http_error_ratio{env=“prod”}) by (service) > 5% for 10 minutes

49 O que é um alerta? 20:00 20:05 20:10 20:15
20:20 20:25 http_error_ratio  {service=“billing"} http_error_ratio  {service=“auth”} http_error_ratio  {service=“cca”} 5% avg(http_error_ratio{env=“prod”}) by (service) > 5% for 10 minutes

Rules YAML Definindo alertas

Definindo alertas groups: - name: kubernetes_alerts rules: - alert: k8s_container_is_frequently_restarting
expr: round(increase( kube_pod_container_status_restarts_total [30m])) > 5 for: 10m labels: severity: warning service: kubernetes  squad: platform annotations: description: Pod {{$labels.namespace}}/ {{$labels.pod}} was restarted {{$value}} times within the Rules YAML

Definindo alertas severity: warning service: kubernetes  squad: platform annotations: description:
Pod {{$labels.namespace}}/ {{$labels.pod}} was restarted {{$value}} times within the last hour - alert: k8s_instance_disk_will_fill_in_few_hours expr: predict_linear(node_filesystem_free {job=“kubernetes-node-exporter", mountpoint="/"}[1h], 3600 * 10) < 0 and on(instance, job) (time() - node_boot_time > 3 * 3600) for: 30m labels: severity: critical service: kubernetes  squad: platform annotations: description: Instance disk {{$labels.instance}} will fill in about 10 hours Rules YAML

Alertmanager YAML Roteando os alertas

route: receiver: default_receiver group_by: [alertname, squad, env] group_wait: 15s repeat_interval:
1h routes: - receiver: platform_slack repeat_interval: 30m match_re: squad: platform severity: critical|warning|info continue: true - receiver: platform_opsgenie match_re: squad: platform severity: critical continue: true Alertmanager YAML Roteando os alertas

match_re: squad: platform severity: critical continue: true receivers: - name:
platform_slack slack_configs: - channel: '#platform-alerts' send_resolved: true text: '{{ template "slack.service.text" . }}' color: '{{ template "nu.slack.color" . }}' title: '{{ template "slack.service.title" . }}' actions: - type: 'button' text: ':prometheus: See on Prometheus' url: '{{ template "nu.prometheus.url" . }}' style: 'primary' - type: 'button' text: ':ledger: Open Playbook' url: '{{ template "nu.playbook.url" .}}' - type: 'button' text: ':mute: Silence this Alert' Alertmanager YAML Roteando os alertas

slack_configs: - channel: '#platform-alerts' send_resolved: true text: '{{ template "slack.service.text"
. }}' color: '{{ template "nu.slack.color" . }}' title: '{{ template "slack.service.title" . }}' actions: - type: 'button' text: ':prometheus: See on Prometheus' url: '{{ template "nu.prometheus.url" . }}' style: 'primary' - type: 'button' text: ':ledger: Open Playbook' url: '{{ template "nu.playbook.url" .}}' - type: 'button' text: ':mute: Silence this Alert' url: '{{ template "nu.silence.url" .}}' style: 'danger' - name: platform_opsgenie opsgenie_configs: - api_key: {{PLATFORM_OPSGENIE_API_KEY}} message: '{{ template "opsgenie.service.message" . }}' priority: '{{ template "opsgenie.alert_priority" . }}' Alertmanager YAML Roteando os alertas

Resumindo…

Kubernetes mantém o estado desejado das suas aplicações.

Kubernetes mantém o estado desejado das suas aplicações. Prometheus monitora
sua infraestrutura sem adicionar dependências nela.

Kubernetes mantém o estado desejado das suas aplicações. Prometheus monitora
sua infraestrutura sem adicionar dependências nela. Alertas roteados automaticamente avisam em caso de problemas.

Mais ideias?

Gerar arquivos de alertas e roteamento automaticamente Infraestrutura declarativa

Alertas que disparam ações de resolução Automação de resolução

Múltiplos Prometheis* sendo agregados pelo Thanos Alta disponibilidade

Múltiplos Prometheis* sendo agregados pelo Thanos Alta disponibilidade * sim,
pelo visto o plural é esse

Estamos contratando! sou.nu/vagasnu

TDC POA 2018 Obrigado, TDC! <3 Alexandre Cisneiros [email protected] @Cisneiros
nubank.engineering sou.nu/vagasnu

Monitorando seu cluster de Kubernetes com Prome...

Monitorando seu cluster de Kubernetes com Prometheus [TDC POA 2018]

More Decks by Alexandre Cisneiros

Other Decks in Programming

Featured

Transcript