Monitoring Kubernetes with Datadog

Monitoring Kubernetes with Datadog

Kubernetes Meetup Tokyo #6

32f2e5ddb187baa2abac66d7e8b283fe?s=128

Seigo Uchida

June 20, 2016
Tweet

Transcript

  1. Monitoring Kubernetes with Kubernetes Meetup Tokyo #2 June 20, 2016

  2. @spesnova Ops at Increments,Inc

  3. Qiita A service like a "Medium for developers in Japan"

  4. Slide mode has been released You can create a simple

    slide by using markdown on Qiita
  5. Theme: How to monitor Kubernetes "Monitoring Kubernetes" has two meanings

    • Monitoring Containers on Kubernetes • Monitoring Kubernetes cluster
  6. Monitoring Theory

  7. Datadog Monitoring Theory

  8. IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Collecting the right data

  9. IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Work metrics Work metrics indicate the top-level health of

    your system
  10. IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Resource metrics A Resource metric measures how much of

    something is consumed by your system
  11. IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Events Events capture what happened, at a point in

    time, with optional additional info
  12. IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@@DIBSUQOH Alert on actionable work metrics

  13. IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSJOWFTUJHBUJOH@EJBHSBN@QOH Investigating recursively A work metric can be considered as

    a resource metric for other system
  14. Monitoring Containers

  15. Collecting Overview

  16. Collecting in servers / DC

  17. Collecting in servers / DC Scaling doesn’t happen so often

    Static config works
  18. Collecting in VMs / Cloud

  19. Collecting in VMs / Cloud Scaling happens often impossible to

    manage static config
  20. Collecting in VMs / Cloud New concepts came out

  21. Collecting in VMs / Cloud • Auto registration & de-registration

    • Role based aggregation New concepts came out
  22. Collecting in containers

  23. Collecting in containers • Dedicated monitoring agent container • Dynamic

    locating
  24. Pattern 1. Agent as side-cars container

  25. Pattern 1. Agent as side-cars container • Co-locating the agent

    and target • Grouping them as a Pod
  26. Pattern 2. Agent with service discovery

  27. Pattern 2. Agent with service discovery

  28. • Locate one agent per host • Get containers info

    from Kubernetes Pattern 2. Agent with service discovery
  29. Side-cars vs Service Discovery • Side-cars: • pros: simple •

    cons: bad efficiency • Service Discovery: • pros: efficiency • cons: not simple
  30. Native Solution

  31. Native solution

  32. Native solution 1. cAdvisor collects data on host 2. kubelet

    fetch data from cAdvisor 3. Heapster gathers and aggregate data from kubelet 4. Heapster pushs aggregated data to InfluxDB 5. InfluxDB stores the data 6. Grafana fetches data from InfluxDB and visualize 7. kubedash fetches data from Heapster and visualize
  33. cAdvisor • kubelet binary includes cAdvisor • Collects basic resource

    metrics as default • CPU, Mem, DiskIO, NetworkIO
  34. Collecting custom metrics NOTE: Custom metrics support is in Alpha

    (June 20, 2016)
  35. • Add endpoint which exposes custom metrics • User app:

    using Prometheus client library • Third-Party app: using Prometheus exporter • Metrics format: Prometheus metrics • Configure cAdvisor to those endpoints Collecting custom metrics NOTE: Custom metrics support is in Alpha (June 20, 2016)
  36. Prometheus • Inspired by Google's Borgmon monitoring system • Kubernetes

    and cAdvisor natively support • Kubernetes API - /metrics has prometheus metrics • cAdvisor API - /metrics has prometheus metricc • The second official component by the CNCF An OSS monitoring tool
  37. Monitoring with

  38. Why datadog? Picking up some reasons for kubernetes monitoring •

    Docker, Kubernetes and etcd integration • Long data retention • Events timeline • Query based monitoring
  39. Long data retention You should care about "roll-up" policy, not

    only retention period "Pro and Enterprise data retention is for 13 months at full resolution (maximum is one point per second)"
  40. Events timeline Events will be much more helpful to investigate

    issues in Kubernetes Many things will be operated automatically in Kubernetes Events will be key to understand what happened
  41. Query based monitoring • Dynamic location • You will want

    to view pods by many angles • replicaSet • namespace • labels • cluster wide You can’t track containers with Host-centric monitoring model
  42. How to setup?

  43. dd-agent container • dd-agent uses Service Discovery pattern • Deploy

    dd-agent on all nodes DaemonSets
  44. dd-agent container # dd-agent.ds.yaml apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name:

    dd-agent spec: … spec: containers: - image: datadog/docker-dd-agent:kubernetes imagePullPolicy: Always name: dd-agent ports: - containerPort: 8125 name: dogstatsdport env: - name: API_KEY value: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" How dd-agent collects basic resource metrics
  45. dd-agent container How dd-agent collects basic resource metrics $ kubectl

    create -f ops.namespace.yaml $ kubectl create -f dd-agent.ds.yaml —namespace=ops $ kubectl get nodes --no-headers=true | wc -l 3 $ kubectl get ds —namespace=ops NAME DESIRED CURRENT NODE-SELECTOR AGE dd-agent 3 3 <none> 6d
  46. Collecting basic resource metrics How dd-agent collects basic resource metrics

  47. • Kubernetes is API promise, everything is plugable Replace behind

    cAdvisor • Fetch Pods list from kubelet to invest metadata • Fetch Pods metrics from cAdvisor Collecting basic resource metrics How dd-agent collects basic resource metrics
  48. Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service

    Discovery feature)
  49. Collecting custom metrics 1. User sets "config template" for an

    image to KV store 2. dd-agent fetches "config template" from KV store 3. dd-agent fetches pods list from kubelet 4. dd-agent creates monitoring config for the image 5. dd-agent monitors containers • which are created from the image • which are on same host with dd-agent How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)
  50. Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service

    Discovery feature) /datadog/ check_configs/ docker_image_0/ - check_names: ["check_name_0"] - init_configs: [{init_config}] - instances: [{instance_config}] docker_image_1/ - check_names: ["check_name_1"] - init_configs: [{init_config}] - instances: [{instance_config}] … …
  51. Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service

    Discovery feature) $ etcdctl mkdir /datadog/check_configs/nginx-example $ etcdctl set /datadog/check_configs/nginx-example/check_names '["nginx"]' $ etcdctl set /datadog/check_configs/nginx-example/init_configs '[{}]' $ etcdctl set /datadog/check_configs/nginx-example/instances '[{"nginx_status_url": "http://%%host%%:%%port%%/nginx_status/", "tags": "%%tags%%"}]'
  52. Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service

    Discovery feature) # dd-agent.ds.yaml env: - name: SD_BACKEND value: "docker" - name: SD_CONFIG_BACKEND value: "etcd" - name: SD_BACKEND_HOST value: "<your-etcd-hostname>" - name: SD_BACKEND_PORT value: "<your-etcd-port>"
  53. Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service

    Discovery feature) • Currently image name format (this will be updated soon) NG: "repo/user/image_name:tag" • Managing config templates in KV store git2etcd or git2consul is useful • dd-agent watches the KV store new config template will be applied immediately
  54. Monitoring Cluster

  55. What are work metrics for Kubernetes? How do you measure

    Kubernetes health and performance? • AFAIK there is no endpoint which has entire cluster health • Kubernetes responsibility: • Scheduling pods • Running services At lease if pods and services are healthy, I can say "Kubernetes is working" Monitor pods and services But this approach makes investigation harder…
  56. What are work metrics for Kubernetes? How do you measure

    Kubernetes health and performance? IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSJOWFTUJHBUJOH@EJBHSBN@QOH • Kubernetes is composed by many services • There is no "top-level" system like a traditional web app one service can be a resource for other services Kubernetes work metrics is a collection of each service’s work metrics
  57. Kubernetes components • kubelet • etcd • pods in "kube-system"

  58. Monitoring kubelet • Datadog has some checks for kubelet •

    check kubelet is running (with ping) • check docker is running • check synloop
  59. Monitoring etcd • Datadog has etcd integration • etcd has

    /stats endpoint which has statistics Datadog uses this endpoint Useful for work metrics (throughput, success, error, latency) • etcd has /metrics endpoint which has Prometheus metrics Includes work and resource metrics (internal metrics)
  60. Monitoring pods in "kube-system" # query sum:docker.containers.stopped{ kube_namespace:kube-system } by

    {kubernetescluster} You can see how many pods are stopped and health • componentstatus API
  61. Monitoring apiserver Many services use apiserver as resource • /healthz/ping

    endpoint for health check You can use Datadog’s http_check for it • /metrics endpoint has Prometheus metrics about API Currently dd-agent don’t use it It will be useful for collecting work and metrics
  62. Monitoring nodes • dd-agent container also monitors host • You

    can filter nodes by "kubernetes" tag
  63. Recap

  64. Recap • Datadog monitoring theory is useful whatever you monitor

    • Side-cars or Service Discovery • Query based monitoring • Monitor each components for cluster monitoring
  65. Questions

  66. Questions Questions from me! • Labeling best practice • What

    kind of labels should I add • How do you create and manage k8s cluster? • Do you separate cluster by environment like production and staging?
  67. Thanks