Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring Kubernetes with Datadog

Monitoring Kubernetes with Datadog

Kubernetes Meetup Tokyo #6

Seigo Uchida

June 20, 2016

More Decks by Seigo Uchida

Other Decks in Technology


  1. Slide mode has been released You can create a simple

    slide by using markdown on Qiita
  2. Theme: How to monitor Kubernetes "Monitoring Kubernetes" has two meanings

    • Monitoring Containers on Kubernetes • Monitoring Kubernetes cluster
  3. Collecting in VMs / Cloud • Auto registration & de-registration

    • Role based aggregation New concepts came out
  4. • Locate one agent per host • Get containers info

    from Kubernetes Pattern 2. Agent with service discovery
  5. Side-cars vs Service Discovery • Side-cars: • pros: simple •

    cons: bad efficiency • Service Discovery: • pros: efficiency • cons: not simple
  6. Native solution 1. cAdvisor collects data on host 2. kubelet

    fetch data from cAdvisor 3. Heapster gathers and aggregate data from kubelet 4. Heapster pushs aggregated data to InfluxDB 5. InfluxDB stores the data 6. Grafana fetches data from InfluxDB and visualize 7. kubedash fetches data from Heapster and visualize
  7. cAdvisor • kubelet binary includes cAdvisor • Collects basic resource

    metrics as default • CPU, Mem, DiskIO, NetworkIO
  8. • Add endpoint which exposes custom metrics • User app:

    using Prometheus client library • Third-Party app: using Prometheus exporter • Metrics format: Prometheus metrics • Configure cAdvisor to those endpoints Collecting custom metrics NOTE: Custom metrics support is in Alpha (June 20, 2016)
  9. Prometheus • Inspired by Google's Borgmon monitoring system • Kubernetes

    and cAdvisor natively support • Kubernetes API - /metrics has prometheus metrics • cAdvisor API - /metrics has prometheus metricc • The second official component by the CNCF An OSS monitoring tool
  10. Why datadog? Picking up some reasons for kubernetes monitoring •

    Docker, Kubernetes and etcd integration • Long data retention • Events timeline • Query based monitoring
  11. Long data retention You should care about "roll-up" policy, not

    only retention period "Pro and Enterprise data retention is for 13 months at full resolution (maximum is one point per second)"
  12. Events timeline Events will be much more helpful to investigate

    issues in Kubernetes Many things will be operated automatically in Kubernetes Events will be key to understand what happened
  13. Query based monitoring • Dynamic location • You will want

    to view pods by many angles • replicaSet • namespace • labels • cluster wide You can’t track containers with Host-centric monitoring model
  14. dd-agent container # dd-agent.ds.yaml apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name:

    dd-agent spec: … spec: containers: - image: datadog/docker-dd-agent:kubernetes imagePullPolicy: Always name: dd-agent ports: - containerPort: 8125 name: dogstatsdport env: - name: API_KEY value: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" How dd-agent collects basic resource metrics
  15. dd-agent container How dd-agent collects basic resource metrics $ kubectl

    create -f ops.namespace.yaml $ kubectl create -f dd-agent.ds.yaml —namespace=ops $ kubectl get nodes --no-headers=true | wc -l 3 $ kubectl get ds —namespace=ops NAME DESIRED CURRENT NODE-SELECTOR AGE dd-agent 3 3 <none> 6d
  16. • Kubernetes is API promise, everything is plugable Replace behind

    cAdvisor • Fetch Pods list from kubelet to invest metadata • Fetch Pods metrics from cAdvisor Collecting basic resource metrics How dd-agent collects basic resource metrics
  17. Collecting custom metrics 1. User sets "config template" for an

    image to KV store 2. dd-agent fetches "config template" from KV store 3. dd-agent fetches pods list from kubelet 4. dd-agent creates monitoring config for the image 5. dd-agent monitors containers • which are created from the image • which are on same host with dd-agent How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)
  18. Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service

    Discovery feature) /datadog/ check_configs/ docker_image_0/ - check_names: ["check_name_0"] - init_configs: [{init_config}] - instances: [{instance_config}] docker_image_1/ - check_names: ["check_name_1"] - init_configs: [{init_config}] - instances: [{instance_config}] … …
  19. Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service

    Discovery feature) $ etcdctl mkdir /datadog/check_configs/nginx-example $ etcdctl set /datadog/check_configs/nginx-example/check_names '["nginx"]' $ etcdctl set /datadog/check_configs/nginx-example/init_configs '[{}]' $ etcdctl set /datadog/check_configs/nginx-example/instances '[{"nginx_status_url": "http://%%host%%:%%port%%/nginx_status/", "tags": "%%tags%%"}]'
  20. Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service

    Discovery feature) # dd-agent.ds.yaml env: - name: SD_BACKEND value: "docker" - name: SD_CONFIG_BACKEND value: "etcd" - name: SD_BACKEND_HOST value: "<your-etcd-hostname>" - name: SD_BACKEND_PORT value: "<your-etcd-port>"
  21. Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service

    Discovery feature) • Currently image name format (this will be updated soon) NG: "repo/user/image_name:tag" • Managing config templates in KV store git2etcd or git2consul is useful • dd-agent watches the KV store new config template will be applied immediately
  22. What are work metrics for Kubernetes? How do you measure

    Kubernetes health and performance? • AFAIK there is no endpoint which has entire cluster health • Kubernetes responsibility: • Scheduling pods • Running services At lease if pods and services are healthy, I can say "Kubernetes is working" Monitor pods and services But this approach makes investigation harder…
  23. What are work metrics for Kubernetes? How do you measure

    Kubernetes health and performance? IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSJOWFTUJHBUJOH@EJBHSBN@QOH • Kubernetes is composed by many services • There is no "top-level" system like a traditional web app one service can be a resource for other services Kubernetes work metrics is a collection of each service’s work metrics
  24. Monitoring kubelet • Datadog has some checks for kubelet •

    check kubelet is running (with ping) • check docker is running • check synloop
  25. Monitoring etcd • Datadog has etcd integration • etcd has

    /stats endpoint which has statistics Datadog uses this endpoint Useful for work metrics (throughput, success, error, latency) • etcd has /metrics endpoint which has Prometheus metrics Includes work and resource metrics (internal metrics)
  26. Monitoring pods in "kube-system" # query sum:docker.containers.stopped{ kube_namespace:kube-system } by

    {kubernetescluster} You can see how many pods are stopped and health • componentstatus API
  27. Monitoring apiserver Many services use apiserver as resource • /healthz/ping

    endpoint for health check You can use Datadog’s http_check for it • /metrics endpoint has Prometheus metrics about API Currently dd-agent don’t use it It will be useful for collecting work and metrics
  28. Recap • Datadog monitoring theory is useful whatever you monitor

    • Side-cars or Service Discovery • Query based monitoring • Monitor each components for cluster monitoring
  29. Questions Questions from me! • Labeling best practice • What

    kind of labels should I add • How do you create and manage k8s cluster? • Do you separate cluster by environment like production and staging?