Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring Kubernetes with Datadog

Monitoring Kubernetes with Datadog

Kubernetes Meetup Tokyo #6

Seigo Uchida

June 20, 2016
Tweet

More Decks by Seigo Uchida

Other Decks in Technology

Transcript

  1. Monitoring Kubernetes
    with
    Kubernetes Meetup Tokyo #2 June 20, 2016

    View Slide

  2. @spesnova
    Ops at Increments,Inc

    View Slide

  3. Qiita
    A service like a "Medium for developers in Japan"

    View Slide

  4. Slide mode has been released
    You can create a simple slide by using markdown on Qiita

    View Slide

  5. Theme: How to monitor Kubernetes
    "Monitoring Kubernetes" has two meanings
    • Monitoring Containers on Kubernetes
    • Monitoring Kubernetes cluster

    View Slide

  6. Monitoring Theory

    View Slide

  7. Datadog Monitoring Theory

    View Slide

  8. IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH
    Collecting the right data

    View Slide

  9. IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH
    Work metrics
    Work metrics indicate the top-level health of your system

    View Slide

  10. IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH
    Resource metrics
    A Resource metric measures how much of something is consumed by your system

    View Slide

  11. IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH
    Events
    Events capture what happened, at a point in time, with optional additional info

    View Slide

  12. IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@@DIBSUQOH
    Alert on actionable work metrics

    View Slide

  13. IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSJOWFTUJHBUJOH@EJBHSBN@QOH
    Investigating recursively
    A work metric can be considered as a resource metric for other system

    View Slide

  14. Monitoring Containers

    View Slide

  15. Collecting Overview

    View Slide

  16. Collecting in servers / DC

    View Slide

  17. Collecting in servers / DC
    Scaling doesn’t happen so often
    Static config works

    View Slide

  18. Collecting in VMs / Cloud

    View Slide

  19. Collecting in VMs / Cloud
    Scaling happens often
    impossible to manage static config

    View Slide

  20. Collecting in VMs / Cloud
    New concepts came out

    View Slide

  21. Collecting in VMs / Cloud
    • Auto registration & de-registration
    • Role based aggregation
    New concepts came out

    View Slide

  22. Collecting in containers

    View Slide

  23. Collecting in containers
    • Dedicated monitoring agent container
    • Dynamic locating

    View Slide

  24. Pattern 1. Agent as side-cars container

    View Slide

  25. Pattern 1. Agent as side-cars container
    • Co-locating the agent and target
    • Grouping them as a Pod

    View Slide

  26. Pattern 2. Agent with service discovery

    View Slide

  27. Pattern 2. Agent with service discovery

    View Slide

  28. • Locate one agent per host
    • Get containers info from Kubernetes
    Pattern 2. Agent with service discovery

    View Slide

  29. Side-cars vs Service Discovery
    • Side-cars:
    • pros: simple
    • cons: bad efficiency
    • Service Discovery:
    • pros: efficiency
    • cons: not simple

    View Slide

  30. Native Solution

    View Slide

  31. Native solution

    View Slide

  32. Native solution
    1. cAdvisor collects data on host
    2. kubelet fetch data from cAdvisor
    3. Heapster gathers and aggregate data from kubelet
    4. Heapster pushs aggregated data to InfluxDB
    5. InfluxDB stores the data
    6. Grafana fetches data from InfluxDB and visualize
    7. kubedash fetches data from Heapster and visualize

    View Slide

  33. cAdvisor
    • kubelet binary includes cAdvisor
    • Collects basic resource metrics as default
    • CPU, Mem, DiskIO, NetworkIO

    View Slide

  34. Collecting custom metrics
    NOTE: Custom metrics support is in Alpha (June 20, 2016)

    View Slide

  35. • Add endpoint which exposes custom metrics
    • User app: using Prometheus client library
    • Third-Party app: using Prometheus exporter
    • Metrics format: Prometheus metrics
    • Configure cAdvisor to those endpoints
    Collecting custom metrics
    NOTE: Custom metrics support is in Alpha (June 20, 2016)

    View Slide

  36. Prometheus
    • Inspired by Google's Borgmon monitoring system
    • Kubernetes and cAdvisor natively support
    • Kubernetes API - /metrics has prometheus metrics
    • cAdvisor API - /metrics has prometheus metricc
    • The second official component by the CNCF
    An OSS monitoring tool

    View Slide

  37. Monitoring with

    View Slide

  38. Why datadog?
    Picking up some reasons for kubernetes monitoring
    • Docker, Kubernetes and etcd integration
    • Long data retention
    • Events timeline
    • Query based monitoring

    View Slide

  39. Long data retention
    You should care about "roll-up" policy, not only retention period
    "Pro and Enterprise data retention is for 13 months at
    full resolution (maximum is one point per second)"

    View Slide

  40. Events timeline
    Events will be much more helpful to investigate issues in Kubernetes
    Many things will be operated automatically in Kubernetes
    Events will be key to understand what happened

    View Slide

  41. Query based monitoring
    • Dynamic location
    • You will want to view pods by many angles
    • replicaSet
    • namespace
    • labels
    • cluster wide
    You can’t track containers with Host-centric monitoring model

    View Slide

  42. How to setup?

    View Slide

  43. dd-agent container
    • dd-agent uses Service Discovery pattern
    • Deploy dd-agent on all nodes
    DaemonSets

    View Slide

  44. dd-agent container
    # dd-agent.ds.yaml
    apiVersion: extensions/v1beta1
    kind: DaemonSet
    metadata:
    name: dd-agent
    spec:

    spec:
    containers:
    - image: datadog/docker-dd-agent:kubernetes
    imagePullPolicy: Always
    name: dd-agent
    ports:
    - containerPort: 8125
    name: dogstatsdport
    env:
    - name: API_KEY
    value: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    How dd-agent collects basic resource metrics

    View Slide

  45. dd-agent container
    How dd-agent collects basic resource metrics
    $ kubectl create -f ops.namespace.yaml
    $ kubectl create -f dd-agent.ds.yaml —namespace=ops
    $ kubectl get nodes --no-headers=true | wc -l
    3
    $ kubectl get ds —namespace=ops
    NAME DESIRED CURRENT NODE-SELECTOR AGE
    dd-agent 3 3 6d

    View Slide

  46. Collecting basic resource metrics
    How dd-agent collects basic resource metrics

    View Slide

  47. • Kubernetes is API promise, everything is plugable
    Replace behind cAdvisor
    • Fetch Pods list from kubelet to invest metadata
    • Fetch Pods metrics from cAdvisor
    Collecting basic resource metrics
    How dd-agent collects basic resource metrics

    View Slide

  48. Collecting custom metrics
    How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)

    View Slide

  49. Collecting custom metrics
    1. User sets "config template" for an image to KV store
    2. dd-agent fetches "config template" from KV store
    3. dd-agent fetches pods list from kubelet
    4. dd-agent creates monitoring config for the image
    5. dd-agent monitors containers
    • which are created from the image
    • which are on same host with dd-agent
    How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)

    View Slide

  50. Collecting custom metrics
    How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)
    /datadog/
    check_configs/
    docker_image_0/
    - check_names: ["check_name_0"]
    - init_configs: [{init_config}]
    - instances: [{instance_config}]
    docker_image_1/
    - check_names: ["check_name_1"]
    - init_configs: [{init_config}]
    - instances: [{instance_config}]


    View Slide

  51. Collecting custom metrics
    How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)
    $ etcdctl mkdir /datadog/check_configs/nginx-example
    $ etcdctl set /datadog/check_configs/nginx-example/check_names
    '["nginx"]'
    $ etcdctl set /datadog/check_configs/nginx-example/init_configs '[{}]'
    $ etcdctl set /datadog/check_configs/nginx-example/instances
    '[{"nginx_status_url": "http://%%host%%:%%port%%/nginx_status/",
    "tags": "%%tags%%"}]'

    View Slide

  52. Collecting custom metrics
    How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)
    # dd-agent.ds.yaml
    env:
    - name: SD_BACKEND
    value: "docker"
    - name: SD_CONFIG_BACKEND
    value: "etcd"
    - name: SD_BACKEND_HOST
    value: ""
    - name: SD_BACKEND_PORT
    value: ""

    View Slide

  53. Collecting custom metrics
    How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)
    • Currently image name format (this will be updated soon)
    NG: "repo/user/image_name:tag"
    • Managing config templates in KV store
    git2etcd or git2consul is useful
    • dd-agent watches the KV store
    new config template will be applied immediately

    View Slide

  54. Monitoring Cluster

    View Slide

  55. What are work metrics for Kubernetes?
    How do you measure Kubernetes health and performance?
    • AFAIK there is no endpoint which has entire cluster health
    • Kubernetes responsibility:
    • Scheduling pods
    • Running services
    At lease if pods and services are healthy,
    I can say "Kubernetes is working"
    Monitor pods and services
    But this approach makes investigation harder…

    View Slide

  56. What are work metrics for Kubernetes?
    How do you measure Kubernetes health and performance?
    IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSJOWFTUJHBUJOH@EJBHSBN@QOH
    • Kubernetes is composed by many services
    • There is no "top-level" system like a traditional web app
    one service can be a resource for other services
    Kubernetes work metrics is a collection of each service’s
    work metrics

    View Slide

  57. Kubernetes components
    • kubelet
    • etcd
    • pods in "kube-system"

    View Slide

  58. Monitoring kubelet
    • Datadog has some checks for kubelet
    • check kubelet is running (with ping)
    • check docker is running
    • check synloop

    View Slide

  59. Monitoring etcd
    • Datadog has etcd integration
    • etcd has /stats endpoint which has statistics
    Datadog uses this endpoint
    Useful for work metrics (throughput, success, error, latency)
    • etcd has /metrics endpoint which has Prometheus
    metrics
    Includes work and resource metrics (internal metrics)

    View Slide

  60. Monitoring pods in "kube-system"
    # query
    sum:docker.containers.stopped{
    kube_namespace:kube-system
    } by {kubernetescluster}
    You can see how many pods are stopped and health
    • componentstatus API

    View Slide

  61. Monitoring apiserver
    Many services use apiserver as resource
    • /healthz/ping endpoint for health check
    You can use Datadog’s http_check for it
    • /metrics endpoint has Prometheus metrics about API
    Currently dd-agent don’t use it
    It will be useful for collecting work and metrics

    View Slide

  62. Monitoring nodes
    • dd-agent container also monitors host
    • You can filter nodes by "kubernetes" tag

    View Slide

  63. Recap

    View Slide

  64. Recap
    • Datadog monitoring theory is useful whatever you monitor
    • Side-cars or Service Discovery
    • Query based monitoring
    • Monitor each components for cluster monitoring

    View Slide

  65. Questions

    View Slide

  66. Questions
    Questions from me!
    • Labeling best practice
    • What kind of labels should I add
    • How do you create and manage k8s cluster?
    • Do you separate cluster by environment like
    production and staging?

    View Slide

  67. Thanks

    View Slide