Monitoring Kubernetes with Datadog

Monitoring Kubernetes with Kubernetes Meetup Tokyo #2 June 20, 2016

@spesnova Ops at Increments,Inc

Qiita A service like a "Medium for developers in Japan"

Slide mode has been released You can create a simple
slide by using markdown on Qiita

Theme: How to monitor Kubernetes "Monitoring Kubernetes" has two meanings
• Monitoring Containers on Kubernetes • Monitoring Kubernetes cluster

Monitoring Theory

Datadog Monitoring Theory

IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Collecting the right data

IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Work metrics Work metrics indicate the top-level health of
your system

IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Resource metrics A Resource metric measures how much of
something is consumed by your system

IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Events Events capture what happened, at a point in
time, with optional additional info

IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@@DIBSUQOH Alert on actionable work metrics

IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSJOWFTUJHBUJOH@EJBHSBN@QOH Investigating recursively A work metric can be considered as
a resource metric for other system

Monitoring Containers

Collecting Overview

Collecting in servers / DC

Collecting in servers / DC Scaling doesn’t happen so often
Static conﬁg works

Collecting in VMs / Cloud

Collecting in VMs / Cloud Scaling happens often impossible to
manage static conﬁg

Collecting in VMs / Cloud New concepts came out

Collecting in VMs / Cloud • Auto registration & de-registration
• Role based aggregation New concepts came out

Collecting in containers

Collecting in containers • Dedicated monitoring agent container • Dynamic
locating

Pattern 1. Agent as side-cars container

Pattern 1. Agent as side-cars container • Co-locating the agent
and target • Grouping them as a Pod

Pattern 2. Agent with service discovery

• Locate one agent per host • Get containers info
from Kubernetes Pattern 2. Agent with service discovery

Side-cars vs Service Discovery • Side-cars: • pros: simple •
cons: bad efﬁciency • Service Discovery: • pros: efﬁciency • cons: not simple

Native Solution

Native solution

Native solution 1. cAdvisor collects data on host 2. kubelet
fetch data from cAdvisor 3. Heapster gathers and aggregate data from kubelet 4. Heapster pushs aggregated data to InfluxDB 5. InfluxDB stores the data 6. Grafana fetches data from InfluxDB and visualize 7. kubedash fetches data from Heapster and visualize

cAdvisor • kubelet binary includes cAdvisor • Collects basic resource
metrics as default • CPU, Mem, DiskIO, NetworkIO

Collecting custom metrics NOTE: Custom metrics support is in Alpha
(June 20, 2016)

• Add endpoint which exposes custom metrics • User app:
using Prometheus client library • Third-Party app: using Prometheus exporter • Metrics format: Prometheus metrics • Conﬁgure cAdvisor to those endpoints Collecting custom metrics NOTE: Custom metrics support is in Alpha (June 20, 2016)

Prometheus • Inspired by Google's Borgmon monitoring system • Kubernetes
and cAdvisor natively support • Kubernetes API - /metrics has prometheus metrics • cAdvisor API - /metrics has prometheus metricc • The second ofﬁcial component by the CNCF An OSS monitoring tool

Monitoring with

Why datadog? Picking up some reasons for kubernetes monitoring •
Docker, Kubernetes and etcd integration • Long data retention • Events timeline • Query based monitoring

Long data retention You should care about "roll-up" policy, not
only retention period "Pro and Enterprise data retention is for 13 months at full resolution (maximum is one point per second)"

Events timeline Events will be much more helpful to investigate
issues in Kubernetes Many things will be operated automatically in Kubernetes Events will be key to understand what happened

Query based monitoring • Dynamic location • You will want
to view pods by many angles • replicaSet • namespace • labels • cluster wide You can’t track containers with Host-centric monitoring model

How to setup?

dd-agent container • dd-agent uses Service Discovery pattern • Deploy
dd-agent on all nodes DaemonSets

dd-agent container # dd-agent.ds.yaml apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name:
dd-agent spec: … spec: containers: - image: datadog/docker-dd-agent:kubernetes imagePullPolicy: Always name: dd-agent ports: - containerPort: 8125 name: dogstatsdport env: - name: API_KEY value: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" How dd-agent collects basic resource metrics

dd-agent container How dd-agent collects basic resource metrics $ kubectl
create -f ops.namespace.yaml $ kubectl create -f dd-agent.ds.yaml —namespace=ops $ kubectl get nodes --no-headers=true | wc -l 3 $ kubectl get ds —namespace=ops NAME DESIRED CURRENT NODE-SELECTOR AGE dd-agent 3 3 <none> 6d

Collecting basic resource metrics How dd-agent collects basic resource metrics

• Kubernetes is API promise, everything is plugable Replace behind
cAdvisor • Fetch Pods list from kubelet to invest metadata • Fetch Pods metrics from cAdvisor Collecting basic resource metrics How dd-agent collects basic resource metrics

Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service
Discovery feature)

Collecting custom metrics 1. User sets "config template" for an
image to KV store 2. dd-agent fetches "config template" from KV store 3. dd-agent fetches pods list from kubelet 4. dd-agent creates monitoring config for the image 5. dd-agent monitors containers • which are created from the image • which are on same host with dd-agent How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)

Discovery feature) /datadog/ check_configs/ docker_image_0/ - check_names: ["check_name_0"] - init_configs: [{init_config}] - instances: [{instance_config}] docker_image_1/ - check_names: ["check_name_1"] - init_configs: [{init_config}] - instances: [{instance_config}] … …

Discovery feature) $ etcdctl mkdir /datadog/check_configs/nginx-example $ etcdctl set /datadog/check_configs/nginx-example/check_names '["nginx"]' $ etcdctl set /datadog/check_configs/nginx-example/init_configs '[{}]' $ etcdctl set /datadog/check_configs/nginx-example/instances '[{"nginx_status_url": "http://%%host%%:%%port%%/nginx_status/", "tags": "%%tags%%"}]'

Discovery feature) # dd-agent.ds.yaml env: - name: SD_BACKEND value: "docker" - name: SD_CONFIG_BACKEND value: "etcd" - name: SD_BACKEND_HOST value: "<your-etcd-hostname>" - name: SD_BACKEND_PORT value: "<your-etcd-port>"

Discovery feature) • Currently image name format (this will be updated soon) NG: "repo/user/image_name:tag" • Managing conﬁg templates in KV store git2etcd or git2consul is useful • dd-agent watches the KV store new conﬁg template will be applied immediately

Monitoring Cluster

What are work metrics for Kubernetes? How do you measure
Kubernetes health and performance? • AFAIK there is no endpoint which has entire cluster health • Kubernetes responsibility: • Scheduling pods • Running services At lease if pods and services are healthy, I can say "Kubernetes is working" Monitor pods and services But this approach makes investigation harder…

What are work metrics for Kubernetes? How do you measure
Kubernetes health and performance? IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSJOWFTUJHBUJOH@EJBHSBN@QOH • Kubernetes is composed by many services • There is no "top-level" system like a traditional web app one service can be a resource for other services Kubernetes work metrics is a collection of each service’s work metrics

Kubernetes components • kubelet • etcd • pods in "kube-system"

Monitoring kubelet • Datadog has some checks for kubelet •
check kubelet is running (with ping) • check docker is running • check synloop

Monitoring etcd • Datadog has etcd integration • etcd has
/stats endpoint which has statistics Datadog uses this endpoint Useful for work metrics (throughput, success, error, latency) • etcd has /metrics endpoint which has Prometheus metrics Includes work and resource metrics (internal metrics)

Monitoring pods in "kube-system" # query sum:docker.containers.stopped{ kube_namespace:kube-system } by
{kubernetescluster} You can see how many pods are stopped and health • componentstatus API

Monitoring apiserver Many services use apiserver as resource • /healthz/ping
endpoint for health check You can use Datadog’s http_check for it • /metrics endpoint has Prometheus metrics about API Currently dd-agent don’t use it It will be useful for collecting work and metrics

Monitoring nodes • dd-agent container also monitors host • You
can ﬁlter nodes by "kubernetes" tag

Recap • Datadog monitoring theory is useful whatever you monitor
• Side-cars or Service Discovery • Query based monitoring • Monitor each components for cluster monitoring

Questions

Questions Questions from me! • Labeling best practice • What
kind of labels should I add • How do you create and manage k8s cluster? • Do you separate cluster by environment like production and staging?

Thanks

Monitoring Kubernetes with Datadog

Monitoring Kubernetes with Datadog

More Decks by Seigo Uchida

Other Decks in Technology

Featured

Transcript