Monitoring Kubernetes with Datadog

Slide 1

Slide 1 text

Monitoring Kubernetes with Kubernetes Meetup Tokyo #2 June 20, 2016

Slide 2

Slide 2 text

@spesnova Ops at Increments,Inc

Slide 3

Slide 3 text

Qiita A service like a "Medium for developers in Japan"

Slide 4

Slide 4 text

Slide mode has been released You can create a simple slide by using markdown on Qiita

Slide 5

Slide 5 text

Theme: How to monitor Kubernetes "Monitoring Kubernetes" has two meanings • Monitoring Containers on Kubernetes • Monitoring Kubernetes cluster

Slide 6

Slide 6 text

Monitoring Theory

Slide 7

Slide 7 text

Datadog Monitoring Theory

Slide 8

Slide 8 text

IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Collecting the right data

Slide 9

Slide 9 text

IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Work metrics Work metrics indicate the top-level health of your system

Slide 10

Slide 10 text

IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Resource metrics A Resource metric measures how much of something is consumed by your system

Slide 11

Slide 11 text

IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Events Events capture what happened, at a point in time, with optional additional info

Slide 12

Slide 12 text

IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@@DIBSUQOH Alert on actionable work metrics

Slide 13

Slide 13 text

IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSJOWFTUJHBUJOH@EJBHSBN@QOH Investigating recursively A work metric can be considered as a resource metric for other system

Slide 14

Slide 14 text

Monitoring Containers

Slide 15

Slide 15 text

Collecting Overview

Slide 16

Slide 16 text

Collecting in servers / DC

Slide 17

Slide 17 text

Collecting in servers / DC Scaling doesn’t happen so often Static conﬁg works

Slide 18

Slide 18 text

Collecting in VMs / Cloud

Slide 19

Slide 19 text

Collecting in VMs / Cloud Scaling happens often impossible to manage static conﬁg

Slide 20

Slide 20 text

Collecting in VMs / Cloud New concepts came out

Slide 21

Slide 21 text

Collecting in VMs / Cloud • Auto registration & de-registration • Role based aggregation New concepts came out

Slide 22

Slide 22 text

Collecting in containers

Slide 23

Slide 23 text

Collecting in containers • Dedicated monitoring agent container • Dynamic locating

Slide 24

Slide 24 text

Pattern 1. Agent as side-cars container

Slide 25

Slide 25 text

Pattern 1. Agent as side-cars container • Co-locating the agent and target • Grouping them as a Pod

Slide 26

Slide 26 text

Pattern 2. Agent with service discovery

Slide 27

Slide 27 text

Pattern 2. Agent with service discovery

Slide 28

Slide 28 text

• Locate one agent per host • Get containers info from Kubernetes Pattern 2. Agent with service discovery

Slide 29

Slide 29 text

Side-cars vs Service Discovery • Side-cars: • pros: simple • cons: bad efﬁciency • Service Discovery: • pros: efﬁciency • cons: not simple

Slide 30

Slide 30 text

Native Solution

Slide 31

Slide 31 text

Native solution

Slide 32

Slide 32 text

Native solution 1. cAdvisor collects data on host 2. kubelet fetch data from cAdvisor 3. Heapster gathers and aggregate data from kubelet 4. Heapster pushs aggregated data to InfluxDB 5. InfluxDB stores the data 6. Grafana fetches data from InfluxDB and visualize 7. kubedash fetches data from Heapster and visualize

Slide 33

Slide 33 text

cAdvisor • kubelet binary includes cAdvisor • Collects basic resource metrics as default • CPU, Mem, DiskIO, NetworkIO

Slide 34

Slide 34 text

Collecting custom metrics NOTE: Custom metrics support is in Alpha (June 20, 2016)

Slide 35

Slide 35 text

• Add endpoint which exposes custom metrics • User app: using Prometheus client library • Third-Party app: using Prometheus exporter • Metrics format: Prometheus metrics • Conﬁgure cAdvisor to those endpoints Collecting custom metrics NOTE: Custom metrics support is in Alpha (June 20, 2016)

Slide 36

Slide 36 text

Prometheus • Inspired by Google's Borgmon monitoring system • Kubernetes and cAdvisor natively support • Kubernetes API - /metrics has prometheus metrics • cAdvisor API - /metrics has prometheus metricc • The second ofﬁcial component by the CNCF An OSS monitoring tool

Slide 37

Slide 37 text

Monitoring with

Slide 38

Slide 38 text

Why datadog? Picking up some reasons for kubernetes monitoring • Docker, Kubernetes and etcd integration • Long data retention • Events timeline • Query based monitoring

Slide 39

Slide 39 text

Long data retention You should care about "roll-up" policy, not only retention period "Pro and Enterprise data retention is for 13 months at full resolution (maximum is one point per second)"

Slide 40

Slide 40 text

Events timeline Events will be much more helpful to investigate issues in Kubernetes Many things will be operated automatically in Kubernetes Events will be key to understand what happened

Slide 41

Slide 41 text

Query based monitoring • Dynamic location • You will want to view pods by many angles • replicaSet • namespace • labels • cluster wide You can’t track containers with Host-centric monitoring model

Slide 42

Slide 42 text

How to setup?

Slide 43

Slide 43 text

dd-agent container • dd-agent uses Service Discovery pattern • Deploy dd-agent on all nodes DaemonSets

Slide 44

Slide 44 text

dd-agent container # dd-agent.ds.yaml apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name: dd-agent spec: … spec: containers: - image: datadog/docker-dd-agent:kubernetes imagePullPolicy: Always name: dd-agent ports: - containerPort: 8125 name: dogstatsdport env: - name: API_KEY value: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" How dd-agent collects basic resource metrics

Slide 45

Slide 45 text

dd-agent container How dd-agent collects basic resource metrics $ kubectl create -f ops.namespace.yaml $ kubectl create -f dd-agent.ds.yaml —namespace=ops $ kubectl get nodes --no-headers=true | wc -l 3 $ kubectl get ds —namespace=ops NAME DESIRED CURRENT NODE-SELECTOR AGE dd-agent 3 3 6d

Slide 46

Slide 46 text

Collecting basic resource metrics How dd-agent collects basic resource metrics

Slide 47

Slide 47 text

• Kubernetes is API promise, everything is plugable Replace behind cAdvisor • Fetch Pods list from kubelet to invest metadata • Fetch Pods metrics from cAdvisor Collecting basic resource metrics How dd-agent collects basic resource metrics

Slide 48

Slide 48 text

Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)

Slide 49

Slide 49 text

Collecting custom metrics 1. User sets "config template" for an image to KV store 2. dd-agent fetches "config template" from KV store 3. dd-agent fetches pods list from kubelet 4. dd-agent creates monitoring config for the image 5. dd-agent monitors containers • which are created from the image • which are on same host with dd-agent How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)

Slide 50

Slide 50 text

Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service Discovery feature) /datadog/ check_configs/ docker_image_0/ - check_names: ["check_name_0"] - init_configs: [{init_config}] - instances: [{instance_config}] docker_image_1/ - check_names: ["check_name_1"] - init_configs: [{init_config}] - instances: [{instance_config}] … …

Slide 51

Slide 51 text

Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service Discovery feature) $ etcdctl mkdir /datadog/check_configs/nginx-example $ etcdctl set /datadog/check_configs/nginx-example/check_names '["nginx"]' $ etcdctl set /datadog/check_configs/nginx-example/init_configs '[{}]' $ etcdctl set /datadog/check_configs/nginx-example/instances '[{"nginx_status_url": "http://%%host%%:%%port%%/nginx_status/", "tags": "%%tags%%"}]'

Slide 52

Slide 52 text

Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service Discovery feature) # dd-agent.ds.yaml env: - name: SD_BACKEND value: "docker" - name: SD_CONFIG_BACKEND value: "etcd" - name: SD_BACKEND_HOST value: "" - name: SD_BACKEND_PORT value: ""

Slide 53

Slide 53 text

Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service Discovery feature) • Currently image name format (this will be updated soon) NG: "repo/user/image_name:tag" • Managing conﬁg templates in KV store git2etcd or git2consul is useful • dd-agent watches the KV store new conﬁg template will be applied immediately

Slide 54

Slide 54 text

Monitoring Cluster

Slide 55

Slide 55 text

What are work metrics for Kubernetes? How do you measure Kubernetes health and performance? • AFAIK there is no endpoint which has entire cluster health • Kubernetes responsibility: • Scheduling pods • Running services At lease if pods and services are healthy, I can say "Kubernetes is working" Monitor pods and services But this approach makes investigation harder…

Slide 56

Slide 56 text

What are work metrics for Kubernetes? How do you measure Kubernetes health and performance? IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSJOWFTUJHBUJOH@EJBHSBN@QOH • Kubernetes is composed by many services • There is no "top-level" system like a traditional web app one service can be a resource for other services Kubernetes work metrics is a collection of each service’s work metrics

Slide 57

Slide 57 text

Kubernetes components • kubelet • etcd • pods in "kube-system"

Slide 58

Slide 58 text

Monitoring kubelet • Datadog has some checks for kubelet • check kubelet is running (with ping) • check docker is running • check synloop

Slide 59

Slide 59 text

Monitoring etcd • Datadog has etcd integration • etcd has /stats endpoint which has statistics Datadog uses this endpoint Useful for work metrics (throughput, success, error, latency) • etcd has /metrics endpoint which has Prometheus metrics Includes work and resource metrics (internal metrics)

Slide 60

Slide 60 text

Monitoring pods in "kube-system" # query sum:docker.containers.stopped{ kube_namespace:kube-system } by {kubernetescluster} You can see how many pods are stopped and health • componentstatus API

Slide 61

Slide 61 text

Monitoring apiserver Many services use apiserver as resource • /healthz/ping endpoint for health check You can use Datadog’s http_check for it • /metrics endpoint has Prometheus metrics about API Currently dd-agent don’t use it It will be useful for collecting work and metrics

Slide 62

Slide 62 text

Monitoring nodes • dd-agent container also monitors host • You can ﬁlter nodes by "kubernetes" tag

Slide 63

Slide 63 text

Recap

Slide 64

Slide 64 text

Recap • Datadog monitoring theory is useful whatever you monitor • Side-cars or Service Discovery • Query based monitoring • Monitor each components for cluster monitoring

Slide 65

Slide 65 text

Questions

Slide 66

Slide 66 text

Questions Questions from me! • Labeling best practice • What kind of labels should I add • How do you create and manage k8s cluster? • Do you separate cluster by environment like production and staging?

Slide 67

Slide 67 text

Thanks