IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Resource metrics A Resource metric measures how much of something is consumed by your system
IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSBMFSUJOH@DIBSU@QOH Events Events capture what happened, at a point in time, with optional additional info
IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSJOWFTUJHBUJOH@EJBHSBN@QOH Investigating recursively A work metric can be considered as a resource metric for other system
Native solution 1. cAdvisor collects data on host 2. kubelet fetch data from cAdvisor 3. Heapster gathers and aggregate data from kubelet 4. Heapster pushs aggregated data to InfluxDB 5. InfluxDB stores the data 6. Grafana fetches data from InfluxDB and visualize 7. kubedash fetches data from Heapster and visualize
• Add endpoint which exposes custom metrics • User app: using Prometheus client library • Third-Party app: using Prometheus exporter • Metrics format: Prometheus metrics • Configure cAdvisor to those endpoints Collecting custom metrics NOTE: Custom metrics support is in Alpha (June 20, 2016)
Prometheus • Inspired by Google's Borgmon monitoring system • Kubernetes and cAdvisor natively support • Kubernetes API - /metrics has prometheus metrics • cAdvisor API - /metrics has prometheus metricc • The second official component by the CNCF An OSS monitoring tool
Why datadog? Picking up some reasons for kubernetes monitoring • Docker, Kubernetes and etcd integration • Long data retention • Events timeline • Query based monitoring
Long data retention You should care about "roll-up" policy, not only retention period "Pro and Enterprise data retention is for 13 months at full resolution (maximum is one point per second)"
Events timeline Events will be much more helpful to investigate issues in Kubernetes Many things will be operated automatically in Kubernetes Events will be key to understand what happened
Query based monitoring • Dynamic location • You will want to view pods by many angles • replicaSet • namespace • labels • cluster wide You can’t track containers with Host-centric monitoring model
• Kubernetes is API promise, everything is plugable Replace behind cAdvisor • Fetch Pods list from kubelet to invest metadata • Fetch Pods metrics from cAdvisor Collecting basic resource metrics How dd-agent collects basic resource metrics
Collecting custom metrics 1. User sets "config template" for an image to KV store 2. dd-agent fetches "config template" from KV store 3. dd-agent fetches pods list from kubelet 4. dd-agent creates monitoring config for the image 5. dd-agent monitors containers • which are created from the image • which are on same host with dd-agent How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)
Collecting custom metrics How dd-agent collects custom metrics (dd-agent’s Service Discovery feature) • Currently image name format (this will be updated soon) NG: "repo/user/image_name:tag" • Managing config templates in KV store git2etcd or git2consul is useful • dd-agent watches the KV store new config template will be applied immediately
What are work metrics for Kubernetes? How do you measure Kubernetes health and performance? • AFAIK there is no endpoint which has entire cluster health • Kubernetes responsibility: • Scheduling pods • Running services At lease if pods and services are healthy, I can say "Kubernetes is working" Monitor pods and services But this approach makes investigation harder…
What are work metrics for Kubernetes? How do you measure Kubernetes health and performance? IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSJOWFTUJHBUJOH@EJBHSBN@QOH • Kubernetes is composed by many services • There is no "top-level" system like a traditional web app one service can be a resource for other services Kubernetes work metrics is a collection of each service’s work metrics
Monitoring etcd • Datadog has etcd integration • etcd has /stats endpoint which has statistics Datadog uses this endpoint Useful for work metrics (throughput, success, error, latency) • etcd has /metrics endpoint which has Prometheus metrics Includes work and resource metrics (internal metrics)
Monitoring pods in "kube-system" # query sum:docker.containers.stopped{ kube_namespace:kube-system } by {kubernetescluster} You can see how many pods are stopped and health • componentstatus API
Monitoring apiserver Many services use apiserver as resource • /healthz/ping endpoint for health check You can use Datadog’s http_check for it • /metrics endpoint has Prometheus metrics about API Currently dd-agent don’t use it It will be useful for collecting work and metrics
Recap • Datadog monitoring theory is useful whatever you monitor • Side-cars or Service Discovery • Query based monitoring • Monitor each components for cluster monitoring
Questions Questions from me! • Labeling best practice • What kind of labels should I add • How do you create and manage k8s cluster? • Do you separate cluster by environment like production and staging?