fetch data from cAdvisor 3. Heapster gathers and aggregate data from kubelet 4. Heapster pushs aggregated data to InfluxDB 5. InfluxDB stores the data 6. Grafana fetches data from InfluxDB and visualize 7. kubedash fetches data from Heapster and visualize
using Prometheus client library • Third-Party app: using Prometheus exporter • Metrics format: Prometheus metrics • Configure cAdvisor to those endpoints Collecting custom metrics NOTE: Custom metrics support is in Alpha (June 20, 2016)
and cAdvisor natively support • Kubernetes API - /metrics has prometheus metrics • cAdvisor API - /metrics has prometheus metricc • The second official component by the CNCF An OSS monitoring tool
image to KV store 2. dd-agent fetches "config template" from KV store 3. dd-agent fetches pods list from kubelet 4. dd-agent creates monitoring config for the image 5. dd-agent monitors containers • which are created from the image • which are on same host with dd-agent How dd-agent collects custom metrics (dd-agent’s Service Discovery feature)
Discovery feature) • Currently image name format (this will be updated soon) NG: "repo/user/image_name:tag" • Managing config templates in KV store git2etcd or git2consul is useful • dd-agent watches the KV store new config template will be applied immediately
Kubernetes health and performance? • AFAIK there is no endpoint which has entire cluster health • Kubernetes responsibility: • Scheduling pods • Running services At lease if pods and services are healthy, I can say "Kubernetes is working" Monitor pods and services But this approach makes investigation harder…
Kubernetes health and performance? IUUQTEUZSBMMY[ZDMPVEGSPOUOFUCMPHJNBHFTIPXUPNPOJUPSJOWFTUJHBUJOH@EJBHSBN@QOH • Kubernetes is composed by many services • There is no "top-level" system like a traditional web app one service can be a resource for other services Kubernetes work metrics is a collection of each service’s work metrics
/stats endpoint which has statistics Datadog uses this endpoint Useful for work metrics (throughput, success, error, latency) • etcd has /metrics endpoint which has Prometheus metrics Includes work and resource metrics (internal metrics)
endpoint for health check You can use Datadog’s http_check for it • /metrics endpoint has Prometheus metrics about API Currently dd-agent don’t use it It will be useful for collecting work and metrics