22 Web server’s example Throughput - requests per second (e.g. 100 req/s) Success - % of responses that are 2xx (e.g. 99.9%) Error - % of responses that are 5xx (e.g. 0.01%) Performance - 90the percentile response in sec (e.g. 200ms)
27 Web server’s example Utilization - Disk usage (e.g. 43%) Saturation - Memory swap usage (e.g. 131MB ) Errors - 5xx errors from upstream services (e.g. 50 errors/sec) Availability - % time the DB is reachable (e.g.99.9%)
37 Monitoring unavailable pods However, the unavailable pods caused by not only the cluster, but also Kubernetes users’ misconfiguration, customers traffic and GCP’s failure.
40 What are work metrics for Kubernetes cluster? Similar to a web server, can we use Kubernetes API server throughput, success, error and duration as a work metrics?
41 What are work metrics for Kubernetes cluster? Similar to a web server, can we use Kubernetes API server throughput, success, error and duration as a work metrics? Yes, but it’s not enough.
55 kubelet work metrics kubelet’s error rate can be increased by users misconfiguration, so we don’t use tight threshold. (we use 1% as threshold for now)
57 kube-proxy work metrics We have them in the dashboard but we don’t use them actively since kube-proxy metrics are not reliable enough to set alerting on them.The main reason being the kube-proxy metrics integration between Prometheus and Datadog.
60 kube-dns work metrics But sometimes kube-dns causes issues in the cluster, so we have a plan to migrate it to CoreDNS or monitor it somehow by creating an original tool
61 cluster-level monitoring Skipping this due to the time limitation, but we have dedicated dashboard and monitors for cluster-level monitoring (Datadog Agent)
62 cluster-level logging Skipping this due to the time limitation, but we have dedicated dashboard and monitors for cluster-level logging (Stackdriver Logging Agent and Datadog Agent)