The Work and Resource metrics
or The RED and USE method
Slide 19
Slide 19 text
19
The work (RED) metrics
Throughput
Success
Error
Performance
Slide 20
Slide 20 text
20
The Work (RED) metrics
indicates the top-level health of your system
Slide 21
Slide 21 text
21
The Work (RED) metrics
indicates your system is working or broken
Slide 22
Slide 22 text
22
Web server’s example
Throughput - requests per second (e.g. 100 req/s)
Success - % of responses that are 2xx (e.g. 99.9%)
Error - % of responses that are 5xx (e.g. 0.01%)
Performance - 90the percentile response in sec (e.g. 200ms)
Slide 23
Slide 23 text
23
The Resource (USE) metrics
Utilization
Saturation
Errors
Availability
Slide 24
Slide 24 text
24
The Resource (USE) metrics
indicates a low-level health of your system
Slide 25
Slide 25 text
25
Web server’s example
Web server
CPU Mem Disk Network DB server
The web server depends on these resources
Slide 26
Slide 26 text
26
Web server’s example
Web server
CPU Mem Disk Network DB server
CPU Mem Disk Network
The DB server’s resources
Slide 27
Slide 27 text
27
Web server’s example
Utilization - Disk usage (e.g. 43%)
Saturation - Memory swap usage (e.g. 131MB )
Errors - 5xx errors from upstream services (e.g. 50 errors/sec)
Availability - % time the DB is reachable (e.g.99.9%)
Slide 28
Slide 28 text
WORK METRICS
FOR KUBERNETES CLUSTER
Slide 29
Slide 29 text
29
Remember today’s theme
How do you make sure that
your cluster is working or broken ?
Slide 30
Slide 30 text
What are work metrics for the Cluster?
Slide 31
Slide 31 text
31
What is Kubernetes cluster’s job?
Slide 32
Slide 32 text
32
Kubernetes job is orchestration
Are there any metrics which indicate
a Kubernetes Cluster is orchestrating properly ?
Slide 33
Slide 33 text
33
Metrics we monitor for checking the orchestration
Slide 34
Slide 34 text
34
Monitoring unavailable pods
Slide 35
Slide 35 text
35
Metrics we monitor for checking the orchestration (Cluster level work metrics)
Slide 36
Slide 36 text
36
Monitoring unavailable pods
If there are no unavailable pods, at least we can say
Kubernetes Cluster is orchestrating properly
Slide 37
Slide 37 text
37
Monitoring unavailable pods
However, the unavailable pods caused by not only the cluster,
but also Kubernetes users’ misconfiguration, customers
traffic and GCP’s failure.
Slide 38
Slide 38 text
38
Metrics we monitor for checking the orchestration
Slide 39
Slide 39 text
39
What are work metrics for Kubernetes cluster?
Monitoring only unavailable pods is not enough.
What should we do?
Slide 40
Slide 40 text
40
What are work metrics for Kubernetes cluster?
Similar to a web server, can we use Kubernetes API server
throughput, success, error and duration as a work metrics?
Slide 41
Slide 41 text
41
What are work metrics for Kubernetes cluster?
Similar to a web server, can we use Kubernetes API server
throughput, success, error and duration as a work metrics?
Yes, but it’s not enough.
Slide 42
Slide 42 text
42
What are work metrics for Kubernetes cluster?
Since Kubernetes cluster is a distributed system,
We need to monitor each components’ work metrics.
50
Kubernetes Master Components
Since master components are managed by GKE,
we don’t need to(can’t) monitor them by ourselves
Slide 51
Slide 51 text
51
What are work metrics for Kubernetes cluster?
Since Kubernetes cluster is a distributed system,
We need to monitor each components’ work metrics.
Slide 52
Slide 52 text
52
Metrics about master components we have
Slide 53
Slide 53 text
Kubernetes Node Components
Slide 54
Slide 54 text
54
kubelet work metrics
Slide 55
Slide 55 text
55
kubelet work metrics
kubelet’s error rate can be increased by users
misconfiguration, so we don’t use tight threshold.
(we use 1% as threshold for now)
Slide 56
Slide 56 text
56
kube-proxy work metrics
Slide 57
Slide 57 text
57
kube-proxy work metrics
We have them in the dashboard but we don’t use them
actively since kube-proxy metrics are not reliable enough
to set alerting on them.The main reason being the
kube-proxy metrics integration between Prometheus and
Datadog.
Slide 58
Slide 58 text
Kubernetes Addons
Slide 59
Slide 59 text
59
kube-dns work metrics
As same as kube-proxy, we don’t use them actively since
they are not reliable.
Slide 60
Slide 60 text
60
kube-dns work metrics
But sometimes kube-dns causes issues in the cluster, so
we have a plan to migrate it to CoreDNS or monitor it
somehow by creating an original tool
Slide 61
Slide 61 text
61
cluster-level monitoring
Skipping this due to the time limitation, but we have
dedicated dashboard and monitors for cluster-level
monitoring (Datadog Agent)
Slide 62
Slide 62 text
62
cluster-level logging
Skipping this due to the time limitation, but we have
dedicated dashboard and monitors for cluster-level
logging (Stackdriver Logging Agent and Datadog Agent)
Slide 63
Slide 63 text
RESOURCE METRICS
FOR KUBERNETES CLUSTER
Slide 64
Slide 64 text
Kubernetes Node Components
Slide 65
Slide 65 text
65
Cluster level resource metrics
Slide 66
Slide 66 text
66
Cluster level resource metrics
Slide 67
Slide 67 text
67
Node level resource metrics
Similarly we see Disk and Network usage
Slide 68
Slide 68 text
68
Cluster level resource metrics
See the Kubernetes nodes as
one big machine
Slide 69
Slide 69 text
69
Node level resource metrics
Slide 70
Slide 70 text
70
Node level resource metrics
Slide 71
Slide 71 text
71
kubelet resource metrics (availability)
Slide 72
Slide 72 text
72
kubelet resource metrics (availability)
Slide 73
Slide 73 text
73
Investigation
Cluster level work metrics
CPU Mem Disk Network
Cluster level resource metrics Node level resource metrics
CPU Mem Disk Network
Slide 74
Slide 74 text
74
Investigation
Cluster level work metrics
CPU Mem Disk Network
kubelet kube-dns
CPU Mem Disk Network
Slide 75
Slide 75 text
RECAP
Slide 76
Slide 76 text
76
Recap
Define responsibility boundaries first
Work and Resource Metrics (RED&USE)
Monitor each components as possible