Kubernetes Cluster Monitoring

Mercari Meetup for Microservices Platform #2, May 22, 2019 Kubernetes
Cluster Monitoring

2 About me @spesnova Software Engineer, microservices platform team at
Mercari Kubernetes Tokyo Community Organizer

3 Today’s theme Monitoring Kubernetes pods

4 Today’s theme Monitoring Kubernetes pods

5 Today’s theme Monitoring Kubernetes cluster

6 Today’s theme How to make sure that your cluster
is working or broken ?

CONTEXT ABOUT OUR KUBERNETES CLUSTER

Current Status

9 Current Status 200+ engineers 100+ microservices 8members in the
platform team

10 Current Status 100+ namespaces

11 Current Status 100+ k8s services

12 Current Status 2K+ pods

13 Current Status 2K+ containers

Responsibility Boundary

15 Responsibility boundary k8s nodes pods pods pods k8s master

16 Responsibility boundary Platform’s responsibility k8s nodes pods Developer’s responsibility
pods pods boundary k8s master

17 Responsibility boundary Platform’s responsibility k8s nodes boundary pods Developer’s
responsibility pods pods k8s master GKE’s responsibility boundary

The Work and Resource metrics or The RED and USE
method

19 The work (RED) metrics Throughput Success Error Performance

20 The Work (RED) metrics indicates the top-level health of
your system

21 The Work (RED) metrics indicates your system is working
or broken

22 Web server’s example Throughput - requests per second (e.g.
100 req/s) Success - % of responses that are 2xx (e.g. 99.9%) Error - % of responses that are 5xx (e.g. 0.01%) Performance - 90the percentile response in sec (e.g. 200ms)

23 The Resource (USE) metrics Utilization Saturation Errors Availability

24 The Resource (USE) metrics indicates a low-level health of
your system

25 Web server’s example Web server CPU Mem Disk Network
DB server The web server depends on these resources

26 Web server’s example Web server CPU Mem Disk Network
DB server CPU Mem Disk Network The DB server’s resources

27 Web server’s example Utilization - Disk usage (e.g. 43%)
Saturation - Memory swap usage (e.g. 131MB ) Errors - 5xx errors from upstream services (e.g. 50 errors/sec) Availability - % time the DB is reachable (e.g.99.9%)

WORK METRICS FOR KUBERNETES CLUSTER

29 Remember today’s theme How do you make sure that
your cluster is working or broken ?

What are work metrics for the Cluster?

31 What is Kubernetes cluster’s job?

32 Kubernetes job is orchestration Are there any metrics which
indicate a Kubernetes Cluster is orchestrating properly ?

33 Metrics we monitor for checking the orchestration

34 Monitoring unavailable pods

35 Metrics we monitor for checking the orchestration (Cluster level
work metrics)

36 Monitoring unavailable pods If there are no unavailable pods,
at least we can say Kubernetes Cluster is orchestrating properly

37 Monitoring unavailable pods However, the unavailable pods caused by
not only the cluster, but also Kubernetes users’ misconﬁguration, customers traﬃc and GCP’s failure.

38 Metrics we monitor for checking the orchestration

39 What are work metrics for Kubernetes cluster? Monitoring only
unavailable pods is not enough. What should we do?

40 What are work metrics for Kubernetes cluster? Similar to
a web server, can we use Kubernetes API server throughput, success, error and duration as a work metrics?

41 What are work metrics for Kubernetes cluster? Similar to
a web server, can we use Kubernetes API server throughput, success, error and duration as a work metrics? Yes, but it’s not enough.

42 What are work metrics for Kubernetes cluster? Since Kubernetes
cluster is a distributed system, We need to monitor each components’ work metrics.

Kubernetes Components

44 Kubernetes Components Master Components Node Components Addons

45 Kubernetes Components kube-api-server etcd kube-scheduler kube-controller Master Components

46 Kubernetes Components kubelet kube-proxy Docker Nodes Components

47 Kubernetes Components kube-dns cluster-level monitoring cluster-level Logging Addons

Kubernetes Master Components

49 Remember the boundaries Platform’s responsibility k8s nodes boundary pods
Developer’s responsibility pods pods k8s master GKE’s responsibility boundary

50 Kubernetes Master Components Since master components are managed by
GKE, we don’t need to(can’t) monitor them by ourselves

51 What are work metrics for Kubernetes cluster? Since Kubernetes
cluster is a distributed system, We need to monitor each components’ work metrics.

52 Metrics about master components we have

Kubernetes Node Components

54 kubelet work metrics

55 kubelet work metrics kubelet’s error rate can be increased
by users misconﬁguration, so we don’t use tight threshold. (we use 1% as threshold for now)

56 kube-proxy work metrics

57 kube-proxy work metrics We have them in the dashboard
but we don’t use them actively since kube-proxy metrics are not reliable enough to set alerting on them.The main reason being the kube-proxy metrics integration between Prometheus and Datadog.

Kubernetes Addons

59 kube-dns work metrics As same as kube-proxy, we don’t
use them actively since they are not reliable.

60 kube-dns work metrics But sometimes kube-dns causes issues in
the cluster, so we have a plan to migrate it to CoreDNS or monitor it somehow by creating an original tool

61 cluster-level monitoring Skipping this due to the time limitation,
but we have dedicated dashboard and monitors for cluster-level monitoring (Datadog Agent)

62 cluster-level logging Skipping this due to the time limitation,
but we have dedicated dashboard and monitors for cluster-level logging (Stackdriver Logging Agent and Datadog Agent)

RESOURCE METRICS FOR KUBERNETES CLUSTER

Kubernetes Node Components

65 Cluster level resource metrics

66 Cluster level resource metrics

67 Node level resource metrics Similarly we see Disk and
Network usage

68 Cluster level resource metrics See the Kubernetes nodes as
one big machine

69 Node level resource metrics

70 Node level resource metrics

71 kubelet resource metrics (availability)

72 kubelet resource metrics (availability)

73 Investigation Cluster level work metrics CPU Mem Disk Network
Cluster level resource metrics Node level resource metrics CPU Mem Disk Network

74 Investigation Cluster level work metrics CPU Mem Disk Network
kubelet kube-dns CPU Mem Disk Network

76 Recap Deﬁne responsibility boundaries ﬁrst Work and Resource Metrics
(RED&USE) Monitor each components as possible

Kubernetes Cluster Monitoring

Kubernetes Cluster Monitoring

More Decks by Seigo Uchida

Other Decks in Technology

Featured

Transcript