Kubernetes Cluster Monitoring

by Seigo Uchida

Slide 1

Slide 1 text

Mercari Meetup for Microservices Platform #2, May 22, 2019 Kubernetes Cluster Monitoring

Slide 2

Slide 2 text

2 About me @spesnova Software Engineer, microservices platform team at Mercari Kubernetes Tokyo Community Organizer

Slide 3

Slide 3 text

3 Today’s theme Monitoring Kubernetes pods

Slide 4

Slide 4 text

4 Today’s theme Monitoring Kubernetes pods

Slide 5

Slide 5 text

5 Today’s theme Monitoring Kubernetes cluster

Slide 6

Slide 6 text

6 Today’s theme How to make sure that your cluster is working or broken ?

Slide 7

Slide 7 text

CONTEXT ABOUT OUR KUBERNETES CLUSTER

Slide 8

Slide 8 text

Current Status

Slide 9

Slide 9 text

9 Current Status 200+ engineers 100+ microservices 8members in the platform team

Slide 10

Slide 10 text

10 Current Status 100+ namespaces

Slide 11

Slide 11 text

11 Current Status 100+ k8s services

Slide 12

Slide 12 text

12 Current Status 2K+ pods

Slide 13

Slide 13 text

13 Current Status 2K+ containers

Slide 14

Slide 14 text

Responsibility Boundary

Slide 15

Slide 15 text

15 Responsibility boundary k8s nodes pods pods pods k8s master

Slide 16

Slide 16 text

16 Responsibility boundary Platform’s responsibility k8s nodes pods Developer’s responsibility pods pods boundary k8s master

Slide 17

Slide 17 text

17 Responsibility boundary Platform’s responsibility k8s nodes boundary pods Developer’s responsibility pods pods k8s master GKE’s responsibility boundary

Slide 18

Slide 18 text

The Work and Resource metrics or The RED and USE method

Slide 19

Slide 19 text

19 The work (RED) metrics Throughput Success Error Performance

Slide 20

Slide 20 text

20 The Work (RED) metrics indicates the top-level health of your system

Slide 21

Slide 21 text

21 The Work (RED) metrics indicates your system is working or broken

Slide 22

Slide 22 text

22 Web server’s example Throughput - requests per second (e.g. 100 req/s) Success - % of responses that are 2xx (e.g. 99.9%) Error - % of responses that are 5xx (e.g. 0.01%) Performance - 90the percentile response in sec (e.g. 200ms)

Slide 23

Slide 23 text

23 The Resource (USE) metrics Utilization Saturation Errors Availability

Slide 24

Slide 24 text

24 The Resource (USE) metrics indicates a low-level health of your system

Slide 25

Slide 25 text

25 Web server’s example Web server CPU Mem Disk Network DB server The web server depends on these resources

Slide 26

Slide 26 text

26 Web server’s example Web server CPU Mem Disk Network DB server CPU Mem Disk Network The DB server’s resources

Slide 27

Slide 27 text

27 Web server’s example Utilization - Disk usage (e.g. 43%) Saturation - Memory swap usage (e.g. 131MB ) Errors - 5xx errors from upstream services (e.g. 50 errors/sec) Availability - % time the DB is reachable (e.g.99.9%)

Slide 28

Slide 28 text

WORK METRICS FOR KUBERNETES CLUSTER

Slide 29

Slide 29 text

29 Remember today’s theme How do you make sure that your cluster is working or broken ?

Slide 30

Slide 30 text

What are work metrics for the Cluster?

Slide 31

Slide 31 text

31 What is Kubernetes cluster’s job?

Slide 32

Slide 32 text

32 Kubernetes job is orchestration Are there any metrics which indicate a Kubernetes Cluster is orchestrating properly ?

Slide 33

Slide 33 text

33 Metrics we monitor for checking the orchestration

Slide 34

Slide 34 text

34 Monitoring unavailable pods

Slide 35

Slide 35 text

35 Metrics we monitor for checking the orchestration (Cluster level work metrics)

Slide 36

Slide 36 text

36 Monitoring unavailable pods If there are no unavailable pods, at least we can say Kubernetes Cluster is orchestrating properly

Slide 37

Slide 37 text

37 Monitoring unavailable pods However, the unavailable pods caused by not only the cluster, but also Kubernetes users’ misconﬁguration, customers traﬃc and GCP’s failure.

Slide 38

Slide 38 text

38 Metrics we monitor for checking the orchestration

Slide 39

Slide 39 text

39 What are work metrics for Kubernetes cluster? Monitoring only unavailable pods is not enough. What should we do?

Slide 40

Slide 40 text

40 What are work metrics for Kubernetes cluster? Similar to a web server, can we use Kubernetes API server throughput, success, error and duration as a work metrics?

Slide 41

Slide 41 text

41 What are work metrics for Kubernetes cluster? Similar to a web server, can we use Kubernetes API server throughput, success, error and duration as a work metrics? Yes, but it’s not enough.

Slide 42

Slide 42 text

42 What are work metrics for Kubernetes cluster? Since Kubernetes cluster is a distributed system, We need to monitor each components’ work metrics.

Slide 43

Slide 43 text

Kubernetes Components

Slide 44

Slide 44 text

44 Kubernetes Components Master Components Node Components Addons

Slide 45

Slide 45 text

45 Kubernetes Components kube-api-server etcd kube-scheduler kube-controller Master Components

Slide 46

Slide 46 text

46 Kubernetes Components kubelet kube-proxy Docker Nodes Components

Slide 47

Slide 47 text

47 Kubernetes Components kube-dns cluster-level monitoring cluster-level Logging Addons

Slide 48

Slide 48 text

Kubernetes Master Components

Slide 49

Slide 49 text

49 Remember the boundaries Platform’s responsibility k8s nodes boundary pods Developer’s responsibility pods pods k8s master GKE’s responsibility boundary

Slide 50

Slide 50 text

50 Kubernetes Master Components Since master components are managed by GKE, we don’t need to(can’t) monitor them by ourselves

Slide 51

Slide 51 text

51 What are work metrics for Kubernetes cluster? Since Kubernetes cluster is a distributed system, We need to monitor each components’ work metrics.

Slide 52

Slide 52 text

52 Metrics about master components we have

Slide 53

Slide 53 text

Kubernetes Node Components

Slide 54

Slide 54 text

54 kubelet work metrics

Slide 55

Slide 55 text

55 kubelet work metrics kubelet’s error rate can be increased by users misconﬁguration, so we don’t use tight threshold. (we use 1% as threshold for now)

Slide 56

Slide 56 text

56 kube-proxy work metrics

Slide 57

Slide 57 text

57 kube-proxy work metrics We have them in the dashboard but we don’t use them actively since kube-proxy metrics are not reliable enough to set alerting on them.The main reason being the kube-proxy metrics integration between Prometheus and Datadog.

Slide 58

Slide 58 text

Kubernetes Addons

Slide 59

Slide 59 text

59 kube-dns work metrics As same as kube-proxy, we don’t use them actively since they are not reliable.

Slide 60

Slide 60 text

60 kube-dns work metrics But sometimes kube-dns causes issues in the cluster, so we have a plan to migrate it to CoreDNS or monitor it somehow by creating an original tool

Slide 61

Slide 61 text

61 cluster-level monitoring Skipping this due to the time limitation, but we have dedicated dashboard and monitors for cluster-level monitoring (Datadog Agent)

Slide 62

Slide 62 text

62 cluster-level logging Skipping this due to the time limitation, but we have dedicated dashboard and monitors for cluster-level logging (Stackdriver Logging Agent and Datadog Agent)

Slide 63

Slide 63 text

RESOURCE METRICS FOR KUBERNETES CLUSTER

Slide 64

Slide 64 text

Kubernetes Node Components

Slide 65

Slide 65 text

65 Cluster level resource metrics

Slide 66

Slide 66 text

66 Cluster level resource metrics

Slide 67

Slide 67 text

67 Node level resource metrics Similarly we see Disk and Network usage

Slide 68

Slide 68 text

68 Cluster level resource metrics See the Kubernetes nodes as one big machine

Slide 69

Slide 69 text

69 Node level resource metrics

Slide 70

Slide 70 text

70 Node level resource metrics

Slide 71

Slide 71 text

71 kubelet resource metrics (availability)

Slide 72

Slide 72 text

72 kubelet resource metrics (availability)

Slide 73

Slide 73 text

73 Investigation Cluster level work metrics CPU Mem Disk Network Cluster level resource metrics Node level resource metrics CPU Mem Disk Network

Slide 74

Slide 74 text

74 Investigation Cluster level work metrics CPU Mem Disk Network kubelet kube-dns CPU Mem Disk Network

Slide 75

Slide 75 text

RECAP

Slide 76

Slide 76 text

76 Recap Deﬁne responsibility boundaries ﬁrst Work and Resource Metrics (RED&USE) Monitor each components as possible

Slide 77

Slide 77 text

No content