Kubernetes Cluster Monitoring

Kubernetes Cluster Monitoring

Talk at Mercari Microservices Platform Meetup #2
https://connpass.com/event/128017/

Explained how we monitor a Kubernetes cluster itself instead of application pods running on it.

32f2e5ddb187baa2abac66d7e8b283fe?s=128

Seigo Uchida

May 22, 2019
Tweet

Transcript

  1. Mercari Meetup for Microservices Platform #2, May 22, 2019 Kubernetes

    Cluster Monitoring
  2. 2 About me @spesnova Software Engineer, microservices platform team at

    Mercari Kubernetes Tokyo Community Organizer
  3. 3 Today’s theme Monitoring Kubernetes pods

  4. 4 Today’s theme Monitoring Kubernetes pods

  5. 5 Today’s theme Monitoring Kubernetes cluster

  6. 6 Today’s theme How to make sure that your cluster

    is working or broken ?
  7. CONTEXT ABOUT OUR KUBERNETES CLUSTER

  8. Current Status

  9. 9 Current Status 200+ engineers 100+ microservices 8members in the

    platform team
  10. 10 Current Status 100+ namespaces

  11. 11 Current Status 100+ k8s services

  12. 12 Current Status 2K+ pods

  13. 13 Current Status 2K+ containers

  14. Responsibility Boundary

  15. 15 Responsibility boundary k8s nodes pods pods pods k8s master

  16. 16 Responsibility boundary Platform’s responsibility k8s nodes pods Developer’s responsibility

    pods pods boundary k8s master
  17. 17 Responsibility boundary Platform’s responsibility k8s nodes boundary pods Developer’s

    responsibility pods pods k8s master GKE’s responsibility boundary
  18. The Work and Resource metrics or The RED and USE

    method
  19. 19 The work (RED) metrics Throughput Success Error Performance

  20. 20 The Work (RED) metrics indicates the top-level health of

    your system
  21. 21 The Work (RED) metrics indicates your system is working

    or broken
  22. 22 Web server’s example Throughput - requests per second (e.g.

    100 req/s) Success - % of responses that are 2xx (e.g. 99.9%) Error - % of responses that are 5xx (e.g. 0.01%) Performance - 90the percentile response in sec (e.g. 200ms)
  23. 23 The Resource (USE) metrics Utilization Saturation Errors Availability

  24. 24 The Resource (USE) metrics indicates a low-level health of

    your system
  25. 25 Web server’s example Web server CPU Mem Disk Network

    DB server The web server depends on these resources
  26. 26 Web server’s example Web server CPU Mem Disk Network

    DB server CPU Mem Disk Network The DB server’s resources
  27. 27 Web server’s example Utilization - Disk usage (e.g. 43%)

    Saturation - Memory swap usage (e.g. 131MB ) Errors - 5xx errors from upstream services (e.g. 50 errors/sec) Availability - % time the DB is reachable (e.g.99.9%)
  28. WORK METRICS FOR KUBERNETES CLUSTER

  29. 29 Remember today’s theme How do you make sure that

    your cluster is working or broken ?
  30. What are work metrics for the Cluster?

  31. 31 What is Kubernetes cluster’s job?

  32. 32 Kubernetes job is orchestration Are there any metrics which

    indicate a Kubernetes Cluster is orchestrating properly ?
  33. 33 Metrics we monitor for checking the orchestration

  34. 34 Monitoring unavailable pods

  35. 35 Metrics we monitor for checking the orchestration (Cluster level

    work metrics)
  36. 36 Monitoring unavailable pods If there are no unavailable pods,

    at least we can say Kubernetes Cluster is orchestrating properly
  37. 37 Monitoring unavailable pods However, the unavailable pods caused by

    not only the cluster, but also Kubernetes users’ misconfiguration, customers traffic and GCP’s failure.
  38. 38 Metrics we monitor for checking the orchestration

  39. 39 What are work metrics for Kubernetes cluster? Monitoring only

    unavailable pods is not enough. What should we do?
  40. 40 What are work metrics for Kubernetes cluster? Similar to

    a web server, can we use Kubernetes API server throughput, success, error and duration as a work metrics?
  41. 41 What are work metrics for Kubernetes cluster? Similar to

    a web server, can we use Kubernetes API server throughput, success, error and duration as a work metrics? Yes, but it’s not enough.
  42. 42 What are work metrics for Kubernetes cluster? Since Kubernetes

    cluster is a distributed system, We need to monitor each components’ work metrics.
  43. Kubernetes Components

  44. 44 Kubernetes Components Master Components Node Components Addons

  45. 45 Kubernetes Components kube-api-server etcd kube-scheduler kube-controller Master Components

  46. 46 Kubernetes Components kubelet kube-proxy Docker Nodes Components

  47. 47 Kubernetes Components kube-dns cluster-level monitoring cluster-level Logging Addons

  48. Kubernetes Master Components

  49. 49 Remember the boundaries Platform’s responsibility k8s nodes boundary pods

    Developer’s responsibility pods pods k8s master GKE’s responsibility boundary
  50. 50 Kubernetes Master Components Since master components are managed by

    GKE, we don’t need to(can’t) monitor them by ourselves
  51. 51 What are work metrics for Kubernetes cluster? Since Kubernetes

    cluster is a distributed system, We need to monitor each components’ work metrics.
  52. 52 Metrics about master components we have

  53. Kubernetes Node Components

  54. 54 kubelet work metrics

  55. 55 kubelet work metrics kubelet’s error rate can be increased

    by users misconfiguration, so we don’t use tight threshold. (we use 1% as threshold for now)
  56. 56 kube-proxy work metrics

  57. 57 kube-proxy work metrics We have them in the dashboard

    but we don’t use them actively since kube-proxy metrics are not reliable enough to set alerting on them.The main reason being the kube-proxy metrics integration between Prometheus and Datadog.
  58. Kubernetes Addons

  59. 59 kube-dns work metrics As same as kube-proxy, we don’t

    use them actively since they are not reliable.
  60. 60 kube-dns work metrics But sometimes kube-dns causes issues in

    the cluster, so we have a plan to migrate it to CoreDNS or monitor it somehow by creating an original tool
  61. 61 cluster-level monitoring Skipping this due to the time limitation,

    but we have dedicated dashboard and monitors for cluster-level monitoring (Datadog Agent)
  62. 62 cluster-level logging Skipping this due to the time limitation,

    but we have dedicated dashboard and monitors for cluster-level logging (Stackdriver Logging Agent and Datadog Agent)
  63. RESOURCE METRICS FOR KUBERNETES CLUSTER

  64. Kubernetes Node Components

  65. 65 Cluster level resource metrics

  66. 66 Cluster level resource metrics

  67. 67 Node level resource metrics Similarly we see Disk and

    Network usage
  68. 68 Cluster level resource metrics See the Kubernetes nodes as

    one big machine
  69. 69 Node level resource metrics

  70. 70 Node level resource metrics

  71. 71 kubelet resource metrics (availability)

  72. 72 kubelet resource metrics (availability)

  73. 73 Investigation Cluster level work metrics CPU Mem Disk Network

    Cluster level resource metrics Node level resource metrics CPU Mem Disk Network
  74. 74 Investigation Cluster level work metrics CPU Mem Disk Network

    kubelet kube-dns CPU Mem Disk Network
  75. RECAP

  76. 76 Recap Define responsibility boundaries first Work and Resource Metrics

    (RED&USE) Monitor each components as possible
  77. None