Kubernetes Cluster Monitoring

Kubernetes Cluster Monitoring

Talk at Mercari Microservices Platform Meetup #2
https://connpass.com/event/128017/

Explained how we monitor a Kubernetes cluster itself instead of application pods running on it.

32f2e5ddb187baa2abac66d7e8b283fe?s=128

Seigo Uchida

May 22, 2019
Tweet

Transcript

  1. 2.

    2 About me @spesnova Software Engineer, microservices platform team at

    Mercari Kubernetes Tokyo Community Organizer
  2. 17.

    17 Responsibility boundary Platform’s responsibility k8s nodes boundary pods Developer’s

    responsibility pods pods k8s master GKE’s responsibility boundary
  3. 22.

    22 Web server’s example Throughput - requests per second (e.g.

    100 req/s) Success - % of responses that are 2xx (e.g. 99.9%) Error - % of responses that are 5xx (e.g. 0.01%) Performance - 90the percentile response in sec (e.g. 200ms)
  4. 25.

    25 Web server’s example Web server CPU Mem Disk Network

    DB server The web server depends on these resources
  5. 26.

    26 Web server’s example Web server CPU Mem Disk Network

    DB server CPU Mem Disk Network The DB server’s resources
  6. 27.

    27 Web server’s example Utilization - Disk usage (e.g. 43%)

    Saturation - Memory swap usage (e.g. 131MB ) Errors - 5xx errors from upstream services (e.g. 50 errors/sec) Availability - % time the DB is reachable (e.g.99.9%)
  7. 29.

    29 Remember today’s theme How do you make sure that

    your cluster is working or broken ?
  8. 32.

    32 Kubernetes job is orchestration Are there any metrics which

    indicate a Kubernetes Cluster is orchestrating properly ?
  9. 36.

    36 Monitoring unavailable pods If there are no unavailable pods,

    at least we can say Kubernetes Cluster is orchestrating properly
  10. 37.

    37 Monitoring unavailable pods However, the unavailable pods caused by

    not only the cluster, but also Kubernetes users’ misconfiguration, customers traffic and GCP’s failure.
  11. 39.

    39 What are work metrics for Kubernetes cluster? Monitoring only

    unavailable pods is not enough. What should we do?
  12. 40.

    40 What are work metrics for Kubernetes cluster? Similar to

    a web server, can we use Kubernetes API server throughput, success, error and duration as a work metrics?
  13. 41.

    41 What are work metrics for Kubernetes cluster? Similar to

    a web server, can we use Kubernetes API server throughput, success, error and duration as a work metrics? Yes, but it’s not enough.
  14. 42.

    42 What are work metrics for Kubernetes cluster? Since Kubernetes

    cluster is a distributed system, We need to monitor each components’ work metrics.
  15. 49.

    49 Remember the boundaries Platform’s responsibility k8s nodes boundary pods

    Developer’s responsibility pods pods k8s master GKE’s responsibility boundary
  16. 50.

    50 Kubernetes Master Components Since master components are managed by

    GKE, we don’t need to(can’t) monitor them by ourselves
  17. 51.

    51 What are work metrics for Kubernetes cluster? Since Kubernetes

    cluster is a distributed system, We need to monitor each components’ work metrics.
  18. 55.

    55 kubelet work metrics kubelet’s error rate can be increased

    by users misconfiguration, so we don’t use tight threshold. (we use 1% as threshold for now)
  19. 57.

    57 kube-proxy work metrics We have them in the dashboard

    but we don’t use them actively since kube-proxy metrics are not reliable enough to set alerting on them.The main reason being the kube-proxy metrics integration between Prometheus and Datadog.
  20. 59.

    59 kube-dns work metrics As same as kube-proxy, we don’t

    use them actively since they are not reliable.
  21. 60.

    60 kube-dns work metrics But sometimes kube-dns causes issues in

    the cluster, so we have a plan to migrate it to CoreDNS or monitor it somehow by creating an original tool
  22. 61.

    61 cluster-level monitoring Skipping this due to the time limitation,

    but we have dedicated dashboard and monitors for cluster-level monitoring (Datadog Agent)
  23. 62.

    62 cluster-level logging Skipping this due to the time limitation,

    but we have dedicated dashboard and monitors for cluster-level logging (Stackdriver Logging Agent and Datadog Agent)
  24. 73.

    73 Investigation Cluster level work metrics CPU Mem Disk Network

    Cluster level resource metrics Node level resource metrics CPU Mem Disk Network
  25. 74.
  26. 75.
  27. 76.
  28. 77.