Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes Cluster Monitoring

Kubernetes Cluster Monitoring

Talk at Mercari Microservices Platform Meetup #2
https://connpass.com/event/128017/

Explained how we monitor a Kubernetes cluster itself instead of application pods running on it.

Seigo Uchida

May 22, 2019
Tweet

More Decks by Seigo Uchida

Other Decks in Technology

Transcript

  1. Mercari Meetup for Microservices Platform #2, May 22, 2019
    Kubernetes Cluster Monitoring

    View Slide

  2. 2
    About me
    @spesnova
    Software Engineer, microservices platform team at Mercari
    Kubernetes Tokyo Community Organizer

    View Slide

  3. 3
    Today’s theme
    Monitoring
    Kubernetes pods

    View Slide

  4. 4
    Today’s theme
    Monitoring
    Kubernetes pods

    View Slide

  5. 5
    Today’s theme
    Monitoring
    Kubernetes cluster

    View Slide

  6. 6
    Today’s theme
    How to make sure that
    your cluster is working or broken ?

    View Slide

  7. CONTEXT
    ABOUT OUR KUBERNETES CLUSTER

    View Slide

  8. Current Status

    View Slide

  9. 9
    Current Status
    200+
    engineers
    100+
    microservices
    8members in
    the platform team

    View Slide

  10. 10
    Current Status
    100+
    namespaces

    View Slide

  11. 11
    Current Status
    100+
    k8s services

    View Slide

  12. 12
    Current Status
    2K+
    pods

    View Slide

  13. 13
    Current Status
    2K+
    containers

    View Slide

  14. Responsibility Boundary

    View Slide

  15. 15
    Responsibility boundary
    k8s nodes
    pods pods pods
    k8s master

    View Slide

  16. 16
    Responsibility boundary
    Platform’s responsibility
    k8s nodes
    pods
    Developer’s responsibility
    pods pods
    boundary
    k8s master

    View Slide

  17. 17
    Responsibility boundary
    Platform’s responsibility
    k8s nodes
    boundary
    pods
    Developer’s responsibility
    pods pods
    k8s master
    GKE’s responsibility
    boundary

    View Slide

  18. The Work and Resource metrics
    or The RED and USE method

    View Slide

  19. 19
    The work (RED) metrics
    Throughput
    Success
    Error
    Performance

    View Slide

  20. 20
    The Work (RED) metrics
    indicates the top-level health of your system

    View Slide

  21. 21
    The Work (RED) metrics
    indicates your system is working or broken

    View Slide

  22. 22
    Web server’s example
    Throughput - requests per second (e.g. 100 req/s)
    Success - % of responses that are 2xx (e.g. 99.9%)
    Error - % of responses that are 5xx (e.g. 0.01%)
    Performance - 90the percentile response in sec (e.g. 200ms)

    View Slide

  23. 23
    The Resource (USE) metrics
    Utilization
    Saturation
    Errors
    Availability

    View Slide

  24. 24
    The Resource (USE) metrics
    indicates a low-level health of your system

    View Slide

  25. 25
    Web server’s example
    Web server
    CPU Mem Disk Network DB server
    The web server depends on these resources

    View Slide

  26. 26
    Web server’s example
    Web server
    CPU Mem Disk Network DB server
    CPU Mem Disk Network
    The DB server’s resources

    View Slide

  27. 27
    Web server’s example
    Utilization - Disk usage (e.g. 43%)
    Saturation - Memory swap usage (e.g. 131MB )
    Errors - 5xx errors from upstream services (e.g. 50 errors/sec)
    Availability - % time the DB is reachable (e.g.99.9%)

    View Slide

  28. WORK METRICS
    FOR KUBERNETES CLUSTER

    View Slide

  29. 29
    Remember today’s theme
    How do you make sure that
    your cluster is working or broken ?

    View Slide

  30. What are work metrics for the Cluster?

    View Slide

  31. 31
    What is Kubernetes cluster’s job?

    View Slide

  32. 32
    Kubernetes job is orchestration
    Are there any metrics which indicate
    a Kubernetes Cluster is orchestrating properly ?

    View Slide

  33. 33
    Metrics we monitor for checking the orchestration

    View Slide

  34. 34
    Monitoring unavailable pods

    View Slide

  35. 35
    Metrics we monitor for checking the orchestration (Cluster level work metrics)

    View Slide

  36. 36
    Monitoring unavailable pods
    If there are no unavailable pods, at least we can say
    Kubernetes Cluster is orchestrating properly

    View Slide

  37. 37
    Monitoring unavailable pods
    However, the unavailable pods caused by not only the cluster,
    but also Kubernetes users’ misconfiguration, customers
    traffic and GCP’s failure.

    View Slide

  38. 38
    Metrics we monitor for checking the orchestration

    View Slide

  39. 39
    What are work metrics for Kubernetes cluster?
    Monitoring only unavailable pods is not enough.
    What should we do?

    View Slide

  40. 40
    What are work metrics for Kubernetes cluster?
    Similar to a web server, can we use Kubernetes API server
    throughput, success, error and duration as a work metrics?

    View Slide

  41. 41
    What are work metrics for Kubernetes cluster?
    Similar to a web server, can we use Kubernetes API server
    throughput, success, error and duration as a work metrics?
    Yes, but it’s not enough.

    View Slide

  42. 42
    What are work metrics for Kubernetes cluster?
    Since Kubernetes cluster is a distributed system,
    We need to monitor each components’ work metrics.

    View Slide

  43. Kubernetes Components

    View Slide

  44. 44
    Kubernetes Components
    Master Components
    Node Components
    Addons

    View Slide

  45. 45
    Kubernetes Components
    kube-api-server
    etcd
    kube-scheduler
    kube-controller
    Master Components

    View Slide

  46. 46
    Kubernetes Components
    kubelet
    kube-proxy
    Docker
    Nodes Components

    View Slide

  47. 47
    Kubernetes Components
    kube-dns
    cluster-level monitoring
    cluster-level Logging
    Addons

    View Slide

  48. Kubernetes Master Components

    View Slide

  49. 49
    Remember the boundaries
    Platform’s responsibility
    k8s nodes
    boundary
    pods
    Developer’s responsibility
    pods pods
    k8s master
    GKE’s responsibility
    boundary

    View Slide

  50. 50
    Kubernetes Master Components
    Since master components are managed by GKE,
    we don’t need to(can’t) monitor them by ourselves

    View Slide

  51. 51
    What are work metrics for Kubernetes cluster?
    Since Kubernetes cluster is a distributed system,
    We need to monitor each components’ work metrics.

    View Slide

  52. 52
    Metrics about master components we have

    View Slide

  53. Kubernetes Node Components

    View Slide

  54. 54
    kubelet work metrics

    View Slide

  55. 55
    kubelet work metrics
    kubelet’s error rate can be increased by users
    misconfiguration, so we don’t use tight threshold.
    (we use 1% as threshold for now)

    View Slide

  56. 56
    kube-proxy work metrics

    View Slide

  57. 57
    kube-proxy work metrics
    We have them in the dashboard but we don’t use them
    actively since kube-proxy metrics are not reliable enough
    to set alerting on them.The main reason being the
    kube-proxy metrics integration between Prometheus and
    Datadog.

    View Slide

  58. Kubernetes Addons

    View Slide

  59. 59
    kube-dns work metrics
    As same as kube-proxy, we don’t use them actively since
    they are not reliable.

    View Slide

  60. 60
    kube-dns work metrics
    But sometimes kube-dns causes issues in the cluster, so
    we have a plan to migrate it to CoreDNS or monitor it
    somehow by creating an original tool

    View Slide

  61. 61
    cluster-level monitoring
    Skipping this due to the time limitation, but we have
    dedicated dashboard and monitors for cluster-level
    monitoring (Datadog Agent)

    View Slide

  62. 62
    cluster-level logging
    Skipping this due to the time limitation, but we have
    dedicated dashboard and monitors for cluster-level
    logging (Stackdriver Logging Agent and Datadog Agent)

    View Slide

  63. RESOURCE METRICS
    FOR KUBERNETES CLUSTER

    View Slide

  64. Kubernetes Node Components

    View Slide

  65. 65
    Cluster level resource metrics

    View Slide

  66. 66
    Cluster level resource metrics

    View Slide

  67. 67
    Node level resource metrics
    Similarly we see Disk and Network usage

    View Slide

  68. 68
    Cluster level resource metrics
    See the Kubernetes nodes as
    one big machine

    View Slide

  69. 69
    Node level resource metrics

    View Slide

  70. 70
    Node level resource metrics

    View Slide

  71. 71
    kubelet resource metrics (availability)

    View Slide

  72. 72
    kubelet resource metrics (availability)

    View Slide

  73. 73
    Investigation
    Cluster level work metrics
    CPU Mem Disk Network
    Cluster level resource metrics Node level resource metrics
    CPU Mem Disk Network

    View Slide

  74. 74
    Investigation
    Cluster level work metrics
    CPU Mem Disk Network
    kubelet kube-dns
    CPU Mem Disk Network

    View Slide

  75. RECAP

    View Slide

  76. 76
    Recap
    Define responsibility boundaries first
    Work and Resource Metrics (RED&USE)
    Monitor each components as possible

    View Slide

  77. View Slide