Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Kubernetes resource management and the behind systems in Mercari

sanposhiho
February 26, 2023

The Kubernetes resource management and the behind systems in Mercari

sanposhiho

February 26, 2023
Tweet

More Decks by sanposhiho

Other Decks in Technology

Transcript

  1. 1 Confidential The Kubernetes resource management and the behind systems

    in Mercari Kensei Nakada / @sanposhiho (Platform Infra)
  2. 2 Mercari JP Platform Infra team
 
 
 Kubernetes upstream

    reviewer (SIG-Scheduling)
 Kubernetes Contributor award 2022 winner
 Kensei Nakada

  3. 3 Cluster size management Agenda Workload size management (Horizontal) Workload

    size management (Vertical) 02 03 01 Our next steps 04
  4. 5 Mercari’s cluster overview - Mercari is the largest marketplace

    app in Japan. - Mercari is composed of more than 300 microservices. - Almost all microservices are on the one cluster. - The platform team is the cluster admin. And each application team is using our provided cluster.
  5. 7

  6. 9 Probably, we want to change the placement for cost

    reduction. But… what if new Pods come after reducing Nodes?
  7. 10 But… what if new Pods come after reducing Nodes?

    Need to create new Node for new Pods.
  8. 12 Cluster Autoscaler (Scaling up) It’s checking if the cluster

    has any unschedulable Pods. ↓ If it does, creates new Node so that they can be scheduled on it.
  9. 13 Cluster Autoscaler (Scaling down) It’s checking if some Nodes

    are low utilized. If all Pods can be evicted on a low utilized Node, it evicts all Pods on that Node, and deletes Node eventually.
  10. 14 Trade-off: Cost 💴 vs Reliability🛡 The more Nodes the

    cluster has -> the stronger the system become against Node failure 🛡 -> BUT, the more money we need to pay 💸 We want to reduce cost while keeping the reliability.
  11. 15 GKE’s Autoscaling profiles GKE has the options (💴 /

    🛡) to decide: - How Cluster Autoscaler delete Nodes during scaling down. - Aggressive 💴 vs Conservative🛡 - how the scheduler schedules Pods. - Prefer either high-utilized Node 💴 or low-utilized Node 🛡
  12. 16 GKE’s Autoscaling profiles We chose 💴 option: - How

    Cluster Autoscaler delete Nodes during scaling down. - Aggressive 💴 vs Conservative🛡 - how the scheduler schedules Pods. - Prefer either high-utilized Node 💴 or low-utilized Node 🛡
  13. 17 How to keep the reliability🛡 Introducing overprovisioning Pods: -

    Pods with very low priority. - They’ll be killed when other Pods cannot be scheduled. - So, they’ll be unschedulable instead and Cluster Autoscaler will notice the demand of scaling up, increase the Node.
  14. 18 Future plan But, cannot just increase the number of

    overprovisioning Pods. The many overprovisioning Pods, the more cost it makes.
  15. 19 Future plan But, cannot just increase the number of

    overprovisioning Pods. The many overprovisioning Pods, the more cost it makes. → need to predict the demand on each time and change the number of overprovisioning Pods.
  16. 21

  17. 25 HorizontalPodAutoscaler scales down/up replicas based on the resource utilization.

    We only need to define the desired resource utilization. Automated way: HorizontalPodAutoscaler HPA keep changing the replicas so that the avg utilization will be 60%
  18. 26 Most of big workloads have been managed by HPAs.

    HorizontalPodAutoscaler in Mercari
  19. 27 The multi container Pods We’re using istio to build

    a service-mesh; And, some Pods have two containers, The primary one and the sidecar container.
  20. 28 HPA for multi container Pods Resource type metrics are

    calculating the resource utilization by taking the average of all container’s utilization.
  21. 29 HPA for multi container Pods Resource type metrics are

    calculating the resource utilization by taking the average of all container’s utilization. -> … what if only one container’s utilization goes up?
  22. 30 The past problem on HPA for multi container Pods

    Let’s say: - container A: the current utilization 10% - container B: the current utilization 90% - Thus, the average is 50% We need to scale up in this situation, BUT, HPA won’t scale up because the avg is still low.
  23. 31 The container based HPA To deal with such a

    situation, HPA introduced the metric type which refers to the each container’s resource utilization.
  24. 32 The container based HPA To deal with such a

    situation, HPA introduced the metric type which refers to the each container’s resource utilization. But, it’s been alpha from v1.20…
  25. 33 The beta graduation will be done in v1.27 (hopefully)

    Working on the beta graduation so that it’ll be available in our cluster as soon as possible.
  26. 34 Then, what’s alternative for now? We’re using DatadogMetric; We

    can define the datadog query in DatadogMetric CRD, and HPA refers to it as an external metric. -> It allows HPA to scale up/down based on the query result from the metric values in datadog.
  27. 38 Incident case: the upstream incident scaled down downstream services

    - The upstream service died accidentally - No traffic went to downstream services during the incident - The downstream’s resource utilization went down. Then, the HPAs in downstream services scale down the workloads.
  28. 39 Incident case: the upstream incident scaled down downstream services

    When the upstream incident got resolved, the huge traffic came back, and downstream service was completely overwhelmed. Such unusual scaling down could happen in any services with HPA.
  29. 40 Setting minReplicas? No. It surely solves such a problem

    but results in the waste of resources. 💸
  30. 41 Define the dynamic minimum replica num We defined the

    DatadogMetrics which refers to the replica number one week ago. ↓ Make HPA not reducing the replica number less than ½ * the last week’s replica num.
  31. 43

  32. 47 Automated way: VerticalPodAutoscaler It watches the Pod’s utilization and

    stores it as historical data. ↓ Calculate the recommended resource amount, and change the Pod’s resource request when the current given resource and the recommendation is very different.
  33. 48 Multidimensional Pod autoscaling We have another option to use

    VPA in indirect way: Multidimensional Pod autoscaling from GKE. MPA almost equals using: - HPA for CPU. - VPA for Memory.
  34. 49 MPA and VPA in Mercari Actually, we haven’t been

    using them yet. Now, we’re running some experiments to evaluate HPA, VPA and MPA.
  35. 50 The resource recommender Also, we’re providing the resource recommender

    slack bot to the developer. Hoge deployment appcontainer
  36. 51 How the resource recommender works It’s fetch the VPA’s

    recommendation value from GKE’s system metrics. -> uses the maximum VPA recommendation value over 3 months.
  37. 52 How the resource recommender works It’s fetch the VPA’s

    recommendation value from GKE’s system metrics. -> uses the maximum VPA recommendation value over 3 months.
  38. 54 Next step 1: trying autoscalers other than HPA As

    shared, we’re trying to use other autoscalers like MPA, VPA. Many of our services are Go’s server and we’d like to provide the information about a good default autoscaler for us.
  39. 55 Next step 2: the HPA’s performance Currently, our cluster

    has too many HPA objects that easily exceeds the recommendation from GKE. If HPA controller takes more time to realize the high resource utilization and scale up the workload, it affect the reliability badly.
  40. 56 Next step 2: Improve the HPA’s performance Proposing the

    metrics feature in the upstream HPA controller.
  41. 57 Next step 2: Improve the HPA’s performance The worst

    case scenario on scaling up is when no Node is available. As shared, we’d like to have some way to adjust the number of overprovisioning Pods by referring the past demand of resources.
  42. 58 Next step 3: HPA utilization target value optimization Currently,

    each application team has responsibility for setting the target resource utilization value in HPA. But, feel like it should be done by some automatic way. We’re considering providing the way to automatically set (or just recommend) the best target resource utilization value calculated from the past behavior of that service.