The Kubernetes resource management and the behind systems in Mercari

1 Conﬁdential The Kubernetes resource management and the behind systems
in Mercari Kensei Nakada / @sanposhiho (Platform Infra)

2 Mercari JP Platform Infra team      Kubernetes upstream
reviewer (SIG-Scheduling)  Kubernetes Contributor award 2022 winner  Kensei Nakada 

3 Cluster size management Agenda Workload size management (Horizontal) Workload
size management (Vertical) 02 03 01 Our next steps 04

4 Mercari’s cluster overview

5 Mercari’s cluster overview - Mercari is the largest marketplace
app in Japan. - Mercari is composed of more than 300 microservices. - Almost all microservices are on the one cluster. - The platform team is the cluster admin. And each application team is using our provided cluster.

6 Cluster size management

8 Probably, we want to change the placement for cost
reduction.

9 Probably, we want to change the placement for cost
reduction. But… what if new Pods come after reducing Nodes?

10 But… what if new Pods come after reducing Nodes?
Need to create new Node for new Pods.

11 Automated way: Cluster Autoscaler Cluster Autoscaler increases/decreases the Nodes
based on the demand of resources from Pods.

12 Cluster Autoscaler (Scaling up) It’s checking if the cluster
has any unschedulable Pods. ↓ If it does, creates new Node so that they can be scheduled on it.

13 Cluster Autoscaler (Scaling down) It’s checking if some Nodes
are low utilized. If all Pods can be evicted on a low utilized Node, it evicts all Pods on that Node, and deletes Node eventually.

14 Trade-off: Cost 💴 vs Reliability🛡 The more Nodes the
cluster has -> the stronger the system become against Node failure 🛡 -> BUT, the more money we need to pay 💸 We want to reduce cost while keeping the reliability.

15 GKE’s Autoscaling proﬁles GKE has the options (💴 /
🛡) to decide: - How Cluster Autoscaler delete Nodes during scaling down. - Aggressive 💴 vs Conservative🛡 - how the scheduler schedules Pods. - Prefer either high-utilized Node 💴 or low-utilized Node 🛡

16 GKE’s Autoscaling proﬁles We chose 💴 option: - How
Cluster Autoscaler delete Nodes during scaling down. - Aggressive 💴 vs Conservative🛡 - how the scheduler schedules Pods. - Prefer either high-utilized Node 💴 or low-utilized Node 🛡

17 How to keep the reliability🛡 Introducing overprovisioning Pods: -
Pods with very low priority. - They’ll be killed when other Pods cannot be scheduled. - So, they’ll be unschedulable instead and Cluster Autoscaler will notice the demand of scaling up, increase the Node.

18 Future plan But, cannot just increase the number of
overprovisioning Pods. The many overprovisioning Pods, the more cost it makes.

19 Future plan But, cannot just increase the number of
overprovisioning Pods. The many overprovisioning Pods, the more cost it makes. → need to predict the demand on each time and change the number of overprovisioning Pods.

20 Workload size management Horizontal

22 Probably, we want to reduce the replicas.

23 When traffic grows, the utilization goes higher.

24 We need to increase the replicas again.

25 HorizontalPodAutoscaler scales down/up replicas based on the resource utilization.
We only need to deﬁne the desired resource utilization. Automated way: HorizontalPodAutoscaler HPA keep changing the replicas so that the avg utilization will be 60%

26 Most of big workloads have been managed by HPAs.
HorizontalPodAutoscaler in Mercari

27 The multi container Pods We’re using istio to build
a service-mesh; And, some Pods have two containers, The primary one and the sidecar container.

28 HPA for multi container Pods Resource type metrics are
calculating the resource utilization by taking the average of all container’s utilization.

29 HPA for multi container Pods Resource type metrics are
calculating the resource utilization by taking the average of all container’s utilization. -> … what if only one container’s utilization goes up?

30 The past problem on HPA for multi container Pods
Let’s say: - container A: the current utilization 10% - container B: the current utilization 90% - Thus, the average is 50% We need to scale up in this situation, BUT, HPA won’t scale up because the avg is still low.

31 The container based HPA To deal with such a
situation, HPA introduced the metric type which refers to the each container’s resource utilization.

32 The container based HPA To deal with such a
situation, HPA introduced the metric type which refers to the each container’s resource utilization. But, it’s been alpha from v1.20…

33 The beta graduation will be done in v1.27 (hopefully)
Working on the beta graduation so that it’ll be available in our cluster as soon as possible.

34 Then, what’s alternative for now? We’re using DatadogMetric; We
can deﬁne the datadog query in DatadogMetric CRD, and HPA refers to it as an external metric. -> It allows HPA to scale up/down based on the query result from the metric values in datadog.

35 DatadogMetric

36 HPA refers to it as External metrics

37 For more detail ↓

38 Incident case: the upstream incident scaled down downstream services
- The upstream service died accidentally - No trafﬁc went to downstream services during the incident - The downstream’s resource utilization went down. Then, the HPAs in downstream services scale down the workloads.

39 Incident case: the upstream incident scaled down downstream services
When the upstream incident got resolved, the huge trafﬁc came back, and downstream service was completely overwhelmed. Such unusual scaling down could happen in any services with HPA.

40 Setting minReplicas? No. It surely solves such a problem
but results in the waste of resources. 💸

41 Deﬁne the dynamic minimum replica num We deﬁned the
DatadogMetrics which refers to the replica number one week ago. ↓ Make HPA not reducing the replica number less than ½ * the last week’s replica num.

42 Workload size management Vertical

44 Probably, we want to change the resource amount on
each Pod

45 When traffic grows, the utilization goes higher.

46 Then, we need to increase the resource amount.

47 Automated way: VerticalPodAutoscaler It watches the Pod’s utilization and
stores it as historical data. ↓ Calculate the recommended resource amount, and change the Pod’s resource request when the current given resource and the recommendation is very different.

48 Multidimensional Pod autoscaling We have another option to use
VPA in indirect way: Multidimensional Pod autoscaling from GKE. MPA almost equals using: - HPA for CPU. - VPA for Memory.

49 MPA and VPA in Mercari Actually, we haven’t been
using them yet. Now, we’re running some experiments to evaluate HPA, VPA and MPA.

50 The resource recommender Also, we’re providing the resource recommender
slack bot to the developer. Hoge deployment appcontainer

51 How the resource recommender works It’s fetch the VPA’s
recommendation value from GKE’s system metrics. -> uses the maximum VPA recommendation value over 3 months.

52 How the resource recommender works It’s fetch the VPA’s
recommendation value from GKE’s system metrics. -> uses the maximum VPA recommendation value over 3 months.

53 Our next steps

54 Next step 1: trying autoscalers other than HPA As
shared, we’re trying to use other autoscalers like MPA, VPA. Many of our services are Go’s server and we’d like to provide the information about a good default autoscaler for us.

55 Next step 2: the HPA’s performance Currently, our cluster
has too many HPA objects that easily exceeds the recommendation from GKE. If HPA controller takes more time to realize the high resource utilization and scale up the workload, it affect the reliability badly.

56 Next step 2: Improve the HPA’s performance Proposing the
metrics feature in the upstream HPA controller.

57 Next step 2: Improve the HPA’s performance The worst
case scenario on scaling up is when no Node is available. As shared, we’d like to have some way to adjust the number of overprovisioning Pods by referring the past demand of resources.

58 Next step 3: HPA utilization target value optimization Currently,
each application team has responsibility for setting the target resource utilization value in HPA. But, feel like it should be done by some automatic way. We’re considering providing the way to automatically set (or just recommend) the best target resource utilization value calculated from the past behavior of that service.

The Kubernetes resource management and the behi...

The Kubernetes resource management and the behind systems in Mercari

More Decks by sanposhiho

Other Decks in Technology

Featured

Transcript