Slide 1

Slide 1 text

1 Confidential The Kubernetes resource management and the behind systems in Mercari Kensei Nakada / @sanposhiho (Platform Infra)

Slide 2

Slide 2 text

2 Mercari JP Platform Infra team
 
 
 Kubernetes upstream reviewer (SIG-Scheduling)
 Kubernetes Contributor award 2022 winner
 Kensei Nakada


Slide 3

Slide 3 text

3 Cluster size management Agenda Workload size management (Horizontal) Workload size management (Vertical) 02 03 01 Our next steps 04

Slide 4

Slide 4 text

4 Mercari’s cluster overview

Slide 5

Slide 5 text

5 Mercari’s cluster overview - Mercari is the largest marketplace app in Japan. - Mercari is composed of more than 300 microservices. - Almost all microservices are on the one cluster. - The platform team is the cluster admin. And each application team is using our provided cluster.

Slide 6

Slide 6 text

6 Cluster size management

Slide 7

Slide 7 text

7

Slide 8

Slide 8 text

8 Probably, we want to change the placement for cost reduction.

Slide 9

Slide 9 text

9 Probably, we want to change the placement for cost reduction. But… what if new Pods come after reducing Nodes?

Slide 10

Slide 10 text

10 But… what if new Pods come after reducing Nodes? Need to create new Node for new Pods.

Slide 11

Slide 11 text

11 Automated way: Cluster Autoscaler Cluster Autoscaler increases/decreases the Nodes based on the demand of resources from Pods.

Slide 12

Slide 12 text

12 Cluster Autoscaler (Scaling up) It’s checking if the cluster has any unschedulable Pods. ↓ If it does, creates new Node so that they can be scheduled on it.

Slide 13

Slide 13 text

13 Cluster Autoscaler (Scaling down) It’s checking if some Nodes are low utilized. If all Pods can be evicted on a low utilized Node, it evicts all Pods on that Node, and deletes Node eventually.

Slide 14

Slide 14 text

14 Trade-off: Cost 💴 vs Reliability🛡 The more Nodes the cluster has -> the stronger the system become against Node failure 🛡 -> BUT, the more money we need to pay 💸 We want to reduce cost while keeping the reliability.

Slide 15

Slide 15 text

15 GKE’s Autoscaling profiles GKE has the options (💴 / 🛡) to decide: - How Cluster Autoscaler delete Nodes during scaling down. - Aggressive 💴 vs Conservative🛡 - how the scheduler schedules Pods. - Prefer either high-utilized Node 💴 or low-utilized Node 🛡

Slide 16

Slide 16 text

16 GKE’s Autoscaling profiles We chose 💴 option: - How Cluster Autoscaler delete Nodes during scaling down. - Aggressive 💴 vs Conservative🛡 - how the scheduler schedules Pods. - Prefer either high-utilized Node 💴 or low-utilized Node 🛡

Slide 17

Slide 17 text

17 How to keep the reliability🛡 Introducing overprovisioning Pods: - Pods with very low priority. - They’ll be killed when other Pods cannot be scheduled. - So, they’ll be unschedulable instead and Cluster Autoscaler will notice the demand of scaling up, increase the Node.

Slide 18

Slide 18 text

18 Future plan But, cannot just increase the number of overprovisioning Pods. The many overprovisioning Pods, the more cost it makes.

Slide 19

Slide 19 text

19 Future plan But, cannot just increase the number of overprovisioning Pods. The many overprovisioning Pods, the more cost it makes. → need to predict the demand on each time and change the number of overprovisioning Pods.

Slide 20

Slide 20 text

20 Workload size management Horizontal

Slide 21

Slide 21 text

21

Slide 22

Slide 22 text

22 Probably, we want to reduce the replicas.

Slide 23

Slide 23 text

23 When traffic grows, the utilization goes higher.

Slide 24

Slide 24 text

24 We need to increase the replicas again.

Slide 25

Slide 25 text

25 HorizontalPodAutoscaler scales down/up replicas based on the resource utilization. We only need to define the desired resource utilization. Automated way: HorizontalPodAutoscaler HPA keep changing the replicas so that the avg utilization will be 60%

Slide 26

Slide 26 text

26 Most of big workloads have been managed by HPAs. HorizontalPodAutoscaler in Mercari

Slide 27

Slide 27 text

27 The multi container Pods We’re using istio to build a service-mesh; And, some Pods have two containers, The primary one and the sidecar container.

Slide 28

Slide 28 text

28 HPA for multi container Pods Resource type metrics are calculating the resource utilization by taking the average of all container’s utilization.

Slide 29

Slide 29 text

29 HPA for multi container Pods Resource type metrics are calculating the resource utilization by taking the average of all container’s utilization. -> … what if only one container’s utilization goes up?

Slide 30

Slide 30 text

30 The past problem on HPA for multi container Pods Let’s say: - container A: the current utilization 10% - container B: the current utilization 90% - Thus, the average is 50% We need to scale up in this situation, BUT, HPA won’t scale up because the avg is still low.

Slide 31

Slide 31 text

31 The container based HPA To deal with such a situation, HPA introduced the metric type which refers to the each container’s resource utilization.

Slide 32

Slide 32 text

32 The container based HPA To deal with such a situation, HPA introduced the metric type which refers to the each container’s resource utilization. But, it’s been alpha from v1.20…

Slide 33

Slide 33 text

33 The beta graduation will be done in v1.27 (hopefully) Working on the beta graduation so that it’ll be available in our cluster as soon as possible.

Slide 34

Slide 34 text

34 Then, what’s alternative for now? We’re using DatadogMetric; We can define the datadog query in DatadogMetric CRD, and HPA refers to it as an external metric. -> It allows HPA to scale up/down based on the query result from the metric values in datadog.

Slide 35

Slide 35 text

35 DatadogMetric

Slide 36

Slide 36 text

36 HPA refers to it as External metrics

Slide 37

Slide 37 text

37 For more detail ↓

Slide 38

Slide 38 text

38 Incident case: the upstream incident scaled down downstream services - The upstream service died accidentally - No traffic went to downstream services during the incident - The downstream’s resource utilization went down. Then, the HPAs in downstream services scale down the workloads.

Slide 39

Slide 39 text

39 Incident case: the upstream incident scaled down downstream services When the upstream incident got resolved, the huge traffic came back, and downstream service was completely overwhelmed. Such unusual scaling down could happen in any services with HPA.

Slide 40

Slide 40 text

40 Setting minReplicas? No. It surely solves such a problem but results in the waste of resources. 💸

Slide 41

Slide 41 text

41 Define the dynamic minimum replica num We defined the DatadogMetrics which refers to the replica number one week ago. ↓ Make HPA not reducing the replica number less than ½ * the last week’s replica num.

Slide 42

Slide 42 text

42 Workload size management Vertical

Slide 43

Slide 43 text

43

Slide 44

Slide 44 text

44 Probably, we want to change the resource amount on each Pod

Slide 45

Slide 45 text

45 When traffic grows, the utilization goes higher.

Slide 46

Slide 46 text

46 Then, we need to increase the resource amount.

Slide 47

Slide 47 text

47 Automated way: VerticalPodAutoscaler It watches the Pod’s utilization and stores it as historical data. ↓ Calculate the recommended resource amount, and change the Pod’s resource request when the current given resource and the recommendation is very different.

Slide 48

Slide 48 text

48 Multidimensional Pod autoscaling We have another option to use VPA in indirect way: Multidimensional Pod autoscaling from GKE. MPA almost equals using: - HPA for CPU. - VPA for Memory.

Slide 49

Slide 49 text

49 MPA and VPA in Mercari Actually, we haven’t been using them yet. Now, we’re running some experiments to evaluate HPA, VPA and MPA.

Slide 50

Slide 50 text

50 The resource recommender Also, we’re providing the resource recommender slack bot to the developer. Hoge deployment appcontainer

Slide 51

Slide 51 text

51 How the resource recommender works It’s fetch the VPA’s recommendation value from GKE’s system metrics. -> uses the maximum VPA recommendation value over 3 months.

Slide 52

Slide 52 text

52 How the resource recommender works It’s fetch the VPA’s recommendation value from GKE’s system metrics. -> uses the maximum VPA recommendation value over 3 months.

Slide 53

Slide 53 text

53 Our next steps

Slide 54

Slide 54 text

54 Next step 1: trying autoscalers other than HPA As shared, we’re trying to use other autoscalers like MPA, VPA. Many of our services are Go’s server and we’d like to provide the information about a good default autoscaler for us.

Slide 55

Slide 55 text

55 Next step 2: the HPA’s performance Currently, our cluster has too many HPA objects that easily exceeds the recommendation from GKE. If HPA controller takes more time to realize the high resource utilization and scale up the workload, it affect the reliability badly.

Slide 56

Slide 56 text

56 Next step 2: Improve the HPA’s performance Proposing the metrics feature in the upstream HPA controller.

Slide 57

Slide 57 text

57 Next step 2: Improve the HPA’s performance The worst case scenario on scaling up is when no Node is available. As shared, we’d like to have some way to adjust the number of overprovisioning Pods by referring the past demand of resources.

Slide 58

Slide 58 text

58 Next step 3: HPA utilization target value optimization Currently, each application team has responsibility for setting the target resource utilization value in HPA. But, feel like it should be done by some automatic way. We’re considering providing the way to automatically set (or just recommend) the best target resource utilization value calculated from the past behavior of that service.