$30 off During Our Annual Pro Sale. View Details »

The Kubernetes resource management and the behind systems in Mercari

sanposhiho
February 26, 2023

The Kubernetes resource management and the behind systems in Mercari

sanposhiho

February 26, 2023
Tweet

More Decks by sanposhiho

Other Decks in Technology

Transcript

  1. 1
    Confidential
    The Kubernetes resource management
    and the behind systems in Mercari
    Kensei Nakada / @sanposhiho (Platform Infra)

    View Slide

  2. 2
    Mercari JP Platform Infra team



    Kubernetes upstream reviewer (SIG-Scheduling)

    Kubernetes Contributor award 2022 winner

    Kensei Nakada


    View Slide

  3. 3
    Cluster size management
    Agenda
    Workload size management (Horizontal)
    Workload size management (Vertical)
    02
    03
    01
    Our next steps
    04

    View Slide

  4. 4
    Mercari’s cluster overview

    View Slide

  5. 5
    Mercari’s cluster overview
    - Mercari is the largest marketplace app in Japan.
    - Mercari is composed of more than 300 microservices.
    - Almost all microservices are on the one cluster.
    - The platform team is the cluster admin. And each application
    team is using our provided cluster.

    View Slide

  6. 6
    Cluster size management

    View Slide

  7. 7

    View Slide

  8. 8
    Probably, we want to
    change the placement for
    cost reduction.

    View Slide

  9. 9
    Probably, we want to
    change the placement for
    cost reduction.
    But… what if new Pods
    come after reducing
    Nodes?

    View Slide

  10. 10
    But… what if new Pods
    come after reducing
    Nodes?
    Need to create new
    Node for new Pods.

    View Slide

  11. 11
    Automated way: Cluster Autoscaler
    Cluster Autoscaler increases/decreases the Nodes
    based on the demand of resources from Pods.

    View Slide

  12. 12
    Cluster Autoscaler (Scaling up)
    It’s checking if the cluster has any unschedulable Pods.

    If it does, creates new Node so that they can be scheduled on it.

    View Slide

  13. 13
    Cluster Autoscaler (Scaling down)
    It’s checking if some Nodes are low utilized.
    If all Pods can be evicted on a low utilized Node, it evicts all Pods
    on that Node,
    and deletes Node eventually.

    View Slide

  14. 14
    Trade-off: Cost 💴 vs Reliability🛡
    The more Nodes the cluster has
    -> the stronger the system become against Node failure 🛡
    -> BUT, the more money we need to pay 💸
    We want to reduce cost while keeping the reliability.

    View Slide

  15. 15
    GKE’s Autoscaling profiles
    GKE has the options (💴 / 🛡) to decide:
    - How Cluster Autoscaler delete Nodes during scaling down.
    - Aggressive 💴 vs Conservative🛡
    - how the scheduler schedules Pods.
    - Prefer either high-utilized Node 💴 or
    low-utilized Node 🛡

    View Slide

  16. 16
    GKE’s Autoscaling profiles
    We chose 💴 option:
    - How Cluster Autoscaler delete Nodes during scaling down.
    - Aggressive 💴 vs Conservative🛡
    - how the scheduler schedules Pods.
    - Prefer either high-utilized Node 💴 or
    low-utilized Node 🛡

    View Slide

  17. 17
    How to keep the reliability🛡
    Introducing overprovisioning Pods:
    - Pods with very low priority.
    - They’ll be killed when other Pods cannot be scheduled.
    - So, they’ll be unschedulable instead and
    Cluster Autoscaler will notice the demand of
    scaling up, increase the Node.

    View Slide

  18. 18
    Future plan
    But, cannot just increase the number of overprovisioning Pods.
    The many overprovisioning Pods, the more cost it makes.

    View Slide

  19. 19
    Future plan
    But, cannot just increase the number of overprovisioning Pods.
    The many overprovisioning Pods, the more cost it makes.
    → need to predict the demand on each time and change the
    number of overprovisioning Pods.

    View Slide

  20. 20
    Workload size management
    Horizontal

    View Slide

  21. 21

    View Slide

  22. 22
    Probably, we want to
    reduce the replicas.

    View Slide

  23. 23
    When traffic grows,
    the utilization goes
    higher.

    View Slide

  24. 24
    We need to increase the
    replicas again.

    View Slide

  25. 25
    HorizontalPodAutoscaler scales down/up replicas
    based on the resource utilization.
    We only need to define the desired resource utilization.
    Automated way: HorizontalPodAutoscaler
    HPA keep changing the
    replicas so that the avg
    utilization will be 60%

    View Slide

  26. 26
    Most of big workloads have been managed by HPAs.
    HorizontalPodAutoscaler in Mercari

    View Slide

  27. 27
    The multi container Pods
    We’re using istio to build a service-mesh;
    And, some Pods have two containers,
    The primary one and the sidecar container.

    View Slide

  28. 28
    HPA for multi container Pods
    Resource type metrics are calculating the resource utilization
    by taking the average of all container’s utilization.

    View Slide

  29. 29
    HPA for multi container Pods
    Resource type metrics are calculating the resource utilization
    by taking the average of all container’s utilization.
    -> … what if only one container’s utilization goes up?

    View Slide

  30. 30
    The past problem on HPA for multi container Pods
    Let’s say:
    - container A: the current utilization 10%
    - container B: the current utilization 90%
    - Thus, the average is 50%
    We need to scale up in this situation,
    BUT, HPA won’t scale up because the avg is still low.

    View Slide

  31. 31
    The container based HPA
    To deal with such a situation,
    HPA introduced the metric type
    which refers to the each
    container’s resource utilization.

    View Slide

  32. 32
    The container based HPA
    To deal with such a situation,
    HPA introduced the metric type
    which refers to the each
    container’s resource utilization.
    But, it’s been alpha from v1.20…

    View Slide

  33. 33
    The beta graduation will be done in v1.27 (hopefully)
    Working on the beta graduation so that it’ll be available in our
    cluster as soon as possible.

    View Slide

  34. 34
    Then, what’s alternative for now?
    We’re using DatadogMetric;
    We can define the datadog query in DatadogMetric CRD,
    and HPA refers to it as an external metric.
    -> It allows HPA to scale up/down based on the query result from
    the metric values in datadog.

    View Slide

  35. 35
    DatadogMetric

    View Slide

  36. 36
    HPA refers to it as External metrics

    View Slide

  37. 37
    For more detail ↓

    View Slide

  38. 38
    Incident case: the upstream incident scaled down
    downstream services
    - The upstream service died accidentally
    - No traffic went to downstream services during the incident
    - The downstream’s resource utilization went
    down.
    Then, the HPAs in downstream services scale down the workloads.

    View Slide

  39. 39
    Incident case: the upstream incident scaled down
    downstream services
    When the upstream incident got resolved,
    the huge traffic came back,
    and downstream service was completely overwhelmed.
    Such unusual scaling down could happen in any services with
    HPA.

    View Slide

  40. 40
    Setting minReplicas?
    No.
    It surely solves such a problem but results in the waste of
    resources. 💸

    View Slide

  41. 41
    Define the dynamic minimum replica num
    We defined the DatadogMetrics
    which refers to the replica number one week ago.

    Make HPA not reducing the replica number
    less than ½ * the last week’s replica num.

    View Slide

  42. 42
    Workload size management
    Vertical

    View Slide

  43. 43

    View Slide

  44. 44
    Probably, we want to
    change the resource
    amount on each Pod

    View Slide

  45. 45
    When traffic grows,
    the utilization goes higher.

    View Slide

  46. 46
    Then, we need to increase
    the resource amount.

    View Slide

  47. 47
    Automated way: VerticalPodAutoscaler
    It watches the Pod’s utilization and stores it as historical data.

    Calculate the recommended resource amount,
    and change the Pod’s resource request
    when the current given resource and the recommendation is
    very different.

    View Slide

  48. 48
    Multidimensional Pod autoscaling
    We have another option to use VPA in indirect way:
    Multidimensional Pod autoscaling from GKE.
    MPA almost equals using:
    - HPA for CPU.
    - VPA for Memory.

    View Slide

  49. 49
    MPA and VPA in Mercari
    Actually, we haven’t been using them yet.
    Now, we’re running some experiments to evaluate HPA, VPA and
    MPA.

    View Slide

  50. 50
    The resource recommender
    Also, we’re providing the resource recommender slack bot to the
    developer.
    Hoge deployment
    appcontainer

    View Slide

  51. 51
    How the resource recommender works
    It’s fetch the VPA’s recommendation value from GKE’s system
    metrics.
    -> uses the maximum VPA recommendation value over 3 months.

    View Slide

  52. 52
    How the resource recommender works
    It’s fetch the VPA’s recommendation value from GKE’s system
    metrics.
    -> uses the maximum VPA recommendation value over 3 months.

    View Slide

  53. 53
    Our next steps

    View Slide

  54. 54
    Next step 1: trying autoscalers other than HPA
    As shared, we’re trying to use other autoscalers like MPA, VPA.
    Many of our services are Go’s server and we’d like to provide the
    information about a good default autoscaler for us.

    View Slide

  55. 55
    Next step 2: the HPA’s performance
    Currently, our cluster has too many HPA objects that easily
    exceeds the recommendation from GKE.
    If HPA controller takes more time to realize the high resource
    utilization and scale up the workload, it affect the reliability badly.

    View Slide

  56. 56
    Next step 2: Improve the HPA’s performance
    Proposing the metrics feature in the upstream HPA controller.

    View Slide

  57. 57
    Next step 2: Improve the HPA’s performance
    The worst case scenario on scaling up is when no Node is
    available.
    As shared, we’d like to have some way to adjust the number of
    overprovisioning Pods by referring the past demand of resources.

    View Slide

  58. 58
    Next step 3: HPA utilization target value optimization
    Currently, each application team has responsibility for setting the
    target resource utilization value in HPA.
    But, feel like it should be done by some automatic way.
    We’re considering providing the way to automatically set (or just
    recommend) the best target resource utilization value calculated
    from the past behavior of that service.

    View Slide