Cost Optimization with Cluster Autoscaler

Cost Optimization with Cluster Autoscaler Takeshi Kondo / @chaspy Lightning
Talks

Target Who knows Kubernetes

tl;dr Cluster Autoscaler is useful

We are deploying (almost) application on Kubernetes

All application is deployed for each PR https://quipper.hatenablog.com/entry/future-with-kubernetes

Appendix: namespaces and num of pods/deployment in staging

We should do scale-in/out by manually

Problem

How to solve the problem? • Cluster Autoscaler • Horizontal
Pod Autoscaler • Vertical Pod Autoscaler

How to solve the problem? • Cluster Autoscaler • Horizontal
Pod Autoscaler • Vertical Pod Autoscaler Tried and reverted because “JaJa Uma” (unmanageable) Not trying

Agenda • Introduction / Background • Cluster Autoscaler • How
to scale-in/out • Check the code • And more topic • (Production) Cluster Autoscaler works when releasing • (Production) PodDisruptionBudget prevents to scale-in • (Production) Making zero downtime when scaling-in • (Staging) Remove stale PullRequests namespaces • Achievement • Conclusion

Cluster Autoscaler *1 • Scale-up • When any unschedulable pods
exist • Scale-in (Need All below conditions) • If no scale-up is needed, • The sum of cpu and memory requests of all pods running on this node is smaller than 50% of the node's allocatable • All pods running on the node can be moved to other nodes • For example, PodDisruptionBudget prevents. See for details *2 • It doesn't have scale-down disabled annotation • "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true" *1 https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md *2 https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can prevent-ca-from-removing-a-node

Check the code: How to scale-in? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scale_down.go#L568 Default: 0.5

Check the code: How to scale-in? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/simulator/cluster.go#L158 Get cpu and
memory Save the bigger metrics

Check the code: How to scale-in? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/simulator/cluster.go#L61

Check the code: How to scale-in? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/vendor/github.com/aws/aws-sdk-go/service/autoscaling/api.go#L5100

Check the code: How to scale-in? https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_TerminateInstanceInAutoScalingGroup.html

And more topic • (Production) Cluster Autoscaler works when releasing
• (Production) PodDisruptionBudget prevents to scale-in • (Production) Making zero downtime when scaling-in • (Staging) Remove stale PullRequests namespaces

Cluster Autoscaler works when releasing • When rolling update, the
count of "max surge" pods is created so there is temporarily Unschedulable pod. Num of running pods Num of desired capacity of ASG

PodDisruptionBudget prevented to scale-in • All deployment has “safe-draining” label
• PDB keeps maxUnavailable is 1 In all deployment

Making zero downtime when scaling-in https://speakerdeck.com/chaspy/rolling-update-kubernetes-deployment-with-zero-downtime

Remove stale PullRequests namespaces

Achievement (Staging) Instance Class Node Count (Aug.) Node Count (Sep.)
Metrics Usage / Capacity (Aug.) Usage / Capacity (Sep.) Japan/ default r5.2xlarge 20.56 17.65 Memory 0.53 0.59 Global/ default r5.2xlarge 12 12 Memory 0.51 0.56 $443.840000 (monthly) * (20.56 - 17.65) = $1291.5744

Achievement (Production) Instance Class Node Count (Aug.) Node Count (Sep.)
Metrics Usage / Capacity (Aug.) Usage / Capacity (Sep.) Japan/ default m5.xlarge 6.38 8.26 Memory 0.68 0.58 Japan/ api m5.2xlarge 17.1 13.42 CPU 0.17 0.14 Global/ default m5.xlarge 6 6 Memory 0.51 0.51 Global/ api m5.2xlarge 6 6 CPU 0.15 0.17 $181.040000 (monthly) * (6.38 - 8.26) = $340.3552 $362.080000 (monthly) * (17.1 - 13.42) = $1332.4544

Achievement (Production) Instance Class Node Count (Aug.) Node Count (Sep.)
Metrics Usage / Capacity (Aug.) Usage / Capacity (Sep.) Japan/ default m5.xlarge 6.38 8.26 Memory 0.68 0.58 Japan/ api m5.2xlarge 17.1 13.42 CPU 0.17 0.14 Global/ default m5.xlarge 6 6 Memory 0.51 0.51 Global/ api m5.2xlarge 6 6 CPU 0.15 0.17 $181.040000 (monthly) * (6.38 - 8.26) = $340.3552 $362.080000 (monthly) * (17.1 - 13.42) = $1332.4544 Due to PDB

Conclusion • Cluster Autoscaler saves the cost of both operation
and infrastructure • (In Staging) With deleting stale namespaces • (In Production) With making no downtime when scaling-in • Pod level (Horizontal / Vertical) autoscaling is not yet introduced • Mean we SRE should increase pods/nodes when high loading • Read Code

Special Thanks • @yuya-takeyama / SRE • Thanks to review
PR • @rbmrclo / SRE • Thanks to review PR • @hiroki-iwasaki / People & Culture • Thanks to organize the “Lightning Talks”

Thank You! chaspy chaspy_ Site Reliability Engineer at Quipper Takeshi
Kondo SRE Lounge Terraform-jp

Cost Optimization with Cluster Autoscaler

Cost Optimization with Cluster Autoscaler

More Decks by Takeshi Kondo

Other Decks in Technology

Featured

Transcript