Cost Optimization with Cluster Autoscaler

93c80c388fe9d8f9df7d030549a0ff0b?s=47 Takeshi Kondo
September 30, 2019

Cost Optimization with Cluster Autoscaler

2019/09/30 Lightning Talks

93c80c388fe9d8f9df7d030549a0ff0b?s=128

Takeshi Kondo

September 30, 2019
Tweet

Transcript

  1. Cost Optimization with Cluster Autoscaler Takeshi Kondo / @chaspy Lightning

    Talks
  2. None
  3. Target Who knows Kubernetes

  4. tl;dr Cluster Autoscaler is useful

  5. We are deploying (almost) application on Kubernetes

  6. All application is deployed for each PR https://quipper.hatenablog.com/entry/future-with-kubernetes

  7. Appendix: namespaces and num of pods/deployment in staging

  8. None
  9. We should do scale-in/out by manually

  10. Problem

  11. How to solve the problem? • Cluster Autoscaler • Horizontal

    Pod Autoscaler • Vertical Pod Autoscaler
  12. How to solve the problem? • Cluster Autoscaler • Horizontal

    Pod Autoscaler • Vertical Pod Autoscaler Tried and reverted because “JaJa Uma” (unmanageable) Not trying
  13. Agenda • Introduction / Background • Cluster Autoscaler • How

    to scale-in/out • Check the code • And more topic • (Production) Cluster Autoscaler works when releasing • (Production) PodDisruptionBudget prevents to scale-in • (Production) Making zero downtime when scaling-in • (Staging) Remove stale PullRequests namespaces • Achievement • Conclusion
  14. Agenda • Introduction / Background • Cluster Autoscaler • How

    to scale-in/out • Check the code • And more topic • (Production) Cluster Autoscaler works when releasing • (Production) PodDisruptionBudget prevents to scale-in • (Production) Making zero downtime when scaling-in • (Staging) Remove stale PullRequests namespaces • Achievement • Conclusion
  15. Cluster Autoscaler *1 • Scale-up • When any unschedulable pods

    exist • Scale-in (Need All below conditions) • If no scale-up is needed, • The sum of cpu and memory requests of all pods running on this node is smaller than 50% of the node's allocatable • All pods running on the node can be moved to other nodes • For example, PodDisruptionBudget prevents. See for details *2 • It doesn't have scale-down disabled annotation • "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true" *1 https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md *2 https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can prevent-ca-from-removing-a-node
  16. Check the code: How to scale-in? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scale_down.go#L568 Default: 0.5

  17. Check the code: How to scale-in? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/simulator/cluster.go#L158 Get cpu and

    memory Save the bigger metrics
  18. Check the code: How to scale-in? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/simulator/cluster.go#L61

  19. Check the code: How to scale-in? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/vendor/github.com/aws/aws-sdk-go/service/autoscaling/api.go#L5100

  20. Check the code: How to scale-in? https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_TerminateInstanceInAutoScalingGroup.html

  21. Agenda • Introduction / Background • Cluster Autoscaler • How

    to scale-in/out • Check the code • And more topic • (Production) Cluster Autoscaler works when releasing • (Production) PodDisruptionBudget prevents to scale-in • (Production) Making zero downtime when scaling-in • (Staging) Remove stale PullRequests namespaces • Achievement • Conclusion
  22. And more topic • (Production) Cluster Autoscaler works when releasing

    • (Production) PodDisruptionBudget prevents to scale-in • (Production) Making zero downtime when scaling-in • (Staging) Remove stale PullRequests namespaces
  23. And more topic • (Production) Cluster Autoscaler works when releasing

    • (Production) PodDisruptionBudget prevents to scale-in • (Production) Making zero downtime when scaling-in • (Staging) Remove stale PullRequests namespaces
  24. Cluster Autoscaler works when releasing • When rolling update, the

    count of "max surge" pods is created so there is temporarily Unschedulable pod. Num of running pods Num of desired capacity of ASG
  25. And more topic • (Production) Cluster Autoscaler works when releasing

    • (Production) PodDisruptionBudget prevents to scale-in • (Production) Making zero downtime when scaling-in • (Staging) Remove stale PullRequests namespaces
  26. PodDisruptionBudget prevented to scale-in • All deployment has “safe-draining” label

    • PDB keeps maxUnavailable is 1 In all deployment
  27. And more topic • (Production) Cluster Autoscaler works when releasing

    • (Production) PodDisruptionBudget prevents to scale-in • (Production) Making zero downtime when scaling-in • (Staging) Remove stale PullRequests namespaces
  28. Making zero downtime when scaling-in https://speakerdeck.com/chaspy/rolling-update-kubernetes-deployment-with-zero-downtime

  29. And more topic • (Production) Cluster Autoscaler works when releasing

    • (Production) PodDisruptionBudget prevents to scale-in • (Production) Making zero downtime when scaling-in • (Staging) Remove stale PullRequests namespaces
  30. Remove stale PullRequests namespaces

  31. Agenda • Introduction / Background • Cluster Autoscaler • How

    to scale-in/out • Check the code • And more topic • (Production) Cluster Autoscaler works when releasing • (Production) PodDisruptionBudget prevents to scale-in • (Production) Making zero downtime when scaling-in • (Staging) Remove stale PullRequests namespaces • Achievement • Conclusion
  32. Achievement (Staging) Instance Class Node Count (Aug.) Node Count (Sep.)

    Metrics Usage / Capacity (Aug.) Usage / Capacity (Sep.) Japan/ default r5.2xlarge 20.56 17.65 Memory 0.53 0.59 Global/ default r5.2xlarge 12 12 Memory 0.51 0.56 $443.840000 (monthly) * (20.56 - 17.65) = $1291.5744
  33. Achievement (Production) Instance Class Node Count (Aug.) Node Count (Sep.)

    Metrics Usage / Capacity (Aug.) Usage / Capacity (Sep.) Japan/ default m5.xlarge 6.38 8.26 Memory 0.68 0.58 Japan/ api m5.2xlarge 17.1 13.42 CPU 0.17 0.14 Global/ default m5.xlarge 6 6 Memory 0.51 0.51 Global/ api m5.2xlarge 6 6 CPU 0.15 0.17 $181.040000 (monthly) * (6.38 - 8.26) = $340.3552 $362.080000 (monthly) * (17.1 - 13.42) = $1332.4544
  34. Achievement (Production) Instance Class Node Count (Aug.) Node Count (Sep.)

    Metrics Usage / Capacity (Aug.) Usage / Capacity (Sep.) Japan/ default m5.xlarge 6.38 8.26 Memory 0.68 0.58 Japan/ api m5.2xlarge 17.1 13.42 CPU 0.17 0.14 Global/ default m5.xlarge 6 6 Memory 0.51 0.51 Global/ api m5.2xlarge 6 6 CPU 0.15 0.17 $181.040000 (monthly) * (6.38 - 8.26) = $340.3552 $362.080000 (monthly) * (17.1 - 13.42) = $1332.4544 Due to PDB
  35. Agenda • Introduction / Background • Cluster Autoscaler • How

    to scale-in/out • Check the code • And more topic • (Production) Cluster Autoscaler works when releasing • (Production) PodDisruptionBudget prevents to scale-in • (Production) Making zero downtime when scaling-in • (Staging) Remove stale PullRequests namespaces • Achievement • Conclusion
  36. Conclusion • Cluster Autoscaler saves the cost of both operation

    and infrastructure • (In Staging) With deleting stale namespaces • (In Production) With making no downtime when scaling-in • Pod level (Horizontal / Vertical) autoscaling is not yet introduced • Mean we SRE should increase pods/nodes when high loading • Read Code
  37. Special Thanks • @yuya-takeyama / SRE • Thanks to review

    PR • @rbmrclo / SRE • Thanks to review PR • @hiroki-iwasaki / People & Culture • Thanks to organize the “Lightning Talks”
  38. Thank You! chaspy chaspy_ Site Reliability Engineer at Quipper Takeshi

    Kondo SRE Lounge Terraform-jp