Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to make pod assignment to thousands of nod...

Tasdik Rahman
September 18, 2024

How to make pod assignment to thousands of nodes every day easier

New Relic operates tens of thousands of nodes across hundreds of Kubernetes clusters. Pod assignment to these thousands of nodes is done every day, as applications get deployed. I'll share our experience in abstracting out the Kubernetes scheduling primitives from users, discuss their limitations and describe the solution. I'll cover:
* the complexity for end user, in specifying scheduling rules to Kubernetes at scale.
* how we built a scheduling engine by extending Kubernetes via mutating admission webhooks, to translate declarative requirements by user into native Kubernetes scheduling constraints.
* tradeoffs made in the system.
After this talk, attendees will be better prepared to deal with the complexity of extending Kubernetes, to abstract pod assignment to nodes, especially at scale, for end users.

Tasdik Rahman

September 18, 2024
Tweet

More Decks by Tasdik Rahman

Other Decks in Programming

Transcript

  1. How to make pod assignments to thousands of nodes every

    day easier Tasdik Rahman @tasdikrahman || www.tasdikrahman.com ContainerDays, Hamburg, 2024
  2. About me • Release lead - team member for v1.9

    of kubernetes-sigs/cluster-api • Contributor to cluster-api and it’s providers ◦ cluster-api-provider-aws and ◦ cluster-api-provider-gcp. • Past Contributor to oVirt (Open source Virtualization). • Senior Software Engineer, New Relic.
  3. Outline • Pod scheduling primitives provided by kubernetes. • Background.

    ◦ Kubernetes At New Relic. ▪ Cluster API Machinepools ▪ Karpenter Provisioners ▪ Running both of them together ◦ Limitations while using the primitives in our fleet. • Scheduling Classes. ◦ Design. ◦ How does it work? ◦ Assumptions made. ◦ Achieving better instance diversity in compute pool. ◦ Improving developer experience. ◦ Increasing karpenter adoption. ◦ What’s next in our roadmap.
  4. Kubernetes at New Relic Spread across different geographies, regions and

    zones. Across different environments Across all cloud providers, with a maximum of 19,500 pods/cluster 270+ k8s clusters On 3 different cloud providers 317,000+ pods 17,000+ nodes Across all cloud providers, with a maximum of 375 nodes/cluster
  5. Cluster API MachinePools • Provides a way to manage a

    set of machines by providing common configuration. • Similar to MachineDeployments, but each infrastrastructure provider has its own implementation for managing the Machines. • Infrastructure provider, creates the Autoscaling primitive in the cloud provider ◦ For example: ▪ In AWS, cluster-api-provider-aws, creates AutoscalingGroups(ASG) and ▪ In Azure, cluster-api-provider-azure, creates VirtualMachineScaleSets (VMSS).
  6. Karpenter Provisioners • Allows specifying different architectures, instance types and

    instance families in a single configuration. • Optimizes the cluster for cost, by choosing which node(s) to provision and which node(s) to de-provision. ◦ enhanced bin-packing. • Groupless autoscaling. i.e. a new node being created could be of any shape and size chosen from the provisioner configuration.
  7. Cluster Autoscaler • When used together with MachinePools, can modify

    the replica count of the autoscaling primitive provided by the cloud provider. ◦ Eg: scales up/down the VMSS or ASG. • Assumption made, is that all the instances in the auto-scaling cloud provider primitive is of the same shape (CPU/memory/disk).
  8. Running both of them together • MachinePools nodes ◦ Created

    with node labels and taints, which are then used to differentiate different sets of machines in compute pool. ◦ Scale from 0 enabled. Cluster autoscaler would scale it down to 0 replicas based on usage. • Karpenter nodes ◦ Doesn’t have any node labels and taints present in the provisioner ◦ Nodes come up, when applications target this pool. ◦ Scales from 0.
  9. Limitations while using the primitives in our fleet • nodeName

    ◦ Nodes not constant in the environment. ◦ Config would need to be changed constantly, if used. • nodeSelector ◦ Improvement over nodeName, via adding filtering on labels rather than on node names. ◦ Becomes challenging when required for more complex scheduling tasks. ▪ Eg: Difficult to specify pods to be scheduled in different Availability zones. • Affinity/Anti-Affinity along with tolerations ◦ Becomes more complex for users to pass and configure. ◦ Values need to be changed over time with growing needs for the application.
  10. Scheduling Classes • Solves the variety of scheduling requirements among

    different teams/applications. • Declarative way to express scheduling requirements of applications without the need to dive into the finer details of how they are implemented. • Heart of it is an admission controller, specifically a mutating webhook running on all resources of Kind Rollout, deployment, statefulset.
  11. Design Goals • Cloud agnostic. • Built on top of

    scheduling primitives provided by k8s. • Sane defaults. • Scheduling constraints can be chained together to form new rules. • Attached scheduling classes can add/negate already present rules. • A specific scheduling class can have only one priority. • Deterministic.
  12. How does it work? • When applicable for mutation, reads

    the scheduling classes passed ◦ Sorts them in priority ◦ Applies them in order, until the last scheduling class is attached. ◦ Generated scheduling constraint is used to mutate the Application. ▪ Mutations constraints, are contained to affinity and tolerations fields of the podspec. • Scheduling constraints stored as config in the webhook deployment. ◦ Each having a group of tolerations and affinities to add/remove. ◦ Scheduling classes are grouped into different types compatible with each other.
  13. Assumptions made • App having custom affinity/nodeSelector and tolerations to

    not be processed. ◦ Deleted of any scheduling class rules. • Deployment/rollout/stateful, being the core building block of an application ◦ Pods are not directly mutated, but rather it’s controlling block. • Apps using auto-scaling via HPA, get attached an autoscaled scheduling class ◦ which add constraints to schedule them to instances of similar shape and size. • StatefulSet applications, attached CAPI scheduling class. ◦ targeting node pools which don’t consolidate aggressively. • A Default scheduling class to be attached ◦ can be configured per cluster. • Architecture scheduling class attached, if not present. ◦ If present, original architecture passed over to affinity.
  14. How does it affect applications? • Scheduling classes are enabled

    by default to all ◦ *that meet the requirements. • Automatically attached the default scheduling class ◦ unless explicitly choosing to opt out. • Applications managed by scheduling class, get added ◦ Labels and annotations ▪ which denote the active list of scheduling classes and ▪ whether it’s managed by scheduling classes.
  15. Achieving better instance pool Diversity in compute pool • Built

    on top of compute pool providers of ◦ Cluster API Machinepools ▪ For applications in the process of optimising for karpenter consolidation rate. ◦ Karpenter provisioners ▪ Groupless autoscaling, supporting wide range of instance shapes/sizes in one provisioner configuration. ▪ Instance families and types can be added without adding the necessary cloudprovider autoscaling construct. (eg: No need to separately create an ASG for new nodes) • Applications can be scheduled on ◦ CAPI pool ◦ Karpenter pool ◦ Or both
  16. Improving Developer Experience • Declarative approach • App only has

    to pass the scheduling class names ◦ The need to pass affinity, tolerations, nodeSelectors goes away, removing the need for the application to know these before hand. • Additive/Negation nature of different scheduling classes ◦ Allowing to mix and match node pools of different features as required by application.
  17. Increasing karpenter adoption • Karpenter scheduling class ◦ Default scheduling

    class for clusters. ◦ Applications by default would then be deployed to the karpenter pool ▪ unless specified with another scheduling. ▪ Has custom affinities/tolerations. ▪ Explicitly opts out of karpenter. • Keeping provisioners, instance family, narrower to start with ◦ To allow for less frequent consolidation and then expand it.
  18. Opting out of karpenter • If application doesn’t want to

    schedule in karpenter pool due to ◦ The consolidation rate of nodes. ◦ Karpenter pool present, doesn’t have the node features application needs as of now. • CAPI scheduling class ◦ Makes applications default to CAPI pool, instead of karpenter. • When CAPI scheduling class is added ◦ Negates the rules added by the default (karpenter) scheduling class ◦ Adds scheduling constraints specific to CAPI pool.
  19. What’s next in the roadmap? • Scheduling classes as a

    custom CRD to allow application owners to interact with the config and add constraints themselves in their cluster. • Support for additional scheduling classes for further fine grained expression of scheduling constraints, and more specificalised instance types. • Additional scheduling classes, to onboard teams using self managed node-pools. • Enhancing the karpenter provisioners further to allow for slow adoption of further node features in new/existing applications. • Showing the result of scheduling results on the deployment platform for an application owner. • Moving over to karpenter v1 API objects.
  20. References • Kubernetes Scheduling • Node Selectors and Affinity •

    Taints and Tolerations • Karpenter • Admission Controllers • Dynamic Admission Control • Mutating Admission Webhook • Cluster API MachinePools • Karpenter provisioners