How to make pod assignment to thousands of nodes every day easier

How to make pod assignments to thousands of nodes every
day easier Tasdik Rahman @tasdikrahman || www.tasdikrahman.com ContainerDays, Hamburg, 2024

About me • Release lead - team member for v1.9
of kubernetes-sigs/cluster-api • Contributor to cluster-api and it’s providers ◦ cluster-api-provider-aws and ◦ cluster-api-provider-gcp. • Past Contributor to oVirt (Open source Virtualization). • Senior Software Engineer, New Relic.

Outline • Pod scheduling primitives provided by kubernetes. • Background.
◦ Kubernetes At New Relic. ▪ Cluster API Machinepools ▪ Karpenter Provisioners ▪ Running both of them together ◦ Limitations while using the primitives in our ﬂeet. • Scheduling Classes. ◦ Design. ◦ How does it work? ◦ Assumptions made. ◦ Achieving better instance diversity in compute pool. ◦ Improving developer experience. ◦ Increasing karpenter adoption. ◦ What’s next in our roadmap.

Pod Scheduling primitives provided by kubernetes

Test Kind Cluster Ref: https://kind.sigs.k8s.io/

nodeName Reference: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ Specifying the node name

nodeSelector Reference: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ Targeting the node-group label

Aﬃnity Constraints over the architecture as a requirement For scheduling.

Taints and Tolerations Node has a taint Pod tolerates the
taint

Background

Kubernetes at New Relic Spread across different geographies, regions and
zones. Across different environments Across all cloud providers, with a maximum of 19,500 pods/cluster 270+ k8s clusters On 3 different cloud providers 317,000+ pods 17,000+ nodes Across all cloud providers, with a maximum of 375 nodes/cluster

Cluster API MachinePools • Provides a way to manage a
set of machines by providing common conﬁguration. • Similar to MachineDeployments, but each infrastrastructure provider has its own implementation for managing the Machines. • Infrastructure provider, creates the Autoscaling primitive in the cloud provider ◦ For example: ▪ In AWS, cluster-api-provider-aws, creates AutoscalingGroups(ASG) and ▪ In Azure, cluster-api-provider-azure, creates VirtualMachineScaleSets (VMSS).

Karpenter Provisioners • Allows specifying different architectures, instance types and
instance families in a single conﬁguration. • Optimizes the cluster for cost, by choosing which node(s) to provision and which node(s) to de-provision. ◦ enhanced bin-packing. • Groupless autoscaling. i.e. a new node being created could be of any shape and size chosen from the provisioner conﬁguration.

Cluster Autoscaler • When used together with MachinePools, can modify
the replica count of the autoscaling primitive provided by the cloud provider. ◦ Eg: scales up/down the VMSS or ASG. • Assumption made, is that all the instances in the auto-scaling cloud provider primitive is of the same shape (CPU/memory/disk).

Running both of them together • MachinePools nodes ◦ Created
with node labels and taints, which are then used to differentiate different sets of machines in compute pool. ◦ Scale from 0 enabled. Cluster autoscaler would scale it down to 0 replicas based on usage. • Karpenter nodes ◦ Doesn’t have any node labels and taints present in the provisioner ◦ Nodes come up, when applications target this pool. ◦ Scales from 0.

Limitations while using the primitives in our fleet • nodeName
◦ Nodes not constant in the environment. ◦ Config would need to be changed constantly, if used. • nodeSelector ◦ Improvement over nodeName, via adding filtering on labels rather than on node names. ◦ Becomes challenging when required for more complex scheduling tasks. ▪ Eg: Difficult to specify pods to be scheduled in different Availability zones. • Affinity/Anti-Affinity along with tolerations ◦ Becomes more complex for users to pass and configure. ◦ Values need to be changed over time with growing needs for the application.

Scheduling Classes

Scheduling Classes • Solves the variety of scheduling requirements among
different teams/applications. • Declarative way to express scheduling requirements of applications without the need to dive into the ﬁner details of how they are implemented. • Heart of it is an admission controller, speciﬁcally a mutating webhook running on all resources of Kind Rollout, deployment, statefulset.

Design Goals • Cloud agnostic. • Built on top of
scheduling primitives provided by k8s. • Sane defaults. • Scheduling constraints can be chained together to form new rules. • Attached scheduling classes can add/negate already present rules. • A speciﬁc scheduling class can have only one priority. • Deterministic.

How does it work?

How does it work? • When applicable for mutation, reads
the scheduling classes passed ◦ Sorts them in priority ◦ Applies them in order, until the last scheduling class is attached. ◦ Generated scheduling constraint is used to mutate the Application. ▪ Mutations constraints, are contained to affinity and tolerations fields of the podspec. • Scheduling constraints stored as config in the webhook deployment. ◦ Each having a group of tolerations and affinities to add/remove. ◦ Scheduling classes are grouped into different types compatible with each other.

Assumptions made

Assumptions made • App having custom affinity/nodeSelector and tolerations to
not be processed. ◦ Deleted of any scheduling class rules. • Deployment/rollout/stateful, being the core building block of an application ◦ Pods are not directly mutated, but rather it’s controlling block. • Apps using auto-scaling via HPA, get attached an autoscaled scheduling class ◦ which add constraints to schedule them to instances of similar shape and size. • StatefulSet applications, attached CAPI scheduling class. ◦ targeting node pools which don’t consolidate aggressively. • A Default scheduling class to be attached ◦ can be configured per cluster. • Architecture scheduling class attached, if not present. ◦ If present, original architecture passed over to affinity.

How does it affect applications?

How does it affect applications? • Scheduling classes are enabled
by default to all ◦ *that meet the requirements. • Automatically attached the default scheduling class ◦ unless explicitly choosing to opt out. • Applications managed by scheduling class, get added ◦ Labels and annotations ▪ which denote the active list of scheduling classes and ▪ whether it’s managed by scheduling classes.

Achieving better instance pool diversity in compute pool

Achieving better instance pool Diversity in compute pool • Built
on top of compute pool providers of ◦ Cluster API Machinepools ▪ For applications in the process of optimising for karpenter consolidation rate. ◦ Karpenter provisioners ▪ Groupless autoscaling, supporting wide range of instance shapes/sizes in one provisioner conﬁguration. ▪ Instance families and types can be added without adding the necessary cloudprovider autoscaling construct. (eg: No need to separately create an ASG for new nodes) • Applications can be scheduled on ◦ CAPI pool ◦ Karpenter pool ◦ Or both

Node Groups

Improving Developer Experience

Improving Developer Experience • Declarative approach • App only has
to pass the scheduling class names ◦ The need to pass aﬃnity, tolerations, nodeSelectors goes away, removing the need for the application to know these before hand. • Additive/Negation nature of different scheduling classes ◦ Allowing to mix and match node pools of different features as required by application.

Increasing karpenter adoption

Increasing karpenter adoption • Karpenter scheduling class ◦ Default scheduling
class for clusters. ◦ Applications by default would then be deployed to the karpenter pool ▪ unless speciﬁed with another scheduling. ▪ Has custom aﬃnities/tolerations. ▪ Explicitly opts out of karpenter. • Keeping provisioners, instance family, narrower to start with ◦ To allow for less frequent consolidation and then expand it.

Opting out of karpenter

Opting out of karpenter • If application doesn’t want to
schedule in karpenter pool due to ◦ The consolidation rate of nodes. ◦ Karpenter pool present, doesn’t have the node features application needs as of now. • CAPI scheduling class ◦ Makes applications default to CAPI pool, instead of karpenter. • When CAPI scheduling class is added ◦ Negates the rules added by the default (karpenter) scheduling class ◦ Adds scheduling constraints speciﬁc to CAPI pool.

What’s next in our roadmap

What’s next in the roadmap? • Scheduling classes as a
custom CRD to allow application owners to interact with the config and add constraints themselves in their cluster. • Support for additional scheduling classes for further fine grained expression of scheduling constraints, and more specificalised instance types. • Additional scheduling classes, to onboard teams using self managed node-pools. • Enhancing the karpenter provisioners further to allow for slow adoption of further node features in new/existing applications. • Showing the result of scheduling results on the deployment platform for an application owner. • Moving over to karpenter v1 API objects.

References • Kubernetes Scheduling • Node Selectors and Aﬃnity •
Taints and Tolerations • Karpenter • Admission Controllers • Dynamic Admission Control • Mutating Admission Webhook • Cluster API MachinePools • Karpenter provisioners

Thank you Tasdik Rahman @tasdikrahman || www.tasdikrahman.com

How to make pod assignment to thousands of nod...

How to make pod assignment to thousands of nodes every day easier

More Decks by Tasdik Rahman

Other Decks in Programming

Featured

Transcript