How to make pod assignment to thousands of nodes every day easier

Slide 1

Slide 1 text

How to make pod assignments to thousands of nodes every day easier Tasdik Rahman @tasdikrahman || www.tasdikrahman.com ContainerDays, Hamburg, 2024

Slide 2

Slide 2 text

About me ● Release lead - team member for v1.9 of kubernetes-sigs/cluster-api ● Contributor to cluster-api and it’s providers ○ cluster-api-provider-aws and ○ cluster-api-provider-gcp. ● Past Contributor to oVirt (Open source Virtualization). ● Senior Software Engineer, New Relic.

Slide 3

Slide 3 text

Outline ● Pod scheduling primitives provided by kubernetes. ● Background. ○ Kubernetes At New Relic. ■ Cluster API Machinepools ■ Karpenter Provisioners ■ Running both of them together ○ Limitations while using the primitives in our ﬂeet. ● Scheduling Classes. ○ Design. ○ How does it work? ○ Assumptions made. ○ Achieving better instance diversity in compute pool. ○ Improving developer experience. ○ Increasing karpenter adoption. ○ What’s next in our roadmap.

Slide 4

Slide 4 text

Pod Scheduling primitives provided by kubernetes

Slide 5

Slide 5 text

Test Kind Cluster Ref: https://kind.sigs.k8s.io/

Slide 6

Slide 6 text

nodeName Reference: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ Specifying the node name

Slide 7

Slide 7 text

nodeSelector Reference: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ Targeting the node-group label

Slide 8

Slide 8 text

Aﬃnity Constraints over the architecture as a requirement For scheduling.

Slide 9

Slide 9 text

Taints and Tolerations Node has a taint Pod tolerates the taint

Slide 10

Slide 10 text

Background

Slide 11

Slide 11 text

Kubernetes at New Relic Spread across different geographies, regions and zones. Across different environments Across all cloud providers, with a maximum of 19,500 pods/cluster 270+ k8s clusters On 3 different cloud providers 317,000+ pods 17,000+ nodes Across all cloud providers, with a maximum of 375 nodes/cluster

Slide 12

Slide 12 text

Cluster API MachinePools ● Provides a way to manage a set of machines by providing common conﬁguration. ● Similar to MachineDeployments, but each infrastrastructure provider has its own implementation for managing the Machines. ● Infrastructure provider, creates the Autoscaling primitive in the cloud provider ○ For example: ■ In AWS, cluster-api-provider-aws, creates AutoscalingGroups(ASG) and ■ In Azure, cluster-api-provider-azure, creates VirtualMachineScaleSets (VMSS).

Slide 13

Slide 13 text

Karpenter Provisioners ● Allows specifying different architectures, instance types and instance families in a single conﬁguration. ● Optimizes the cluster for cost, by choosing which node(s) to provision and which node(s) to de-provision. ○ enhanced bin-packing. ● Groupless autoscaling. i.e. a new node being created could be of any shape and size chosen from the provisioner conﬁguration.

Slide 14

Slide 14 text

Cluster Autoscaler ● When used together with MachinePools, can modify the replica count of the autoscaling primitive provided by the cloud provider. ○ Eg: scales up/down the VMSS or ASG. ● Assumption made, is that all the instances in the auto-scaling cloud provider primitive is of the same shape (CPU/memory/disk).

Slide 15

Slide 15 text

Running both of them together ● MachinePools nodes ○ Created with node labels and taints, which are then used to differentiate different sets of machines in compute pool. ○ Scale from 0 enabled. Cluster autoscaler would scale it down to 0 replicas based on usage. ● Karpenter nodes ○ Doesn’t have any node labels and taints present in the provisioner ○ Nodes come up, when applications target this pool. ○ Scales from 0.

Slide 16

Slide 16 text

Limitations while using the primitives in our fleet ● nodeName ○ Nodes not constant in the environment. ○ Config would need to be changed constantly, if used. ● nodeSelector ○ Improvement over nodeName, via adding filtering on labels rather than on node names. ○ Becomes challenging when required for more complex scheduling tasks. ■ Eg: Difficult to specify pods to be scheduled in different Availability zones. ● Affinity/Anti-Affinity along with tolerations ○ Becomes more complex for users to pass and configure. ○ Values need to be changed over time with growing needs for the application.

Slide 17

Slide 17 text

Scheduling Classes

Slide 18

Slide 18 text

Scheduling Classes ● Solves the variety of scheduling requirements among different teams/applications. ● Declarative way to express scheduling requirements of applications without the need to dive into the ﬁner details of how they are implemented. ● Heart of it is an admission controller, speciﬁcally a mutating webhook running on all resources of Kind Rollout, deployment, statefulset.

Slide 19

Slide 19 text

Design Goals ● Cloud agnostic. ● Built on top of scheduling primitives provided by k8s. ● Sane defaults. ● Scheduling constraints can be chained together to form new rules. ● Attached scheduling classes can add/negate already present rules. ● A speciﬁc scheduling class can have only one priority. ● Deterministic.

Slide 20

Slide 20 text

How does it work?

Slide 21

Slide 21 text

How does it work?

Slide 22

Slide 22 text

How does it work? ● When applicable for mutation, reads the scheduling classes passed ○ Sorts them in priority ○ Applies them in order, until the last scheduling class is attached. ○ Generated scheduling constraint is used to mutate the Application. ■ Mutations constraints, are contained to affinity and tolerations fields of the podspec. ● Scheduling constraints stored as config in the webhook deployment. ○ Each having a group of tolerations and affinities to add/remove. ○ Scheduling classes are grouped into different types compatible with each other.

Slide 23

Slide 23 text

Assumptions made

Slide 24

Slide 24 text

Assumptions made ● App having custom affinity/nodeSelector and tolerations to not be processed. ○ Deleted of any scheduling class rules. ● Deployment/rollout/stateful, being the core building block of an application ○ Pods are not directly mutated, but rather it’s controlling block. ● Apps using auto-scaling via HPA, get attached an autoscaled scheduling class ○ which add constraints to schedule them to instances of similar shape and size. ● StatefulSet applications, attached CAPI scheduling class. ○ targeting node pools which don’t consolidate aggressively. ● A Default scheduling class to be attached ○ can be configured per cluster. ● Architecture scheduling class attached, if not present. ○ If present, original architecture passed over to affinity.

Slide 25

Slide 25 text

How does it affect applications?

Slide 26

Slide 26 text

How does it affect applications? ● Scheduling classes are enabled by default to all ○ *that meet the requirements. ● Automatically attached the default scheduling class ○ unless explicitly choosing to opt out. ● Applications managed by scheduling class, get added ○ Labels and annotations ■ which denote the active list of scheduling classes and ■ whether it’s managed by scheduling classes.

Slide 27

Slide 27 text

Achieving better instance pool diversity in compute pool

Slide 28

Slide 28 text

Achieving better instance pool Diversity in compute pool ● Built on top of compute pool providers of ○ Cluster API Machinepools ■ For applications in the process of optimising for karpenter consolidation rate. ○ Karpenter provisioners ■ Groupless autoscaling, supporting wide range of instance shapes/sizes in one provisioner conﬁguration. ■ Instance families and types can be added without adding the necessary cloudprovider autoscaling construct. (eg: No need to separately create an ASG for new nodes) ● Applications can be scheduled on ○ CAPI pool ○ Karpenter pool ○ Or both

Slide 29

Slide 29 text

Node Groups

Slide 30

Slide 30 text

Improving Developer Experience

Slide 31

Slide 31 text

Improving Developer Experience

Slide 32

Slide 32 text

Improving Developer Experience

Slide 33

Slide 33 text

Improving Developer Experience ● Declarative approach ● App only has to pass the scheduling class names ○ The need to pass aﬃnity, tolerations, nodeSelectors goes away, removing the need for the application to know these before hand. ● Additive/Negation nature of different scheduling classes ○ Allowing to mix and match node pools of different features as required by application.

Slide 34

Slide 34 text

Increasing karpenter adoption

Slide 35

Slide 35 text

Increasing karpenter adoption ● Karpenter scheduling class ○ Default scheduling class for clusters. ○ Applications by default would then be deployed to the karpenter pool ■ unless speciﬁed with another scheduling. ■ Has custom aﬃnities/tolerations. ■ Explicitly opts out of karpenter. ● Keeping provisioners, instance family, narrower to start with ○ To allow for less frequent consolidation and then expand it.

Slide 36

Slide 36 text

Opting out of karpenter

Slide 37

Slide 37 text

Opting out of karpenter ● If application doesn’t want to schedule in karpenter pool due to ○ The consolidation rate of nodes. ○ Karpenter pool present, doesn’t have the node features application needs as of now. ● CAPI scheduling class ○ Makes applications default to CAPI pool, instead of karpenter. ● When CAPI scheduling class is added ○ Negates the rules added by the default (karpenter) scheduling class ○ Adds scheduling constraints speciﬁc to CAPI pool.

Slide 38

Slide 38 text

What’s next in our roadmap

Slide 39

Slide 39 text

What’s next in the roadmap? ● Scheduling classes as a custom CRD to allow application owners to interact with the config and add constraints themselves in their cluster. ● Support for additional scheduling classes for further fine grained expression of scheduling constraints, and more specificalised instance types. ● Additional scheduling classes, to onboard teams using self managed node-pools. ● Enhancing the karpenter provisioners further to allow for slow adoption of further node features in new/existing applications. ● Showing the result of scheduling results on the deployment platform for an application owner. ● Moving over to karpenter v1 API objects.

Slide 40

Slide 40 text

References ● Kubernetes Scheduling ● Node Selectors and Aﬃnity ● Taints and Tolerations ● Karpenter ● Admission Controllers ● Dynamic Admission Control ● Mutating Admission Webhook ● Cluster API MachinePools ● Karpenter provisioners

Slide 41

Slide 41 text

Thank you Tasdik Rahman @tasdikrahman || www.tasdikrahman.com