Understanding the Kubernetes Scheduler: Internals, Roadmap, and Contributions

Understanding the Kubernetes Scheduler: Internals, Roadmap, and Contributions Kensei Nakada
(@sanposhiho)

Today’s Goals • Understand how kube-scheduler works. ◦ I’ll make
you an expert of kube-scheduler so that you will be conﬁdent enough to understand the recent works, and contribute to kube-scheduler. • Understand the recent works around kube-scheduler. • Understand how you can contribute to kube-scheduler, or Kubernetes, in general. Feel free to give me questions while I’m presenting. You can ask questions even in Vietnamese, then someone will help translate it (please!).

Xin chào! こんにちは ! 👋 Kensei Nakada (@sanposhiho) • Software
Engineer @ 󰑔 startup / live in 󰏦 • Kubernetes SIG-Scheduling chair / tech lead

Image Locality Taint/Toleration Kubernetes Scheduler The control plane component that
finds the best Node for every Pod to run on. Resource Ports NodeAffinity PodAffinity/AntiAffinity etc etc… Many factors to consider…

Scheduler Plugins Each scheduling factor is implemented as a plugin.
Image Locality Plugin TaintToleration Plugin Resource Fit Plugin NodePorts Plugin NodeAfﬁnity Plugin Inter-Pod Afﬁnity Plugin etc etc… Kubernetes scheduler consists of many plugins:

Scheduling Framework The underlying architecture for the scheduler, which is
pluggable and extensible. A plugin works at one or more extension points in the scheduling framework. Filter Filter out Nodes that cannot run the Pod. (Insufﬁcient resource, unmatch with NodeAfﬁnity, etc) Score Score Nodes and determine the best one. (Image locality, etc)

×  ×  Filter plugins rejects some Nodes based on their
own point of view.

×  ×  9 Score plugins scores Nodes that have gone
through Filters.

80  60  ×  ×  10 Finally, Node2 is picked up
for this Pod.

Scheduling Framework More extension points actually…

Scheduling Framework Scheduling cycle: Decide where pods go to. NOT
change anything in kube-apiserver at this point. Handle pods one by one.

Scheduling Framework Binding cycle: Apply the scheduling decisions to kube-apiserver.

Scheduling Framework Scheduling Queue: where all the pending pods are
in. The scheduling cycle picks up pods from there, and start scheduling.

Scheduling Queue All pending pods are waiting in Scheduling Queue,
which is composed of three internal spaces: • ActiveQ: hold Pods that are ready to get scheduling. • BackoffQ: hold Pods that are waiting for a backoff time to be completed. • Unschedulable Pod Pool: hold Pods that are back from scheduling cycle, and waiting for some changes in the cluster that make them schedulable.

Recent works at SIG-Scheduling • The scheduler performance improvements. ◦
KEP-4832: Async preemption. ◦ KEP-4247: Queueing Hints. ◦ KEP-5142: Pop pod from backoffQ when activeQ is empty. • TOO MANY works around Dynamic resource allocations. ◦ (We’re not going to look into those.)

Performance matters! • Only one Scheduler within a cluster (basically)
◦ If the Scheduling Throughput falls below the Pod creation speed of the cluster, unscheduled Pods will accumulate. • The major performance blockers for the throughput ◦ API calls ◦ Backoff time ◦ Unschedulable pods

Async preemption (v1.33: beta) Node1 high-priority-Pod I could go to
Node1 if I delete two Pods there… Before your pods get unschedulable, they go through the preemption process. When the preemption happens, the scheduling cycle takes time to complete because it has to make some API calls. It impacts the whole scheduling latency.

Async preemption (v1.33: beta) Node1 high-priority-Pod Boom, I should be
able to go to Node1 at the next scheduling cycle. Before your pods get unschedulable, they go through the preemption process. When the preemption happens, the scheduling cycle takes time to complete because it has to make some API calls. It impacts the whole scheduling latency. Pod deletion ❌ ❌

Async preemption (v1.33: beta) Node1 high-priority-Pod Before your pods get
unschedulable, they go through the preemption process. When the preemption happens, the scheduling cycle takes time to complete because it has to make some API calls. It impacts the whole scheduling latency. Pod deletion ❌ ❌ Starts the next scheduling after the preemption is completed.

Async preemption (v1.33: beta) Node1 high-priority-Pod After deciding which Pod(s)
to delete, makes the API calls asynchronously, and starts the next scheduling cycle without waiting for the those calls completion. The preemption reserves the space which will be freed up before starting the next scheduling cycle so that it can be taken into consideration. I’m going to delete Pod1 and Pod2, and go to Node1. Sure, I’ll reserve the place for you. (unless a more higher priority pod takes over your place)

Async preemption (v1.33: beta) Node1 high-priority-Pod After deciding which Pod(s)
to delete, makes the API calls asynchronously, and starts the next scheduling cycle without waiting for the those calls completion. The preemption reserves the space which will be freed up before starting the next scheduling cycle so that it can be taken into consideration. Make API calls asynchronously. Start a next scheduling without waiting for the preemption. Pod deletion ❌ ❌

which is composed of three internal spaces: • ActiveQ: hold Pods that are ready to get scheduling. • BackoffQ: hold Pods that are waiting for a backoff time to be completed. • Unschedulable Pod Pool: hold Pods that are back from scheduling cycle, and waiting for some changes in the cluster that make them schedulable. What’s this? 🤔

Unschedulable Pod Pool When Pods are rejected at the scheduling
cycle (= unschedulable) they are going back to Unschedulable Pod Pool in Scheduling Queue.

Unschedulable Pod Pool When Pods are rejected at the scheduling
cycle (= unschedulable) they are going back to Unschedulable Pod Pool in Scheduling Queue. 🤔When should we retry those Pods? → We shouldn’t just keep retrying scheduling, which would waste scheduling cycles. We should retry scheduling ONLY WHEN the next scheduling is likely successful!

Scheduling Queue If the Pod is rejected at the scheduling
cycle, it comes back to the queue with the annotation which plugin(s) rejected this Pod. Previously

Scheduling Queue If the Pod is rejected at the scheduling
cycle, it comes back to the queue with the annotation which plugin(s) rejected this Pod. NodeResourceFit’s failure could be solved by: • New Node is created. • Node is updated to have more allocatable capacity. • The scheduled Pod is deleted. Previously

Scheduling Queue The plugin registers which cluster events their failure
could be solved with. NodeAdd Observes NodeAdd, which the failure from the resource fit plugin could be solved with NodeResourceFit’s failure could be solved by: • New Node is created. • Node is updated to have more allocatable capacity. • The scheduled Pod is deleted. Previously

Scheduling Queue The plugin registers which cluster events their failure
could be solved with. NodeAdd Maybe this new Node is too small to run the Pod… But, the queue couldn’t see such details… NodeResourceFit’s failure could be solved by: • New Node is created. • Node is updated to have more allocatable capacity. • The scheduled Pod is deleted. Previously

Queueing Hints (v1.32: beta) OK, Cluster changes can make previously
unschedulable Pods schedulable. But, this old mechanism was too rough and not extensible for out-of-tree plugins. • Plugins can only declare which types changes can make pods schedulable. ◦ Node created, Node updated, Pod deleted … etc • They cannot see the details of each to make more precise decisions. ◦ Node is created, but what kind of Node? etc

Queueing Hints (v1.32: beta) OK, Cluster changes can make previously
unschedulable Pods schedulable. But, this old mechanism was too rough and not extensible for out-of-tree plugins. • Plugins can only declare which types changes can make pods schedulable. ◦ Node created, Node updated, Pod deleted … etc • They cannot see the details of each to make more precise decisions. ◦ Node is created, but what kind of Node? etc Queue Hints allows plugins to see those details, and make requeueing decisions based on them!

Scheduling Queue QHint(s) of the failure plugin(s) decides which events
the pod should be retried with. QueueingHint Check whether a new Node is big enough for the Pod’s requests. If Yes, requeue the Pod. NodeAdd With QHint

BackoffQ BackoffQ is responsible of making pods backoff. Pods that
have rejected a lot at scheduling cycles will get longer waiting time (backoff time) so that other Pods will also get chances of scheduling.

Title Previously

Title Previously What if ActiveQ is empty..? Scheduling cycles idle
until some pods enter ActiveQ Nothing to do until some pods enter ActiveQ…☕

KEP-5142: Pop pod from backoffQ when activeQ is empty It’s
waste of time to let scheduling cycles idle when activeQ is empty. Pods should just skip backoff time in such cases not to waste time.

Title After KEP-5142 When ActiveQ is empty Scheduling cycles pop
Pods from backoffQ directly.

What’s next? • Workload aware scheduling (k/k#132192) • Implement a
general way to make async API calls (KEP-5229) • Expand NominatedNodeName for a best-effort reservation (KEP-5278)

Workload Aware Scheduling Currently, kube-scheduler schedules Pods one by one,
treating them individually. However, as Kubernetes’s use case grows, we’ve got more use cases to schedule groups of pods. • Gang Scheduling (KEP-4671) ◦ Ensure to schedule several Pods at the same time. • Topology Aware Scheduling (Kueue’s feature)

Implement a general way to make async API calls Like
the async preemption feature made the preemption API calls asynchronous, kube-scheduler has several API calls during the scheduling process. We’re trying to implement a general way to make async API calls for a better performance.

Expand NominatedNodeName for a best-effort reservation Some components such as
Kueue and Cluster Autoscaler simulates the scheduling, and they know “this Pod might be scheduled onto that Node”. We’re trying to use NominatedNodeName ﬁeld so that those components can give those as hints so that kube-scheduler can skip some scheduling calculations. Maybe pod1 can be scheduled on node1. You may want to try node1 for pod1 first.

How to contribute to Kubernetes?

How is Kubernetes developed? • Minor features, bugs, cleanups etc
are just discussed at GitHub Issues, and Pull Requests are opened based on the discussion. • Major changes are developed with KEP process. ◦ Kubernetes Enhancement Proposal. ◦ kubernetes/enhancement repository hosts proposals. ◦ Proposals are discussed on a certain format, have to be approved by certain stakeholders. ◦ The feature gate is implemented so that big changes don’t break thousands of users.

What to contribute? There are several types of contribution, which
are all valuable! • Contribute to the implementation ◦ Not only Kubernetes itself, but there’re many sub projects in kubernetes-sigs org! • Contribute to the documentation ◦ May be the easiest to start, to get used to the development process, and get into the community. • Opening GitHub Issues for feature requests/bug reports. Joining the discussion. • Join the release teams. ◦ which manages and helps the Kubernetes release.

How to contribute? (implementation) • Discuss with the maintainers first!
◦ If you want to propose a new feature, or even just want to fix a bug, it’s always better to discuss with the maintainers to make sure what you want to change is correct. • Start from easier tasks. ◦ Check GitHub Issues with good-first-issue and help-wanted labels. • Talk to the maintainers (but don’t bother them a lot) ◦ You can also talk to maintainers at Slack DM etc for questions or suggestions. However, you should try hard before asking, and try to elaborate your questions. You should save the energy that maintainers have to spend to answer your questions otherwise you can just get ignored, in the worst case. (They are all busy!)

How did I start contribution? I joined Google Summer of
Code 2021, and continued the contributions even afterwards. There are several mentoring programs: • Google Summer of Code • LFX Mentorship You will get a mentor from the team who assist you in making changes thoroughly.

Several milestones for you There are several milestones that you
can aim at while making a contribution. • GitHub Membership of Kubernetes org. ◦ You need a few contributions (~10) and also sponsors from the maintainers. • Reviewer ◦ Reviewers ofﬁcially make reviews on PRs for the responsible implementation. ◦ You need a certain amount of contributions (e.g., 20 PRs are merged etc). • Approver ◦ Approvers put ﬁnal approvals on PRs for the responsible implementation. ◦ You need a huge amount of contributions (e.g., 30 PRs are merged etc). • Contributor Award ◦ Not a role, but contributors who have made a huge impact will get awarded from SIGs every year. • SIG Lead (Chair and Tech Lead) ◦ SIG Leads literally leads the SIG’s direction, discussion, and everything.

Other recommendations… After all, the contributions would be in your
free time in most cases. The best recommendation to continue the contributions is to keep an enough motivation! • Contribute to what you LOVE or are interested in! • Contribute to what you can get an actual beneﬁt from. Examples: ◦ Contribute to the component you’re using for the company to get more in-depth knowledge. ◦ Contribute to the component that companies you want to move to are using. Contributions are visible experience/expertise! ◦ Contribute and get connected with awesome people.

THANK YOU!

Understanding the Kubernetes Scheduler: Interna...

Understanding the Kubernetes Scheduler: Internals, Roadmap, and Contributions

More Decks by sanposhiho

Other Decks in Programming

Featured

Transcript