Learning sched_ext: BPF extensible scheduler class

Learning sched_ext: BPF extensible scheduler class @shun159

Disclaimer - I’m a newbie at Linux kernel - And
also, I’m not a scheduler expert.

Introduction • sched_ext: new extensible scheduler class ◦ Allows scheduling
policies to be implemented as BPF programs ◦ Provides simple and intuitive API for implement policies ▪ Doesn’t require knowledge of core scheduler internals ◦ Allows that experimentation in a safe manner without even needing to reboot the system ▪ Safe, cannot crash the host • Protection aﬀorded by BPF veriﬁer ◦ Used in Meta production to optimize their workloads

Implementing scheduling policies: Overview • Userspace can implement an arbitrary
CPU scheduler by loading a BPF programs that implement “sched_ext_ops” • BPF program must implement a set of callbacks ◦ Task wakeup ◦ Task enqueue/dequeue ◦ Task state change (runnable, running, stopping, quiescent) ◦ … • Like other eBPF programs, we can use eBPF maps/data structures as needed veriﬁer Kernel space Struct_ops (callbacks) sched_ext call eBPF Maps rbtrees/ linked-list Userspace programs syscall bpf(BPF_PROG_LOAD)

Implementing scheduling policies: Callbacks(1) /* Pick the target CPU for
a task which is being woken up */ s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags); /* Enqueue a runnable task on the BPF scheduler or dispatch directly to CPU */ void (*enqueue)(struct task_struct *p, u64 enq_flags); /* Remove a task from the BPF scheduler. * This is usually called to isolate the task while updating its scheduling properties (e.g. priority). */ void (*dequeue)(struct task_struct *p, u64 deq_flags); …. /* BPF scheduler’s name, 128 chars or less */ char name[SCX_OPS_NAME_LEN];

Implementing scheduling policies: Callbacks(2) /* A task is becoming runnable
on its associated CPU */ void (*runnable)(struct task_struct *p, u64 enq_ﬂags); /* A task is starting to run on its associated CPU */ void (*running)(struct task_struct *p); /* A task is starting to run on its associated CPU */ void (*stopping)(struct task_struct *p, bool runnable); /* A task is becoming not runnable on its associated CPU */ void (*quiescent)(struct task_struct *p, u64 deq_ﬂags); The only thing we need to implement is the “name” of the scheduler; everything else is optional

Implementing scheduling policies: BPF program All existing and future CFS
tasks(SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE and SCHED_EXT) switched to SCX. Otherwise, only tasks that have SCHED_EXT explicitly set will be placed on sched_ext. When SCX_ENQ_LOCAL is set in the enq_ﬂag, it indicates that running the task on the selected CPU directly should not aﬀect fairness. In this case, just queue it on the local FIFO. Otherwise, in this example code, re-enqueue the task directly in the global DSQ. It will be consumed later by sched_ext. Specify callbacks that invoked by sched_ext.

DSQ’s(Dispatch Queues): Overview • An abstraction layer between BPF scheduler
and kernel for managing queues of tasks, sched_ext uses a FIFO queue called DSQ’s(Dispatch Queues). ◦ By default, one global DSQ and a per-CPU local DSQ are created. ◦ Global DSQ (SCX_DSQ_GLOBAL) ▪ By default, consumed when the local DSQs are empty. ▪ Can be utilized by a scheduler if necessary ◦ per-CPU local DSQ’s (SCX_DSQ_LOCAL) ▪ per-CPU FIFO (or RB Trees if the task is ﬂagged as use priority queue) that SCX pulls from when putting the next task on the CPU. ▪ A CPU always executes a task from its local DSQ CPU CPU Global DSQ Local DSQ’s

DSQ’s(Dispatch Queues): Operations • Each DSQ provides operations; "dispatch" and
"consume" ◦ “Consume”: ▪ Like as pick_next_task(), consuming a next task from a DSQ to run on the calling CPU. ▪ consumed in ops.dispatch() when a core is will go idle if no task is found ◦ “Dispatch”: Placing a task into a CPU. Can be done in the following callbacks ▪ ops.enqueue: invoked when a task is being enqueued in the BPF scheduler. ▪ ops.dispatch: invoked when a CPU is will go idle if a task is not found. ▪ This operation should either dispatch one or more tasks to other local DSQs or transfer a task from a DSQ to the current CPU's DSQ

Scheduling Cycle: Task enqueue/wakeup flow Entering scheduler With explicit blocking:
mutex, semaphore, waitqueue, etc. TIF_NEED_RESCHED is set in p->thread_info->flags Task waking up? s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags); Migrated to selected CPU Task becomes runnable void (*enqueue)(struct task_struct *p, u64 enq_flags); Task dispatched directly to CPU? Task enqueued in the CPU’s local DSQ Task enqueued in the BPF scheduler (or global DSQ) Y N Y N When the task is waking up, ops.select_cpu() is the first operation invoked. The function provides: • CPU selection optimization hint • waking up the CPU if idle. After the target CPU is selected, ops.enqueue() is invoked. It can make one of the following decisions: • Immediately dispatch the task to DSQs • Queue the task on the BPF scheduler

Scheduling Cycle: Runqueue balance/dispatch Start balance Does the CPU have
locally dispatched tasks? void (*dispatch)(s32 cpu, struct task_struct *prev); Dispatch tasks to local DSQ or transfer tasks to local DSQ Were any tasks dispatched in the callback? CPU will go to idle Task successfully balanced to core Y N Y N If there still isn't a task to run, ops.dispatch() is invoked. We can use the following function to populate local DSQ: • scx_bpf_dispatch(): dispatch a task to a DSQ • scx_bpf_comsume(): transfer a task to the dispatching DSQ After ops.dispatch() returns, the following steps are taken: • Try to consume DSQ’s, if successful, run the task • If “dispatched” tasks, retry the ﬁrst branch • Previous task is still “runnable”, keep executing it • Idle When a CPU is ready to schedule, it ﬁrst looks at its local DSQ. if empty, it then looks at the global DSQ.

sched_class: .select_task_rq int select_task_rq_scx(struct rq *rq, struct task_struct *p, int
enq_ﬂags) • This callback invoked by the core scheduler to determine which CPU to assign task to. • When the callback is invoked, BPF scheduler’s “select_cpu” is called.

sched_class: .enqueue_task void enqueue_task_scx(struct rq *rq, struct task_struct *p, int
enq_ﬂags) • This callback is invoked when the "*p" transitions to the "runnable" state. • At this point, the BPF scheduler's "runnable" and “enqueue” callback will be called.

sched_class: .pick_next_task struct task_struct *pick_next_task_scx(struct rq *rq) • Called by
the core scheduler to determine which task from DSQ should be running. • This function returns the task that is currently running. • When the callback is invoked, the BPF scheduler's "running" callback can be called

sched_class: .balance int balance_scx(struct rq *rq, struct task_struct *prev, struct
rq_ﬂags *rf) • If there are no tasks to run, the callback is invoked. • At this point, the BPF scheduler's "cpu_acquire" (to handle cases where the task is migrated from another sched_class) and "dispatch" callbacks are invoked.

Build sched-ext kernel (1) 1. Checkout the sched_ext repo from
github: 2. Checkout and build the latest clang: $ yay -S cmake ninja $ mkdir ~/llvm $ git clone https://github.com/llvm/llvm-project.git llvm-project $ mkdir -p llvm-project/build; cd llvm-project/build $ cmake -G Ninja \ -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ -DCMAKE_INSTALL_PREFIX="/$HOME/llvm/$(date +%Y%m%d)" \ -DBUILD_SHARED_LIBS=OFF \ -DLIBCLANG_BUILD_STATIC=ON \ -DCMAKE_BUILD_TYPE=Release \ -DLLVM_ENABLE_TERMINFO=OFF \ -DLLVM_ENABLE_PROJECTS="clang;lld" \ ../llvm $ ninja install -j$(nproc) $ ln -sf /$HOME/llvm/$(date +%Y%m%d) /$HOME/llvm/latest git clone https://github.com/sched-ext/sched_ext

Build sched-ext kernel (2) 3. Download and build the latest
pahole: *** After build pahole and clang, make sure they are in your $PATH *** 4. Build sched_ext kernel: $ cd /data/users/$USER $ git clone https://git.kernel.org/pub/scm/devel/pahole/pahole.git $ mkdir -p pahole/build; cd pahole/build $ cmake -G Ninja ../ $ ninja CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y CONFIG_DEBUG_INFO_BTF=y CONFIG_PAHOLE_HAS_SPLIT_BTF=y CONFIG_PAHOLE_HAS_BTF_TAG=y CONFIG_SCHED_CLASS_EXT=y CONFIG_SCHED_DEBUG=y CONFIG_BPF_SYSCALL=y CONFIG_BPF_JIT=y ### 9P_FS is used by osandov-linux to mount the custom build directory from the hostmachine CONFIG_9P_FS=y CONFIG_NET_9P=y CONFIG_NET_9P_FD=y CONFIG_NET_9P_VIRTIO=y

Build sched-ext kernel (3) 4. Build sched_ext kernel: 5. Build
scx samples: $ make CC=clang-17 LD=ld.lld LLVM=1 menuconﬁg $ make CC=clang-17 LD=ld.lld LLVM=1 olddefconﬁg $ make CC=clang-17 LD=ld.lld LLVM=1 -j$(nproc) $ cd tools/sched_ext $ make CC=clang-17 LD=ld.lld LLVM=1 -j$(nproc)

Build sched-ext kernel (4) 6. Setup a VM for the
sched_ext kernel I recommend using osantov-linux[0], as it is a very handy tool for running a custom-built kernel . $ vm.py create -c 4 -m 8192 -s 50G <vm name> $ vm.py archinstall <vm name> $ kconﬁg.py <path to osandov-linux>/conﬁgs/vmpy.fragment $ vm.py run -k $PWD -- <vm name> [0]: https://github.com/osandov/osandov-linux

Write own CPU scheduler • A "simple-minded vruntime" scheduler that
performs scheduling decisions in userspace. • It may not be the practical approach, the intention is to see how to write a “kernel bypassed” scheduler. ◦ Move the complexity of scheduling tasks from the kernel to userspace.

Write own CPU scheduler: diagram RX Ring TX Ring sched_ext
Userspace scheduler daemon DSQ Local dsq Scheduler enter Ring buﬀer “RX” path “TX” path userspace Kernel space Task queue

Write own CPU scheduler: Select CPU RX Ring TX Ring
sched_ext Userspace scheduler daemon DSQ Scheduler enter Ring buﬀer “RX” path “TX” path userspace Kernel space Task queue When a task wakes up, the select_cpu callback is invoked. This callback provides a CPU hint for the task based on the following decision-making process: • If the previous CPU was in an idle state and the idle state was successfully cleared, the task will use that CPU. • If a CPU is successfully chosen, the task will use the CPU • Otherwise, the callback will simply return the previous CPU number. Local dsq

Write own CPU scheduler: Enqueue a task RX Ring TX
Ring sched_ext Userspace scheduler daemon DSQ Scheduler enter Ring buﬀer “RX” path “TX” path userspace Kernel space Task queue When a task becomes "runnable" state, the enqueue callback is invoked. The BPF scheduler places the task onto the "RX" ring buﬀer. Local dsq

Write own CPU scheduler: Enqueue a task RX Ring TX
Ring sched_ext Userspace scheduler daemon DSQ Scheduler enter Ring buﬀer “RX” path “TX” path userspace Kernel space Task queue When the userspace application receives a task from the RX ring, the task is enqueued in a list. The application maintains the task list with the following restrictions: - N = 8192 - T = {task1, task2, task3, ..., taskN} - T[i].vruntime ≤ T[j].vruntime (0 ≤ i < j ≤ N) After sorting the list, the application places the tasks onto the "TX" ring buﬀer. Local dsq

Write own CPU scheduler: Dispatch tasks RX Ring TX Ring
sched_ext Userspace scheduler daemon DSQ Scheduler enter Ring buﬀer “RX” path “TX” path userspace Kernel space Task queue The dispatch callback is called, due to there still is not a task to run at this point. This callback consumes tasks from the "TX" ring. The tasks consumed are then placed on the DSQs and run on the selected CPU. Local dsq

Thank you

Learning sched_ext: BPF extensible scheduler class

Learning sched_ext: BPF extensible scheduler class

Eishun Kondoh

More Decks by Eishun Kondoh

Featured

Transcript

Learning sched_ext: BPF extensible scheduler class @shun159

Disclaimer - I’m a newbie at Linux kernel - And

Introduction • sched_ext: new extensible scheduler class ◦ Allows scheduling

Implementing scheduling policies: Overview • Userspace can implement an arbitrary

Implementing scheduling policies: Callbacks(1) /* Pick the target CPU for

Implementing scheduling policies: Callbacks(2) /* A task is becoming runnable

Implementing scheduling policies: BPF program All existing and future CFS

DSQ’s(Dispatch Queues): Overview • An abstraction layer between BPF scheduler

DSQ’s(Dispatch Queues): Operations • Each DSQ provides operations; "dispatch" and

Scheduling Cycle: Task enqueue/wakeup ﬂow Entering scheduler With explicit blocking:

Scheduling Cycle: Runqueue balance/dispatch Start balance Does the CPU have

sched_class: .select_task_rq int select_task_rq_scx(struct rq rq, struct task_struct p, int

sched_class: .enqueue_task void enqueue_task_scx(struct rq rq, struct task_struct p, int

sched_class: .pick_next_task struct task_struct pick_next_task_scx(struct rq rq) • Called by

sched_class: .balance int balance_scx(struct rq rq, struct task_struct prev, struct

Build sched-ext kernel (1) 1. Checkout the sched_ext repo from

Build sched-ext kernel (2) 3. Download and build the latest

Build sched-ext kernel (3) 4. Build sched_ext kernel: 5. Build

Build sched-ext kernel (4) 6. Setup a VM for the

Write own CPU scheduler • A "simple-minded vruntime" scheduler that

Write own CPU scheduler: diagram RX Ring TX Ring sched_ext

Write own CPU scheduler: Select CPU RX Ring TX Ring

Write own CPU scheduler: Enqueue a task RX Ring TX

Write own CPU scheduler: Enqueue a task RX Ring TX

Write own CPU scheduler: Dispatch tasks RX Ring TX Ring

Thank you