Slide 1

Slide 1 text

Learning sched_ext: BPF extensible scheduler class @shun159

Slide 2

Slide 2 text

Disclaimer - I’m a newbie at Linux kernel - And also, I’m not a scheduler expert.

Slide 3

Slide 3 text

Introduction ● sched_ext: new extensible scheduler class ○ Allows scheduling policies to be implemented as BPF programs ○ Provides simple and intuitive API for implement policies ■ Doesn’t require knowledge of core scheduler internals ○ Allows that experimentation in a safe manner without even needing to reboot the system ■ Safe, cannot crash the host ● Protection afforded by BPF verifier ○ Used in Meta production to optimize their workloads

Slide 4

Slide 4 text

Implementing scheduling policies: Overview ● Userspace can implement an arbitrary CPU scheduler by loading a BPF programs that implement “sched_ext_ops” ● BPF program must implement a set of callbacks ○ Task wakeup ○ Task enqueue/dequeue ○ Task state change (runnable, running, stopping, quiescent) ○ … ● Like other eBPF programs, we can use eBPF maps/data structures as needed verifier Kernel space Struct_ops (callbacks) sched_ext call eBPF Maps rbtrees/ linked-list Userspace programs syscall bpf(BPF_PROG_LOAD)

Slide 5

Slide 5 text

Implementing scheduling policies: Callbacks(1) /* Pick the target CPU for a task which is being woken up */ s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags); /* Enqueue a runnable task on the BPF scheduler or dispatch directly to CPU */ void (*enqueue)(struct task_struct *p, u64 enq_flags); /* Remove a task from the BPF scheduler. * This is usually called to isolate the task while updating its scheduling properties (e.g. priority). */ void (*dequeue)(struct task_struct *p, u64 deq_flags); …. /* BPF scheduler’s name, 128 chars or less */ char name[SCX_OPS_NAME_LEN];

Slide 6

Slide 6 text

Implementing scheduling policies: Callbacks(2) /* A task is becoming runnable on its associated CPU */ void (*runnable)(struct task_struct *p, u64 enq_flags); /* A task is starting to run on its associated CPU */ void (*running)(struct task_struct *p); /* A task is starting to run on its associated CPU */ void (*stopping)(struct task_struct *p, bool runnable); /* A task is becoming not runnable on its associated CPU */ void (*quiescent)(struct task_struct *p, u64 deq_flags); The only thing we need to implement is the “name” of the scheduler; everything else is optional

Slide 7

Slide 7 text

Implementing scheduling policies: BPF program All existing and future CFS tasks(SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE and SCHED_EXT) switched to SCX. Otherwise, only tasks that have SCHED_EXT explicitly set will be placed on sched_ext. When SCX_ENQ_LOCAL is set in the enq_flag, it indicates that running the task on the selected CPU directly should not affect fairness. In this case, just queue it on the local FIFO. Otherwise, in this example code, re-enqueue the task directly in the global DSQ. It will be consumed later by sched_ext. Specify callbacks that invoked by sched_ext.

Slide 8

Slide 8 text

DSQ’s(Dispatch Queues): Overview ● An abstraction layer between BPF scheduler and kernel for managing queues of tasks, sched_ext uses a FIFO queue called DSQ’s(Dispatch Queues). ○ By default, one global DSQ and a per-CPU local DSQ are created. ○ Global DSQ (SCX_DSQ_GLOBAL) ■ By default, consumed when the local DSQs are empty. ■ Can be utilized by a scheduler if necessary ○ per-CPU local DSQ’s (SCX_DSQ_LOCAL) ■ per-CPU FIFO (or RB Trees if the task is flagged as use priority queue) that SCX pulls from when putting the next task on the CPU. ■ A CPU always executes a task from its local DSQ CPU CPU Global DSQ Local DSQ’s

Slide 9

Slide 9 text

DSQ’s(Dispatch Queues): Operations ● Each DSQ provides operations; "dispatch" and "consume" ○ “Consume”: ■ Like as pick_next_task(), consuming a next task from a DSQ to run on the calling CPU. ■ consumed in ops.dispatch() when a core is will go idle if no task is found ○ “Dispatch”: Placing a task into a CPU. Can be done in the following callbacks ■ ops.enqueue: invoked when a task is being enqueued in the BPF scheduler. ■ ops.dispatch: invoked when a CPU is will go idle if a task is not found. ■ This operation should either dispatch one or more tasks to other local DSQs or transfer a task from a DSQ to the current CPU's DSQ

Slide 10

Slide 10 text

Scheduling Cycle: Task enqueue/wakeup flow Entering scheduler With explicit blocking: mutex, semaphore, waitqueue, etc. TIF_NEED_RESCHED is set in p->thread_info->flags Task waking up? s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags); Migrated to selected CPU Task becomes runnable void (*enqueue)(struct task_struct *p, u64 enq_flags); Task dispatched directly to CPU? Task enqueued in the CPU’s local DSQ Task enqueued in the BPF scheduler (or global DSQ) Y N Y N When the task is waking up, ops.select_cpu() is the first operation invoked. The function provides: ● CPU selection optimization hint ● waking up the CPU if idle. After the target CPU is selected, ops.enqueue() is invoked. It can make one of the following decisions: ● Immediately dispatch the task to DSQs ● Queue the task on the BPF scheduler

Slide 11

Slide 11 text

Scheduling Cycle: Runqueue balance/dispatch Start balance Does the CPU have locally dispatched tasks? void (*dispatch)(s32 cpu, struct task_struct *prev); Dispatch tasks to local DSQ or transfer tasks to local DSQ Were any tasks dispatched in the callback? CPU will go to idle Task successfully balanced to core Y N Y N If there still isn't a task to run, ops.dispatch() is invoked. We can use the following function to populate local DSQ: ● scx_bpf_dispatch(): dispatch a task to a DSQ ● scx_bpf_comsume(): transfer a task to the dispatching DSQ After ops.dispatch() returns, the following steps are taken: ● Try to consume DSQ’s, if successful, run the task ● If “dispatched” tasks, retry the first branch ● Previous task is still “runnable”, keep executing it ● Idle When a CPU is ready to schedule, it first looks at its local DSQ. if empty, it then looks at the global DSQ.

Slide 12

Slide 12 text

sched_class: .select_task_rq int select_task_rq_scx(struct rq *rq, struct task_struct *p, int enq_flags) ● This callback invoked by the core scheduler to determine which CPU to assign task to. ● When the callback is invoked, BPF scheduler’s “select_cpu” is called.

Slide 13

Slide 13 text

sched_class: .enqueue_task void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags) ● This callback is invoked when the "*p" transitions to the "runnable" state. ● At this point, the BPF scheduler's "runnable" and “enqueue” callback will be called.

Slide 14

Slide 14 text

sched_class: .pick_next_task struct task_struct *pick_next_task_scx(struct rq *rq) ● Called by the core scheduler to determine which task from DSQ should be running. ● This function returns the task that is currently running. ● When the callback is invoked, the BPF scheduler's "running" callback can be called

Slide 15

Slide 15 text

sched_class: .balance int balance_scx(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) ● If there are no tasks to run, the callback is invoked. ● At this point, the BPF scheduler's "cpu_acquire" (to handle cases where the task is migrated from another sched_class) and "dispatch" callbacks are invoked.

Slide 16

Slide 16 text

Build sched-ext kernel (1) 1. Checkout the sched_ext repo from github: 2. Checkout and build the latest clang: $ yay -S cmake ninja $ mkdir ~/llvm $ git clone https://github.com/llvm/llvm-project.git llvm-project $ mkdir -p llvm-project/build; cd llvm-project/build $ cmake -G Ninja \ -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ -DCMAKE_INSTALL_PREFIX="/$HOME/llvm/$(date +%Y%m%d)" \ -DBUILD_SHARED_LIBS=OFF \ -DLIBCLANG_BUILD_STATIC=ON \ -DCMAKE_BUILD_TYPE=Release \ -DLLVM_ENABLE_TERMINFO=OFF \ -DLLVM_ENABLE_PROJECTS="clang;lld" \ ../llvm $ ninja install -j$(nproc) $ ln -sf /$HOME/llvm/$(date +%Y%m%d) /$HOME/llvm/latest git clone https://github.com/sched-ext/sched_ext

Slide 17

Slide 17 text

Build sched-ext kernel (2) 3. Download and build the latest pahole: *** After build pahole and clang, make sure they are in your $PATH *** 4. Build sched_ext kernel: $ cd /data/users/$USER $ git clone https://git.kernel.org/pub/scm/devel/pahole/pahole.git $ mkdir -p pahole/build; cd pahole/build $ cmake -G Ninja ../ $ ninja CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y CONFIG_DEBUG_INFO_BTF=y CONFIG_PAHOLE_HAS_SPLIT_BTF=y CONFIG_PAHOLE_HAS_BTF_TAG=y CONFIG_SCHED_CLASS_EXT=y CONFIG_SCHED_DEBUG=y CONFIG_BPF_SYSCALL=y CONFIG_BPF_JIT=y ### 9P_FS is used by osandov-linux to mount the custom build directory from the hostmachine CONFIG_9P_FS=y CONFIG_NET_9P=y CONFIG_NET_9P_FD=y CONFIG_NET_9P_VIRTIO=y

Slide 18

Slide 18 text

Build sched-ext kernel (3) 4. Build sched_ext kernel: 5. Build scx samples: $ make CC=clang-17 LD=ld.lld LLVM=1 menuconfig $ make CC=clang-17 LD=ld.lld LLVM=1 olddefconfig $ make CC=clang-17 LD=ld.lld LLVM=1 -j$(nproc) $ cd tools/sched_ext $ make CC=clang-17 LD=ld.lld LLVM=1 -j$(nproc)

Slide 19

Slide 19 text

Build sched-ext kernel (4) 6. Setup a VM for the sched_ext kernel I recommend using osantov-linux[0], as it is a very handy tool for running a custom-built kernel . $ vm.py create -c 4 -m 8192 -s 50G $ vm.py archinstall $ kconfig.py /configs/vmpy.fragment $ vm.py run -k $PWD -- [0]: https://github.com/osandov/osandov-linux

Slide 20

Slide 20 text

Write own CPU scheduler ● A "simple-minded vruntime" scheduler that performs scheduling decisions in userspace. ● It may not be the practical approach, the intention is to see how to write a “kernel bypassed” scheduler. ○ Move the complexity of scheduling tasks from the kernel to userspace.

Slide 21

Slide 21 text

Write own CPU scheduler: diagram RX Ring TX Ring sched_ext Userspace scheduler daemon DSQ Local dsq Scheduler enter Ring buffer “RX” path “TX” path userspace Kernel space Task queue

Slide 22

Slide 22 text

Write own CPU scheduler: Select CPU RX Ring TX Ring sched_ext Userspace scheduler daemon DSQ Scheduler enter Ring buffer “RX” path “TX” path userspace Kernel space Task queue When a task wakes up, the select_cpu callback is invoked. This callback provides a CPU hint for the task based on the following decision-making process: ● If the previous CPU was in an idle state and the idle state was successfully cleared, the task will use that CPU. ● If a CPU is successfully chosen, the task will use the CPU ● Otherwise, the callback will simply return the previous CPU number. Local dsq

Slide 23

Slide 23 text

Write own CPU scheduler: Enqueue a task RX Ring TX Ring sched_ext Userspace scheduler daemon DSQ Scheduler enter Ring buffer “RX” path “TX” path userspace Kernel space Task queue When a task becomes "runnable" state, the enqueue callback is invoked. The BPF scheduler places the task onto the "RX" ring buffer. Local dsq

Slide 24

Slide 24 text

Write own CPU scheduler: Enqueue a task RX Ring TX Ring sched_ext Userspace scheduler daemon DSQ Scheduler enter Ring buffer “RX” path “TX” path userspace Kernel space Task queue When the userspace application receives a task from the RX ring, the task is enqueued in a list. The application maintains the task list with the following restrictions: - N = 8192 - T = {task1, task2, task3, ..., taskN} - T[i].vruntime ≤ T[j].vruntime (0 ≤ i < j ≤ N) After sorting the list, the application places the tasks onto the "TX" ring buffer. Local dsq

Slide 25

Slide 25 text

Write own CPU scheduler: Dispatch tasks RX Ring TX Ring sched_ext Userspace scheduler daemon DSQ Scheduler enter Ring buffer “RX” path “TX” path userspace Kernel space Task queue The dispatch callback is called, due to there still is not a task to run at this point. This callback consumes tasks from the "TX" ring. The tasks consumed are then placed on the DSQs and run on the selected CPU. Local dsq

Slide 26

Slide 26 text

Thank you