Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning sched_ext: BPF extensible scheduler class

Eishun Kondoh
June 06, 2023
200

Learning sched_ext: BPF extensible scheduler class

My learning note about sched_ext

Eishun Kondoh

June 06, 2023
Tweet

Transcript

  1. Disclaimer - I’m a newbie at Linux kernel - And

    also, I’m not a scheduler expert.
  2. Introduction • sched_ext: new extensible scheduler class ◦ Allows scheduling

    policies to be implemented as BPF programs ◦ Provides simple and intuitive API for implement policies ▪ Doesn’t require knowledge of core scheduler internals ◦ Allows that experimentation in a safe manner without even needing to reboot the system ▪ Safe, cannot crash the host • Protection afforded by BPF verifier ◦ Used in Meta production to optimize their workloads
  3. Implementing scheduling policies: Overview • Userspace can implement an arbitrary

    CPU scheduler by loading a BPF programs that implement “sched_ext_ops” • BPF program must implement a set of callbacks ◦ Task wakeup ◦ Task enqueue/dequeue ◦ Task state change (runnable, running, stopping, quiescent) ◦ … • Like other eBPF programs, we can use eBPF maps/data structures as needed verifier Kernel space Struct_ops (callbacks) sched_ext call eBPF Maps rbtrees/ linked-list Userspace programs syscall bpf(BPF_PROG_LOAD)
  4. Implementing scheduling policies: Callbacks(1) /* Pick the target CPU for

    a task which is being woken up */ s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags); /* Enqueue a runnable task on the BPF scheduler or dispatch directly to CPU */ void (*enqueue)(struct task_struct *p, u64 enq_flags); /* Remove a task from the BPF scheduler. * This is usually called to isolate the task while updating its scheduling properties (e.g. priority). */ void (*dequeue)(struct task_struct *p, u64 deq_flags); …. /* BPF scheduler’s name, 128 chars or less */ char name[SCX_OPS_NAME_LEN];
  5. Implementing scheduling policies: Callbacks(2) /* A task is becoming runnable

    on its associated CPU */ void (*runnable)(struct task_struct *p, u64 enq_flags); /* A task is starting to run on its associated CPU */ void (*running)(struct task_struct *p); /* A task is starting to run on its associated CPU */ void (*stopping)(struct task_struct *p, bool runnable); /* A task is becoming not runnable on its associated CPU */ void (*quiescent)(struct task_struct *p, u64 deq_flags); The only thing we need to implement is the “name” of the scheduler; everything else is optional
  6. Implementing scheduling policies: BPF program All existing and future CFS

    tasks(SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE and SCHED_EXT) switched to SCX. Otherwise, only tasks that have SCHED_EXT explicitly set will be placed on sched_ext. When SCX_ENQ_LOCAL is set in the enq_flag, it indicates that running the task on the selected CPU directly should not affect fairness. In this case, just queue it on the local FIFO. Otherwise, in this example code, re-enqueue the task directly in the global DSQ. It will be consumed later by sched_ext. Specify callbacks that invoked by sched_ext.
  7. DSQ’s(Dispatch Queues): Overview • An abstraction layer between BPF scheduler

    and kernel for managing queues of tasks, sched_ext uses a FIFO queue called DSQ’s(Dispatch Queues). ◦ By default, one global DSQ and a per-CPU local DSQ are created. ◦ Global DSQ (SCX_DSQ_GLOBAL) ▪ By default, consumed when the local DSQs are empty. ▪ Can be utilized by a scheduler if necessary ◦ per-CPU local DSQ’s (SCX_DSQ_LOCAL) ▪ per-CPU FIFO (or RB Trees if the task is flagged as use priority queue) that SCX pulls from when putting the next task on the CPU. ▪ A CPU always executes a task from its local DSQ CPU CPU Global DSQ Local DSQ’s
  8. DSQ’s(Dispatch Queues): Operations • Each DSQ provides operations; "dispatch" and

    "consume" ◦ “Consume”: ▪ Like as pick_next_task(), consuming a next task from a DSQ to run on the calling CPU. ▪ consumed in ops.dispatch() when a core is will go idle if no task is found ◦ “Dispatch”: Placing a task into a CPU. Can be done in the following callbacks ▪ ops.enqueue: invoked when a task is being enqueued in the BPF scheduler. ▪ ops.dispatch: invoked when a CPU is will go idle if a task is not found. ▪ This operation should either dispatch one or more tasks to other local DSQs or transfer a task from a DSQ to the current CPU's DSQ
  9. Scheduling Cycle: Task enqueue/wakeup flow Entering scheduler With explicit blocking:

    mutex, semaphore, waitqueue, etc. TIF_NEED_RESCHED is set in p->thread_info->flags Task waking up? s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags); Migrated to selected CPU Task becomes runnable void (*enqueue)(struct task_struct *p, u64 enq_flags); Task dispatched directly to CPU? Task enqueued in the CPU’s local DSQ Task enqueued in the BPF scheduler (or global DSQ) Y N Y N When the task is waking up, ops.select_cpu() is the first operation invoked. The function provides: • CPU selection optimization hint • waking up the CPU if idle. After the target CPU is selected, ops.enqueue() is invoked. It can make one of the following decisions: • Immediately dispatch the task to DSQs • Queue the task on the BPF scheduler
  10. Scheduling Cycle: Runqueue balance/dispatch Start balance Does the CPU have

    locally dispatched tasks? void (*dispatch)(s32 cpu, struct task_struct *prev); Dispatch tasks to local DSQ or transfer tasks to local DSQ Were any tasks dispatched in the callback? CPU will go to idle Task successfully balanced to core Y N Y N If there still isn't a task to run, ops.dispatch() is invoked. We can use the following function to populate local DSQ: • scx_bpf_dispatch(): dispatch a task to a DSQ • scx_bpf_comsume(): transfer a task to the dispatching DSQ After ops.dispatch() returns, the following steps are taken: • Try to consume DSQ’s, if successful, run the task • If “dispatched” tasks, retry the first branch • Previous task is still “runnable”, keep executing it • Idle When a CPU is ready to schedule, it first looks at its local DSQ. if empty, it then looks at the global DSQ.
  11. sched_class: .select_task_rq int select_task_rq_scx(struct rq *rq, struct task_struct *p, int

    enq_flags) • This callback invoked by the core scheduler to determine which CPU to assign task to. • When the callback is invoked, BPF scheduler’s “select_cpu” is called.
  12. sched_class: .enqueue_task void enqueue_task_scx(struct rq *rq, struct task_struct *p, int

    enq_flags) • This callback is invoked when the "*p" transitions to the "runnable" state. • At this point, the BPF scheduler's "runnable" and “enqueue” callback will be called.
  13. sched_class: .pick_next_task struct task_struct *pick_next_task_scx(struct rq *rq) • Called by

    the core scheduler to determine which task from DSQ should be running. • This function returns the task that is currently running. • When the callback is invoked, the BPF scheduler's "running" callback can be called
  14. sched_class: .balance int balance_scx(struct rq *rq, struct task_struct *prev, struct

    rq_flags *rf) • If there are no tasks to run, the callback is invoked. • At this point, the BPF scheduler's "cpu_acquire" (to handle cases where the task is migrated from another sched_class) and "dispatch" callbacks are invoked.
  15. Build sched-ext kernel (1) 1. Checkout the sched_ext repo from

    github: 2. Checkout and build the latest clang: $ yay -S cmake ninja $ mkdir ~/llvm $ git clone https://github.com/llvm/llvm-project.git llvm-project $ mkdir -p llvm-project/build; cd llvm-project/build $ cmake -G Ninja \ -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ -DCMAKE_INSTALL_PREFIX="/$HOME/llvm/$(date +%Y%m%d)" \ -DBUILD_SHARED_LIBS=OFF \ -DLIBCLANG_BUILD_STATIC=ON \ -DCMAKE_BUILD_TYPE=Release \ -DLLVM_ENABLE_TERMINFO=OFF \ -DLLVM_ENABLE_PROJECTS="clang;lld" \ ../llvm $ ninja install -j$(nproc) $ ln -sf /$HOME/llvm/$(date +%Y%m%d) /$HOME/llvm/latest git clone https://github.com/sched-ext/sched_ext
  16. Build sched-ext kernel (2) 3. Download and build the latest

    pahole: *** After build pahole and clang, make sure they are in your $PATH *** 4. Build sched_ext kernel: $ cd /data/users/$USER $ git clone https://git.kernel.org/pub/scm/devel/pahole/pahole.git $ mkdir -p pahole/build; cd pahole/build $ cmake -G Ninja ../ $ ninja CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y CONFIG_DEBUG_INFO_BTF=y CONFIG_PAHOLE_HAS_SPLIT_BTF=y CONFIG_PAHOLE_HAS_BTF_TAG=y CONFIG_SCHED_CLASS_EXT=y CONFIG_SCHED_DEBUG=y CONFIG_BPF_SYSCALL=y CONFIG_BPF_JIT=y ### 9P_FS is used by osandov-linux to mount the custom build directory from the hostmachine CONFIG_9P_FS=y CONFIG_NET_9P=y CONFIG_NET_9P_FD=y CONFIG_NET_9P_VIRTIO=y
  17. Build sched-ext kernel (3) 4. Build sched_ext kernel: 5. Build

    scx samples: $ make CC=clang-17 LD=ld.lld LLVM=1 menuconfig $ make CC=clang-17 LD=ld.lld LLVM=1 olddefconfig $ make CC=clang-17 LD=ld.lld LLVM=1 -j$(nproc) $ cd tools/sched_ext $ make CC=clang-17 LD=ld.lld LLVM=1 -j$(nproc)
  18. Build sched-ext kernel (4) 6. Setup a VM for the

    sched_ext kernel I recommend using osantov-linux[0], as it is a very handy tool for running a custom-built kernel . $ vm.py create -c 4 -m 8192 -s 50G <vm name> $ vm.py archinstall <vm name> $ kconfig.py <path to osandov-linux>/configs/vmpy.fragment $ vm.py run -k $PWD -- <vm name> [0]: https://github.com/osandov/osandov-linux
  19. Write own CPU scheduler • A "simple-minded vruntime" scheduler that

    performs scheduling decisions in userspace. • It may not be the practical approach, the intention is to see how to write a “kernel bypassed” scheduler. ◦ Move the complexity of scheduling tasks from the kernel to userspace.
  20. Write own CPU scheduler: diagram RX Ring TX Ring sched_ext

    Userspace scheduler daemon DSQ Local dsq Scheduler enter Ring buffer “RX” path “TX” path userspace Kernel space Task queue
  21. Write own CPU scheduler: Select CPU RX Ring TX Ring

    sched_ext Userspace scheduler daemon DSQ Scheduler enter Ring buffer “RX” path “TX” path userspace Kernel space Task queue When a task wakes up, the select_cpu callback is invoked. This callback provides a CPU hint for the task based on the following decision-making process: • If the previous CPU was in an idle state and the idle state was successfully cleared, the task will use that CPU. • If a CPU is successfully chosen, the task will use the CPU • Otherwise, the callback will simply return the previous CPU number. Local dsq
  22. Write own CPU scheduler: Enqueue a task RX Ring TX

    Ring sched_ext Userspace scheduler daemon DSQ Scheduler enter Ring buffer “RX” path “TX” path userspace Kernel space Task queue When a task becomes "runnable" state, the enqueue callback is invoked. The BPF scheduler places the task onto the "RX" ring buffer. Local dsq
  23. Write own CPU scheduler: Enqueue a task RX Ring TX

    Ring sched_ext Userspace scheduler daemon DSQ Scheduler enter Ring buffer “RX” path “TX” path userspace Kernel space Task queue When the userspace application receives a task from the RX ring, the task is enqueued in a list. The application maintains the task list with the following restrictions: - N = 8192 - T = {task1, task2, task3, ..., taskN} - T[i].vruntime ≤ T[j].vruntime (0 ≤ i < j ≤ N) After sorting the list, the application places the tasks onto the "TX" ring buffer. Local dsq
  24. Write own CPU scheduler: Dispatch tasks RX Ring TX Ring

    sched_ext Userspace scheduler daemon DSQ Scheduler enter Ring buffer “RX” path “TX” path userspace Kernel space Task queue The dispatch callback is called, due to there still is not a task to run at this point. This callback consumes tasks from the "TX" ring. The tasks consumed are then placed on the DSQs and run on the selected CPU. Local dsq