Reading: Efficient Scheduler Live Update for Linux Kernel with Modularization

Efficient Scheduler Live Update for Linux Kernel with Modularization Teng
Ma1, Shanpei Chen1, Yihao Wu1, Erwei Deng1, Zhuo Song1, Quan Chen2, Minyi Guo2 (1Alibaba Group, 2Shanghai Jiao Tong University) Keio University Kono Laboratory, Daiki Wakabayashi ASPLOS ’23 Plugsched: https://github.com/aliyun/plugsched

▪ Linux CFS scheduler has heavy overhead in many scenarios
▪ CFS scheduler itself takes 7.6% of CPU cycles if thousands of containers co-run on a single node1 ▪ Many optimized scheduling policies are proposed ▪ Tableau2 boosted throughput by 1.6x for web applications ▪ Caladan3 improved network request tail latency by 11,000x ▪ Updating the new schedulers requires the long downtime ▪ recompiling a kernel ▪ rebooting the operating system ▪ re-initializing the hardware Needs for Live Scheduler Update 2 1: [Zijun+, USENIX ATC ’22] 2: [Humphries+, Eurosys ’21] 3: [Manohar+, Eurosys ’18]

▪ Function-level: Ksplice1 ▪ Live Patch Tools ▪ ❌ only
handle few code changes (<100 LOC) ▪ Component-level: ghOSt2 ▪ Implement user-space scheduler ▪ ❌ extra overhead with context switch ▪ OS-level: VM-PHU3 ▪ ❌ only applicable to guest kernel in VM State-of-the-art Works of Live Update 3 1: [Arnold+, ACM European conference on Computer systems ’09] 2: [Humphries+, SOSP ’21] 3: [Russinovich+, Eurosys ’21]

▪ Constraints ▪ without kernel code changes ▪ deployed in
commercial Linux servers ▪ not restricted in virtualization environment ▪ Goals ▪ Adequate expressiveness (component-level) ▪ High generality (without extra constraints) ▪ Achieve both short downtime and safety ▪ Easy-to-use Constraints & Goals 4

▪ Preprocessing ▪ transform the built-in scheduler into the kernel
module ▪ Development ▪ develop the new scheduler ▪ generate an RPM ▪ Deployment ▪ inspect kernel stack for checking safety ▪ redirect the function ▪ migrate the data to ensure up-to-date PLUGSCHED: live-update by scheduler modularization 5

▪ Compile Linux Kernel with gcc-python-plugin ▪ Collect information needed
in boundary analysis ▪ properties of functions and global variables ◆ name, location, signature, attribute, scope, source file name ▪ kernel call graph Preprocess: Information Gathering 6

▪ Divide functions (F) into three types ▪ F interface
: boundary between scheduler and kernel ▪ F internal : in scheduler and called by F internal or F interface ▪ F external : outside the scheduler module ▪ Relationship ▪ F scheduler = F internal ∨ F interface = F - F external ▪ F external → F interface → F internal Preprocess: Boundary Analysis 7 call call scheduler module (F scheduler ) ▪: F interface ▪: F internal ▪: F external

▪ Determine boundary between scheduler module and the kernel ▪
input: F interface (user-defined) F mod , G (generated by compiler) ▪ output: F internal , F external Preprocess: Boundary Analysis 8 F mod : all functions in the scheduler files G : kernel call graph

▪ Initialize non-interface functions in /kernel/sched as F’ Preprocess: Boundary
Analysis 9 /kernel/sched ▪: F interface ▪: F’ ▪: F’ external F mod : all functions in the scheduler files

▪ Repeat the following until there are no more edge
that satisfy ☆ Preprocess: Boundary Analysis 10 /kernel/sched ……☆ ▪: F interface ▪: F’ ▪: F’ external

▪ Repeat the following until there are no more edge
that satisfy ☆ Preprocess: Boundary Analysis 11 /kernel/sched ……☆ ▪: F interface ▪: F’ ▪: F’ external

▪ F internal and F external are generated Preprocess: Boundary
Analysis 12 /kernel/sched ▪: F interface ▪: F internal ← F’ ▪: F external ← F’ external

▪ Scheduler module is a combination of F interface and
F internal Preprocess: Boundary Analysis 13 ▪: F interface ▪: F internal ▪: F external /kernel/sched extract as scheduler module

▪ Generate the new scheduler module into a new directory
with GCC plugin ▪ provide work directory for developer: /kernel/sched/plugsched ▪ Three ad-hoc rules ▪ convert F external definition into function declaration ▪ convert D external definition into data declaration by referencing VarDecl::str_decl ▪ For F interface , F internal and D internal , just copy-and-pastes their code to new files Preprocess: Code Extraction 14

▪ Stop the machine ▪ use Linux’s stop_machine() to let
all CPU enter the safe point (state quiescence) ▪ User-defined handler in specified CPU examine all kernel stacks ▪ ensure that no task has the old function in its call stack ▪ If the old function was found on the stack, PLUGSCHED refuses to update ▪ Two optimizations ▪ Binary searching ▪ Parallel inspection Deployment: Stack Inspection 15

▪ Disable the page protection mode ▪ cleans the wp
flag of the CR0 register ▪ writes into code segment during function redirection ▪ Replace the prologue instruction of original functions with JMP instruction ▪ operand is the beginning of the new function ▪ only need to update interface funtions ▪ Exception: __schedule() ▪ ❌ sleeping task will resume to the original __schedule() Deployment: Function Redirection 16

▪ Handling __schedule() ▪ split into upper modifiable half and
read-only lower half at the point of context switch ▪ reuse lower half of the original __schedule() ▪ Stack pivot ◆ adjust stack frame size of __schedule() ▪ ROP ◆ modify return address of context switch Deployment: Function Redirection 17

▪ One of the ROP gadgets to fake the location
of the stack ▪ use when we lack space on the stack (Supplement) stack pivot 18 stack pivot xchg %eax, %esp ret vul. buffer sfp ret arg1 arg2 … vul. buffer sfp ret … vul. buffer sfp ret Original stack ROP w/o stack pivot ROP with stack pivot xor %eax, %eax ret pop %eax pop %edx ret mov %eax, 24(%edx) ret … xor %eax, %eax ret pop %eax pop %edx ret outside stack (e.g. heap)

▪ Previous Works ▪ Ksplice [Arnold+, ACM European conference on
Computer systems ’09] ▪ Use shadow data structure ▪ Different approach ▪ D external is used both scheduler and kernel ◆ not modifiable ▪ D internal is used only within the scheduler ◆ modifiable ◆ Private: allocating memory within the new scheduler ◆ Shared: shares memory between the original/new scheduler ◆ Critical: rq, rt_rq, sched_class, etc. ◆ Non-Critical: sched_domain_topology, task_group_cache, etc. Deployment: Data Update 19 The classifications of data types

▪ Data Rebuild ▪ use stable API: en/dequeue_task() ▪ Dequeue
all running tasks from the original scheduler ▪ Enqueue them to the new scheduler ▪ Constraints ◆ ❌ change data structure size ◆ ❌ modify original field semantics ◆ ❌ cascade live update Deployment: Data Update 20 Data rebuild process use pre-reserved padding space to add new fields in struct body

▪ Test cases ▪ use randomly selected Linux upstream patches
and three schedulers ▪ Experimental Setup ▪ Linux Kernel 4.19.91 ▪ Dual socket 48-core CPUs ◆ Intel Xeon(R) Platinum 8163 CPU @ 2.50 GHz ▪ 192GB memory space ▪ Kpatch v0.7.1 Evaluation 21 p: patch, f: feature, n: new scheduler

▪ Efficiency for seven test cases ▪ The downtime are
between 2.1/1.8 ms and 2.6/2.5 ms (ins/unins) ▪ Difference: Data Rebuild (depends on the number of data structures) ▪ Comparison with kpatch ▪ Reduces 24/26 % of the total downtime ▪ 12.0/13.2x more efficient than kpatch in stack inspection Evaluation: Downtime 22 Downtime with different cases Downtime (PLUGSCHED vs Kpatch)

▪ Core Increasing ▪ Stack inspection time reduces from 1287/1070μs
to~ 282/217μs (ins/unins) when core number increases from 1 to 6 ▪ Threads Increasing ▪ Stack inspection (SI) time increases around 9.17x ▪ a Evaluation: Scalability 23 Scalability of PLUGSCHED SI time with different number of active threads

▪ Nginx (web services) ▪ the performance degradation can reach
up to 36% in 4ms ▪ Redis (key-value stores) ▪ negative impact on the throughput is negligible Evaluation: Latency-sensitive Workloads 24 Processing throughput of Nginx Processing throughput of Redis start updating end updating

▪ The CPU interference between key-value store engine (named BAR)
and data analysis application may result in the long latency ▪ Integrating the BVT (Borrowed-Virtual-Time) feature by PLUGSCHED reduced the 99%-ile latency by 2.5x ▪ function-level live patchs tools are not applicable (> 500LOC) Evaluation: In-Production Case Studies 25 The latency distribution of BAR

▪ ghOSt [Humphries+, SOSP ’21] ▪ implemented user-space scheduler class
to support the scheduler update ▪ has context switch overhead between userspace and kernelspace ▪ K8 [Baumann+, USENIX ATC ’05] ▪ supported the live update based on the micro-kernel architecture ▪ micro-kernel still include the scheduler ▪ VM-PHU [Russinovich+, Eurosys ’21] ▪ virtual machine manager to update the guest kernel lively ▪ not capable to lively update scheduler that is a built-in subsystem Related Works 26

▪ The author presented PLUGSCHED, a safe and easy-to-use system
for dynamically updating kernel scheduler without extra constrains ▪ enable the efficient update of functions and data by modularization ▪ guarantee the safety of function redirection by stack inspection ▪ PLUGSCHED has been deployed on thousands of servers in the cloud and supported scheduler update with extremely low overhead Conclusion 27

Reading: Efficient Scheduler Live Update for Li...

Reading: Efficient Scheduler Live Update for Linux Kernel with Modularization

wkb8s

More Decks by wkb8s

Featured

Transcript

Efficient Scheduler Live Update for Linux Kernel with Modularization Teng

▪ Linux CFS scheduler has heavy overhead in many scenarios

▪ Function-level: Ksplice1 ▪ Live Patch Tools ▪ ❌ only

▪ Constraints ▪ without kernel code changes ▪ deployed in

▪ Preprocessing ▪ transform the built-in scheduler into the kernel

▪ Compile Linux Kernel with gcc-python-plugin ▪ Collect information needed

▪ Divide functions (F) into three types ▪ F interface

▪ Determine boundary between scheduler module and the kernel ▪

▪ Initialize non-interface functions in /kernel/sched as F’ Preprocess: Boundary

▪ Repeat the following until there are no more edge

▪ Repeat the following until there are no more edge

▪ F internal and F external are generated Preprocess: Boundary

▪ Scheduler module is a combination of F interface and

▪ Generate the new scheduler module into a new directory

▪ Stop the machine ▪ use Linux’s stop_machine() to let

▪ Disable the page protection mode ▪ cleans the wp

▪ Handling __schedule() ▪ split into upper modifiable half and

▪ One of the ROP gadgets to fake the location

▪ Previous Works ▪ Ksplice [Arnold+, ACM European conference on

▪ Data Rebuild ▪ use stable API: en/dequeue_task() ▪ Dequeue

▪ Test cases ▪ use randomly selected Linux upstream patches

▪ Efficiency for seven test cases ▪ The downtime are

▪ Core Increasing ▪ Stack inspection time reduces from 1287/1070μs

▪ Nginx (web services) ▪ the performance degradation can reach

▪ The CPU interference between key-value store engine (named BAR)

▪ ghOSt [Humphries+, SOSP ’21] ▪ implemented user-space scheduler class

▪ The author presented PLUGSCHED, a safe and easy-to-use system