▪ CFS scheduler itself takes 7.6% of CPU cycles if thousands of containers co-run on a single node1 ▪ Many optimized scheduling policies are proposed ▪ Tableau2 boosted throughput by 1.6x for web applications ▪ Caladan3 improved network request tail latency by 11,000x ▪ Updating the new schedulers requires the long downtime ▪ recompiling a kernel ▪ rebooting the operating system ▪ re-initializing the hardware Needs for Live Scheduler Update 2 1: [Zijun+, USENIX ATC ’22] 2: [Humphries+, Eurosys ’21] 3: [Manohar+, Eurosys ’18]
handle few code changes (<100 LOC) ▪ Component-level: ghOSt2 ▪ Implement user-space scheduler ▪ ❌ extra overhead with context switch ▪ OS-level: VM-PHU3 ▪ ❌ only applicable to guest kernel in VM State-of-the-art Works of Live Update 3 1: [Arnold+, ACM European conference on Computer systems ’09] 2: [Humphries+, SOSP ’21] 3: [Russinovich+, Eurosys ’21]
commercial Linux servers ▪ not restricted in virtualization environment ▪ Goals ▪ Adequate expressiveness (component-level) ▪ High generality (without extra constraints) ▪ Achieve both short downtime and safety ▪ Easy-to-use Constraints & Goals 4
module ▪ Development ▪ develop the new scheduler ▪ generate an RPM ▪ Deployment ▪ inspect kernel stack for checking safety ▪ redirect the function ▪ migrate the data to ensure up-to-date PLUGSCHED: live-update by scheduler modularization 5
in boundary analysis ▪ properties of functions and global variables ◆ name, location, signature, attribute, scope, source file name ▪ kernel call graph Preprocess: Information Gathering 6
: boundary between scheduler and kernel ▪ F internal : in scheduler and called by F internal or F interface ▪ F external : outside the scheduler module ▪ Relationship ▪ F scheduler = F internal ∨ F interface = F - F external ▪ F external → F interface → F internal Preprocess: Boundary Analysis 7 call call scheduler module (F scheduler ) ▪: F interface ▪: F internal ▪: F external
input: F interface (user-defined) F mod , G (generated by compiler) ▪ output: F internal , F external Preprocess: Boundary Analysis 8 F mod : all functions in the scheduler files G : kernel call graph
with GCC plugin ▪ provide work directory for developer: /kernel/sched/plugsched ▪ Three ad-hoc rules ▪ convert F external definition into function declaration ▪ convert D external definition into data declaration by referencing VarDecl::str_decl ▪ For F interface , F internal and D internal , just copy-and-pastes their code to new files Preprocess: Code Extraction 14
all CPU enter the safe point (state quiescence) ▪ User-defined handler in specified CPU examine all kernel stacks ▪ ensure that no task has the old function in its call stack ▪ If the old function was found on the stack, PLUGSCHED refuses to update ▪ Two optimizations ▪ Binary searching ▪ Parallel inspection Deployment: Stack Inspection 15
flag of the CR0 register ▪ writes into code segment during function redirection ▪ Replace the prologue instruction of original functions with JMP instruction ▪ operand is the beginning of the new function ▪ only need to update interface funtions ▪ Exception: __schedule() ▪ ❌ sleeping task will resume to the original __schedule() Deployment: Function Redirection 16
read-only lower half at the point of context switch ▪ reuse lower half of the original __schedule() ▪ Stack pivot ◆ adjust stack frame size of __schedule() ▪ ROP ◆ modify return address of context switch Deployment: Function Redirection 17
of the stack ▪ use when we lack space on the stack (Supplement) stack pivot 18 stack pivot xchg %eax, %esp ret vul. buffer sfp ret arg1 arg2 … vul. buffer sfp ret … vul. buffer sfp ret Original stack ROP w/o stack pivot ROP with stack pivot xor %eax, %eax ret pop %eax pop %edx ret mov %eax, 24(%edx) ret … xor %eax, %eax ret pop %eax pop %edx ret outside stack (e.g. heap)
Computer systems ’09] ▪ Use shadow data structure ▪ Different approach ▪ D external is used both scheduler and kernel ◆ not modifiable ▪ D internal is used only within the scheduler ◆ modifiable ◆ Private: allocating memory within the new scheduler ◆ Shared: shares memory between the original/new scheduler ◆ Critical: rq, rt_rq, sched_class, etc. ◆ Non-Critical: sched_domain_topology, task_group_cache, etc. Deployment: Data Update 19 The classifications of data types
all running tasks from the original scheduler ▪ Enqueue them to the new scheduler ▪ Constraints ◆ ❌ change data structure size ◆ ❌ modify original field semantics ◆ ❌ cascade live update Deployment: Data Update 20 Data rebuild process use pre-reserved padding space to add new fields in struct body
between 2.1/1.8 ms and 2.6/2.5 ms (ins/unins) ▪ Difference: Data Rebuild (depends on the number of data structures) ▪ Comparison with kpatch ▪ Reduces 24/26 % of the total downtime ▪ 12.0/13.2x more efficient than kpatch in stack inspection Evaluation: Downtime 22 Downtime with different cases Downtime (PLUGSCHED vs Kpatch)
to~ 282/217μs (ins/unins) when core number increases from 1 to 6 ▪ Threads Increasing ▪ Stack inspection (SI) time increases around 9.17x ▪ a Evaluation: Scalability 23 Scalability of PLUGSCHED SI time with different number of active threads
up to 36% in 4ms ▪ Redis (key-value stores) ▪ negative impact on the throughput is negligible Evaluation: Latency-sensitive Workloads 24 Processing throughput of Nginx Processing throughput of Redis start updating end updating
and data analysis application may result in the long latency ▪ Integrating the BVT (Borrowed-Virtual-Time) feature by PLUGSCHED reduced the 99%-ile latency by 2.5x ▪ function-level live patchs tools are not applicable (> 500LOC) Evaluation: In-Production Case Studies 25 The latency distribution of BAR
to support the scheduler update ▪ has context switch overhead between userspace and kernelspace ▪ K8 [Baumann+, USENIX ATC ’05] ▪ supported the live update based on the micro-kernel architecture ▪ micro-kernel still include the scheduler ▪ VM-PHU [Russinovich+, Eurosys ’21] ▪ virtual machine manager to update the guest kernel lively ▪ not capable to lively update scheduler that is a built-in subsystem Related Works 26
for dynamically updating kernel scheduler without extra constrains ▪ enable the efficient update of functions and data by modularization ▪ guarantee the safety of function redirection by stack inspection ▪ PLUGSCHED has been deployed on thousands of servers in the cloud and supported scheduler update with extremely low overhead Conclusion 27