Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reading: PLUGSCHED

wkb8s
June 04, 2024
13

Reading: PLUGSCHED

wkb8s

June 04, 2024
Tweet

Transcript

  1. Efficient Scheduler Live Update for Linux Kernel with Modularization Teng

    Ma1, Shanpei Chen1, Yihao Wu1, Erwei Deng1, Zhuo Song1, Quan Chen2, Minyi Guo2 (1Alibaba Group, 2Shanghai Jiao Tong University) Keio University Kono Laboratory, Daiki Wakabayashi ASPLOS ’23 Plugsched: https://github.com/aliyun/plugsched
  2. ▪ Linux CFS scheduler has heavy overhead in many scenarios

    ▪ CFS scheduler itself takes 7.6% of CPU cycles if thousands of containers co-run on a single node1 ▪ Many optimized scheduling policies are proposed ▪ Tableau2 boosted throughput by 1.6x for web applications ▪ Caladan3 improved network request tail latency by 11,000x ▪ Updating the new schedulers requires the long downtime ▪ recompiling a kernel ▪ rebooting the operating system ▪ re-initializing the hardware Needs for Live Scheduler Update 2 1: [Zijun+, USENIX ATC ’22] 2: [Humphries+, Eurosys ’21] 3: [Manohar+, Eurosys ’18]
  3. ▪ Function-level: Ksplice1 ▪ Live Patch Tools ▪ ❌ only

    handle few code changes (<100 LOC) ▪ Component-level: ghOSt2 ▪ Implement user-space scheduler ▪ ❌ extra overhead with context switch ▪ OS-level: VM-PHU3 ▪ ❌ only applicable to guest kernel in VM State-of-the-art Works of Live Update 3 1: [Arnold+, ACM European conference on Computer systems ’09] 2: [Humphries+, SOSP ’21] 3: [Russinovich+, Eurosys ’21]
  4. ▪ Constraints ▪ without kernel code changes ▪ deployed in

    commercial Linux servers ▪ not restricted in virtualization environment ▪ Goals ▪ Adequate expressiveness (component-level) ▪ High generality (without extra constraints) ▪ Achieve both short downtime and safety ▪ Easy-to-use Constraints & Goals 4
  5. ▪ Preprocessing ▪ transform the built-in scheduler into the kernel

    module ▪ Development ▪ develop the new scheduler ▪ generate an RPM ▪ Deployment ▪ inspect kernel stack for checking safety ▪ redirect the function ▪ migrate the data to ensure up-to-date PLUGSCHED: live-update by scheduler modularization 5
  6. ▪ Compile Linux Kernel with gcc-python-plugin ▪ Collect information needed

    in boundary analysis ▪ properties of functions and global variables ◆ name, location, signature, attribute, scope, source file name ▪ kernel call graph Preprocess: Information Gathering 6
  7. ▪ Divide functions (F) into three types ▪ F interface

    : boundary between scheduler and kernel ▪ F internal : in scheduler and called by F internal or F interface ▪ F external : outside the scheduler module ▪ Relationship ▪ F scheduler = F internal ∨ F interface = F - F external ▪ F external → F interface → F internal Preprocess: Boundary Analysis 7 call call scheduler module (F scheduler ) ▪: F interface ▪: F internal ▪: F external
  8. ▪ Determine boundary between scheduler module and the kernel ▪

    input: F interface (user-defined) F mod , G (generated by compiler) ▪ output: F internal , F external Preprocess: Boundary Analysis 8 F mod : all functions in the scheduler files G : kernel call graph
  9. ▪ Initialize non-interface functions in /kernel/sched as F’ Preprocess: Boundary

    Analysis 9 /kernel/sched ▪: F interface ▪: F’ ▪: F’ external F mod : all functions in the scheduler files
  10. ▪ Repeat the following until there are no more edge

    that satisfy ☆ Preprocess: Boundary Analysis 10 /kernel/sched ……☆ ▪: F interface ▪: F’ ▪: F’ external
  11. ▪ Repeat the following until there are no more edge

    that satisfy ☆ Preprocess: Boundary Analysis 11 /kernel/sched ……☆ ▪: F interface ▪: F’ ▪: F’ external
  12. ▪ F internal and F external are generated Preprocess: Boundary

    Analysis 12 /kernel/sched ▪: F interface ▪: F internal ← F’ ▪: F external ← F’ external
  13. ▪ Scheduler module is a combination of F interface and

    F internal Preprocess: Boundary Analysis 13 ▪: F interface ▪: F internal ▪: F external /kernel/sched extract as scheduler module
  14. ▪ Generate the new scheduler module into a new directory

    with GCC plugin ▪ provide work directory for developer: /kernel/sched/plugsched ▪ Three ad-hoc rules ▪ convert F external definition into function declaration ▪ convert D external definition into data declaration by referencing VarDecl::str_decl ▪ For F interface , F internal and D internal , just copy-and-pastes their code to new files Preprocess: Code Extraction 14
  15. ▪ Stop the machine ▪ use Linux’s stop_machine() to let

    all CPU enter the safe point (state quiescence) ▪ User-defined handler in specified CPU examine all kernel stacks ▪ ensure that no task has the old function in its call stack ▪ If the old function was found on the stack, PLUGSCHED refuses to update ▪ Two optimizations ▪ Binary searching ▪ Parallel inspection Deployment: Stack Inspection 15
  16. ▪ Disable the page protection mode ▪ cleans the wp

    flag of the CR0 register ▪ writes into code segment during function redirection ▪ Replace the prologue instruction of original functions with JMP instruction ▪ operand is the beginning of the new function ▪ only need to update interface funtions ▪ Exception: __schedule() ▪ ❌ sleeping task will resume to the original __schedule() Deployment: Function Redirection 16
  17. ▪ Handling __schedule() ▪ split into upper modifiable half and

    read-only lower half at the point of context switch ▪ reuse lower half of the original __schedule() ▪ Stack pivot ◆ adjust stack frame size of __schedule() ▪ ROP ◆ modify return address of context switch Deployment: Function Redirection 17
  18. ▪ One of the ROP gadgets to fake the location

    of the stack ▪ use when we lack space on the stack (Supplement) stack pivot 18 stack pivot xchg %eax, %esp ret vul. buffer sfp ret arg1 arg2 … vul. buffer sfp ret … vul. buffer sfp ret Original stack ROP w/o stack pivot ROP with stack pivot xor %eax, %eax ret pop %eax pop %edx ret mov %eax, 24(%edx) ret … xor %eax, %eax ret pop %eax pop %edx ret outside stack (e.g. heap)
  19. ▪ Previous Works ▪ Ksplice [Arnold+, ACM European conference on

    Computer systems ’09] ▪ Use shadow data structure ▪ Different approach ▪ D external is used both scheduler and kernel ◆ not modifiable ▪ D internal is used only within the scheduler ◆ modifiable ◆ Private: allocating memory within the new scheduler ◆ Shared: shares memory between the original/new scheduler ◆ Critical: rq, rt_rq, sched_class, etc. ◆ Non-Critical: sched_domain_topology, task_group_cache, etc. Deployment: Data Update 19 The classifications of data types
  20. ▪ Data Rebuild ▪ use stable API: en/dequeue_task() ▪ Dequeue

    all running tasks from the original scheduler ▪ Enqueue them to the new scheduler ▪ Constraints ◆ ❌ change data structure size ◆ ❌ modify original field semantics ◆ ❌ cascade live update Deployment: Data Update 20 Data rebuild process use pre-reserved padding space to add new fields in struct body
  21. ▪ Test cases ▪ use randomly selected Linux upstream patches

    and three schedulers ▪ Experimental Setup ▪ Linux Kernel 4.19.91 ▪ Dual socket 48-core CPUs ◆ Intel Xeon(R) Platinum 8163 CPU @ 2.50 GHz ▪ 192GB memory space ▪ Kpatch v0.7.1 Evaluation 21 p: patch, f: feature, n: new scheduler
  22. ▪ Efficiency for seven test cases ▪ The downtime are

    between 2.1/1.8 ms and 2.6/2.5 ms (ins/unins) ▪ Difference: Data Rebuild (depends on the number of data structures) ▪ Comparison with kpatch ▪ Reduces 24/26 % of the total downtime ▪ 12.0/13.2x more efficient than kpatch in stack inspection Evaluation: Downtime 22 Downtime with different cases Downtime (PLUGSCHED vs Kpatch)
  23. ▪ Core Increasing ▪ Stack inspection time reduces from 1287/1070μs

    to~ 282/217μs (ins/unins) when core number increases from 1 to 6 ▪ Threads Increasing ▪ Stack inspection (SI) time increases around 9.17x ▪ a Evaluation: Scalability 23 Scalability of PLUGSCHED SI time with different number of active threads
  24. ▪ Nginx (web services) ▪ the performance degradation can reach

    up to 36% in 4ms ▪ Redis (key-value stores) ▪ negative impact on the throughput is negligible Evaluation: Latency-sensitive Workloads 24 Processing throughput of Nginx Processing throughput of Redis start updating end updating
  25. ▪ The CPU interference between key-value store engine (named BAR)

    and data analysis application may result in the long latency ▪ Integrating the BVT (Borrowed-Virtual-Time) feature by PLUGSCHED reduced the 99%-ile latency by 2.5x ▪ function-level live patchs tools are not applicable (> 500LOC) Evaluation: In-Production Case Studies 25 The latency distribution of BAR
  26. ▪ ghOSt [Humphries+, SOSP ’21] ▪ implemented user-space scheduler class

    to support the scheduler update ▪ has context switch overhead between userspace and kernelspace ▪ K8 [Baumann+, USENIX ATC ’05] ▪ supported the live update based on the micro-kernel architecture ▪ micro-kernel still include the scheduler ▪ VM-PHU [Russinovich+, Eurosys ’21] ▪ virtual machine manager to update the guest kernel lively ▪ not capable to lively update scheduler that is a built-in subsystem Related Works 26
  27. ▪ The author presented PLUGSCHED, a safe and easy-to-use system

    for dynamically updating kernel scheduler without extra constrains ▪ enable the efficient update of functions and data by modularization ▪ guarantee the safety of function redirection by stack inspection ▪ PLUGSCHED has been deployed on thousands of servers in the cloud and supported scheduler update with extremely low overhead Conclusion 27