Dual level of task scheduling for VM workloads

Dual Level of Task Scheduling for VM Workloads: Pain Points
& Solutions 24 September 2024 Himadri CHHAYA-SHAILESH PhD Student @ Whisper, Inria Paris Supervised by: Jean-Pierre LOZI & Julia LAWALL

Context Pain points Existing partial solutions Towards a complete solution
A step further → user space Conclusion Outline 1 Context 2 Pain points 3 Existing partial solutions 4 Towards a complete solution 5 A step further → user space 6 Conclusion 1/23

Context

A step further → user space Conclusion Dual level of task scheduling for VM workloads Application Guest Scheduler Host Scheduler pCPU vCPU Thread 2/23

A step further → user space Conclusion The semantic gap The guest scheduler knows about the thread’s activities inside the VM, but does not know about the corresponding vCPU’s status on the host. The host scheduler knows about the vCPU’s status, but does not know about the corresponding thread’s activities inside the VM. The schedulers do not share the missing information with each other, and make task placement decisions based on the partial information available to them. 3/23

Pain points

A step further → user space Conclusion Root cause: ill-timed vCPU preemptions A vCPU is a regular process on the host. The host scheduler can preempt a vCPU at any time. If a vCPU is preempted when the corresponding thread is doing some critical operation inside the VM, the performance can be impacted severely. Such ill-timed vCPU preemptions are more likely to occur when the host is oversubscribed. 4/23

A step further → user space Conclusion Consequences of ill-timed vCPU preemptions 1 Lock holder preemption problem Lock waiter preemption problem Blocked-waiter wake up problem Readers preemption problem RCU reader preemption problem Interrupt context preemption problem 1Nicely described with existing solutions in "Scaling Guest OS Critical Sections with eCS" by S. Kashyap et al. 5/23

A step further → user space Conclusion Exposing hardware information to the guest scheduler We want to achieve bare-metal like performance for VM workloads. The scheduler is continuously optimized for bare-metal performance by utilizing the hardware information. VM workloads cannot take the full advantage of such optimizations unless the hardware information is also exposed to the guest scheduler. 6/23

Existing partial solutions

A step further → user space Conclusion Dedicated resource partitioning + vCPU pinning Ensures that a vCPU is able to run on a dedicated pCPU whenever the guest scheduler schedules a thread on the vCPU. Expensive for the customers. Unappealing for the cloud provider because of the lack of resource consolidation. The issue of exposing hardware information to the guest scheduler still remains unsolved. 7/23

A step further → user space Conclusion Steal time The hypervisor passes the information about how much time was spent running processes other than the vCPUs of a VM to the guest scheduler. arch/x86/include/uapi/asm/kvm_para.h --- struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; __u8 preempted; __u8 u8_pad [3]; __u32 pad [11]; }; 8/23

A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. 9/23

A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. Tunable parameter PLE_gap is used to determine if the interval between two consecutive PAUSE instructions is too short. 9/23

A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. Tunable parameter PLE_gap is used to determine if the interval between two consecutive PAUSE instructions is too short. Tunable parameter PLE_window determines for how long a vCPU can spin before the hypervisor intervenes. 9/23

A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. Tunable parameter PLE_gap is used to determine if the interval between two consecutive PAUSE instructions is too short. Tunable parameter PLE_window determines for how long a vCPU can spin before the hypervisor intervenes. If the vCPU spinning exceeds the limit, then a VM exit is forced for the spinning vCPU. 9/23

A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. Tunable parameter PLE_gap is used to determine if the interval between two consecutive PAUSE instructions is too short. Tunable parameter PLE_window determines for how long a vCPU can spin before the hypervisor intervenes. If the vCPU spinning exceeds the limit, then a VM exit is forced for the spinning vCPU. The hypervisor then schedules one of the candidate vCPUs that can potentially free the resource for the spinning vCPU. 9/23

Towards a complete solution

A step further → user space Conclusion CPS from ASPLOS’23 Title: A Cooperative Para-virtualized Scheduling Framework for Manycore Machines Authors: Yuxuan Liu, Tianqiang Xu, Zeyu Mi, Zhichao Hua, Binyu Zang, and Haibo Chen 10/23

A step further → user space Conclusion Dynamic vcpu priority management from LKML’24 [RFC PATCH 0/8] Dynamic vcpu priority management in kvm From: Vineeth Pillai and Joel Fernandes Paravirt Scheduling: V1 Host Userland Host Kernel VMM Guest Userland Guest Kernel KVM vcpu1 thread vcpu2 thread vcpu3 thread vcpu4 thread Scheduler Handshake Protocol Negtotiation Hypercall/MSR 11/23

A step further → user space Conclusion Dynamic vcpu priority management from LKML’24 [RFC PATCH 0/8] Dynamic vcpu priority management in kvm From: Vineeth Pillai and Joel Fernandes Paravirt Scheduling: v2 Host Userland Host Kernel VMM Guest Userland Guest Kernel KVM Kernel module / BPF program vcpu1 thread vcpu2 thread vcpu3 thread vcpu4 thread Scheduler Handshake Protocol Negtotiation Hypercall/MSR 11/23

A step further → user space Conclusion Dynamic vcpu priority management from LKML’24 [RFC PATCH 0/8] Dynamic vcpu priority management in kvm From: Vineeth Pillai and Joel Fernandes Paravirt Scheduling: v3 Host Userland Host Kernel VMM Guest Userland Guest Kernel pvsched driver/bpf program KVM pvsched-device process VMM main thread Kernel module / BPF program vcpu1 thread vcpu2 thread vcpu3 thread vcpu4 thread Scheduler Other device processes 11/23

A step further → user space

A step further → user space Conclusion Guest Parallel applications on oversubscribed hosts Applications achieve parallelization with help of parallel application runtime libraries, e.g. OpenMP, Open MPI, etc. Parallel application runtime libraries determine the Degree of Parallelization (DoP) by referring to the number of cores on the machine, i.e. number of vCPUs for a VM. But vCPUs get preempted on the host, and we end up using incorrect DoP for parallel applications running inside VMs. 12/23

A step further → user space Conclusion Para-virtualized guest parallel application runtimes Problem: Parallel application runtime libraries are oblivious to the phenomena of vCPU preemption. Impact: Suboptimal performance for parallel applications running inside VMs. Solution: 1 Aggregate information about vCPU preemptions on the host 2 Use this information inside the guest parallel application runtime libraries and dynamically adjust the DoP for the guest parallel applications. 13/23

A step further → user space Conclusion How does it work? Application T2 Guest Scheduler Host Scheduler pCPU1 vCPU2 :) T1 T3 vCPU1 :( vCPU3 :) T4 vCPU4 :( pCPU2 Application T1 Guest Scheduler Host Scheduler pCPU1 vCPU2 :) T2 vCPU1 vCPU3 :) vCPU4 pCPU2 14/23

A step further → user space Conclusion Why does it work? Let’s understand with an example... VM workload: UA (input class B) from NPB3.4-OMP Unstructured computation Three major loops Implemented with a total of 38,768 internal barriers Parallel application runtime library: libgomp from GCC-12 Host: 36 pCPUs, linux-kernel v6.11-rc4, Debian-testing Guest: 36 vCPUs, linux-kernel v6.6.16, Debian-12 15/23

A step further → user space Conclusion Spinning vs Blocking OMP_WAIT_POLICY=active Parallel worker threads spin upon reaching a barrier. OMP_WAIT_POLICY=passive Parallel worker threads block upon reaching a barrier. In an ideal world, spinning is faster than blocking. UA (Class B) spinning: 9.68 ± 0.04 seconds (1.80x) UA (Class B) blocking: 17.43 ± 0.09 seconds But in the real world with vCPU preemptions, spinning slows down more than blocking. UA (Class B) spinning: 19.95 ± 0.5 seconds (0.48x) UA (Class B) blocking: 22.43 ± 0.09 seconds (0.78x) Degradation in spinning performance increases with increase in number of preempted vCPUs. 16/23

A step further → user space Conclusion Spin with minimized number of preempted vCPUs 0 5 10 trace_uaB_dop_in_the_guest_rw_from_ua 0 10 20 30 threads all threads running threads 17/23

A step further → user space Conclusion OMP_DYNAMIC If the environment variable is set to true, the OpenMP implementation may adjust the number of threads to use for executing parallel regions in order to optimize the use of system resources. 18/23

A step further → user space Conclusion How well does it work? UA (Class B) spinning: 19.95 ± 0.5 seconds UA (Class B) blocking: 22.43 ± 0.09 seconds UA (Class B) spinning with minimized number of preempted vCPUs: 13.02 ± 0.23 seconds (1.53x) 19/23

A step further → user space Conclusion Changes in libgomp libgomp -gcc -12/ config/linux/proc.c --- unsigned gomp_dynamic_max_threads (void) { ... pv_sched_info = SYSCALL( get_pv_sched_info ); if (is_valid( pv_sched_info )) curr_dop = prev_dop - pv_sched_info; ... return curr_dop; } 20/23

A step further → user space Conclusion Changes in the host kernel kernel/sched/core.c --- static inline void ttwu_do_wakeup (...) { if (is_vcpu(p) && ! preempted_vcpu (p)) record_pvsched_sample (...); } static void __sched notrace __schedule (...) { if ( is_idle_task (next) || is_idle_task(prev)) record_idle_sample (...); if (is_vcpu(prev) && preempted_vcpu (prev)) record_pvsched_sample (...); else if (is_vcpu(next)) record_pvsched_sample (...); } static void compute_pv_sched_info (...) {...} 21/23

A step further → user space Conclusion Prototype OpenMP PCI Device IVSHMEM Memory Backend Host Scheduler Host Kernel Space Host User Space QEMU Guest Kernel Space Guest User Space Application Guest Scheduler 22/23

Conclusion

A step further → user space Conclusion Conclusion Para-virtualized solutions implementing co-operative scheduling are promising in order to address the semantic gap. A common need for implementing these solutions is the shared memory between the host and the guest schedulers. Custom implementation of the shared memory for every solution is unproductive and redundant work. It is about time to standardize the interface. 23/23 Questions & Feedback: [email protected]

Dual level of task scheduling for VM workloads

Dual level of task scheduling for VM workloads

More Decks by Kernel Recipes

Other Decks in Technology

Featured

Transcript