Slide 1

Slide 1 text

Dual Level of Task Scheduling for VM Workloads: Pain Points & Solutions 24 September 2024 Himadri CHHAYA-SHAILESH PhD Student @ Whisper, Inria Paris Supervised by: Jean-Pierre LOZI & Julia LAWALL

Slide 2

Slide 2 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Outline 1 Context 2 Pain points 3 Existing partial solutions 4 Towards a complete solution 5 A step further → user space 6 Conclusion 1/23

Slide 3

Slide 3 text

Context

Slide 4

Slide 4 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Dual level of task scheduling for VM workloads Application Guest Scheduler Host Scheduler pCPU vCPU Thread 2/23

Slide 5

Slide 5 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion The semantic gap The guest scheduler knows about the thread’s activities inside the VM, but does not know about the corresponding vCPU’s status on the host. The host scheduler knows about the vCPU’s status, but does not know about the corresponding thread’s activities inside the VM. The schedulers do not share the missing information with each other, and make task placement decisions based on the partial information available to them. 3/23

Slide 6

Slide 6 text

Pain points

Slide 7

Slide 7 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Root cause: ill-timed vCPU preemptions A vCPU is a regular process on the host. The host scheduler can preempt a vCPU at any time. If a vCPU is preempted when the corresponding thread is doing some critical operation inside the VM, the performance can be impacted severely. Such ill-timed vCPU preemptions are more likely to occur when the host is oversubscribed. 4/23

Slide 8

Slide 8 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Consequences of ill-timed vCPU preemptions 1 Lock holder preemption problem Lock waiter preemption problem Blocked-waiter wake up problem Readers preemption problem RCU reader preemption problem Interrupt context preemption problem 1Nicely described with existing solutions in "Scaling Guest OS Critical Sections with eCS" by S. Kashyap et al. 5/23

Slide 9

Slide 9 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Exposing hardware information to the guest scheduler We want to achieve bare-metal like performance for VM workloads. The scheduler is continuously optimized for bare-metal performance by utilizing the hardware information. VM workloads cannot take the full advantage of such optimizations unless the hardware information is also exposed to the guest scheduler. 6/23

Slide 10

Slide 10 text

Existing partial solutions

Slide 11

Slide 11 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Dedicated resource partitioning + vCPU pinning Ensures that a vCPU is able to run on a dedicated pCPU whenever the guest scheduler schedules a thread on the vCPU. Expensive for the customers. Unappealing for the cloud provider because of the lack of resource consolidation. The issue of exposing hardware information to the guest scheduler still remains unsolved. 7/23

Slide 12

Slide 12 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Steal time The hypervisor passes the information about how much time was spent running processes other than the vCPUs of a VM to the guest scheduler. arch/x86/include/uapi/asm/kvm_para.h --- struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; __u8 preempted; __u8 u8_pad [3]; __u32 pad [11]; }; 8/23

Slide 13

Slide 13 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. 9/23

Slide 14

Slide 14 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. Tunable parameter PLE_gap is used to determine if the interval between two consecutive PAUSE instructions is too short. 9/23

Slide 15

Slide 15 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. Tunable parameter PLE_gap is used to determine if the interval between two consecutive PAUSE instructions is too short. Tunable parameter PLE_window determines for how long a vCPU can spin before the hypervisor intervenes. 9/23

Slide 16

Slide 16 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. Tunable parameter PLE_gap is used to determine if the interval between two consecutive PAUSE instructions is too short. Tunable parameter PLE_window determines for how long a vCPU can spin before the hypervisor intervenes. If the vCPU spinning exceeds the limit, then a VM exit is forced for the spinning vCPU. 9/23

Slide 17

Slide 17 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. Tunable parameter PLE_gap is used to determine if the interval between two consecutive PAUSE instructions is too short. Tunable parameter PLE_window determines for how long a vCPU can spin before the hypervisor intervenes. If the vCPU spinning exceeds the limit, then a VM exit is forced for the spinning vCPU. The hypervisor then schedules one of the candidate vCPUs that can potentially free the resource for the spinning vCPU. 9/23

Slide 18

Slide 18 text

Towards a complete solution

Slide 19

Slide 19 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion CPS from ASPLOS’23 Title: A Cooperative Para-virtualized Scheduling Framework for Manycore Machines Authors: Yuxuan Liu, Tianqiang Xu, Zeyu Mi, Zhichao Hua, Binyu Zang, and Haibo Chen 10/23

Slide 20

Slide 20 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Dynamic vcpu priority management from LKML’24 [RFC PATCH 0/8] Dynamic vcpu priority management in kvm From: Vineeth Pillai and Joel Fernandes Paravirt Scheduling: V1 Host Userland Host Kernel VMM Guest Userland Guest Kernel KVM vcpu1 thread vcpu2 thread vcpu3 thread vcpu4 thread Scheduler Handshake Protocol Negtotiation Hypercall/MSR 11/23

Slide 21

Slide 21 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Dynamic vcpu priority management from LKML’24 [RFC PATCH 0/8] Dynamic vcpu priority management in kvm From: Vineeth Pillai and Joel Fernandes Paravirt Scheduling: v2 Host Userland Host Kernel VMM Guest Userland Guest Kernel KVM Kernel module / BPF program vcpu1 thread vcpu2 thread vcpu3 thread vcpu4 thread Scheduler Handshake Protocol Negtotiation Hypercall/MSR 11/23

Slide 22

Slide 22 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Dynamic vcpu priority management from LKML’24 [RFC PATCH 0/8] Dynamic vcpu priority management in kvm From: Vineeth Pillai and Joel Fernandes Paravirt Scheduling: v3 Host Userland Host Kernel VMM Guest Userland Guest Kernel pvsched driver/bpf program KVM pvsched-device process VMM main thread Kernel module / BPF program vcpu1 thread vcpu2 thread vcpu3 thread vcpu4 thread Scheduler Other device processes 11/23

Slide 23

Slide 23 text

A step further → user space

Slide 24

Slide 24 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Guest Parallel applications on oversubscribed hosts Applications achieve parallelization with help of parallel application runtime libraries, e.g. OpenMP, Open MPI, etc. Parallel application runtime libraries determine the Degree of Parallelization (DoP) by referring to the number of cores on the machine, i.e. number of vCPUs for a VM. But vCPUs get preempted on the host, and we end up using incorrect DoP for parallel applications running inside VMs. 12/23

Slide 25

Slide 25 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Para-virtualized guest parallel application runtimes Problem: Parallel application runtime libraries are oblivious to the phenomena of vCPU preemption. Impact: Suboptimal performance for parallel applications running inside VMs. Solution: 1 Aggregate information about vCPU preemptions on the host 2 Use this information inside the guest parallel application runtime libraries and dynamically adjust the DoP for the guest parallel applications. 13/23

Slide 26

Slide 26 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion How does it work? Application T2 Guest Scheduler Host Scheduler pCPU1 vCPU2 :) T1 T3 vCPU1 :( vCPU3 :) T4 vCPU4 :( pCPU2 Application T1 Guest Scheduler Host Scheduler pCPU1 vCPU2 :) T2 vCPU1 vCPU3 :) vCPU4 pCPU2 14/23

Slide 27

Slide 27 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Why does it work? Let’s understand with an example... VM workload: UA (input class B) from NPB3.4-OMP Unstructured computation Three major loops Implemented with a total of 38,768 internal barriers Parallel application runtime library: libgomp from GCC-12 Host: 36 pCPUs, linux-kernel v6.11-rc4, Debian-testing Guest: 36 vCPUs, linux-kernel v6.6.16, Debian-12 15/23

Slide 28

Slide 28 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Spinning vs Blocking OMP_WAIT_POLICY=active Parallel worker threads spin upon reaching a barrier. OMP_WAIT_POLICY=passive Parallel worker threads block upon reaching a barrier. In an ideal world, spinning is faster than blocking. UA (Class B) spinning: 9.68 ± 0.04 seconds (1.80x) UA (Class B) blocking: 17.43 ± 0.09 seconds But in the real world with vCPU preemptions, spinning slows down more than blocking. UA (Class B) spinning: 19.95 ± 0.5 seconds (0.48x) UA (Class B) blocking: 22.43 ± 0.09 seconds (0.78x) Degradation in spinning performance increases with increase in number of preempted vCPUs. 16/23

Slide 29

Slide 29 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Spin with minimized number of preempted vCPUs 0 5 10 trace_uaB_dop_in_the_guest_rw_from_ua 0 10 20 30 threads all threads running threads 17/23

Slide 30

Slide 30 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion OMP_DYNAMIC If the environment variable is set to true, the OpenMP implementation may adjust the number of threads to use for executing parallel regions in order to optimize the use of system resources. 18/23

Slide 31

Slide 31 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion How well does it work? UA (Class B) spinning: 19.95 ± 0.5 seconds UA (Class B) blocking: 22.43 ± 0.09 seconds UA (Class B) spinning with minimized number of preempted vCPUs: 13.02 ± 0.23 seconds (1.53x) 19/23

Slide 32

Slide 32 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Changes in libgomp libgomp -gcc -12/ config/linux/proc.c --- unsigned gomp_dynamic_max_threads (void) { ... pv_sched_info = SYSCALL( get_pv_sched_info ); if (is_valid( pv_sched_info )) curr_dop = prev_dop - pv_sched_info; ... return curr_dop; } 20/23

Slide 33

Slide 33 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Changes in the host kernel kernel/sched/core.c --- static inline void ttwu_do_wakeup (...) { if (is_vcpu(p) && ! preempted_vcpu (p)) record_pvsched_sample (...); } static void __sched notrace __schedule (...) { if ( is_idle_task (next) || is_idle_task(prev)) record_idle_sample (...); if (is_vcpu(prev) && preempted_vcpu (prev)) record_pvsched_sample (...); else if (is_vcpu(next)) record_pvsched_sample (...); } static void compute_pv_sched_info (...) {...} 21/23

Slide 34

Slide 34 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Prototype OpenMP PCI Device IVSHMEM Memory Backend Host Scheduler Host Kernel Space Host User Space QEMU Guest Kernel Space Guest User Space Application Guest Scheduler 22/23

Slide 35

Slide 35 text

Conclusion

Slide 36

Slide 36 text

Context Pain points Existing partial solutions Towards a complete solution A step further → user space Conclusion Conclusion Para-virtualized solutions implementing co-operative scheduling are promising in order to address the semantic gap. A common need for implementing these solutions is the shared memory between the host and the guest schedulers. Custom implementation of the shared memory for every solution is unproductive and redundant work. It is about time to standardize the interface. 23/23 Questions & Feedback: [email protected]