Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dual level of task scheduling for VM workloads

Dual level of task scheduling for VM workloads

When a multi-threaded application runs inside a virtual machine, it experiences a dual level of task scheduling. The guest OS’ task scheduler decides how to place application threads on the vCPUs, and the host OS’ task scheduler decides how to place these vCPU threads on the pCPUs. While the guest is aware of the application threads’ activities inside the VM, it is often oblivious about the status of its vCPUs on the host. On the other hand, the host is aware of the status of the vCPUs, but it is oblivious about the application threads’ activities inside the VM. Thus, neither the guest nor the host have the complete information to make optimal task placement decisions across both the levels. This leads to the well-known semantic gap between the host and the guest task schedulers. Many existing academic as well as in-kernel solutions partially help by targeting specific issues spanning from the semantic gap. And more recently, we might be getting closer to achieving a generic and complete solution with the efforts of standardizing the paravirt scheduling interface[1].

In this talk, we will take a deep dive into the issues spanning from the semantic gap. We will review the paravirt scheduling proposal as well as some of the noteworthy partial solutions. And finally, we will learn about the semantic gap related research at Whisper that builds upon the idea of paravirt scheduling and proposes a new use-case for it.

[1] [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority
management) – https://lore.kernel.org/kvm/[email protected]/T/

Himadri CHHAYA-SHAILESH

Kernel Recipes

September 26, 2024
Tweet

More Decks by Kernel Recipes

Other Decks in Technology

Transcript

  1. Dual Level of Task Scheduling for VM Workloads: Pain Points

    & Solutions 24 September 2024 Himadri CHHAYA-SHAILESH PhD Student @ Whisper, Inria Paris Supervised by: Jean-Pierre LOZI & Julia LAWALL
  2. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Outline 1 Context 2 Pain points 3 Existing partial solutions 4 Towards a complete solution 5 A step further → user space 6 Conclusion 1/23
  3. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Dual level of task scheduling for VM workloads Application Guest Scheduler Host Scheduler pCPU vCPU Thread 2/23
  4. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion The semantic gap The guest scheduler knows about the thread’s activities inside the VM, but does not know about the corresponding vCPU’s status on the host. The host scheduler knows about the vCPU’s status, but does not know about the corresponding thread’s activities inside the VM. The schedulers do not share the missing information with each other, and make task placement decisions based on the partial information available to them. 3/23
  5. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Root cause: ill-timed vCPU preemptions A vCPU is a regular process on the host. The host scheduler can preempt a vCPU at any time. If a vCPU is preempted when the corresponding thread is doing some critical operation inside the VM, the performance can be impacted severely. Such ill-timed vCPU preemptions are more likely to occur when the host is oversubscribed. 4/23
  6. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Consequences of ill-timed vCPU preemptions 1 Lock holder preemption problem Lock waiter preemption problem Blocked-waiter wake up problem Readers preemption problem RCU reader preemption problem Interrupt context preemption problem 1Nicely described with existing solutions in "Scaling Guest OS Critical Sections with eCS" by S. Kashyap et al. 5/23
  7. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Exposing hardware information to the guest scheduler We want to achieve bare-metal like performance for VM workloads. The scheduler is continuously optimized for bare-metal performance by utilizing the hardware information. VM workloads cannot take the full advantage of such optimizations unless the hardware information is also exposed to the guest scheduler. 6/23
  8. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Dedicated resource partitioning + vCPU pinning Ensures that a vCPU is able to run on a dedicated pCPU whenever the guest scheduler schedules a thread on the vCPU. Expensive for the customers. Unappealing for the cloud provider because of the lack of resource consolidation. The issue of exposing hardware information to the guest scheduler still remains unsolved. 7/23
  9. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Steal time The hypervisor passes the information about how much time was spent running processes other than the vCPUs of a VM to the guest scheduler. arch/x86/include/uapi/asm/kvm_para.h --- struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; __u8 preempted; __u8 u8_pad [3]; __u32 pad [11]; }; 8/23
  10. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. 9/23
  11. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. Tunable parameter PLE_gap is used to determine if the interval between two consecutive PAUSE instructions is too short. 9/23
  12. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. Tunable parameter PLE_gap is used to determine if the interval between two consecutive PAUSE instructions is too short. Tunable parameter PLE_window determines for how long a vCPU can spin before the hypervisor intervenes. 9/23
  13. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. Tunable parameter PLE_gap is used to determine if the interval between two consecutive PAUSE instructions is too short. Tunable parameter PLE_window determines for how long a vCPU can spin before the hypervisor intervenes. If the vCPU spinning exceeds the limit, then a VM exit is forced for the spinning vCPU. 9/23
  14. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Forced VM-exit upon exceeding PLE limit Spin loops can be implemented using the PAUSE instruction. Hardware can detect if a vCPU has been spinning excessively. Tunable parameter PLE_gap is used to determine if the interval between two consecutive PAUSE instructions is too short. Tunable parameter PLE_window determines for how long a vCPU can spin before the hypervisor intervenes. If the vCPU spinning exceeds the limit, then a VM exit is forced for the spinning vCPU. The hypervisor then schedules one of the candidate vCPUs that can potentially free the resource for the spinning vCPU. 9/23
  15. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion CPS from ASPLOS’23 Title: A Cooperative Para-virtualized Scheduling Framework for Manycore Machines Authors: Yuxuan Liu, Tianqiang Xu, Zeyu Mi, Zhichao Hua, Binyu Zang, and Haibo Chen 10/23
  16. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Dynamic vcpu priority management from LKML’24 [RFC PATCH 0/8] Dynamic vcpu priority management in kvm From: Vineeth Pillai and Joel Fernandes Paravirt Scheduling: V1 Host Userland Host Kernel VMM Guest Userland Guest Kernel KVM vcpu1 thread vcpu2 thread vcpu3 thread vcpu4 thread Scheduler Handshake Protocol Negtotiation Hypercall/MSR 11/23
  17. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Dynamic vcpu priority management from LKML’24 [RFC PATCH 0/8] Dynamic vcpu priority management in kvm From: Vineeth Pillai and Joel Fernandes Paravirt Scheduling: v2 Host Userland Host Kernel VMM Guest Userland Guest Kernel KVM Kernel module / BPF program vcpu1 thread vcpu2 thread vcpu3 thread vcpu4 thread Scheduler Handshake Protocol Negtotiation Hypercall/MSR 11/23
  18. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Dynamic vcpu priority management from LKML’24 [RFC PATCH 0/8] Dynamic vcpu priority management in kvm From: Vineeth Pillai and Joel Fernandes Paravirt Scheduling: v3 Host Userland Host Kernel VMM Guest Userland Guest Kernel pvsched driver/bpf program KVM pvsched-device process VMM main thread Kernel module / BPF program vcpu1 thread vcpu2 thread vcpu3 thread vcpu4 thread Scheduler Other device processes 11/23
  19. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Guest Parallel applications on oversubscribed hosts Applications achieve parallelization with help of parallel application runtime libraries, e.g. OpenMP, Open MPI, etc. Parallel application runtime libraries determine the Degree of Parallelization (DoP) by referring to the number of cores on the machine, i.e. number of vCPUs for a VM. But vCPUs get preempted on the host, and we end up using incorrect DoP for parallel applications running inside VMs. 12/23
  20. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Para-virtualized guest parallel application runtimes Problem: Parallel application runtime libraries are oblivious to the phenomena of vCPU preemption. Impact: Suboptimal performance for parallel applications running inside VMs. Solution: 1 Aggregate information about vCPU preemptions on the host 2 Use this information inside the guest parallel application runtime libraries and dynamically adjust the DoP for the guest parallel applications. 13/23
  21. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion How does it work? Application T2 Guest Scheduler Host Scheduler pCPU1 vCPU2 :) T1 T3 vCPU1 :( vCPU3 :) T4 vCPU4 :( pCPU2 Application T1 Guest Scheduler Host Scheduler pCPU1 vCPU2 :) T2 vCPU1 vCPU3 :) vCPU4 pCPU2 14/23
  22. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Why does it work? Let’s understand with an example... VM workload: UA (input class B) from NPB3.4-OMP Unstructured computation Three major loops Implemented with a total of 38,768 internal barriers Parallel application runtime library: libgomp from GCC-12 Host: 36 pCPUs, linux-kernel v6.11-rc4, Debian-testing Guest: 36 vCPUs, linux-kernel v6.6.16, Debian-12 15/23
  23. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Spinning vs Blocking OMP_WAIT_POLICY=active Parallel worker threads spin upon reaching a barrier. OMP_WAIT_POLICY=passive Parallel worker threads block upon reaching a barrier. In an ideal world, spinning is faster than blocking. UA (Class B) spinning: 9.68 ± 0.04 seconds (1.80x) UA (Class B) blocking: 17.43 ± 0.09 seconds But in the real world with vCPU preemptions, spinning slows down more than blocking. UA (Class B) spinning: 19.95 ± 0.5 seconds (0.48x) UA (Class B) blocking: 22.43 ± 0.09 seconds (0.78x) Degradation in spinning performance increases with increase in number of preempted vCPUs. 16/23
  24. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Spin with minimized number of preempted vCPUs 0 5 10 trace_uaB_dop_in_the_guest_rw_from_ua 0 10 20 30 threads all threads running threads 17/23
  25. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion OMP_DYNAMIC If the environment variable is set to true, the OpenMP implementation may adjust the number of threads to use for executing parallel regions in order to optimize the use of system resources. 18/23
  26. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion How well does it work? UA (Class B) spinning: 19.95 ± 0.5 seconds UA (Class B) blocking: 22.43 ± 0.09 seconds UA (Class B) spinning with minimized number of preempted vCPUs: 13.02 ± 0.23 seconds (1.53x) 19/23
  27. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Changes in libgomp libgomp -gcc -12/ config/linux/proc.c --- unsigned gomp_dynamic_max_threads (void) { ... pv_sched_info = SYSCALL( get_pv_sched_info ); if (is_valid( pv_sched_info )) curr_dop = prev_dop - pv_sched_info; ... return curr_dop; } 20/23
  28. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Changes in the host kernel kernel/sched/core.c --- static inline void ttwu_do_wakeup (...) { if (is_vcpu(p) && ! preempted_vcpu (p)) record_pvsched_sample (...); } static void __sched notrace __schedule (...) { if ( is_idle_task (next) || is_idle_task(prev)) record_idle_sample (...); if (is_vcpu(prev) && preempted_vcpu (prev)) record_pvsched_sample (...); else if (is_vcpu(next)) record_pvsched_sample (...); } static void compute_pv_sched_info (...) {...} 21/23
  29. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Prototype OpenMP PCI Device IVSHMEM Memory Backend Host Scheduler Host Kernel Space Host User Space QEMU Guest Kernel Space Guest User Space Application Guest Scheduler 22/23
  30. Context Pain points Existing partial solutions Towards a complete solution

    A step further → user space Conclusion Conclusion Para-virtualized solutions implementing co-operative scheduling are promising in order to address the semantic gap. A common need for implementing these solutions is the shared memory between the host and the guest schedulers. Custom implementation of the shared memory for every solution is unproductive and redundant work. It is about time to standardize the interface. 23/23 Questions & Feedback: [email protected]