Reading: IRS

Scheduler Activations for Interference-Resilient SMP Virtual Machine Scheduling Yong Zhao1,
Luwei Cheng1, Jia Rao1 (1The University of Texas at Arlington, 2Facebook) Keio University Kono Laboratory, Daiki Wakabayashi Middleware ’17

▪ Cloud providers consolidate multiple VMs onto a single physical
machine to improve hardware utilization ▪ There exists two-level scheduling in virtualized environments ▪ Thread scheduling by guest OS ▪ vCPU scheduling by hypervisor Double Scheduling in Virtualized Environments 2 VM1 pCPU1 pCPU2 vCPU1 vCPU2 VM2 vCPU1 thread ? ? ? ? Physical Machine …

▪ The semantic gap between the guest OS and the
hypervisor may lead to performance degradation ▪ Lock-holder preemption (LHP) ▪ Lock-waiter preemption (LWP) Double Scheduling Problems 3 Acquire Spinlock Preempted by Hypervisor Critical Section Fail to acquire spinlock Spinning vCPU1 vCPU2 vCPU1 vCPU2 vCPU3 vCPU4 waiter preempted Spinning strict FIFO ordering of spinlocks

▪ Prior works focus on hypervisor-level scheduling ▪ Hypervisor-level: Co-Scheduling,
Relaxed Co-Scheduling [VMware ’10] ◆ Co-schedule all vCPUs of the same VM ◆ Expensive to implement and causes CPU fragmentation ▪ Guest OS-assisted: Delay Scheduling [Uhlig+, VM ’04] ◆ Guest OS notifies the hypervisor before acquiring a spin-lock and hypervisor delays preemption to avoid LHP and LW ◆ Hypervisor need to deviate from its existing scheduling algorithm ▪ Hardware-assisted: Intel Pause-Loop Exiting (PLE) [Riel ’11] ◆ Detect excessive spinning and prevent a VM from wasting CPU cycles Prior Works 4

▪ If the guest OS can balance load and timely
schedule critical threads, any application can be made resilient to interference Potential of Guest OS Load Balancing 5 suffer from lower CPU utilizations caused by frequent LHP and LWP user-level work stealing Performance slowdown of parallel applications (PARSEC and NPB benchmarks) pCPU1 VM2 vCPU1 VM1 vCPU1 pCPU2 VM2 vCPU2 pCPU4 VM2 vCPU4 … higher is better single-thread interfering program 4-thread parallel program

▪ Idea ▪ Before a vCPU is preempted, the guest
OS migrates the critical thread on this vCPU to another running vCPU ▪ Motivation & Objective ▪ Inspired by Scheduler Activation [Anderson+, TOCS ’92] ▪ Minimize interference-induced idling and CPU waste Proposal: Interference-Resilient Scheduling (IRS) 6

IRS Design 7 Before Xen preempts a vCPU, it sends
a notification to the guest OS (only when target vCPU is involuntarily preempted and is willing to run) and delay the preemption Upon receiving the notification, it activates load balancing Deschedules the thread on the to-be-preempted vCPU Moves the thread to a sibling vCPU with the least waiting time After the thread migration, Xen finishes vCPU switching 20-26 μs delay (❶ + ❷ + ❸) << 30 ms timeslice in Xen async Notify completion of CS

▪ Challenge ▪ Find a least loaded vCPU considering contention
in pCPUs ▪ Balance load and ensure cache locality when preempted vCPUs come back ▪ Approach ▪ Estimate vCPU load based on rt_avg which considers steal time ▪ Allow the wakeup balancer preempt the current task if it was migrated by IRS Migrator 8 vCPU1 pCPU0 pCPU1 thread1 thread2 Other VM thread1 thread2 vCPU2 vCPU2 vCPU1 vCPU1 preempted migrated by IRS thread2 migrated by wakeup balancer enter critical section blocked thread2 wakes up Simple IRS Approach IRS Approach vCPU1 pCPU0 pCPU1 thread1 thread2 Other VM thread1 thread2 vCPU2 vCPU2 vCPU1 vCPU1 preempted migrated by IRS thread1 migrated by periodic balancer enter critical section blocked thread2 wakes up cache polluted!! thread1 vCPU2 : critical section

▪ Use existing load balancing primitives and make minimal changes
▪ Xen: 30 LOC ▪ Linux kernel: 130 LOC ▪ Hypervisor-guest communication uses Xen’s event channel ▪ SA sender: send notification with virtual interrupt (vIRQ) ▪ SA receiver: implemented as an interrupt handler Implementation 9

▪ Experimental Setup ▪ DELL Power Edge T420 server ◆
two six-core Intel Xeon E5-2410 1.9 GHz processors ◆ 32 GB memory ◆ one Gigabit Network card ◆ 1 TB 7200 RPM SATA hard disk ◆ Linux kernel 3.18.4 (Guest OS, dom0 OS) ◆ Hypervisor: Xen 4.5.0 ▪ VM ◆ Interfering VM: run CPU-intensive micro-benchmark, PARSEC and NPB benchmarks ◆ Measurement VM: run parallel and multi-threaded workloads ◆ 4 vCPUs ◆ 4 GB memory Evaluation 10

▪ Workloads ▪ PARSEC (blocking sync) ▪ NASA parallel benchmarks
(spinning sync) ▪ SPECjbb2005, Apache HTTP server benchmark ▪ Scheduling strategies ▪ IRS ▪ Vanilla Xen 4.5.0 ▪ VMWare relaxed-coscheduling (Relaxed-Co) ▪ Intel Pause-Loop Exiting (PLE) ▪ Experiments ▪ Controlled experiments: vCPUs pinned to pCPUs, increasing interference ▪ Realistic experiments: vCPUs free to run any pCPUs Evaluation 11 pCPU1 VM2 vCPU1 VM1 vCPU1 pCPU2 VM2 vCPU2 VM1 vCPU2 pCPU4 VM2 vCPU4 VM1 vCPU4 … more general multi-threaded programs with little or no synchronization

▪ IRS outperformed vanilla Xen, co-scheduling and PLE ▪ Performance
improvement decreased as the level of interference Parallel Performance (blocking) 12 Improvement on PARSEC performance higher is better vCPU onto which a thread was migrated can be preempted soon When a few vCPUs were under interference, IRS was able to migrate threads onto vCPUs without interference pipeline parallelism leaves little room for performance improvement PLE does not work because of the short spinning period

▪ IRS attained higher performance improvement over the baseline ▪
PLE and relaxed-Co were more effective for spinning workloads than blocking workloads ▪ IRS was unable to find any idle vCPUs to migrate. However, IRS makes the scheduling happen much sooner Parallel Performance (spinning) 13 Improvement on NPB performance higher is better

▪ IRS improves both throughput and request latency ▪ As
ab had a large number of threads, improvement on a few threads did not contribute to the overall performance Multi-threaded Performance 14 Improvement on server throughput and latency (IRS vs vanilla Xen/Linux) higher is better

▪ IRS improves the system wide speedup across all workloads
by 22 % on average ▪ The gain on system weighted speedup is mainly due to the performance improvement in foreground applications System Fairness and Efficiency 15 Weighted speedup of two PARSEC applications (blocking) higher is better

▪ Performance gain diminishes as the number of vCPUs having
interference increased ▪ IRS can be useful in a highly consolidated scenario Scalability and Sensitivity Analysis 16 The trend of IRS performance improvement with a varying degree of interferences higher is better # of interfering VMs

▪ Occurs when parallel workloads with frequent blocking are co-located
with CPU-intensive applications ▪ Since blocked threads do not consume any CPU cycles, they exhibit deceptive idleness to the vCPU scheduler CPU Stacking 17 pCPU1 VM2 vCPU1 VM1 vCPU1 pCPU2 VM2 vCPU2 pCPU3 VM2 vCPU3 VM1 vCPU2 pCPU1 pCPU2 pCPU3 (single-thread CPU-intensive program) x 2 3-thread blocking parallel program VM1 vCPU1 VM1 vCPU2 VM2 vCPU1 VM2 vCPU2 VM2 vCPU3 balancing of vCPU exectime by hypervisor lead to severe performance degradation

▪ IRS greatly mitigating CPU stacking ▪ Co-scheduling and PLE
incurred more serious CPU stacking compared to vanilla Xen because of deceptive idleness Mitigating CPU Stacking 18 PARSEC performance in response to CPU stacking (all vCPUs are unpinned) higher is better

▪ Interference-Resilient Scheduling (IRS) : a coordinated approach that bridges
the guest-hypervisor semantic gap at the guest OS side. ▪ Inspired by Scheduler Activation [Anderson+, TOCS ’92] ▪ Enhances Guest OS load balancing to make any parallel applications resilient to interference ▪ Mitigates LHP and LWP problems ▪ Alleviates the CPU stacking problem ▪ Outperforms PLE and relaxed co-scheduling Conclusion 19

Reading: IRS

Reading: IRS

wkb8s

More Decks by wkb8s

Featured

Transcript

Scheduler Activations for Interference-Resilient SMP Virtual Machine Scheduling Yong Zhao1,

▪ Cloud providers consolidate multiple VMs onto a single physical

▪ The semantic gap between the guest OS and the

▪ Prior works focus on hypervisor-level scheduling ▪ Hypervisor-level: Co-Scheduling,

▪ If the guest OS can balance load and timely

▪ Idea ▪ Before a vCPU is preempted, the guest

IRS Design 7 Before Xen preempts a vCPU, it sends

▪ Challenge ▪ Find a least loaded vCPU considering contention

▪ Use existing load balancing primitives and make minimal changes

▪ Experimental Setup ▪ DELL Power Edge T420 server ◆

▪ Workloads ▪ PARSEC (blocking sync) ▪ NASA parallel benchmarks

▪ IRS outperformed vanilla Xen, co-scheduling and PLE ▪ Performance

▪ IRS attained higher performance improvement over the baseline ▪

▪ IRS improves both throughput and request latency ▪ As

▪ IRS improves the system wide speedup across all workloads

▪ Performance gain diminishes as the number of vCPUs having

▪ Occurs when parallel workloads with frequent blocking are co-located

▪ IRS greatly mitigating CPU stacking ▪ Co-scheduling and PLE

▪ Interference-Resilient Scheduling (IRS) : a coordinated approach that bridges