Reading: Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription

Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription Hang Huang1,
Jia Rao2, Song Wu1, Hai Jin1, Hong Jiang2, Hao Che2, Xiaofeng Wu2 (1National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab, Cluster and Grid Computing Lab School of Computer Science and Technology, Huazhong University of Science and Technology, 2The University of Texas at Arlington) Keio University Kono Laboratory, Daiki Wakabayashi HPDC ’21

▪ Technique to prepare for core scaling ▪ Running applications
with more threads than cores in advance ▪ Used for applications that cannot dynamically change the # of threads Thread oversubscription 2 thread0 CPU0 thread1 thread2 thread0 CPU0 CPU1 thread1 thread2 CPU2 available cores are increased ✔ fully utilize entire CPUs even when scaled

▪ Oversubscription is still inefficient for a large number of
applications ▪ Benchmarks suffer as much as 25x slowdown Performance degradation due to oversubscription 3 lower is better CPU Intel Xeon 2.10 GHz processors Core 8 Memory 128 GB OS Ubuntu 16.04 64 bit Kernel 5.1.12 Benchmark PARSEC 3.0, SPLASH-2, NAS parallel benchmarks 8T: 8 threads, oversubscription ratio: 1 32T: 32 threads, oversubscription ratio: 4

▪ 1. Overhead due to frequent context switching ▪ 2.
Loss of locality Generally considered causes of degradation 4 thread0 CPU0 thread1 thread2 thread0 CPU0 VS without oversubscription with oversubscription ✖ switching cost ✖ pollute cache

▪ Point out that the generally considered disadvantages is trivial
▪ Overhead of context switching ▪ Loss of locality ▪ Identify the root causes that are responsible for the slowdowns ▪ Complex wakeup process ▪ Fluctuating load and unnecessary migrations ▪ Occurrence of busy-waiting ▪ Propose two OS mechanisms to support efficient oversubscription ▪ Virtual blocking ▪ Busy-waiting detection Goal 5

▪ Setup ▪ As with the previous setup ▪ Only
use 1 core (probably) ▪ Micro benchmark ▪ Configure threads to yield CPU after they finish the minimum time slice ◆ Time slice: 750us (== sched_min_granularity) ▪ No data access ▪ Two options ◆ (a) Pure computation ◆ (b) Computation with synchronization – Update a shared variable with __sync_fetch_and_add – Incur heavy cache coherence traffic on multiple cores Overhead of context switch 7 CPU0 CPU1 shared variable CPU2 1. update value Case (b) 2. invalidate cache line

▪ Oversubscription does not add additional overhead ▪ Per context
switch cost is relatively stable at 1.5 us Overhead of context switch 8 lower is better normalized to performance with 1 thread cost of cache coherence traffic is negligible

▪ Setup ▪ As with the previous setup ▪ Only
use 1 core ▪ 1 or 2 threads ▪ Micro benchmark ▪ Each thread traverses a sub-array repeatedly ◆ Varying the total array size while the size of each sub-array is fixed ▪ Four access pattern ◆ (a) sequential x read, (b) sequential x read-modify-write, (c) random x read, (d) random x read-modify-write Effect of oversubscription on cache performance 9 subarray0 subarray1 subarray2 subarray3 total array (a), (b) seq. (c), (d) rand. thread1 thread2 t1,t2,t1,t2 …

▪ Oversubscription can sometimes improve TLB efficiency ▪ Cache/TLB impact
does not necessarily worsen Effect of oversubscription on cache performance 10 t over : exectime (2 thread) t serial : exectime (1 thread) out of L2 cache 1ms: alternating access makes prefetching less effective (< 17.5 ms: traverse their sub-arrays) lower is better

▪ Oversubscription can sometimes improve TLB efficiency ▪ Cache/TLB impact
does not necessarily worsen Effect of oversubscription on cache performance 11 t over : exectime (2 thread) t serial : exectime (1 thread) out of L2 cache 1 thread: can use L2 cache partially 2 thread: L2 cache is completely flushed Easy to re-use TLB cache during scheduling period because of subarray division lower is better

▪ Inefficiencies stem from thread synchronization ▪ 1. Blocking synchronization
◆ (A) Complex wakeup process – ① Select the most idle core to wake up thread – ② Insert awakening thread to its runqueue – ③ Checks if the waking thread should preempt the current running thread ◆ (B) Fluctuating load and unnecessary migrations ▪ 2. Busy-waiting synchronization Root cause of inefficiency with oversubscription 13 awakening threadA CPU0 running threadB runqueue: 2 threads CPU1 running threadC runqueue: 4 threads ①,② ③ Comparing priority of thread A, B overhead is significant in oversubscription

◆ (A) Complex wakeup process ◆ (B) Fluctuating load and unnecessary migrations ▪ 2. Busy-waiting synchronization Root cause of inefficiency with oversubscription 14 Runqueue0 thread0 thread1 Sleepqueue1 thread0 thread1 frequent switching between “runnable ” and “sleep” may trigger excessive, unnecessary migrations

Runqueue0 ▪ Inefficiencies stem from thread synchronization ▪ 1. Blocking
synchronization ▪ 2. Busy-waiting synchronization ◆ Provides fast lock acquisition at the cost of wasting CPU cycles on spinning ◆ Oversubscription exacerbates the lock-holder preemption (LHP) problem Root cause of inefficiency with oversubscription 15 thread0 Waiter CPU0 thread1 Waiter thread2 Holder thread3 Waiter can’t stop spinning until lock holder thread is scheduled

▪ 2. Busy-waiting synchronization ◆ Prior spinning detection methods is insufficient – (A) Software approaches » spin-then-block strategy – (B) Hardware approaches » pause-loop-exiting (PLE) » AMD pause filter (PF) Root cause of inefficiency with oversubscription 16 ✖ Applicable only to VM ✖ Depend on spin implementation ✖ Need to modify application can’t be detected by both PLE and PF (no PAUSE or NOP)

▪ Eliminate the thread-waking overhead by removing the sleep queues
▪ 1. Add thread_state flag ◆ 1 : thread is blocked ◆ 0 : not blocked ▪ 2. Skip blocked threads during CPU scheduling until thread_state is cleared ◆ Blocked threads still remains in the runqueue ▪ 3. thread_state is cleared when a thread is awakened Virtual Blocking (VB) 18 User-level libraries like pthreads (e.g., spinlock, mutex, rwlock) all use futexes internally

▪ Deschedule the spinning thread and sets a skip flag
if … ▪ 1. All branches recorded in LBR are identical and ▪ 2. There are no TLB misses or L1 data cache misses Busy-waiting Detection (BWD) 19 interrupt by hrtimer (100 us interval) Last Branch Records (from, to) address of recently completed branches x 16 entries HW performance counters *all records are cleared for each period

▪ Implemented VB and BWD in Linux kernel 5.1.12 ▪
VB: added 217 LOC ▪ BWD: added 104 LOC ▪ No changes in user-space libraries or applications Implementation 20

▪ Goal ▪ Evaluate effectiveness of VB and BWD ▪
Test true positive rate of BWD ▪ Setup ▪ Same as page 3 ▪ Benchmarks ▪ PARSEC 3.0, SPLASH-2, NAS parallel benchmarks ▪ Memcached ◆ Use mutilate to stress test the performance of memcached server ◆ 10:1 GET-SET ratio with 128-byte key size and 2048-byte value size Evaluation 21

▪ VB improved the performance of blocking synchronization ▪ Most
effective for group synchronization, i.e., barrier and condition variable Effectiveness of VB with micro-benchmarks 22 higher is better one-to-one synchronization does not benefit since only one waiter thread is woken up Even with a 1:1 ratio of cores to threads, oversubscription can occur

▪ Thread oversubscription introduced 5.5 % to 56.7 % slowdown
▪ VB achieved performance close to the baseline w/o oversubsctiption Effectiveness of VB with realistic benchmarks 23 lower is better 8T: 8 threads 8c: 8 core ht: hyper threads optimized: virtual blocking enabled

▪ The culprits were the loss of CPU util and
excessive migrations ▪ VB greatly improved CPU util and reduced the # of migrations Effectiveness of VB with realistic benchmarks 24 The runtime statistics (8 core) 32 threads

▪ Improved the average, the 95th percentile, 99th percentile tail
latencies, and the throughput Effectiveness of VB with memcached server 25 higher is better lower is better The number of threads affected by wake-up delays has decreased Impact on average performance is minimal since only a small number of threads wake up simultaneously

▪ Achieve the performance under oversubscription close to that without
oversubscription Effectiveness of BWD with realistic benchmarks 26 Execution time (8 core) PLE is not applicable to container BWD is effective to customized busy-waiting algorithms

▪ Close to 100% true positive rates for 10 different
spinlocks ▪ At most 0.61% false positive rate across 8 benchmarks True and false positive rates of BWD 27 true positive rate of BWD false positive rate of BWD 8 blocking benchmarks without any user or kernel-level spinning micro-benchmark with 2 threads on single core - Thread0: continuously holds a spinlock - Thread1: repeatedly tries to acquire the spinlock

▪ Adjusting thread-level concurrency ▪ Decoupling contention management from scheduling
[Johnson+, ASPLOS ’10] ◆ Determine optimal number of threads based on system load ◆ Changes to application source code or libraries is required ▪ Contention- and locality-aware lock design ▪ Spin Detection Hardware for Improved Management of Multithreaded Systems [Johnson+, TPDS ’06] ◆ Add dedicated hardware for detecting spinning within CPU ◆ Requires new hardware support Related Works 28

▪ Overhead of context switching ▪ Loss of locality ▪ Identify the root causes that are responsible for the slowdowns ▪ Complex wakeup process ▪ Fluctuating load and unnecessary migrations ▪ Occurrence of busy-waiting ▪ Propose two OS mechanisms to support efficient oversubscription ▪ Virtual blocking ▪ Busy-waiting detection Conclusion 29

Reading: Towards Exploiting CPU Elasticity via ...

Reading: Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription

wkb8s

More Decks by wkb8s

Featured

Transcript

Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription Hang Huang1,

▪ Technique to prepare for core scaling ▪ Running applications

▪ Oversubscription is still inefficient for a large number of

▪ 1. Overhead due to frequent context switching ▪ 2.

▪ Point out that the generally considered disadvantages is trivial

▪ Point out that the generally considered disadvantages is trivial

▪ Setup ▪ As with the previous setup ▪ Only

▪ Oversubscription does not add additional overhead ▪ Per context

▪ Setup ▪ As with the previous setup ▪ Only

▪ Oversubscription can sometimes improve TLB efficiency ▪ Cache/TLB impact

▪ Oversubscription can sometimes improve TLB efficiency ▪ Cache/TLB impact

▪ Point out that the generally considered disadvantages is trivial

▪ Inefficiencies stem from thread synchronization ▪ 1. Blocking synchronization

▪ Inefficiencies stem from thread synchronization ▪ 1. Blocking synchronization

Runqueue0 ▪ Inefficiencies stem from thread synchronization ▪ 1. Blocking

▪ Inefficiencies stem from thread synchronization ▪ 1. Blocking synchronization

▪ Point out that the generally considered disadvantages is trivial

▪ Eliminate the thread-waking overhead by removing the sleep queues

▪ Deschedule the spinning thread and sets a skip flag

▪ Implemented VB and BWD in Linux kernel 5.1.12 ▪

▪ Goal ▪ Evaluate effectiveness of VB and BWD ▪

▪ VB improved the performance of blocking synchronization ▪ Most

▪ Thread oversubscription introduced 5.5 % to 56.7 % slowdown

▪ The culprits were the loss of CPU util and

▪ Improved the average, the 95th percentile, 99th percentile tail

▪ Achieve the performance under oversubscription close to that without

▪ Close to 100% true positive rates for 10 different

▪ Adjusting thread-level concurrency ▪ Decoupling contention management from scheduling

▪ Point out that the generally considered disadvantages is trivial