Jia Rao2, Song Wu1, Hai Jin1, Hong Jiang2, Hao Che2, Xiaofeng Wu2 (1National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab, Cluster and Grid Computing Lab School of Computer Science and Technology, Huazhong University of Science and Technology, 2The University of Texas at Arlington) Keio University Kono Laboratory, Daiki Wakabayashi HPDC ’21
with more threads than cores in advance ▪ Used for applications that cannot dynamically change the # of threads Thread oversubscription 2 thread0 CPU0 thread1 thread2 thread0 CPU0 CPU1 thread1 thread2 CPU2 available cores are increased ✔ fully utilize entire CPUs even when scaled
Loss of locality Generally considered causes of degradation 4 thread0 CPU0 thread1 thread2 thread0 CPU0 VS without oversubscription with oversubscription ✖ switching cost ✖ pollute cache
▪ Overhead of context switching ▪ Loss of locality ▪ Identify the root causes that are responsible for the slowdowns ▪ Complex wakeup process ▪ Fluctuating load and unnecessary migrations ▪ Occurrence of busy-waiting ▪ Propose two OS mechanisms to support efficient oversubscription ▪ Virtual blocking ▪ Busy-waiting detection Goal 5
▪ Overhead of context switching ▪ Loss of locality ▪ Identify the root causes that are responsible for the slowdowns ▪ Complex wakeup process ▪ Fluctuating load and unnecessary migrations ▪ Occurrence of busy-waiting ▪ Propose two OS mechanisms to support efficient oversubscription ▪ Virtual blocking ▪ Busy-waiting detection Goal 6
use 1 core (probably) ▪ Micro benchmark ▪ Configure threads to yield CPU after they finish the minimum time slice ◆ Time slice: 750us (== sched_min_granularity) ▪ No data access ▪ Two options ◆ (a) Pure computation ◆ (b) Computation with synchronization – Update a shared variable with __sync_fetch_and_add – Incur heavy cache coherence traffic on multiple cores Overhead of context switch 7 CPU0 CPU1 shared variable CPU2 1. update value Case (b) 2. invalidate cache line
switch cost is relatively stable at 1.5 us Overhead of context switch 8 lower is better normalized to performance with 1 thread cost of cache coherence traffic is negligible
use 1 core ▪ 1 or 2 threads ▪ Micro benchmark ▪ Each thread traverses a sub-array repeatedly ◆ Varying the total array size while the size of each sub-array is fixed ▪ Four access pattern ◆ (a) sequential x read, (b) sequential x read-modify-write, (c) random x read, (d) random x read-modify-write Effect of oversubscription on cache performance 9 subarray0 subarray1 subarray2 subarray3 total array (a), (b) seq. (c), (d) rand. thread1 thread2 t1,t2,t1,t2 …
does not necessarily worsen Effect of oversubscription on cache performance 10 t over : exectime (2 thread) t serial : exectime (1 thread) out of L2 cache 1ms: alternating access makes prefetching less effective (< 17.5 ms: traverse their sub-arrays) lower is better
does not necessarily worsen Effect of oversubscription on cache performance 11 t over : exectime (2 thread) t serial : exectime (1 thread) out of L2 cache 1 thread: can use L2 cache partially 2 thread: L2 cache is completely flushed Easy to re-use TLB cache during scheduling period because of subarray division lower is better
▪ Overhead of context switching ▪ Loss of locality ▪ Identify the root causes that are responsible for the slowdowns ▪ Complex wakeup process ▪ Fluctuating load and unnecessary migrations ▪ Occurrence of busy-waiting ▪ Propose two OS mechanisms to support efficient oversubscription ▪ Virtual blocking ▪ Busy-waiting detection Goal 12
◆ (A) Complex wakeup process – ① Select the most idle core to wake up thread – ② Insert awakening thread to its runqueue – ③ Checks if the waking thread should preempt the current running thread ◆ (B) Fluctuating load and unnecessary migrations ▪ 2. Busy-waiting synchronization Root cause of inefficiency with oversubscription 13 awakening threadA CPU0 running threadB runqueue: 2 threads CPU1 running threadC runqueue: 4 threads ①,② ③ Comparing priority of thread A, B overhead is significant in oversubscription
synchronization ▪ 2. Busy-waiting synchronization ◆ Provides fast lock acquisition at the cost of wasting CPU cycles on spinning ◆ Oversubscription exacerbates the lock-holder preemption (LHP) problem Root cause of inefficiency with oversubscription 15 thread0 Waiter CPU0 thread1 Waiter thread2 Holder thread3 Waiter can’t stop spinning until lock holder thread is scheduled
▪ 2. Busy-waiting synchronization ◆ Prior spinning detection methods is insufficient – (A) Software approaches » spin-then-block strategy – (B) Hardware approaches » pause-loop-exiting (PLE) » AMD pause filter (PF) Root cause of inefficiency with oversubscription 16 ✖ Applicable only to VM ✖ Depend on spin implementation ✖ Need to modify application can’t be detected by both PLE and PF (no PAUSE or NOP)
▪ Overhead of context switching ▪ Loss of locality ▪ Identify the root causes that are responsible for the slowdowns ▪ Complex wakeup process ▪ Fluctuating load and unnecessary migrations ▪ Occurrence of busy-waiting ▪ Propose two OS mechanisms to support efficient oversubscription ▪ Virtual blocking ▪ Busy-waiting detection Goal 17
▪ 1. Add thread_state flag ◆ 1 : thread is blocked ◆ 0 : not blocked ▪ 2. Skip blocked threads during CPU scheduling until thread_state is cleared ◆ Blocked threads still remains in the runqueue ▪ 3. thread_state is cleared when a thread is awakened Virtual Blocking (VB) 18 User-level libraries like pthreads (e.g., spinlock, mutex, rwlock) all use futexes internally
if … ▪ 1. All branches recorded in LBR are identical and ▪ 2. There are no TLB misses or L1 data cache misses Busy-waiting Detection (BWD) 19 interrupt by hrtimer (100 us interval) Last Branch Records (from, to) address of recently completed branches x 16 entries HW performance counters *all records are cleared for each period
Test true positive rate of BWD ▪ Setup ▪ Same as page 3 ▪ Benchmarks ▪ PARSEC 3.0, SPLASH-2, NAS parallel benchmarks ▪ Memcached ◆ Use mutilate to stress test the performance of memcached server ◆ 10:1 GET-SET ratio with 128-byte key size and 2048-byte value size Evaluation 21
effective for group synchronization, i.e., barrier and condition variable Effectiveness of VB with micro-benchmarks 22 higher is better one-to-one synchronization does not benefit since only one waiter thread is woken up Even with a 1:1 ratio of cores to threads, oversubscription can occur
excessive migrations ▪ VB greatly improved CPU util and reduced the # of migrations Effectiveness of VB with realistic benchmarks 24 The runtime statistics (8 core) 32 threads
latencies, and the throughput Effectiveness of VB with memcached server 25 higher is better lower is better The number of threads affected by wake-up delays has decreased Impact on average performance is minimal since only a small number of threads wake up simultaneously
oversubscription Effectiveness of BWD with realistic benchmarks 26 Execution time (8 core) PLE is not applicable to container BWD is effective to customized busy-waiting algorithms
spinlocks ▪ At most 0.61% false positive rate across 8 benchmarks True and false positive rates of BWD 27 true positive rate of BWD false positive rate of BWD 8 blocking benchmarks without any user or kernel-level spinning micro-benchmark with 2 threads on single core - Thread0: continuously holds a spinlock - Thread1: repeatedly tries to acquire the spinlock
[Johnson+, ASPLOS ’10] ◆ Determine optimal number of threads based on system load ◆ Changes to application source code or libraries is required ▪ Contention- and locality-aware lock design ▪ Spin Detection Hardware for Improved Management of Multithreaded Systems [Johnson+, TPDS ’06] ◆ Add dedicated hardware for detecting spinning within CPU ◆ Requires new hardware support Related Works 28
▪ Overhead of context switching ▪ Loss of locality ▪ Identify the root causes that are responsible for the slowdowns ▪ Complex wakeup process ▪ Fluctuating load and unnecessary migrations ▪ Occurrence of busy-waiting ▪ Propose two OS mechanisms to support efficient oversubscription ▪ Virtual blocking ▪ Busy-waiting detection Conclusion 29