Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reading: Towards Exploiting CPU Elasticity via ...

Avatar for wkb8s wkb8s
October 17, 2025
0

Reading: Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription

Avatar for wkb8s

wkb8s

October 17, 2025
Tweet

More Decks by wkb8s

Transcript

  1. Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription Hang Huang1,

    Jia Rao2, Song Wu1, Hai Jin1, Hong Jiang2, Hao Che2, Xiaofeng Wu2 (1National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab, Cluster and Grid Computing Lab School of Computer Science and Technology, Huazhong University of Science and Technology, 2The University of Texas at Arlington) Keio University Kono Laboratory, Daiki Wakabayashi HPDC ’21
  2. ▪ Technique to prepare for core scaling ▪ Running applications

    with more threads than cores in advance ▪ Used for applications that cannot dynamically change the # of threads Thread oversubscription 2 thread0 CPU0 thread1 thread2 thread0 CPU0 CPU1 thread1 thread2 CPU2 available cores are increased ✔ fully utilize entire CPUs even when scaled
  3. ▪ Oversubscription is still inefficient for a large number of

    applications ▪ Benchmarks suffer as much as 25x slowdown Performance degradation due to oversubscription 3 lower is better CPU Intel Xeon 2.10 GHz processors Core 8 Memory 128 GB OS Ubuntu 16.04 64 bit Kernel 5.1.12 Benchmark PARSEC 3.0, SPLASH-2, NAS parallel benchmarks 8T: 8 threads, oversubscription ratio: 1 32T: 32 threads, oversubscription ratio: 4
  4. ▪ 1. Overhead due to frequent context switching ▪ 2.

    Loss of locality Generally considered causes of degradation 4 thread0 CPU0 thread1 thread2 thread0 CPU0 VS without oversubscription with oversubscription ✖ switching cost ✖ pollute cache
  5. ▪ Point out that the generally considered disadvantages is trivial

    ▪ Overhead of context switching ▪ Loss of locality ▪ Identify the root causes that are responsible for the slowdowns ▪ Complex wakeup process ▪ Fluctuating load and unnecessary migrations ▪ Occurrence of busy-waiting ▪ Propose two OS mechanisms to support efficient oversubscription ▪ Virtual blocking ▪ Busy-waiting detection Goal 5
  6. ▪ Point out that the generally considered disadvantages is trivial

    ▪ Overhead of context switching ▪ Loss of locality ▪ Identify the root causes that are responsible for the slowdowns ▪ Complex wakeup process ▪ Fluctuating load and unnecessary migrations ▪ Occurrence of busy-waiting ▪ Propose two OS mechanisms to support efficient oversubscription ▪ Virtual blocking ▪ Busy-waiting detection Goal 6
  7. ▪ Setup ▪ As with the previous setup ▪ Only

    use 1 core (probably) ▪ Micro benchmark ▪ Configure threads to yield CPU after they finish the minimum time slice ◆ Time slice: 750us (== sched_min_granularity) ▪ No data access ▪ Two options ◆ (a) Pure computation ◆ (b) Computation with synchronization – Update a shared variable with __sync_fetch_and_add – Incur heavy cache coherence traffic on multiple cores Overhead of context switch 7 CPU0 CPU1 shared variable CPU2 1. update value Case (b) 2. invalidate cache line
  8. ▪ Oversubscription does not add additional overhead ▪ Per context

    switch cost is relatively stable at 1.5 us Overhead of context switch 8 lower is better normalized to performance with 1 thread cost of cache coherence traffic is negligible
  9. ▪ Setup ▪ As with the previous setup ▪ Only

    use 1 core ▪ 1 or 2 threads ▪ Micro benchmark ▪ Each thread traverses a sub-array repeatedly ◆ Varying the total array size while the size of each sub-array is fixed ▪ Four access pattern ◆ (a) sequential x read, (b) sequential x read-modify-write, (c) random x read, (d) random x read-modify-write Effect of oversubscription on cache performance 9 subarray0 subarray1 subarray2 subarray3 total array (a), (b) seq. (c), (d) rand. thread1 thread2 t1,t2,t1,t2 …
  10. ▪ Oversubscription can sometimes improve TLB efficiency ▪ Cache/TLB impact

    does not necessarily worsen Effect of oversubscription on cache performance 10 t over : exectime (2 thread) t serial : exectime (1 thread) out of L2 cache 1ms: alternating access makes prefetching less effective (< 17.5 ms: traverse their sub-arrays) lower is better
  11. ▪ Oversubscription can sometimes improve TLB efficiency ▪ Cache/TLB impact

    does not necessarily worsen Effect of oversubscription on cache performance 11 t over : exectime (2 thread) t serial : exectime (1 thread) out of L2 cache 1 thread: can use L2 cache partially 2 thread: L2 cache is completely flushed Easy to re-use TLB cache during scheduling period because of subarray division lower is better
  12. ▪ Point out that the generally considered disadvantages is trivial

    ▪ Overhead of context switching ▪ Loss of locality ▪ Identify the root causes that are responsible for the slowdowns ▪ Complex wakeup process ▪ Fluctuating load and unnecessary migrations ▪ Occurrence of busy-waiting ▪ Propose two OS mechanisms to support efficient oversubscription ▪ Virtual blocking ▪ Busy-waiting detection Goal 12
  13. ▪ Inefficiencies stem from thread synchronization ▪ 1. Blocking synchronization

    ◆ (A) Complex wakeup process – ① Select the most idle core to wake up thread – ② Insert awakening thread to its runqueue – ③ Checks if the waking thread should preempt the current running thread ◆ (B) Fluctuating load and unnecessary migrations ▪ 2. Busy-waiting synchronization Root cause of inefficiency with oversubscription 13 awakening threadA CPU0 running threadB runqueue: 2 threads CPU1 running threadC runqueue: 4 threads ①,② ③ Comparing priority of thread A, B overhead is significant in oversubscription
  14. ▪ Inefficiencies stem from thread synchronization ▪ 1. Blocking synchronization

    ◆ (A) Complex wakeup process ◆ (B) Fluctuating load and unnecessary migrations ▪ 2. Busy-waiting synchronization Root cause of inefficiency with oversubscription 14 Runqueue0 thread0 thread1 Sleepqueue1 thread0 thread1 frequent switching between “runnable ” and “sleep” may trigger excessive, unnecessary migrations
  15. Runqueue0 ▪ Inefficiencies stem from thread synchronization ▪ 1. Blocking

    synchronization ▪ 2. Busy-waiting synchronization ◆ Provides fast lock acquisition at the cost of wasting CPU cycles on spinning ◆ Oversubscription exacerbates the lock-holder preemption (LHP) problem Root cause of inefficiency with oversubscription 15 thread0 Waiter CPU0 thread1 Waiter thread2 Holder thread3 Waiter can’t stop spinning until lock holder thread is scheduled
  16. ▪ Inefficiencies stem from thread synchronization ▪ 1. Blocking synchronization

    ▪ 2. Busy-waiting synchronization ◆ Prior spinning detection methods is insufficient – (A) Software approaches » spin-then-block strategy – (B) Hardware approaches » pause-loop-exiting (PLE) » AMD pause filter (PF) Root cause of inefficiency with oversubscription 16 ✖ Applicable only to VM ✖ Depend on spin implementation ✖ Need to modify application can’t be detected by both PLE and PF (no PAUSE or NOP)
  17. ▪ Point out that the generally considered disadvantages is trivial

    ▪ Overhead of context switching ▪ Loss of locality ▪ Identify the root causes that are responsible for the slowdowns ▪ Complex wakeup process ▪ Fluctuating load and unnecessary migrations ▪ Occurrence of busy-waiting ▪ Propose two OS mechanisms to support efficient oversubscription ▪ Virtual blocking ▪ Busy-waiting detection Goal 17
  18. ▪ Eliminate the thread-waking overhead by removing the sleep queues

    ▪ 1. Add thread_state flag ◆ 1 : thread is blocked ◆ 0 : not blocked ▪ 2. Skip blocked threads during CPU scheduling until thread_state is cleared ◆ Blocked threads still remains in the runqueue ▪ 3. thread_state is cleared when a thread is awakened Virtual Blocking (VB) 18 User-level libraries like pthreads (e.g., spinlock, mutex, rwlock) all use futexes internally
  19. ▪ Deschedule the spinning thread and sets a skip flag

    if … ▪ 1. All branches recorded in LBR are identical and ▪ 2. There are no TLB misses or L1 data cache misses Busy-waiting Detection (BWD) 19 interrupt by hrtimer (100 us interval) Last Branch Records (from, to) address of recently completed branches x 16 entries HW performance counters *all records are cleared for each period
  20. ▪ Implemented VB and BWD in Linux kernel 5.1.12 ▪

    VB: added 217 LOC ▪ BWD: added 104 LOC ▪ No changes in user-space libraries or applications Implementation 20
  21. ▪ Goal ▪ Evaluate effectiveness of VB and BWD ▪

    Test true positive rate of BWD ▪ Setup ▪ Same as page 3 ▪ Benchmarks ▪ PARSEC 3.0, SPLASH-2, NAS parallel benchmarks ▪ Memcached ◆ Use mutilate to stress test the performance of memcached server ◆ 10:1 GET-SET ratio with 128-byte key size and 2048-byte value size Evaluation 21
  22. ▪ VB improved the performance of blocking synchronization ▪ Most

    effective for group synchronization, i.e., barrier and condition variable Effectiveness of VB with micro-benchmarks 22 higher is better one-to-one synchronization does not benefit since only one waiter thread is woken up Even with a 1:1 ratio of cores to threads, oversubscription can occur
  23. ▪ Thread oversubscription introduced 5.5 % to 56.7 % slowdown

    ▪ VB achieved performance close to the baseline w/o oversubsctiption Effectiveness of VB with realistic benchmarks 23 lower is better 8T: 8 threads 8c: 8 core ht: hyper threads optimized: virtual blocking enabled
  24. ▪ The culprits were the loss of CPU util and

    excessive migrations ▪ VB greatly improved CPU util and reduced the # of migrations Effectiveness of VB with realistic benchmarks 24 The runtime statistics (8 core) 32 threads
  25. ▪ Improved the average, the 95th percentile, 99th percentile tail

    latencies, and the throughput Effectiveness of VB with memcached server 25 higher is better lower is better The number of threads affected by wake-up delays has decreased Impact on average performance is minimal since only a small number of threads wake up simultaneously
  26. ▪ Achieve the performance under oversubscription close to that without

    oversubscription Effectiveness of BWD with realistic benchmarks 26 Execution time (8 core) PLE is not applicable to container BWD is effective to customized busy-waiting algorithms
  27. ▪ Close to 100% true positive rates for 10 different

    spinlocks ▪ At most 0.61% false positive rate across 8 benchmarks True and false positive rates of BWD 27 true positive rate of BWD false positive rate of BWD 8 blocking benchmarks without any user or kernel-level spinning micro-benchmark with 2 threads on single core - Thread0: continuously holds a spinlock - Thread1: repeatedly tries to acquire the spinlock
  28. ▪ Adjusting thread-level concurrency ▪ Decoupling contention management from scheduling

    [Johnson+, ASPLOS ’10] ◆ Determine optimal number of threads based on system load ◆ Changes to application source code or libraries is required ▪ Contention- and locality-aware lock design ▪ Spin Detection Hardware for Improved Management of Multithreaded Systems [Johnson+, TPDS ’06] ◆ Add dedicated hardware for detecting spinning within CPU ◆ Requires new hardware support Related Works 28
  29. ▪ Point out that the generally considered disadvantages is trivial

    ▪ Overhead of context switching ▪ Loss of locality ▪ Identify the root causes that are responsible for the slowdowns ▪ Complex wakeup process ▪ Fluctuating load and unnecessary migrations ▪ Occurrence of busy-waiting ▪ Propose two OS mechanisms to support efficient oversubscription ▪ Virtual blocking ▪ Busy-waiting detection Conclusion 29