Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reading: Fair Scheduling for AVX2 and AVX-512 W...

wkb8s
October 11, 2024
25

Reading: Fair Scheduling for AVX2 and AVX-512 Workloads

wkb8s

October 11, 2024
Tweet

Transcript

  1. Fair Scheduling for AVX2 and AVX-512 Workloads Mathias Gottschlag1, Philipp

    Machauer1, Yussuf Khalil1, Frank Bellosa1 (1Karlsruhe Institute of Technology) Keio University Kono Laboratory, Daiki Wakabayashi USENIX ATC ’21
  2. ▪ CPU performance is commonly limited by their power consumption

    ▪ Example: Intel Turbo Boost ◆ Turbo level is determined by the # of active cores ▪ Power consumption depends on instructions ▪ Higher frequencies for simple instructions Power-Limited Computing 2
  3. ▪ SIMD instructions for 256/512-bit vectors by Intel ▪ vector

    size ∝ consuming energy ▪ three frequency level AVX2, AVX-512 3
  4. ▪ Frequency reduction affects other less power-intensive task ▪ 1.

    After context switch ▪ 2. Due to hyper threading ▪ Equal CPU time does not ensure equal relative performance Unfairness in Existing Schedulers 4
  5. Proposal: Frequency Reduction Compensation Proposal: Relative CPU Time Equal CPU

    Time not accounted by scheduler ▪ Subtract overhead to scale CPU time accounting of victim task ▪ Ensure relative performance instead of equal CPU time ▪ Victim task: non-AVX instruction + suffering from frequency reduction ▪ Two challenges ▪ Detect victim tasks ▪ Estimate performance impact 5
  6. ▪ Exploit trap mechanism to detect AVX2/AVX-512 instruction ▪ Disable

    AVX instructions (clear controlling bit) at each context switch ▪ Exception handler is called when AVX instruction is executed ◆ re-enable AVX instruction and mark the task as AVX2/AVX-512 task Detecting Victim Tasks 6 AVX ZMM_Hi256 Hi16_ZMM XCR0 register
  7. ▪ Scale victim’s CPU time by ratio between ideal and

    actual frequency ▪ Not add the entire time slice to vruntime Estimating Performance Impact 7 frequency time f 0 : Non AVX f 1 : AVX2 f 2 : AVX-512 measured t 0 t 1 t 2 time slice ideal average CPU frequency Victim Task
  8. ▪ Estimating non-AVX, AVX frequency is difficult ▪ f 0

    , f 1 , f 2 depend on turbo level (# of active cores) ▪ turbo level can change at any point during the time slice Estimating Performance Impact 8 average CPU frequency frequency time f 0 : Non AVX f 1 : AVX2 f 2 : AVX-512 measured t 0 t 1 t 2 time slice ideal Victim Task
  9. ▪ Estimate average turbo level ▪ Compare expected value of

    f measured with actual value Estimating Performance Impact 9 c i = t i * f i (c: cycle) r i = c i / c total measurable determined by turbo level Share of AVX-512 frequency cycles (Example: r1 = 0) r 2 Approximation that turbo level was constant # of active cores
  10. ▪ Average turbo level is calculated by linear interpolation ▪

    If actual value of f measured is 2.5 GHz and r 2 is 0.3, turbo level is estimated as follows ◆ 90% of the time: 13-16 cores ◆ 10% of the time: 9-12 cores ▪ f ideal and p are finally calculated! Estimating Performance Impact 10 Share of AVX-512 frequency cycles (Example: r1 = 0) r 2 9 1 0.9 * f non-AVX, 13-16 cores + 0.1 * f non-AVX, 9-12 cores
  11. ▪ Design of proposed system is incompatible with CFS ▪

    Different cores can have different virtual runtime ranges ◆ CFS has one runqueue per logical CPU ◆ Advantage gained by victim task is lost because vruntime of Task4 is normalized to 100 ▪ Infrequent load balancing ▪ MuQSS scheduler with CFS algorithm ▪ share runqueue with logical CPUs ▪ Frequent load balancing Implementation 11 same physical core low-power task high-power task vruntime: 100 vruntime: 50 (overhead: 50) Task 1 Task 2 Task 3 Task 4 CPU1 CPU2
  12. ▪ Experimental Setup ▪ Intel Xeon Gold 6130 CPU ◆

    16 physical CPUs, 32 logical CPUs ▪ 24 GiB of 2666 MHz DDR4 RAM ▪ Fedora 31 operating system ▪ Linux 5.9 kernel ◆ 1. with CFS scheduler ◆ 2. with modified scheduler based on MuQSS ▪ Benchmarks (victim tasks) ▪ Parsec 3.0 ▪ nginx ▪ Linux kernel build benchmark from PTS 9.0.1 Evaluation 12
  13. ▪ Research Questions ▪ Sensible approach to improve fairness? ▪

    Additional overhead introduced? ▪ Execute benchmarks alongside x265 which uses AVX2/AVX-512 ▪ 4 instances of x265 video encoder with 8 threads ◆ configured to use either AVX2, AVX-512 Evaluation 13
  14. ▪ Unfairness definition Evaluation 14 = fg① / fg② ①

    ② calculated with execution time of foreground application
  15. ▪ Unfairness definition Evaluation 15 = fg① / fg② =

    0.5 / fg② ① ② ⓷ = fg① / fg⓷ = 0.5 / tp fg = tp fg calculated with execution time of foreground application
  16. ▪ Proposed system reduces the average unfairness ▪ AVX2 :

    7.9% → 2.5% ▪ AVX-512 : 24.9% → 5.4% Evaluation 16 Unfairness for Parsec 3.0 benchmarks executed alongside x265 not able to scale to all logical CPUs
  17. ▪ For most benchmarks, statistically insignificant overhead ▪ Additional time

    spent in the scheduler ▪ Additional exception handling of the register accesses Evaluation 17 Baseline: remove frequency reduction compensation code with #ifdef Completion time difference of benchmarks
  18. ▪ For most benchmarks, prototype scheduler causes at most 17%

    overhead when compared with CFS ▪ This range matches the performance reported for unmodified MuQSS Evaluation 18 Completion time: MuQSS-based prototype vs CFS
  19. ▪ Sharing runqueues beyond sibling hyper-threads does not have any

    beneficial impact on prototype ▪ The load balancing mechanism provide all CPUs with enough choices Evaluation 19 Unfairness for Parsec 3.0 benchmarks executed alongside x265 Overhead for different runqueue sharing options among two logical CPUs (without frequency reduction compensation) improved cache efficiency
  20. ▪ ECOSystem [Heng+, SIGOPS ’02] ▪ Use consumed energy as

    the basis for scheduling ▪ Not viable on current hardware as the CPUs lack interfaces required for sufficiently accurate energy models ▪ Core scheduling [Aubrey+, LPC ’19] ▪ Limit co-scheduling of AVX-512 and non-AVX tasks ▪ May leaves hyper-threads idle, which cause reduced utilization of CPU Related Works 20
  21. ▪ Power-intensive instructions such as AVX2, AVX-512 reduce CPU frequency

    ▪ CPU frequency reduction may affects tasks that do not execute such power-intensive instructions ▪ This paper propose a system to achieve fair scheduling ▪ Frequency reduction compensation ▪ Trap-based detection of affected tasks ▪ Prototype reduced unfairness for AVX-512 workload by 4x Conclusion 21