Reading: Fair Scheduling for AVX2 and AVX-512 Workloads

Fair Scheduling for AVX2 and AVX-512 Workloads Mathias Gottschlag1, Philipp
Machauer1, Yussuf Khalil1, Frank Bellosa1 (1Karlsruhe Institute of Technology) Keio University Kono Laboratory, Daiki Wakabayashi USENIX ATC ’21

▪ CPU performance is commonly limited by their power consumption
▪ Example: Intel Turbo Boost ◆ Turbo level is determined by the # of active cores ▪ Power consumption depends on instructions ▪ Higher frequencies for simple instructions Power-Limited Computing 2

▪ SIMD instructions for 256/512-bit vectors by Intel ▪ vector
size ∝ consuming energy ▪ three frequency level AVX2, AVX-512 3

▪ Frequency reduction affects other less power-intensive task ▪ 1.
After context switch ▪ 2. Due to hyper threading ▪ Equal CPU time does not ensure equal relative performance Unfairness in Existing Schedulers 4

Proposal: Frequency Reduction Compensation Proposal: Relative CPU Time Equal CPU
Time not accounted by scheduler ▪ Subtract overhead to scale CPU time accounting of victim task ▪ Ensure relative performance instead of equal CPU time ▪ Victim task: non-AVX instruction + suffering from frequency reduction ▪ Two challenges ▪ Detect victim tasks ▪ Estimate performance impact 5

▪ Exploit trap mechanism to detect AVX2/AVX-512 instruction ▪ Disable
AVX instructions (clear controlling bit) at each context switch ▪ Exception handler is called when AVX instruction is executed ◆ re-enable AVX instruction and mark the task as AVX2/AVX-512 task Detecting Victim Tasks 6 AVX ZMM_Hi256 Hi16_ZMM XCR0 register

▪ Scale victim’s CPU time by ratio between ideal and
actual frequency ▪ Not add the entire time slice to vruntime Estimating Performance Impact 7 frequency time f 0 : Non AVX f 1 : AVX2 f 2 : AVX-512 measured t 0 t 1 t 2 time slice ideal average CPU frequency Victim Task

▪ Estimating non-AVX, AVX frequency is difficult ▪ f 0
, f 1 , f 2 depend on turbo level (# of active cores) ▪ turbo level can change at any point during the time slice Estimating Performance Impact 8 average CPU frequency frequency time f 0 : Non AVX f 1 : AVX2 f 2 : AVX-512 measured t 0 t 1 t 2 time slice ideal Victim Task

▪ Estimate average turbo level ▪ Compare expected value of
f measured with actual value Estimating Performance Impact 9 c i = t i * f i (c: cycle) r i = c i / c total measurable determined by turbo level Share of AVX-512 frequency cycles (Example: r1 = 0) r 2 Approximation that turbo level was constant # of active cores

▪ Average turbo level is calculated by linear interpolation ▪
If actual value of f measured is 2.5 GHz and r 2 is 0.3, turbo level is estimated as follows ◆ 90% of the time: 13-16 cores ◆ 10% of the time: 9-12 cores ▪ f ideal and p are finally calculated! Estimating Performance Impact 10 Share of AVX-512 frequency cycles (Example: r1 = 0) r 2 9 1 0.9 * f non-AVX, 13-16 cores + 0.1 * f non-AVX, 9-12 cores

▪ Design of proposed system is incompatible with CFS ▪
Different cores can have different virtual runtime ranges ◆ CFS has one runqueue per logical CPU ◆ Advantage gained by victim task is lost because vruntime of Task4 is normalized to 100 ▪ Infrequent load balancing ▪ MuQSS scheduler with CFS algorithm ▪ share runqueue with logical CPUs ▪ Frequent load balancing Implementation 11 same physical core low-power task high-power task vruntime: 100 vruntime: 50 (overhead: 50) Task 1 Task 2 Task 3 Task 4 CPU1 CPU2

▪ Experimental Setup ▪ Intel Xeon Gold 6130 CPU ◆
16 physical CPUs, 32 logical CPUs ▪ 24 GiB of 2666 MHz DDR4 RAM ▪ Fedora 31 operating system ▪ Linux 5.9 kernel ◆ 1. with CFS scheduler ◆ 2. with modified scheduler based on MuQSS ▪ Benchmarks (victim tasks) ▪ Parsec 3.0 ▪ nginx ▪ Linux kernel build benchmark from PTS 9.0.1 Evaluation 12

▪ Research Questions ▪ Sensible approach to improve fairness? ▪
Additional overhead introduced? ▪ Execute benchmarks alongside x265 which uses AVX2/AVX-512 ▪ 4 instances of x265 video encoder with 8 threads ◆ configured to use either AVX2, AVX-512 Evaluation 13

▪ Unfairness definition Evaluation 14 = fg① / fg② ①
② calculated with execution time of foreground application

▪ Unfairness definition Evaluation 15 = fg① / fg② =
0.5 / fg② ① ② ⓷ = fg① / fg⓷ = 0.5 / tp fg = tp fg calculated with execution time of foreground application

▪ Proposed system reduces the average unfairness ▪ AVX2 :
7.9% → 2.5% ▪ AVX-512 : 24.9% → 5.4% Evaluation 16 Unfairness for Parsec 3.0 benchmarks executed alongside x265 not able to scale to all logical CPUs

▪ For most benchmarks, statistically insignificant overhead ▪ Additional time
spent in the scheduler ▪ Additional exception handling of the register accesses Evaluation 17 Baseline: remove frequency reduction compensation code with #ifdef Completion time difference of benchmarks

▪ For most benchmarks, prototype scheduler causes at most 17%
overhead when compared with CFS ▪ This range matches the performance reported for unmodified MuQSS Evaluation 18 Completion time: MuQSS-based prototype vs CFS

▪ Sharing runqueues beyond sibling hyper-threads does not have any
beneficial impact on prototype ▪ The load balancing mechanism provide all CPUs with enough choices Evaluation 19 Unfairness for Parsec 3.0 benchmarks executed alongside x265 Overhead for different runqueue sharing options among two logical CPUs (without frequency reduction compensation) improved cache efficiency

▪ ECOSystem [Heng+, SIGOPS ’02] ▪ Use consumed energy as
the basis for scheduling ▪ Not viable on current hardware as the CPUs lack interfaces required for sufficiently accurate energy models ▪ Core scheduling [Aubrey+, LPC ’19] ▪ Limit co-scheduling of AVX-512 and non-AVX tasks ▪ May leaves hyper-threads idle, which cause reduced utilization of CPU Related Works 20

▪ Power-intensive instructions such as AVX2, AVX-512 reduce CPU frequency
▪ CPU frequency reduction may affects tasks that do not execute such power-intensive instructions ▪ This paper propose a system to achieve fair scheduling ▪ Frequency reduction compensation ▪ Trap-based detection of affected tasks ▪ Prototype reduced unfairness for AVX-512 workload by 4x Conclusion 21

Reading: Fair Scheduling for AVX2 and AVX-512 W...

Reading: Fair Scheduling for AVX2 and AVX-512 Workloads

wkb8s

More Decks by wkb8s

Featured

Transcript

Fair Scheduling for AVX2 and AVX-512 Workloads Mathias Gottschlag1, Philipp

▪ CPU performance is commonly limited by their power consumption

▪ SIMD instructions for 256/512-bit vectors by Intel ▪ vector

▪ Frequency reduction affects other less power-intensive task ▪ 1.

Proposal: Frequency Reduction Compensation Proposal: Relative CPU Time Equal CPU

▪ Exploit trap mechanism to detect AVX2/AVX-512 instruction ▪ Disable

▪ Scale victim’s CPU time by ratio between ideal and

▪ Estimating non-AVX, AVX frequency is difficult ▪ f 0

▪ Estimate average turbo level ▪ Compare expected value of

▪ Average turbo level is calculated by linear interpolation ▪

▪ Design of proposed system is incompatible with CFS ▪

▪ Experimental Setup ▪ Intel Xeon Gold 6130 CPU ◆

▪ Research Questions ▪ Sensible approach to improve fairness? ▪

▪ Unfairness definition Evaluation 14 = fg① / fg② ①

▪ Unfairness definition Evaluation 15 = fg① / fg② =

▪ Proposed system reduces the average unfairness ▪ AVX2 :

▪ For most benchmarks, statistically insignificant overhead ▪ Additional time

▪ For most benchmarks, prototype scheduler causes at most 17%

▪ Sharing runqueues beyond sibling hyper-threads does not have any

▪ ECOSystem [Heng+, SIGOPS ’02] ▪ Use consumed energy as

▪ Power-intensive instructions such as AVX2, AVX-512 reduce CPU frequency