Reading: PALM: Progress- and Locality-Aware Adaptive Task Migration for Efficient Thread Packing

PALM: Progress- and Locality-Aware Adaptive Task Migration for Efficient Thread
Packing Jinsu Park1, Seongbeom Park1, Myeonggyun Han1, Woongki Baek2 (1Department of CSE, UNIST, 2Department of CSE and Graduate School of AI, UNIST) Keio University Kono Laboratory, Daiki Wakabayashi IPDPS ’21

▪ One of the static techniques for multithreaded application ▪
Launch an app with fewer number of threads than available cores ▪ ✔ Easy to ensure fairness between threads ▪ ✖ System performance needs to be considered by the app ▪ ✖ Vulnerable to the change such as core counts and energy ▪ ✖ Vulnerable to the colocated applications Thread Reduction (TR) 2 App thread0 Core0 App thread1 Core1 App thread2 Core2

▪ One of the practical techniques for multithreaded application ▪
Dynamically packs the threads of the target app to fewer cores ▪ ✔ No need to consider the system by the app ▪ ✔ Dynamically adjust the concurrency level ▪ ✖ Imbalance when the core count is not a divisor of the thread count ▪ ✖ May occur thrashing on the private caches Thread Packing (TP) 3 App thread0 Core0 Core1 Core2 App thread3 App thread1 App thread4 App thread2

TP Issue with Linux Kernel 4 App thread0 Physical Core0
App thread1 Physical Core1 Socket0 Physical Core2 App thread2 Physical Core3 Socket1 ▪ Balance the load at coarse granularity ▪ ✖ Often suppresses inter-CPU migration across CPU sockets ▪ ✖ Lacks the consideration of the progress of each task ▪ Remains uninvestigated comprehensively with various benchmarks

▪ Purpose ▪ Investigate the performance inefficiencies of TP ▪
Setup ▪ 16-core NUMA system ◆ two 8-core ▪ Intel E-2640 CPUs ◆ supports per-core DVFS ◆ governor: performance ◆ frequency – Symmetric multiprocessing (SMP) : 2.6 GHz – Heterogeneous multiprocessing (HMP) : 1.2 - 2.6 GHz ▪ Linux kernel 4.11.0 ◆ CFS scheduler Preliminary Experiment 5

▪ Employ PARSEC, SPLASH, NPB benchmark suites ▪ Synchronization-intensive benchmarks
▪ Non-intensive benchmarks ◆ Multi-Grid (MG), blackscholes (BL), raytrace (RT), swaptions (SW), water-nsquared (WA) ▪ Thread count is set to the allocated core count with TR Benchmarks 6 collected by executing it with 16 cores and 16 threads

▪ TP achieves lower performance when N T % N
C != 0 ▪ Some cores are packed with more threads than others ▪ Load balancing of Linux is too slow to handle TP Issue with Synchronization-intensive Benchmarks 7 Performance comparison of thread reduction (TR) and thread packing (TP) (16 threads for TP) TP is worse synchronization-intensive benchmarks N C : core count N T : thread count thread0 Core0 thread1 Core1 core count (= thread num of TR) thread2

▪ TP incurs more performance degradation with larger core counts
▪ Difference in the per-thread computation resource becomes larger TP Issue with Larger Core Counts 8 Performance comparison of thread reduction (TR) and thread packing (TP) (16 threads for TP) TP is worse 15 cores - 15 cores with 1 thread - 1 core with 2 threads 3 cores - 2 cores with 5 threads - 1 core with 6 threads

▪ Idle time accounts for larger portion with TP than
TR ▪ Because of imbalance in per-thread computation resource Detailed Analysis of TP Drawbacks 9 Execution time breakdowns of thread reduction (TR) and thread packing (TP) (16 threads for TP, 15 cores) lower is better increasing the User time because of busy-waiting synchronization

▪ Progress- and locality-aware task migration ▪ Handle the TP
issue when N T % N C != 0 ▪ Handle the TP issue of cache thrashing ▪ Dynamical adjustment of the scheduling period ▪ Start with the shortest period (i.e., 0.125 ms) ▪ Handle the TP issue with synchronization-intensive benchmarks ▪ Support heterogeneous multiprocessing systems (HMP) ▪ Pack threads considering core capacity Proposal: PALM 10

▪ Performance monitor ▪ Notify the progress of the application
to PALM runtime system ◆ Modify multithreading libraries (Pthreads) ◆ Use Application Heartbeats framework [Hoffman+, ICAC ’10] ▪ Runtime system ▪ Allocate threads based on the progress given by PALM monitor Design 11 sleep(sched_period)

▪ Creates N C thread groups ▪ N C :
core count ▪ Each thread groups is assigned with ▪ the core type ◆ Different computation capacity in HMP environment ▪ 1 thread at least Step1: Creating thread groups 12 capacity: 360 thread_count: 1 capacity_per_thread: 360 Thread Group0 capacity: 220 thread_count: 1 capacity_per_thread: 220 Thread Group1 capacity: 100 thread_count: 1 capacity_per_thread: 100 Thread Group2

▪ The following thread groups are created with this example
▪ 360 Capacity x 2, 220 Capacity x 4, 100 Capacity x 2 Step1: Creating thread groups 13 capacity: 220 Logical Core0 capacity: 220 Logical Core1 Physical Core0 capacity: 360 Logical Core2 capacity: 360 Logical Core3 Physical Core1 capacity: 100 Logical Core4 capacity: 100 Logical Core5 Physical Core2 capacity: 220 Logical Core6 capacity: 220 Logical Core7 Physical Core3 shares same capacity in a physical core

▪ Allocate the capacity as uniformly as possible ▪ Each
thread selects the thread group with the largest per-thread capacity Step2: Determining thread counts 14 capacity: 360 thread_count: 1 capacity_per_thread: 360 Thread Group0 capacity: 220 thread_count: 1 capacity_per_thread: 220 Thread Group1 capacity: 100 thread_count: 1 capacity_per_thread: 100 Thread Group2 When application is executed with 6 threads, remaining 3 threads need to be allocated.

thread selects the thread group with the largest per-thread capacity Step2: Determining thread counts 17 capacity: 360 thread_count: 3 capacity_per_thread: 120 Thread Group0 capacity: 220 thread_count: 2 capacity_per_thread: 110 Thread Group1 capacity: 100 thread_count: 1 capacity_per_thread: 100 Thread Group2 All threads have been allocated. ・Seems inefficient to allocate threads one by one … ・Similar operation is instrumented by Linux 6.8.0

▪ Assigns high-priority thread to the thread group with larger
capacity ▪ Priority is determined by the amount of resources thread has received ▪ Assigns consecutive threads to the same group to utilize cache, if possible Step3: Assigning threads 18 capacity: 360 thread_count: 3 capacity_per_thread: 120 Thread Group0 capacity: 220 thread_count: 2 capacity_per_thread: 110 Thread Group1 capacity: 100 thread_count: 1 capacity_per_thread: 100 Thread Group2 progress: 100 Thread3 progress: 120 Thread4 progress: 160 Thread5 progress: 200 Thread1 progress: 220 Thread2 progress: 300 Thread0 threads that have received a less amount of resources are assigned with thread group with large capacity

▪ Each thread group selects the core with minimum migration
cost ▪ Progress-conscious ◆ Start selecting with the thread group with the highest priority thread ▪ Cache locality-conscious Step4: Allocating physical cores 19 capacity: 220 thr_count: 3 Thread Group5 capacity: 220 thr_count: 2 Thread Group6 capacity: 220 Logical Core0 capacity: 220 Logical Core1 Physical Core0 capacity: 220 Logical Core6 capacity: 220 Logical Core7 Physical Core3 capacity: 220 thr_count: 2 Thread Group3 capacity: 220 thr_count: 3 Thread Group4 ① ② ③ ④ high priority thread is allocated to thread group with lower thread count set CPU affinity

▪ Starts with setting the scheduling period to the shortest
one ▪ i.e., 0.125 ms ▪ At the end of each epoch, ▪ Compares the performance between the current/previous epochs ▪ Doubles (or halves) the scheduling period Scheduling Period Controller 20 currPerf > prevPref period x= 2 period /= 2 true sleep(period) period = maxPeriod false false

▪ Concern ▪ 1. Performance and energy consumption ▪ 2.
Effectiveness for dynamic server consolidation and power capping ▪ 3. Performance impact of the PALM components ▪ 4. Performance sensitivity to the system parameters ▪ Setup ▪ Almost same as preliminary experiment ▪ Use TP, TR, PALM, SB ▪ 16 threads for TP, PALM, SB Evaluation 21 Static Best: PALM + optimal scheduling period by offline profiling

▪ PALM outperforms TP across all the concurrency levels ▪
Achieves the performance similar to the TR and the static best version ▪ The same can be said for energy consumption (figure is omitted) Performance and Energy Results 22 reduce idle time avoid busy-waiting Overall execution time across seven benchmarks (High: 15 cores, Medium: 7 cores, Low: 2 cores) CPU time in each workload lower is better

▪ Implement dynamic resource manager like Heracles[David+, ISCA ’15] ▪
Adjust the core affinity of LC/batch app based on the load ▪ PALM outperforms TP when memcached and VO are colocated Case study: Dynamic Server Consolidation 23 Runtime behavior (latency critical (LC): memcached, batch: VO) higher is better load applied to the LC benchmark core counts of batch app decreases performance of VO dynamically control the core counts of LC/batch app statically allocate the core in a way that LC app satisfies the SLO

▪ PALM outperforms TR under power capping* *Power capping limit
power consumption by managing resource usage Case study: Power Capping 24 Runtime behavior (CG benchmark, 16 cores) enable power capping disable power capping TR utilizes all the 16 cores even under power capping higher is better dynamically adjust the core count allocated to the app dynamically adjust all of the core frequency TP suffers from the thread imbalance

▪ PA is effective for fine-grain synchronization benchmarks (Really?) ▪
e.g., CA, CG, VO ▪ LA is effective for cache sensitive benchmarks ▪ e.g., FT, LU ▪ SC is effective for coarse-grain synchronization benchmarks ▪ e.g., BA, FT Impact of the PALM Components 25 Execution time (16 threads) Number of migrations PA: progress-aware* LA: locality-aware* (*scheduling period: 0.25 ms) SC: scheduling period control thread migration is significantly reduced by LA lower is better

▪ PALM outperforms TP across all the system scales ▪
PALM outperforms TP in the HMP system (Really?) ▪ 9.2% shorter execution time than TR Sensitivity to the System Parameters 26 System scale Core heterogeneity N C : 3 4 threads N C : 7 8 threads N C : 15 16 threads N C : 15 (7 high-performance, 8 low-performance) 16 threads Heterogeneity-aware TR TR cannot address the imbalance caused by the heterogeneity high freq / low freq

▪ Pack & Cap [Cochran+, MICRO-44] ▪ Combine DVFS and
TP to optimize performance under power caps ▪ Require extensive profiling, limiting adaptability to different systems ▪ The Linux scheduler: a decade of wasted cores [Lozi+, Eurosys ’16] ▪ Demonstrate the Work Conserving bugs in the Linux kernel ▪ Lacks the investigation of the impact of TP Related Works 27 Weakpoint of PALM compared to my research - ✖ Need to modify pthread library, scheduling period, and CPU affinity - ✖ Not investigate the actual problem of HMP environment

▪ Identified the root causes of inefficiencies of the TP
▪ with in-depth analysis using various synchronization-intensive benchmarks ▪ PALM: progress- and locality-aware adaptive task migration ▪ Achieves the greater performance than TP at high concurrency level ◆ 47% shorter execution time ◆ 39.3% lower energy consumption ▪ Improves ◆ the efficiency of dynamic server consolidation ◆ the performance under power capping Conclusion 28

Reading: PALM: Progress- and Locality-Aware Ada...

Reading: PALM: Progress- and Locality-Aware Adaptive Task Migration for Efficient Thread Packing

wkb8s

More Decks by wkb8s

Featured

Transcript

PALM: Progress- and Locality-Aware Adaptive Task Migration for Efficient Thread

▪ One of the static techniques for multithreaded application ▪

▪ One of the practical techniques for multithreaded application ▪

TP Issue with Linux Kernel 4 App thread0 Physical Core0

▪ Purpose ▪ Investigate the performance inefficiencies of TP ▪

▪ Employ PARSEC, SPLASH, NPB benchmark suites ▪ Synchronization-intensive benchmarks

▪ TP achieves lower performance when N T % N

▪ TP incurs more performance degradation with larger core counts

▪ Idle time accounts for larger portion with TP than

▪ Progress- and locality-aware task migration ▪ Handle the TP

▪ Performance monitor ▪ Notify the progress of the application

▪ Creates N C thread groups ▪ N C :

▪ The following thread groups are created with this example

▪ Allocate the capacity as uniformly as possible ▪ Each

▪ Allocate the capacity as uniformly as possible ▪ Each

▪ Allocate the capacity as uniformly as possible ▪ Each

▪ Allocate the capacity as uniformly as possible ▪ Each

▪ Assigns high-priority thread to the thread group with larger

▪ Each thread group selects the core with minimum migration

▪ Starts with setting the scheduling period to the shortest

▪ Concern ▪ 1. Performance and energy consumption ▪ 2.

▪ PALM outperforms TP across all the concurrency levels ▪

▪ Implement dynamic resource manager like Heracles[David+, ISCA ’15] ▪

▪ PALM outperforms TR under power capping* *Power capping limit

▪ PA is effective for fine-grain synchronization benchmarks (Really?) ▪

▪ PALM outperforms TP across all the system scales ▪

▪ Pack & Cap [Cochran+, MICRO-44] ▪ Combine DVFS and

▪ Identified the root causes of inefficiencies of the TP