Zhiyi Huang - University of Otago, New Zealand / Shanghai Jiao Tong University, China

WATS: Workload-Aware Task Scheduling in Asymmetric Multi- core Architectures Dr.
Zhiyi Huang Department of Computer Science University of Otago Joint work with Quan Chen, Yawen Chen and Minyi Guo

Outline   Background   Existing problem   Workload-Aware Task Scheduling
  Evaluation   Conclusion

Why task scheduling?   Parallel programming with threads is hard
  Manual load balancing is harder   Tasks, not threads   Proposed by J. Reinders as one of the four features for parallel programming solutions   Performance portability   Again proposed by J. Reinders as one of the five qualities desired   Hard for heterogeneous/asymmetric architectures   Task scheduling plays a key role!

Background   Asymmetric Multi-Core architecture (AMC)   High performance  
low power consuming   Symmetric Multi-core architecture -> AMC   DVFS   Task scheduling   Static task scheduling   dynamic task scheduling (task-sharing & task- stealing) Do not take AMC into consideration

Task-sharing 5 Worker 1 Worker 2 Worker 3 Worker 4
Task Task Task Task Lock the central task pool when getting a task Task Lock Unlock Central task Pool Lock Unlock

Task-stealing 6 Worker 1 Worker 2 Worker 3 Worker 4
Task 1 Task Task Task Task Task Task Task Task Task Task Lock the victim task pool when stealing task from it Lock Unlock

Task-sharing and Task-stealing   All the workers get tasks from
a central task pool   Lock the central task pool whenever a worker tries to get a task from it   Each worker has a task pool, and get task from its own task pool mostly.   Lock the victim worker’s task pool when a worker tries to steal a task from it 7  Tasks are scheduled to workers randomly

Scheduling problem in AMC   How to allocate m independent
tasks with different workloads to asymmetric multi-core processors with cores operating at k different speeds. The task allocation problem in AMC

An example of the problem   Speed of c0 is
twice the speed of c1 , c2 and c3   T1 , T2 , T3 and T4 need 1.5t, 4t, 1t, 1.5t respectively on c0   Therefore, they need 3t, 8t, 2t, 3t on c1 , c2 and c3

Optimal solution

A near-optimal solution   The optimal solution may not exist
and it is NP-hard to find the optimal solution   Choose Pi that satisfies and

Limitations of the solution   The number of tasks and
their workloads are required.   However, the information is not available until the tasks are completed.   The solution cannot tolerate the dynamic changing of tasks and their workloads   Dynamic adjusting is required to further balance workloads

Workload-aware Task Scheduling   History-based Task allocation   To address
the first limitation   Allocate tasks to different c-groups according to their workloads and the computation capacity of c- groups   Preference-based Task-stealing   To address the second limitation   Adjust the workloads dynamically among different c-groups

History-based task allocation   Assumptions:   Tasks executing the same
function have similar workloads   The percentage of tasks executing the same function among all tasks is almost the same during the execution of a parallel application.

History-based workload estimation   Tasks completed in history are organized
as task classes according to their function names.   TC(f, n, w) is used to represent a task class. Function name Num of tasks The average workloads

Workload information updating   When a task is completed, its
task class’s information is updated.

Task allocation   There are m task classes TCi (fi
, ni , wi ) (1<i<m) in descending order of wi   The overall workload ni * wi is used as the workload of the task class TCi (fi , ni , wi )   The near optimal algorithm is applied to group task classes into task clusters that are mapped to c-groups.

Preference-based Task-stealing   For AMC with k c-groups, each core
needs k task pools   Each core is given a preference list of task pools. (rob the weaker first principle)

Stealing strategy   Obtain tasks from its local task pool
for its c- group.   Steal tasks from other task pools for its c- group.   Obtain tasks from its local task pool for the next c-group in its preference list   Steal tasks from other task pools for the next c- group in its preference list.

Example architecture   There are 3 c-groups, and hence each
core has 3 task pools. Preference list of cores

Evaluation - hardware   Dell 16-core computer that has four
AMD Quad-core Opteron 8380 processors.   Emulated AMC architectures Name 2.5GHz 1.8GHz 1.3GHz 0.8GHz AMC1 2 2 2 10 AMC2 4 4 4 4 AMC3 2 0 0 14 AMC4 4 0 0 12 AMC5 8 0 0 8 AMC6 12 0 0 4 AMC7 16 0 0 0

Benchmarks   128 tasks was launched in each batch to
fill up the cores

Schedulers   Cilk: traditional child-first task-stealing scheduler. (developed by MIT)
  PFT: traditional Parent-First Task-stealing scheduler.   RTS: Random Task-Snatching scheduler. (a faster core snatches tasks from a randomly chosen slower core if the faster core cannot steal any task.)   WATS: our Workload-Aware Task Scheduler

Performance of WATS (1) WATS can significantly improve the performance
of the CPU-bound applications

Performance of WATS (2)   Performance of GA in all
7 AMCs. WATS can adapt to different AMCs

Scalability of history-based task allocation   128 tasks with 4
different workloads (in proportion of 8t, 4t, 2t and t) in each batch.   Num of tasks with workloads of 8t, 4t, 2t and t is n, n, n, 128-n. WATS is scalable RTS is not scalable n

Effectiveness of preference-based task-stealing   WATS-NP: a scheduler that only
adopts history- based task allocation The preference-based task-stealing is very helpful when handling slightly unbalanced workloads

Task-snatching in WATS   WATS-TS: task snatching is integrated into
WATS It is not worthwhile to snatch tasks from slower cores since they are also close to completion in WATS.

Conclusions   We have proposed a history-based task allocation algorithm
that can allocate tasks in AMC near- optimally.   We have proposed a novel preference-based task- stealing policy that can effectively balance workloads among different groups of cores.   We have implemented a task scheduler, WATS, which achieves a performance gain of up to 82.7% compared to the random task stealing approach commonly employed.

Questions? Acknowledgement: Thanks to David Eyers for comments. Quan Chen
thanks Dept of Computer Science, Univ. of Otago for hosting and funding his study. This research was partially supported by Natural Science Foundation China. Reference: WATS: Workload-Aware Task Scheduling in Asymmetric Multi- core Architectures, Quan Chen , Yawen Chen, Zhiyi Huang, and Minyi Guo, to appear in the Proceedings of IPDPS'12, Shanghai, May 2012.

Zhiyi Huang - University of Otago, New Zealand ...

Zhiyi Huang - University of Otago, New Zealand / Shanghai Jiao Tong University, China

Multicore World

More Decks by Multicore World

Other Decks in Research

Featured

Transcript

WATS: Workload-Aware Task Scheduling in Asymmetric Multi- core Architectures Dr.

Outline   Background   Existing problem   Workload-Aware Task Scheduling

Why task scheduling?   Parallel programming with threads is hard

Background   Asymmetric Multi-Core architecture (AMC)   High performance 

Task-sharing 5 Worker 1 Worker 2 Worker 3 Worker 4

Task-stealing 6 Worker 1 Worker 2 Worker 3 Worker 4

Task-sharing and Task-stealing   All the workers get tasks from

Scheduling problem in AMC   How to allocate m independent

An example of the problem   Speed of c0 is

Optimal solution

A near-optimal solution   The optimal solution may not exist

Limitations of the solution   The number of tasks and

Workload-aware Task Scheduling   History-based Task allocation   To address

History-based task allocation   Assumptions:   Tasks executing the same

History-based workload estimation   Tasks completed in history are organized

Workload information updating   When a task is completed, its

Task allocation   There are m task classes TCi (fi

Preference-based Task-stealing   For AMC with k c-groups, each core

Stealing strategy   Obtain tasks from its local task pool

Example architecture   There are 3 c-groups, and hence each

Evaluation - hardware   Dell 16-core computer that has four

Benchmarks   128 tasks was launched in each batch to

Schedulers   Cilk: traditional child-first task-stealing scheduler. (developed by MIT)

Performance of WATS (1) WATS can significantly improve the performance

Performance of WATS (2)   Performance of GA in all

Scalability of history-based task allocation   128 tasks with 4

Effectiveness of preference-based task-stealing   WATS-NP: a scheduler that only

Task-snatching in WATS   WATS-TS: task snatching is integrated into

Conclusions   We have proposed a history-based task allocation algorithm

Questions? Acknowledgement: Thanks to David Eyers for comments. Quan Chen