Zhiyi Huang - University of Otago, New Zealand / Shanghai Jiao Tong University, China

Zhiyi Huang - University of Otago, New Zealand / Shanghai Jiao Tong University, China

"WATS - Workload-Aware Task Scheduling in Asymmetric Multi-core Architectures"


Multicore World

July 16, 2012


  1. WATS: Workload-Aware Task Scheduling in Asymmetric Multi- core Architectures Dr.

    Zhiyi Huang Department of Computer Science University of Otago Joint work with Quan Chen, Yawen Chen and Minyi Guo
  2. Outline   Background   Existing problem   Workload-Aware Task Scheduling

      Evaluation   Conclusion
  3. Why task scheduling?   Parallel programming with threads is hard

      Manual load balancing is harder   Tasks, not threads   Proposed by J. Reinders as one of the four features for parallel programming solutions   Performance portability   Again proposed by J. Reinders as one of the five qualities desired   Hard for heterogeneous/asymmetric architectures   Task scheduling plays a key role!
  4. Background   Asymmetric Multi-Core architecture (AMC)   High performance  

    low power consuming   Symmetric Multi-core architecture -> AMC   DVFS   Task scheduling   Static task scheduling   dynamic task scheduling (task-sharing & task- stealing) Do not take AMC into consideration
  5. Task-sharing 5 Worker 1 Worker 2 Worker 3 Worker 4

    Task Task Task Task Lock the central task pool when getting a task Task Lock Unlock Central task Pool Lock Unlock
  6. Task-stealing 6 Worker 1 Worker 2 Worker 3 Worker 4

    Task 1 Task Task Task Task Task Task Task Task Task Task Lock the victim task pool when stealing task from it Lock Unlock
  7. Task-sharing and Task-stealing   All the workers get tasks from

    a central task pool   Lock the central task pool whenever a worker tries to get a task from it   Each worker has a task pool, and get task from its own task pool mostly.   Lock the victim worker’s task pool when a worker tries to steal a task from it 7  Tasks are scheduled to workers randomly
  8. Scheduling problem in AMC   How to allocate m independent

    tasks with different workloads to asymmetric multi-core processors with cores operating at k different speeds. The task allocation problem in AMC
  9. An example of the problem   Speed of c0 is

    twice the speed of c1 , c2 and c3   T1 , T2 , T3 and T4 need 1.5t, 4t, 1t, 1.5t respectively on c0   Therefore, they need 3t, 8t, 2t, 3t on c1 , c2 and c3
  10. Optimal solution

  11. A near-optimal solution   The optimal solution may not exist

    and it is NP-hard to find the optimal solution   Choose Pi that satisfies and
  12. Limitations of the solution   The number of tasks and

    their workloads are required.   However, the information is not available until the tasks are completed.   The solution cannot tolerate the dynamic changing of tasks and their workloads   Dynamic adjusting is required to further balance workloads
  13. Workload-aware Task Scheduling   History-based Task allocation   To address

    the first limitation   Allocate tasks to different c-groups according to their workloads and the computation capacity of c- groups   Preference-based Task-stealing   To address the second limitation   Adjust the workloads dynamically among different c-groups
  14. History-based task allocation   Assumptions:   Tasks executing the same

    function have similar workloads   The percentage of tasks executing the same function among all tasks is almost the same during the execution of a parallel application.
  15. History-based workload estimation   Tasks completed in history are organized

    as task classes according to their function names.   TC(f, n, w) is used to represent a task class. Function name Num of tasks The average workloads
  16. Workload information updating   When a task is completed, its

    task class’s information is updated.
  17. Task allocation   There are m task classes TCi (fi

    , ni , wi ) (1<i<m) in descending order of wi   The overall workload ni * wi is used as the workload of the task class TCi (fi , ni , wi )   The near optimal algorithm is applied to group task classes into task clusters that are mapped to c-groups.
  18. Preference-based Task-stealing   For AMC with k c-groups, each core

    needs k task pools   Each core is given a preference list of task pools. (rob the weaker first principle)
  19. Stealing strategy   Obtain tasks from its local task pool

    for its c- group.   Steal tasks from other task pools for its c- group.   Obtain tasks from its local task pool for the next c-group in its preference list   Steal tasks from other task pools for the next c- group in its preference list.
  20. Example architecture   There are 3 c-groups, and hence each

    core has 3 task pools. Preference list of cores
  21. Evaluation - hardware   Dell 16-core computer that has four

    AMD Quad-core Opteron 8380 processors.   Emulated AMC architectures Name 2.5GHz 1.8GHz 1.3GHz 0.8GHz AMC1 2 2 2 10 AMC2 4 4 4 4 AMC3 2 0 0 14 AMC4 4 0 0 12 AMC5 8 0 0 8 AMC6 12 0 0 4 AMC7 16 0 0 0
  22. Benchmarks   128 tasks was launched in each batch to

    fill up the cores
  23. Schedulers   Cilk: traditional child-first task-stealing scheduler. (developed by MIT)

      PFT: traditional Parent-First Task-stealing scheduler.   RTS: Random Task-Snatching scheduler. (a faster core snatches tasks from a randomly chosen slower core if the faster core cannot steal any task.)   WATS: our Workload-Aware Task Scheduler
  24. Performance of WATS (1) WATS can significantly improve the performance

    of the CPU-bound applications
  25. Performance of WATS (2)   Performance of GA in all

    7 AMCs. WATS can adapt to different AMCs
  26. Scalability of history-based task allocation   128 tasks with 4

    different workloads (in proportion of 8t, 4t, 2t and t) in each batch.   Num of tasks with workloads of 8t, 4t, 2t and t is n, n, n, 128-n. WATS is scalable RTS is not scalable n
  27. Effectiveness of preference-based task-stealing   WATS-NP: a scheduler that only

    adopts history- based task allocation The preference-based task-stealing is very helpful when handling slightly unbalanced workloads
  28. Task-snatching in WATS   WATS-TS: task snatching is integrated into

    WATS It is not worthwhile to snatch tasks from slower cores since they are also close to completion in WATS.
  29. Conclusions   We have proposed a history-based task allocation algorithm

    that can allocate tasks in AMC near- optimally.   We have proposed a novel preference-based task- stealing policy that can effectively balance workloads among different groups of cores.   We have implemented a task scheduler, WATS, which achieves a performance gain of up to 82.7% compared to the random task stealing approach commonly employed.
  30. Questions? Acknowledgement: Thanks to David Eyers for comments. Quan Chen

    thanks Dept of Computer Science, Univ. of Otago for hosting and funding his study. This research was partially supported by Natural Science Foundation China. Reference: WATS: Workload-Aware Task Scheduling in Asymmetric Multi- core Architectures, Quan Chen , Yawen Chen, Zhiyi Huang, and Minyi Guo, to appear in the Proceedings of IPDPS'12, Shanghai, May 2012.