Co-scheduling for large-scale applications: memory and resilience

Co-scheduling for large-scale applications: memory and resilience Loïc Pottier ISI
Seminar — May 28, 2019 USC Information Science Institute [email protected] Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 1 / 36

About myself PhD student at ENS de Lyon, France from
2015 to 2018 Advisors: Yves Robert and Anne Benoit Started my postdoc in the Science Automation Technologies group last February During my PhD → mostly worked on (theoretical) scheduling problems for HPC systems In this talk, I will try to talk brieﬂy about what I did during my PhD Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 2 / 36

Exascale is coming #1 in 2012 is Titan (#6 in
2018) 18,688 processors 299,008 cores 17 Petaﬂops Opteron 6274 (16 cores) 32GB of shared memory/node Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 3 / 36

2018) 18,688 processors 299,008 cores 17 Petaﬂops Opteron 6274 (16 cores) 32GB of shared memory/node #2 in 2018 is Sunway TaihuLight (#1 in 2018 is mostly using GPUs) 40,960 processors 10,649,600 cores 93 Petaﬂops Sunway SW26010 (260 cores) 32GB of shared memory/node Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 3 / 36

2018) 18,688 processors 299,008 cores 17 Petaflops Opteron 6274 (16 cores) 32GB of shared memory/node #2 in 2018 is Sunway TaihuLight (#1 in 2018 is mostly using GPUs) 40,960 processors 10,649,600 cores 93 Petaflops Sunway SW26010 (260 cores) 32GB of shared memory/node Node concurrency explodes! How to efficiently use these resources? Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 3 / 36

2018) 18,688 processors 299,008 cores 17 Petaflops Opteron 6274 (16 cores) 32GB of shared memory/node #2 in 2018 is Sunway TaihuLight (#1 in 2018 is mostly using GPUs) 40,960 processors 10,649,600 cores 93 Petaflops Sunway SW26010 (260 cores) 32GB of shared memory/node Node concurrency explodes! How to efficiently use these resources? Solution used in this talk: concurrent scheduling Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 3 / 36

Why concurrent scheduling? Why not use all the cores to
run each application? Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 4 / 36

run each application? Best solution if the applications are perfectly parallel Amdhal’s Law An application will execute on p processors in time s × tseq + (1 − s) tseq p , where s is the sequential fraction. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 4 / 36

run each application? Best solution if the applications are perfectly parallel But ... most of them are not perfectly parallel Amdhal’s Law An application will execute on p processors in time s × tseq + (1 − s) tseq p , where s is the sequential fraction. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 4 / 36

run each application? Best solution if the applications are perfectly parallel But ... most of them are not perfectly parallel In this talk, all applications obey Amdahl’s Law Amdhal’s Law An application will execute on p processors in time s × tseq + (1 − s) tseq p , where s is the sequential fraction. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 4 / 36

run each application? Best solution if the applications are perfectly parallel But ... most of them are not perfectly parallel In this talk, all applications obey Amdahl’s Law Amdhal’s Law An application will execute on p processors in time s × tseq + (1 − s) tseq p , where s is the sequential fraction. Co-scheduling [Ousterhout, 1982] Execute multiple applications at the same time on the same platform, in order to maximize platform throughput. time 0 p T2 T3 T1 Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 4 / 36

Co-scheduling at scale is challenging Chip multiprocessor (CMP) Multiple cores
on the same chip CMP cores are not independent Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 5 / 36

on the same chip CMP cores are not independent Resources shared: cache memory channels prefetching units Core1 Core2 Core3 Last Level Cache (LLC) Main Memory (DRAM) Core1 Core2 Core3 Last Level Cache (LLC) Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 5 / 36

on the same chip CMP cores are not independent Resources shared: cache ⇐ focus in this talk memory channels prefetching units Core1 Core2 Core3 Last Level Cache (LLC) Main Memory (DRAM) Core1 Core2 Core3 Last Level Cache (LLC) Last Level Cache (LLC) Applications compete to access resources ⇒ co-sharing issues Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 5 / 36

on the same chip CMP cores are not independent Resources shared: cache ⇐ focus in this talk memory channels prefetching units Core1 Core2 Core3 Last Level Cache (LLC) Main Memory (DRAM) Core1 Core2 Core3 Last Level Cache (LLC) Last Level Cache (LLC) Applications compete to access resources ⇒ co-sharing issues Solution proposed ⇒ cache partitioning Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 5 / 36

Why resilience? Supercomputers enroll huge number of processors More components
→ increased probability of errors MTBF of 1 processor → around 100 years MTBF of p processors → 100 p MTBF Titan < 1 day Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 6 / 36

Why resilience? Supercomputers enroll huge number of processors More components
→ increased probability of errors MTBF of 1 processor → around 100 years MTBF of p processors → 100 p MTBF Titan < 1 day Resilience at petascale is already a problem Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 6 / 36

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor
redistribution Conclusion Outline 1 Co-scheduling applications on cache-partitioned systems Model and theory for perfectly parallel applications Experimental results using Cache Allocation Technology 2 Resilient application co-scheduling with processor redistribution 3 Conclusion Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 7 / 36

redistribution Conclusion Framework Applications compete to access resources (processors, memory, network) Focus on Last Level Cache (LLC) Challenge: model interferences between applications lots of experimental works few models mostly limited to pairwise interaction Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 9 / 36

redistribution Conclusion Framework Applications compete to access resources (processors, memory, network) Focus on Last Level Cache (LLC) Challenge: model interferences between applications lots of experimental works few models mostly limited to pairwise interaction Problem If we co-schedule n applications sharing one LLC, how to minimize makespan using cache-partitioning techniques? Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 9 / 36

redistribution Conclusion Model Problem n parallel applications {T1, . . . , Tn} (all applications start at the same time) Execution platform with p identical processors pi processor fraction assigned to Ti ( n i=1 pi = p) Shared cache of size Cs xi cache fraction assigned to Ti ( n i=1 xi = 1) Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 10 / 36

redistribution Conclusion Execution time Amdahl speedup proﬁle Fli (pi ) = si wi + (1 − si ) wi pi si sequential fraction pi processor number Exei (pi , xi ) : Execution time for Ti with pi processors and xi fraction of cache Sequential execution time : Exeseq i (xi ) = Exei (1, xi ) Exei (pi , xi ) ⇒ computations (Amdahl) + communication costs Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 11 / 36

redistribution Conclusion The power law of cache misses1 Cache size C0 → miss rate m0 Miss rate m for an arbitrary cache size C? m = m0 C0 C α , where α is the sensitivity factor (0.3 ≤ α ≤ 0.7) 1Allan Hartstein et al. “On the nature of cache miss behavior: Is it √ 2”. In: The Journal of Instruction-Level Parallelism 10 (2008), pp. 1–22. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 12 / 36

redistribution Conclusion Problem definition Definition (CoSchedCache) Given n applications T1, . . . , Tn and a platform with p identical processors sharing a cache of size Cs , find a schedule {(p1, x1 ), . . . , (pn, xn )} with n i=1 pi ≤ p, and n i=1 xi ≤ 1, that minimizes max 1≤i≤n Exei (pi , xi ). Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 13 / 36

redistribution Conclusion Complexity results for perfectly parallel applications Lemma 1 To minimize the makespan, all applications must ﬁnish at the same time. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 14 / 36

redistribution Conclusion Complexity results for perfectly parallel applications Lemma 1 To minimize the makespan, all applications must ﬁnish at the same time. Lemma 2 Given n applications T1, . . . , Tn and a partitioning of the cache {x1, . . . , xn}, then the optimal number of processors for application Ti (i ∈ {1, . . . , n}) is: pi = p Exeseq i (xi ) n j=1 Exeseq j (xj ) Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 14 / 36

redistribution Conclusion Complexity results for perfectly parallel applications Lemma 1 To minimize the makespan, all applications must ﬁnish at the same time. Lemma 2 Given n applications T1, . . . , Tn and a partitioning of the cache {x1, . . . , xn}, then the optimal number of processors for application Ti (i ∈ {1, . . . , n}) is: pi = p Exeseq i (xi ) n j=1 Exeseq j (xj ) Theorem 1 The problem of ﬁnding a schedule S = {(x1, p1 ), . . . , (xn, pn )} that minimizes the makespan is NP-complete. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 14 / 36

redistribution Conclusion Platform and Cache Allocation Technology Platform Two Intel Xeon E5-2650L v4 Broadwell Each with 14 cores with Hyper-Threading disabled 35MB last-level cache divided into 20 slices Vanilla 4.11.0 Linux kernel with cache partitioning enabled Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 16 / 36

redistribution Conclusion Platform and Cache Allocation Technology Platform Two Intel Xeon E5-2650L v4 Broadwell Each with 14 cores with Hyper-Threading disabled 35MB last-level cache divided into 20 slices Vanilla 4.11.0 Linux kernel with cache partitioning enabled Cache Allocation Technology (CAT) Provided by Intel to partition the last-level cache Part of the Resource Director Technology (RDT) OS groups applications into classes of service (COS) Each COS describes the amount of cache that assigned applications can use Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 16 / 36

redistribution Conclusion Cache Allocation Technology (1/2) CAT example with 2 classes of service, 3 cores and a 4-bit capacity mask (CBM): LLC CBM1 = 1110 CBM2 = 0001 p1 p2 COS1 p3 COS2 First COS has 2 cores and 75% of the LLC, the second class of service has the remaining resources Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 17 / 36

redistribution Conclusion Cache Allocation Technology (2/2) Some technical restrictions: Numbers of slices and classes are architecture dependent (20 and 16 on our platform) A CBM cannot be empty (each class of applications must have at least one slice of cache) Bits set in a CBM must be contiguous Slices are not distributed geographically in the LLC: 0x10000 and 0x00001 CBM should behave exactly the same Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 18 / 36

redistribution Conclusion Benchmarks NAS Parallel benchmarks class=A (shared memory version) CG Uses conjugate gradients method to solve a large sparse sym- metric positive deﬁnite system of linear equations MG Performs a multi-grid solve on a sequence of meshes Table: Description of the NAS parallel benchmarks. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 19 / 36

redistribution Conclusion Applications and metrics Now iterative NAS benchmarks: we modiﬁed the main loop of NAS applications such that each of them computes for a duration T focus on CG and MG (most interesting combination in terms of cache partitioning) Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 20 / 36

redistribution Conclusion Applications and metrics Now iterative NAS benchmarks: we modiﬁed the main loop of NAS applications such that each of them computes for a duration T focus on CG and MG (most interesting combination in terms of cache partitioning) We measure the time for one iteration of Ai : Ti = T #iteri where #iteri is the number of iterations of application Ai during T. Goal: weighted throughput Maximize: min i 1 βi Ti where βi are the weights Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 20 / 36

redistribution Conclusion Applications and metrics We modiﬁed the main loop of NAS applications such that each of them computes for a duration T We ensure that each application reaches the steady state with enough iterations (T = 3 minutes) The platform has two processors: one is used to run the experiments, the other manages the experiments (cache experiments are highly sensitive) Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 21 / 36

redistribution Conclusion Impact of cache partitioning on a real platform 800 900 1000 1100 1200 5% 25% 35% 50% 75% 95% Fraction of cache Total number of iterations Cache partitioning Without cache partitioning CG and MG (six cores each). The cache fraction of CG is varying from 5% to 95%. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 22 / 36

redistribution Conclusion Co-scheduling results with two applications (CG+MG) 1 2 3 4 0.25 0.50 0.75 1.00 2.00 3.00 4.00 βMG mini 1 βiTi DP-CP DP-Equal DP-NoCP Eq-CP Eq-NoCP Model Prediction DP-CP exhibits a gain around 15% on average over DP-NoCP! Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 23 / 36

redistribution Conclusion Checkpoint with fail-stop errors Save the state of the application periodically: Time W W C C C Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 25 / 36

redistribution Conclusion Checkpoint with fail-stop errors Save the state of the application periodically: Time W W C C C In case of errors, application returns to last checkpoint: Time W W Error C C C Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 25 / 36

redistribution Conclusion Checkpoint with fail-stop errors Save the state of the application periodically: Time W W C C C In case of errors, application returns to last checkpoint: Time W W Error C C C Work done between last checkpoint and error is lost; downtime D and recovery R before resuming execution: Time Wlost W W C D + R C C Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 25 / 36

redistribution Conclusion Problem: CoSched A pack Set of n parallel applications executed on p processors A new pack can start its execution only when previous pack is ﬁnished Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 26 / 36

redistribution Conclusion Problem: CoSched A pack Set of n parallel applications executed on p processors A new pack can start its execution only when previous pack is ﬁnished time 0 p T3 T2 T1 tf time 0 p Error T3 T2 T1 tf Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 26 / 36

redistribution Conclusion Problem: CoSched A pack Set of n parallel applications executed on p processors A new pack can start its execution only when previous pack is ﬁnished time 0 p T3 T2 T1 tf time 0 p Error T3 T2 T1 tf Unbalanced load balancing! Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 26 / 36

redistribution Conclusion Example time 0 p T2 T3 T1 Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 27 / 36

redistribution Conclusion Example time 0 p T2 T3 T1 time 0 p T2 T3 T1 Redistribution when T2 releases its processors Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 27 / 36

redistribution Conclusion Example time 0 p tf Error T2 T3 T1 Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 28 / 36

redistribution Conclusion Example time 0 p tf Error T2 T3 T1 time 0 p tf Error T2 T3 T1 How to compute the new execution time of T3 ? Give processors of T1 to T3 ? Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 28 / 36

redistribution Conclusion Model n independent parallel applications T1, T2, . . . , Tn Execution platform with p identical processors Each application is malleable: its number of processors j can change at any time Each application is a divisible load application Problem: CoSched Minimize the maximum of the expected completion times of n applications executed on p processors subject to failures. Redistributions are allowed only when an application completes execution or is struck by a failure. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 29 / 36

redistribution Conclusion Application model n independent parallel applications T1, T2, . . . , Tn Execution platform with p identical processors Each applications is malleable: its number of processors j can change at any time, and we know its fault-free execution time (ti,j ) for Ti on j processors Each application is divisible (divisible load) Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 30 / 36

redistribution Conclusion Complexity without redistribution Theorem 1 The CoSched problem without redistributions can be solved in polynomial time O(p × log(n)), where p is the number of processors, and n is the number of applications Each application has two processors We allocate the p − 2n remaining processors two by two in a greedy way to longest application Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 31 / 36

redistribution Conclusion Greedy algorithm when redistributions are allowed T1 = t1,1 = 10, w1,1 = 10 t1,2 = 9, w1,2 = 18 t1,3 = 6, w1,3 = 18 T2 = t2,1 = 6, w2,1 = 6 t2,2 = 3, w2,2 = 6 t2,3 = 3, w2,3 = 9 T1 T2 0 6 9 T2 0 6 8 T1 (a) Greedy uses largest execution time to allocate processors T1 T2 0 3 10 T2 0 3 7.2 T1 (b) Greedy-SP uses best speedup proﬁle to allocate processors Some examples where Greedy-SP is not optimal either... Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 32 / 36

redistribution Conclusion Complexity with redistribution Theorem 2 With constant redistribution costs and without failures, CoSched is NP-complete (in the strong sense) Reduction from 3-Partition with distinct integers Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 33 / 36

redistribution Conclusion Conclusion Chosen approach Build a realistic theoretical model + study the complexity Using theoretical insights, design eﬃcient polynomial-time heuristics Evaluation using simulations with realistic inputs Challenge our solution through real experiments on a dedicated platform Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 35 / 36

redistribution Conclusion List of publications Book chapters [B1] G. Aupy, A. Benoit, L. Pottier, P. Raghavan, Y. Robert and M. Shantharam. Co-scheduling high-performance computing applications. Big Data Management and Processing. Ed. by Kuan-Ching Li, Hai Jiang, and Albert Zomaya. Chapman and Hall/CRC Press, 2017. International peer-reviewed journals [J1] A. Benoit, L. Pottier and Y. Robert. Resilient co-scheduling of malleable applications. International Journal of High Performance Computing Applications, IJHPCA, 2017. [J2] G. Aupy, A. Benoit, S. Dai, L. Pottier, P. Raghavan, Y. Robert and M. Shantharam. Co-scheduling Amdahl applications on cache-partitioned systems. International Journal of High Performance Computing Applications, IJHPCA, 2017. [J3] G. Aupy, A. Benoit, B. Goglin, L. Pottier and Y. Robert. Co-scheduling HPC workloads on cache-partitioned CMP platforms. International Journal of High Performance Computing Applications, IJHPCA, 2018. International peer-reviewed conferences [C1] A. Benoit, L. Pottier and Y. Robert. Resilient application co-scheduling with processor redistribution. 45th International Conference on Parallel Processing, ICPP, 2016. [C2] A. Benoit, S. Perarnau, L. Pottier and Y. Robert. A performance model to execute workﬂows on high-bandwidth-memory architectures. 47th International Conference on Parallel Processing, ICPP, 2018. [C3] G. Aupy, A. Benoit, B. Goglin, L. Pottier and Y. Robert. Co-scheduling HPC workloads on cache-partitioned CMP platforms. IEEE Cluster, 2018. International peer-reviewed workshops [W1] G. Aupy, A. Benoit, L. Pottier, P. Raghavan, Y. Robert and M. Shantharam. Co-scheduling algorithms for cache-partitioned systems. 19th Workshop on Advances in Parallel and Distributed Computational Models, IPDPSW, 2017. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 36 / 36

Co-scheduling for large-scale applications: mem...

Co-scheduling for large-scale applications: memory and resilience

More Decks by SciTech

Other Decks in Technology

Featured

Transcript