Slide 1

Slide 1 text

Co-scheduling for large-scale applications: memory and resilience Loïc Pottier ISI Seminar — May 28, 2019 USC Information Science Institute [email protected] Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 1 / 36

Slide 2

Slide 2 text

About myself PhD student at ENS de Lyon, France from 2015 to 2018 Advisors: Yves Robert and Anne Benoit Started my postdoc in the Science Automation Technologies group last February During my PhD → mostly worked on (theoretical) scheduling problems for HPC systems In this talk, I will try to talk briefly about what I did during my PhD Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 2 / 36

Slide 3

Slide 3 text

Exascale is coming #1 in 2012 is Titan (#6 in 2018) 18,688 processors 299,008 cores 17 Petaflops Opteron 6274 (16 cores) 32GB of shared memory/node Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 3 / 36

Slide 4

Slide 4 text

Exascale is coming #1 in 2012 is Titan (#6 in 2018) 18,688 processors 299,008 cores 17 Petaflops Opteron 6274 (16 cores) 32GB of shared memory/node #2 in 2018 is Sunway TaihuLight (#1 in 2018 is mostly using GPUs) 40,960 processors 10,649,600 cores 93 Petaflops Sunway SW26010 (260 cores) 32GB of shared memory/node Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 3 / 36

Slide 5

Slide 5 text

Exascale is coming #1 in 2012 is Titan (#6 in 2018) 18,688 processors 299,008 cores 17 Petaflops Opteron 6274 (16 cores) 32GB of shared memory/node #2 in 2018 is Sunway TaihuLight (#1 in 2018 is mostly using GPUs) 40,960 processors 10,649,600 cores 93 Petaflops Sunway SW26010 (260 cores) 32GB of shared memory/node Node concurrency explodes! How to efficiently use these resources? Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 3 / 36

Slide 6

Slide 6 text

Exascale is coming #1 in 2012 is Titan (#6 in 2018) 18,688 processors 299,008 cores 17 Petaflops Opteron 6274 (16 cores) 32GB of shared memory/node #2 in 2018 is Sunway TaihuLight (#1 in 2018 is mostly using GPUs) 40,960 processors 10,649,600 cores 93 Petaflops Sunway SW26010 (260 cores) 32GB of shared memory/node Node concurrency explodes! How to efficiently use these resources? Solution used in this talk: concurrent scheduling Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 3 / 36

Slide 7

Slide 7 text

Why concurrent scheduling? Why not use all the cores to run each application? Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 4 / 36

Slide 8

Slide 8 text

Why concurrent scheduling? Why not use all the cores to run each application? Best solution if the applications are perfectly parallel Amdhal’s Law An application will execute on p processors in time s × tseq + (1 − s) tseq p , where s is the sequential fraction. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 4 / 36

Slide 9

Slide 9 text

Why concurrent scheduling? Why not use all the cores to run each application? Best solution if the applications are perfectly parallel But ... most of them are not perfectly parallel Amdhal’s Law An application will execute on p processors in time s × tseq + (1 − s) tseq p , where s is the sequential fraction. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 4 / 36

Slide 10

Slide 10 text

Why concurrent scheduling? Why not use all the cores to run each application? Best solution if the applications are perfectly parallel But ... most of them are not perfectly parallel In this talk, all applications obey Amdahl’s Law Amdhal’s Law An application will execute on p processors in time s × tseq + (1 − s) tseq p , where s is the sequential fraction. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 4 / 36

Slide 11

Slide 11 text

Why concurrent scheduling? Why not use all the cores to run each application? Best solution if the applications are perfectly parallel But ... most of them are not perfectly parallel In this talk, all applications obey Amdahl’s Law Amdhal’s Law An application will execute on p processors in time s × tseq + (1 − s) tseq p , where s is the sequential fraction. Co-scheduling [Ousterhout, 1982] Execute multiple applications at the same time on the same platform, in order to maximize platform throughput. time 0 p T2 T3 T1 Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 4 / 36

Slide 12

Slide 12 text

Co-scheduling at scale is challenging Chip multiprocessor (CMP) Multiple cores on the same chip CMP cores are not independent Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 5 / 36

Slide 13

Slide 13 text

Co-scheduling at scale is challenging Chip multiprocessor (CMP) Multiple cores on the same chip CMP cores are not independent Resources shared: cache memory channels prefetching units Core1 Core2 Core3 Last Level Cache (LLC) Main Memory (DRAM) Core1 Core2 Core3 Last Level Cache (LLC) Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 5 / 36

Slide 14

Slide 14 text

Co-scheduling at scale is challenging Chip multiprocessor (CMP) Multiple cores on the same chip CMP cores are not independent Resources shared: cache ⇐ focus in this talk memory channels prefetching units Core1 Core2 Core3 Last Level Cache (LLC) Main Memory (DRAM) Core1 Core2 Core3 Last Level Cache (LLC) Last Level Cache (LLC) Applications compete to access resources ⇒ co-sharing issues Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 5 / 36

Slide 15

Slide 15 text

Co-scheduling at scale is challenging Chip multiprocessor (CMP) Multiple cores on the same chip CMP cores are not independent Resources shared: cache ⇐ focus in this talk memory channels prefetching units Core1 Core2 Core3 Last Level Cache (LLC) Main Memory (DRAM) Core1 Core2 Core3 Last Level Cache (LLC) Last Level Cache (LLC) Applications compete to access resources ⇒ co-sharing issues Solution proposed ⇒ cache partitioning Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 5 / 36

Slide 16

Slide 16 text

Why resilience? Supercomputers enroll huge number of processors More components → increased probability of errors MTBF of 1 processor → around 100 years MTBF of p processors → 100 p MTBF Titan < 1 day Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 6 / 36

Slide 17

Slide 17 text

Why resilience? Supercomputers enroll huge number of processors More components → increased probability of errors MTBF of 1 processor → around 100 years MTBF of p processors → 100 p MTBF Titan < 1 day Resilience at petascale is already a problem Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 6 / 36

Slide 18

Slide 18 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Outline 1 Co-scheduling applications on cache-partitioned systems Model and theory for perfectly parallel applications Experimental results using Cache Allocation Technology 2 Resilient application co-scheduling with processor redistribution 3 Conclusion Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 7 / 36

Slide 19

Slide 19 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Outline 1 Co-scheduling applications on cache-partitioned systems Model and theory for perfectly parallel applications Experimental results using Cache Allocation Technology 2 Resilient application co-scheduling with processor redistribution 3 Conclusion Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 8 / 36

Slide 20

Slide 20 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Framework Applications compete to access resources (processors, memory, network) Focus on Last Level Cache (LLC) Challenge: model interferences between applications lots of experimental works few models mostly limited to pairwise interaction Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 9 / 36

Slide 21

Slide 21 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Framework Applications compete to access resources (processors, memory, network) Focus on Last Level Cache (LLC) Challenge: model interferences between applications lots of experimental works few models mostly limited to pairwise interaction Problem If we co-schedule n applications sharing one LLC, how to minimize makespan using cache-partitioning techniques? Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 9 / 36

Slide 22

Slide 22 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Model Problem n parallel applications {T1, . . . , Tn} (all applications start at the same time) Execution platform with p identical processors pi processor fraction assigned to Ti ( n i=1 pi = p) Shared cache of size Cs xi cache fraction assigned to Ti ( n i=1 xi = 1) Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 10 / 36

Slide 23

Slide 23 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Execution time Amdahl speedup profile Fli (pi ) = si wi + (1 − si ) wi pi si sequential fraction pi processor number Exei (pi , xi ) : Execution time for Ti with pi processors and xi fraction of cache Sequential execution time : Exeseq i (xi ) = Exei (1, xi ) Exei (pi , xi ) ⇒ computations (Amdahl) + communication costs Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 11 / 36

Slide 24

Slide 24 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion The power law of cache misses1 Cache size C0 → miss rate m0 Miss rate m for an arbitrary cache size C? m = m0 C0 C α , where α is the sensitivity factor (0.3 ≤ α ≤ 0.7) 1Allan Hartstein et al. “On the nature of cache miss behavior: Is it √ 2”. In: The Journal of Instruction-Level Parallelism 10 (2008), pp. 1–22. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 12 / 36

Slide 25

Slide 25 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Problem definition Definition (CoSchedCache) Given n applications T1, . . . , Tn and a platform with p identical processors sharing a cache of size Cs , find a schedule {(p1, x1 ), . . . , (pn, xn )} with n i=1 pi ≤ p, and n i=1 xi ≤ 1, that minimizes max 1≤i≤n Exei (pi , xi ). Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 13 / 36

Slide 26

Slide 26 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Complexity results for perfectly parallel applications Lemma 1 To minimize the makespan, all applications must finish at the same time. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 14 / 36

Slide 27

Slide 27 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Complexity results for perfectly parallel applications Lemma 1 To minimize the makespan, all applications must finish at the same time. Lemma 2 Given n applications T1, . . . , Tn and a partitioning of the cache {x1, . . . , xn}, then the optimal number of processors for application Ti (i ∈ {1, . . . , n}) is: pi = p Exeseq i (xi ) n j=1 Exeseq j (xj ) Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 14 / 36

Slide 28

Slide 28 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Complexity results for perfectly parallel applications Lemma 1 To minimize the makespan, all applications must finish at the same time. Lemma 2 Given n applications T1, . . . , Tn and a partitioning of the cache {x1, . . . , xn}, then the optimal number of processors for application Ti (i ∈ {1, . . . , n}) is: pi = p Exeseq i (xi ) n j=1 Exeseq j (xj ) Theorem 1 The problem of finding a schedule S = {(x1, p1 ), . . . , (xn, pn )} that minimizes the makespan is NP-complete. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 14 / 36

Slide 29

Slide 29 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Outline 1 Co-scheduling applications on cache-partitioned systems Model and theory for perfectly parallel applications Experimental results using Cache Allocation Technology 2 Resilient application co-scheduling with processor redistribution 3 Conclusion Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 15 / 36

Slide 30

Slide 30 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Platform and Cache Allocation Technology Platform Two Intel Xeon E5-2650L v4 Broadwell Each with 14 cores with Hyper-Threading disabled 35MB last-level cache divided into 20 slices Vanilla 4.11.0 Linux kernel with cache partitioning enabled Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 16 / 36

Slide 31

Slide 31 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Platform and Cache Allocation Technology Platform Two Intel Xeon E5-2650L v4 Broadwell Each with 14 cores with Hyper-Threading disabled 35MB last-level cache divided into 20 slices Vanilla 4.11.0 Linux kernel with cache partitioning enabled Cache Allocation Technology (CAT) Provided by Intel to partition the last-level cache Part of the Resource Director Technology (RDT) OS groups applications into classes of service (COS) Each COS describes the amount of cache that assigned applications can use Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 16 / 36

Slide 32

Slide 32 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Cache Allocation Technology (1/2) CAT example with 2 classes of service, 3 cores and a 4-bit capacity mask (CBM): LLC CBM1 = 1110 CBM2 = 0001 p1 p2 COS1 p3 COS2 First COS has 2 cores and 75% of the LLC, the second class of service has the remaining resources Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 17 / 36

Slide 33

Slide 33 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Cache Allocation Technology (2/2) Some technical restrictions: Numbers of slices and classes are architecture dependent (20 and 16 on our platform) A CBM cannot be empty (each class of applications must have at least one slice of cache) Bits set in a CBM must be contiguous Slices are not distributed geographically in the LLC: 0x10000 and 0x00001 CBM should behave exactly the same Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 18 / 36

Slide 34

Slide 34 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Benchmarks NAS Parallel benchmarks class=A (shared memory version) CG Uses conjugate gradients method to solve a large sparse sym- metric positive definite system of linear equations MG Performs a multi-grid solve on a sequence of meshes Table: Description of the NAS parallel benchmarks. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 19 / 36

Slide 35

Slide 35 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Applications and metrics Now iterative NAS benchmarks: we modified the main loop of NAS applications such that each of them computes for a duration T focus on CG and MG (most interesting combination in terms of cache partitioning) Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 20 / 36

Slide 36

Slide 36 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Applications and metrics Now iterative NAS benchmarks: we modified the main loop of NAS applications such that each of them computes for a duration T focus on CG and MG (most interesting combination in terms of cache partitioning) We measure the time for one iteration of Ai : Ti = T #iteri where #iteri is the number of iterations of application Ai during T. Goal: weighted throughput Maximize: min i 1 βi Ti where βi are the weights Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 20 / 36

Slide 37

Slide 37 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Applications and metrics We modified the main loop of NAS applications such that each of them computes for a duration T We ensure that each application reaches the steady state with enough iterations (T = 3 minutes) The platform has two processors: one is used to run the experiments, the other manages the experiments (cache experiments are highly sensitive) Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 21 / 36

Slide 38

Slide 38 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Impact of cache partitioning on a real platform 800 900 1000 1100 1200 5% 25% 35% 50% 75% 95% Fraction of cache Total number of iterations Cache partitioning Without cache partitioning CG and MG (six cores each). The cache fraction of CG is varying from 5% to 95%. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 22 / 36

Slide 39

Slide 39 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Co-scheduling results with two applications (CG+MG) 1 2 3 4 0.25 0.50 0.75 1.00 2.00 3.00 4.00 βMG mini 1 βiTi DP-CP DP-Equal DP-NoCP Eq-CP Eq-NoCP Model Prediction DP-CP exhibits a gain around 15% on average over DP-NoCP! Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 23 / 36

Slide 40

Slide 40 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Outline 1 Co-scheduling applications on cache-partitioned systems Model and theory for perfectly parallel applications Experimental results using Cache Allocation Technology 2 Resilient application co-scheduling with processor redistribution 3 Conclusion Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 24 / 36

Slide 41

Slide 41 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Checkpoint with fail-stop errors Save the state of the application periodically: Time W W C C C Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 25 / 36

Slide 42

Slide 42 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Checkpoint with fail-stop errors Save the state of the application periodically: Time W W C C C In case of errors, application returns to last checkpoint: Time W W Error C C C Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 25 / 36

Slide 43

Slide 43 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Checkpoint with fail-stop errors Save the state of the application periodically: Time W W C C C In case of errors, application returns to last checkpoint: Time W W Error C C C Work done between last checkpoint and error is lost; downtime D and recovery R before resuming execution: Time Wlost W W C D + R C C Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 25 / 36

Slide 44

Slide 44 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Problem: CoSched A pack Set of n parallel applications executed on p processors A new pack can start its execution only when previous pack is finished Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 26 / 36

Slide 45

Slide 45 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Problem: CoSched A pack Set of n parallel applications executed on p processors A new pack can start its execution only when previous pack is finished time 0 p T3 T2 T1 tf time 0 p Error T3 T2 T1 tf Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 26 / 36

Slide 46

Slide 46 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Problem: CoSched A pack Set of n parallel applications executed on p processors A new pack can start its execution only when previous pack is finished time 0 p T3 T2 T1 tf time 0 p Error T3 T2 T1 tf Unbalanced load balancing! Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 26 / 36

Slide 47

Slide 47 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Example time 0 p T2 T3 T1 Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 27 / 36

Slide 48

Slide 48 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Example time 0 p T2 T3 T1 time 0 p T2 T3 T1 Redistribution when T2 releases its processors Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 27 / 36

Slide 49

Slide 49 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Example time 0 p tf Error T2 T3 T1 Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 28 / 36

Slide 50

Slide 50 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Example time 0 p tf Error T2 T3 T1 time 0 p tf Error T2 T3 T1 How to compute the new execution time of T3 ? Give processors of T1 to T3 ? Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 28 / 36

Slide 51

Slide 51 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Model n independent parallel applications T1, T2, . . . , Tn Execution platform with p identical processors Each application is malleable: its number of processors j can change at any time Each application is a divisible load application Problem: CoSched Minimize the maximum of the expected completion times of n applications executed on p processors subject to failures. Redistributions are allowed only when an application completes execution or is struck by a failure. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 29 / 36

Slide 52

Slide 52 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Application model n independent parallel applications T1, T2, . . . , Tn Execution platform with p identical processors Each applications is malleable: its number of processors j can change at any time, and we know its fault-free execution time (ti,j ) for Ti on j processors Each application is divisible (divisible load) Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 30 / 36

Slide 53

Slide 53 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Complexity without redistribution Theorem 1 The CoSched problem without redistributions can be solved in polynomial time O(p × log(n)), where p is the number of processors, and n is the number of applications Each application has two processors We allocate the p − 2n remaining processors two by two in a greedy way to longest application Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 31 / 36

Slide 54

Slide 54 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Greedy algorithm when redistributions are allowed T1 = t1,1 = 10, w1,1 = 10 t1,2 = 9, w1,2 = 18 t1,3 = 6, w1,3 = 18 T2 = t2,1 = 6, w2,1 = 6 t2,2 = 3, w2,2 = 6 t2,3 = 3, w2,3 = 9 T1 T2 0 6 9 T2 0 6 8 T1 (a) Greedy uses largest execution time to allocate processors T1 T2 0 3 10 T2 0 3 7.2 T1 (b) Greedy-SP uses best speedup profile to allocate processors Some examples where Greedy-SP is not optimal either... Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 32 / 36

Slide 55

Slide 55 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Complexity with redistribution Theorem 2 With constant redistribution costs and without failures, CoSched is NP-complete (in the strong sense) Reduction from 3-Partition with distinct integers Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 33 / 36

Slide 56

Slide 56 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Outline 1 Co-scheduling applications on cache-partitioned systems Model and theory for perfectly parallel applications Experimental results using Cache Allocation Technology 2 Resilient application co-scheduling with processor redistribution 3 Conclusion Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 34 / 36

Slide 57

Slide 57 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion Conclusion Chosen approach Build a realistic theoretical model + study the complexity Using theoretical insights, design efficient polynomial-time heuristics Evaluation using simulations with realistic inputs Challenge our solution through real experiments on a dedicated platform Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 35 / 36

Slide 58

Slide 58 text

Co-scheduling applications on cache-partitioned systems Resilient application co-scheduling with processor redistribution Conclusion List of publications Book chapters [B1] G. Aupy, A. Benoit, L. Pottier, P. Raghavan, Y. Robert and M. Shantharam. Co-scheduling high-performance computing applications. Big Data Management and Processing. Ed. by Kuan-Ching Li, Hai Jiang, and Albert Zomaya. Chapman and Hall/CRC Press, 2017. International peer-reviewed journals [J1] A. Benoit, L. Pottier and Y. Robert. Resilient co-scheduling of malleable applications. International Journal of High Performance Computing Applications, IJHPCA, 2017. [J2] G. Aupy, A. Benoit, S. Dai, L. Pottier, P. Raghavan, Y. Robert and M. Shantharam. Co-scheduling Amdahl applications on cache-partitioned systems. International Journal of High Performance Computing Applications, IJHPCA, 2017. [J3] G. Aupy, A. Benoit, B. Goglin, L. Pottier and Y. Robert. Co-scheduling HPC workloads on cache-partitioned CMP platforms. International Journal of High Performance Computing Applications, IJHPCA, 2018. International peer-reviewed conferences [C1] A. Benoit, L. Pottier and Y. Robert. Resilient application co-scheduling with processor redistribution. 45th International Conference on Parallel Processing, ICPP, 2016. [C2] A. Benoit, S. Perarnau, L. Pottier and Y. Robert. A performance model to execute workflows on high-bandwidth-memory architectures. 47th International Conference on Parallel Processing, ICPP, 2018. [C3] G. Aupy, A. Benoit, B. Goglin, L. Pottier and Y. Robert. Co-scheduling HPC workloads on cache-partitioned CMP platforms. IEEE Cluster, 2018. International peer-reviewed workshops [W1] G. Aupy, A. Benoit, L. Pottier, P. Raghavan, Y. Robert and M. Shantharam. Co-scheduling algorithms for cache-partitioned systems. 19th Workshop on Advances in Parallel and Distributed Computational Models, IPDPSW, 2017. Loïc Pottier Co-scheduling for large-scale applications: memory and resilience 36 / 36