presentation ExaAM: John Turner (PI, ORNL) Jim Belak (Co-PI, LLNL) Balasubramaniam Radhakrishnan (ORNL) Philip Fackler (ORNL) Younggil Song (ORNL) Stephen Nichols (ORNL) Jean-Luc Fattebert (ORNL) Chris Newman (LANL) AEOLUS: Karen Willcox (Co-Director, UT-Austin) Omar Ghattas (Co-Director, UT-Austin) John Turner (ORNL) Balasubramaniam Radhakrishnan (ORNL) George Biros (UT-Austin) Yuanxun Bao (UT-Austin) Yigong Qin (UT-Austin) Parisa Khodabakhshi (UT-Austin) Rudy Geelen (UT-Austin) Olena Burkovska (ORNL) Max Gunzburger (FSU, UT-Austin) NOTE: In some cases I will be relaying some work of others that I was not directly involved in and will include a note on those slides Lianghao Cao (UT-Austin) Joshua Chen (UT-Austin) Fengyi Li (UT-Austin) Tinsley Oden (UT-Austin) Peng Chen (UT-Austin) Dingcheng Luo (UT-Austin) Youssef Marzouk (MIT) Ricardo Baptista (MIT)
of usable computational power 2D instead of 3D The effects can take many forms Binary surrogate alloy Simplified free energy formulations Other missing physics Insufficient sample size No uncertainty quantification I think our community believes these limits are real, not just more complication for complication’s sake
scheduled to deploy the world’s first exascale computer later this year >1.5 ExaFlops, 4 AMD GPUs/node Much bigger than the already very big top supercomputers #1 Top500 List: Fugaku (Kobe, Japan) 0.442 ExaFlops 7,630,848 Arm A64FX cores #2 Top500 List: Summit (Oak Ridge, US) 0.149 ExaFlops 191,664 IBM Power9 cores 26,136 NVIDIA Tesla V100 GPUs You can apply to use the DOE machines, no need to be DOE-funded or US-based
• Perform one simulation on all/most of an exascale computer • One (set of coupled) PDE(s) • This is a big lift – the equivalent of 260,000 V100 GPUs Exascale Problem • Solve one problem using phase-field on all/most of an exascale computer • Ex. Many simulations to predict microstructure throughout an AM part with UQ • With job-packing and workflow managers, this is a recognized use case • Wall time per simulation needs to be reasonable, so scaling still matters Can be thought of as a sliding scale: 1 simulation on 260,000 GPUs to 260,000 coordinated simulations on 1 GPU each
microsegregation from solidification cells on the level of 0.5 – 1 μm with thin domains on the nm scale Also from NIST, precipitate dimensions range from 8 nm – 900 nm For standard anneal (800 C, 2h), diffusion length is ~0.5 μm Stoudt et al., IMMI, 9, 2020. Zhang, et al., Acta Mater., 152, 2018. Zhang, et al., Acta Mater., 152, 2018. So what does this mean? To study nucleation growth, and coarsening for multiple cells, we need: Grid spacing ~ 1nm (precipitate thickness) Domain ~ 2 μm x 2 μm x 1 μm (precipitate length, multiple cells) Time ~ 2 hours (annealing time) 2048 x 2048 x 1024 grid (4.3 billion points) 2-4 compositions, 12 order parameters (65 billion DoF)
few phase-field simulations of full melt pools with cells/dendrites – Even in 2D, let alone 3D • Solidification cells on the level of 0.5 – 1 μm • Need grid spacing ~10 nm • Melt-pool radius ~50 μm 2D, half melt pool: 5,000 x 5,000 grid (25 million grid points) 3D, quarter spot weld: 5,000 x 5,000 x 5,000 grid (125 billion grid points) Stoudt et al., IMMI, 9, 2020.
in the Exascale Computing Project for additive manufacturing • Focus: Exascale problem of incorporating localized microstructure effects in a part-scale simulation • Also strongly interested in pushing simulations toward the exascale • A DOE applied math center • Optimal control under uncertainty, UQ, optimal experimental design, multifidelity methods, reduced-order modeling • Additive manufacturing is one of two application areas • One focus is on how applied math methods can direct the efficient use of 100s, 1000s, etc. of high-fidelity phase- field simulations • Not focused on exascale, but the methods are very relevant
Want to be able to effectively use GPUs, lots of them 2. Want something open source and flexible enough to use in other contexts Does this already exist? CPU-only frameworks MOOSE FiPy PRISMS-PF Pace3D FEniCS Single-purpose GPU codes Shimokawabe, et al., SC’11, 2011 [4,000 GPUs] Zhu, et al., AIP Adv., 2018 [21 GPUs] GPU-capable frameworks ? MOOSE? Krol, et al. Prog. Sys. Eng., 2020 [1 GPU] Single-purpose CPU codes Many codes
Implicit / Explicit Finite Difference / Finite Volume / Finite Element / Pseudospectral C++ / Fortran Library-Centric / Minimal Dependencies Pre-existing / New in ExaAM CUDA / HIP / OpenMP / Kokkos / Raja
Dorr, et al., J. Comp. Phys. 229 (3), 2010. Fattebert, et al., Acta Materialia, 62, 2014. Disclaimer: I’m lightly involved in AMPE, not involved in Tusas or MEUMAPPS-SL AMPE github.com/LLNL/AMPE Tusas github.com/chrisknewman/tusas MEUMAPPS-SL (unreleased) ExaAM team Jean-Luc Fattebert (ORNL) Chris Newman (LANL) Balasubramaniam Radhakrishnan (ORNL) History ~10 years old (~6 years before ExaAM), started at LLNL, Jean-Luc moved to ORNL ~6 years old (~2 years before ExaAM) ~4 years old (concurrent start with ExaAM and HPC4Mfg project) Models KKS, dilute binary, pure material, grain growth KKS, dilute binary pure material, grain growth, Cahn-Hilliard, linear elasticity,… KKS Solver details FV, implicit, multigrid- preconditioned JFNK, structured mesh FE, implicit, multigrid- preconditioned JFNK, unstructured mesh Finite difference, explicit, structured mesh Key dependencies Sundials, hypre, SAMRAI, Raja Trilinos (Kokkos, ML, MueLu, NOX, Belos, AztecOO, Rythmos) Strengths Flexible governing equations, quaternions for polycrystals, adaptive time stepping, scalability, CALPHAD integration, (dormant) adaptive meshing Flexible governing equations, quaternions for polycrystals, adaptive time stepping, scalability, GPU utilization, body-fitted meshes Neighbor search for polycrystals, CALPHAD integration, small source code aids rapid prototyping Ghosh, et al., J. Comp. Phys. (submitted). Radhakrishnan, et al., Metals, (9) 2019
solidification Laser melting of Cu-Ni thin film Perron, et al., Mod. Sim. Mater. Sci. Eng., (26), 2018 Additive manufacturing of Ti-Nb Roehling, et al., JOM, 70 (8), 2018 Disclaimer: I’m lightly involved in AMPE, not involved in Tusas or MEUMAPPS-SL Directional solidification of Al-Cu Ghosh, et al., J. Comp. Phys., (submitted) AMPE AMPE MEUMAPPS-SL Additive manufacturing of Ni- Fe-Nb Radhakrishnan, et al., Metals, (9) 2019 Tusas
AMPE: CALPHAD free energies for KKS models – Requires solving a nonlinear system of equations and care to not diverge outside the physical bounds – Nonlinear system is pointwise, independent of the spatial discretization • Jean-Luc is spinning off the CALPHAD part of AMPE as Thermo4PFM – Currently going through the ORNL software release process – Parse CALPHAD input, calculate homogenous free energies and their derivatives, calculate KKS single-phase compositions – Can be integrated into any phase-field code (that can link with C++) – Can be run on GPUs (tested with OpenMP Target, planned tests with Kokkos) Disclaimer: I’m lightly involved in AMPE
• Full GPU offloading in progress • Three aspects – Hypre preconditioner: Done – Thermo4PFM: Currently testing – SAMRAI loops: Planned, need to be re- written with Raja • GPU speedup – Hypre: 9x speedup (in proxy app, vs. MPI) – Thermo4PFM: 4.5x speedup (vs. OpenMP) – SAMRAI loops: N/A Tusas • Full GPU offloading • Combination of Kokkos (plus other parts of Trilinos) and CUDA/HIP • GPU speedup: – 6x speedup overall (comparing to MPI+OpenMP) Note: All GPU speedups are relative to a Summit node, comparing some multiple of 1 GPU vs 7 CPU cores MEUMAPPS-SL • N/A Disclaimer: I’m lightly involved in AMPE, not involved in Tusas or MEUMAPPS-SL
colonies in Ti-6Al-4V Radhakrishnan et al, Met. Mater. Trans., 47A (2016) Localized ẟ phase in Inconel 625 Song et al, Phys. Rev. Mater, (in press) 𝛾” phase in Inconel 625 (unpublished) MEUMAPPS-SS MEUMAPPS-SS MEUMAPPS C++
automatic offload with OpenACC – Observed 8x speedup with the “-acc” flag – OpenACC support on Frontier unclear 2. OpenMP Target – Fully supported on Frontier, working on a Frontier test machine (AMD GPUs) – Substantial re-write of code – 18x GPU speedup for offloaded loops – Still CPU-based FFT – Overall only 1.2x GPU speedup 3. Re-implementation as GPU-native in C++ Note: All GPU speedups are relative to a Summit node, comparing some multiple of 1 GPU vs 7 CPU cores
CPU cores + 6 GPUs Total time 254.3 s 51.3 s 5x faster Kokkos 111.6 s (44%) 6.7 s (13%) 17x faster FFTs 131.7 s (52%) 41.3 s (81%) 3x faster Other 10.9 s (4%) 3.3 s (6%) 3x faster Overall GPU speedup of 5x per node (~35x for 6 GPUs vs 6 CPU cores) Much larger GPU speedup for Kokkos loops than FFTs GPU calculations dominated by FFT time – FFTs have all the MPI Profiling: Where is the time spent for MEUMAPPS-SS (C++)?
see some decent strong scaling in the 24-192 GPU range 8403: Decent strong scaling through 384 GPUs, still lower wall time at 768 GPUs Weak scaling is pretty poor
FFT library performance – Working with the heFFTe team to improve scaling – ECP another FFT team (FFTX) + new benchmarking effort – Fortran code has better CPU scaling, points to opportunities for heFFTe • Reduce FFTs – Kokkos loops get much better GPU speedups and no MPI communication – Want to trade more non-FFT work for fewer FFTs • Improved physics – Add new capabilities under development in MEUMAPPS-SS (Fortran) – Add support for full CALPHAD free energies with Thermo4PFM • Frontier – MEUMAPPS-SS (C++) up and running on AMD GPUs on an ECP test machine • Open source release
developing 4 phase-field codes • Mix of methods • Current target applications are solidification and solid-state transformations – Codes have the physics capabilities for real problems – But the codes are flexible enough to modify for other applications • Encouraging results on GPUs – MEUMAPPS-SS (C++) and Tusas have 5-6x speedups (w/ ratio of 1 GPU/7 CPU cores) – MEUMAPPS-SS (C++) with strong scaling to hundreds of GPUs – Tusas with strong and weak scaling to 24,000 GPUs (!) • On track for deployment to Frontier
BM3 Upload: AMPE BM1a Upload: MEUMAPPS-SS (C++) • 128x128 grid • 1 million time steps in 35 minutes on 1 CPU core • Highlights the importance of adaptive time stepping Other uses • Tusas used BM3 for verification • Plans to use a 3D version of BM3 for a performance test between AMPE and Tusas • MEUMAPPS-SS (C++) used a simplified version of BM2 for initial testing and benchmarking
at NIST, microsegregation from solidification cells on the level of 0.5 – 1 μm with thin domains on the nm scale Also from NIST, precipitate dimensions range from 8 nm – 900 nm For standard anneal (800 C, 2h), diffusion length is ~0.5 μm Stoudt et al., IMMI, 9, 2020. Zhang, et al., Acta Mater., 152, 2018. Zhang, et al., Acta Mater., 152, 2018. So what does this mean? To study nucleation growth, and coarsening for multiple cells, we need: Grid spacing ~ 1nm (precipitate thickness) Domain ~ 2 μm x 2 μm x 1 μm (multiple cells) Time ~ 2 hours (annealing time) 2048 x 2048 x 1024 grid (4.3 billion points) 2-4 compositions, 12 order parameters (65 billion DoF)
at NIST, microsegregation from solidification cells on the level of 0.5 – 1 μm with thin domains on the nm scale Also from NIST, precipitate dimensions range from 8 nm – 900 nm For standard anneal (800 C, 2h), diffusion length is ~0.5 μm Stoudt et al., IMMI, 9, 2020. Zhang, et al., Acta Mater., 152, 2018. So what does this mean? To study nucleation growth, and coarsening for multiple cells, we need: Grid spacing ~ 1nm (precipitate thickness) Domain ~ 2 μm x 2 μm x 1 μm (multiple cells) Time ~ 2 hours (annealing time) 2048 x 2048 x 1024 grid (4.3 billion points) 2-4 compositions, 12 order parameters (65 billion DoF) Can we do it? heFFTe strong scales to at least 6,144 GPUs for 10243 (Ayala et al., Inter. Conf. Comp. Sci., 2020) This is 4x bigger domain 4x 6144 = 24,576 GPUs = Summit How far can we push on Frontier?
few phase-field simulations of full melt pools with cells/dendrites – Even in 2D, let alone 3D • Solidification cells on the level of 0.5 – 1 μm • Need grid spacing ~10 nm • Melt-pool radius ~50 μm 2D, half melt pool: 5,000 x 5,000 grid (25 million grid points) 3D, quarter spot weld: 5,000 x 5,000 x 5,000 grid (125 billion grid points) Stoudt et al., IMMI, 9, 2020.
do it? Tusas simulations up to 4.3 billion DoF -> about 1 billion elements 10003 domain in 3D, 32,0002 in 2D Summit -> Frontier gives us 10x Flops Need “just” another 10x… Ghosh, et al., J. Comp. Phys. (submitted). • Very few phase-field simulations of full melt pools with cells/dendrites – Even in 2D, let alone 3D • Solidification cells on the level of 0.5 – 1 μm • Need grid spacing ~10 nm • Melt-pool radius ~50 μm 2D, half melt pool: 5,000 x 5,000 grid (25 million grid points) 3D, quarter spot weld: 5,000 x 5,000 x 5,000 grid (125 billion grid points)
computers are almost here Phase-field modeling has challenges that scale can solve But exascale machines are big, and it isn’t easy to use large fractions of them efficiently ExaAM is meeting this challenge with AMPE, Tusas, MEUMAPPS-SL, and MEUMAPPS-SS
center (MMICC) • Optimal control under uncertainty, UQ, optimal experimental design, multifidelity methods, reduced-order modeling • Two application areas: – Additive manufacturing – Block co-polymers • One focus is on how applied math methods can direct the efficient use of 100s, 1000s, etc. of high-fidelity phase-field simulations • Not focused on exascale, but the methods are very relevant
a non-local variant of Cahn-Hilliard called the Ohta-Kawasaki model • Emphasis is on the final steady-state solution Disclaimer: I’m not involved in the copolymer applications Direct energy minimization method is 1,000x faster than gradient flow Cao, Ghattas, Oden Model inversion to infer model parameters from noisy experimental microstructure Baptisa, Cao, Chen, Ghattas, Li, Marzouk, Oden Optimal control of substrate chemistry to direct fine-scale self- assembly Cao, Chen, Chen, Ghattas, Luo, Oden
solidification phenomena • Directional solidification of alloys is the target, but pure material used as a proving ground Burkovska, DeWitt, Radhakrishnan, Gunzburger Non-local Cahn-Hillard can have perfectly sharp interfaces. Can we create a solidification model like this? Solidification reduced order model using operator inference, leveraging equation structure for the reduced representation Khodabakhshi, Geelen, DeWitt, Radhakrishnan, Willcox Multiscale modeling for AM with validation from full-melt-pool simulations Bao, Qin, DeWitt, Radhakrishnan, Biros
Test if targeted, thin phase-field simulations can give insight into the dendrite/cell structure in a melt pool Perform melt-pool- scale thermal simulation Extract gradient and velocity for lines normal to the thermal gradient Perform transient phase-field simulations along those lines
(A full-melt-pool simulation) The only big discrepancy is the primary arm spacing Why? It’s a geometric effect from the converging dendrites Cell size can’t adjust fast enough (Gurevich et al., PRE, 2010) Can’t be seen in rectangular domains Is this a known phenomena?
Can we test “line models” in 3D using ExaAM codes for full- melt-pool simulations? • If they work, can we develop efficient, accurate reduced order models for the “line models”? • If those work, can we solve optimal control problems for the heat source with dendrite-scale resolution? • Stay tuned! ExaAM
of freeing phase-field simulations from current computational constraints But we need codes that use them effectively And we need to define high-value problems to solve Hopefully the work we’re doing in ExaAM and AEOLUS helps the community prepare to solve exascale problems …either directly through our codes and methods, or by learning from what we’ve done (good or bad)
Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative. This work was supported by the US Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under grant number DE-SC0019303 as part of the AEOLUS Center. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.