Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Preparing for Exascale Phase-Field Simulations:...

Preparing for Exascale Phase-Field Simulations: Phase-Field Modeling in ExaAM and AEOLUS

Daniel Wheeler

July 21, 2022
Tweet

More Decks by Daniel Wheeler

Other Decks in Science

Transcript

  1. Preparing for Exascale Phase-Field Simulations: Phase-Field Modeling in ExaAM and

    AEOLUS Stephen DeWitt Computational Sciences and Engineering Division Oak Ridge National Laboratory
  2. 2 2 2 People contributing to the work in the

    presentation ExaAM: John Turner (PI, ORNL) Jim Belak (Co-PI, LLNL) Balasubramaniam Radhakrishnan (ORNL) Philip Fackler (ORNL) Younggil Song (ORNL) Stephen Nichols (ORNL) Jean-Luc Fattebert (ORNL) Chris Newman (LANL) AEOLUS: Karen Willcox (Co-Director, UT-Austin) Omar Ghattas (Co-Director, UT-Austin) John Turner (ORNL) Balasubramaniam Radhakrishnan (ORNL) George Biros (UT-Austin) Yuanxun Bao (UT-Austin) Yigong Qin (UT-Austin) Parisa Khodabakhshi (UT-Austin) Rudy Geelen (UT-Austin) Olena Burkovska (ORNL) Max Gunzburger (FSU, UT-Austin) NOTE: In some cases I will be relaying some work of others that I was not directly involved in and will include a note on those slides Lianghao Cao (UT-Austin) Joshua Chen (UT-Austin) Fengyi Li (UT-Austin) Tinsley Oden (UT-Austin) Peng Chen (UT-Austin) Dingcheng Luo (UT-Austin) Youssef Marzouk (MIT) Ricardo Baptista (MIT)
  3. 3 3 3 Phase-field simulations are limited by a lack

    of usable computational power 2D instead of 3D The effects can take many forms Binary surrogate alloy Simplified free energy formulations Other missing physics Insufficient sample size No uncertainty quantification I think our community believes these limits are real, not just more complication for complication’s sake
  4. 4 4 4 Phase-field simulations are limited by a lack

    of usable computational power But why? Two Possibilities: Lack of computational resources Codes can’t scale to take advantage of existing resources
  5. 5 5 5 The coming exascale era… US DOE is

    scheduled to deploy the world’s first exascale computer later this year >1.5 ExaFlops, 4 AMD GPUs/node Much bigger than the already very big top supercomputers #1 Top500 List: Fugaku (Kobe, Japan) 0.442 ExaFlops 7,630,848 Arm A64FX cores #2 Top500 List: Summit (Oak Ridge, US) 0.149 ExaFlops 191,664 IBM Power9 cores 26,136 NVIDIA Tesla V100 GPUs You can apply to use the DOE machines, no need to be DOE-funded or US-based
  6. 6 6 6 Is phase-field modeling ready for exascale? (And

    what does that even mean? Exascale what?)
  7. 7 7 7 Exascale simulations vs. exascale problems Exascale Simulation

    • Perform one simulation on all/most of an exascale computer • One (set of coupled) PDE(s) • This is a big lift – the equivalent of 260,000 V100 GPUs Exascale Problem • Solve one problem using phase-field on all/most of an exascale computer • Ex. Many simulations to predict microstructure throughout an AM part with UQ • With job-packing and workflow managers, this is a recognized use case • Wall time per simulation needs to be reasonable, so scaling still matters Can be thought of as a sliding scale: 1 simulation on 260,000 GPUs to 260,000 coordinated simulations on 1 GPU each
  8. 9 9 9 Precipitation in AM Inconel 625 From NIST,

    microsegregation from solidification cells on the level of 0.5 – 1 μm with thin domains on the nm scale Also from NIST, precipitate dimensions range from 8 nm – 900 nm For standard anneal (800 C, 2h), diffusion length is ~0.5 μm Stoudt et al., IMMI, 9, 2020. Zhang, et al., Acta Mater., 152, 2018. Zhang, et al., Acta Mater., 152, 2018. So what does this mean? To study nucleation growth, and coarsening for multiple cells, we need: Grid spacing ~ 1nm (precipitate thickness) Domain ~ 2 μm x 2 μm x 1 μm (precipitate length, multiple cells) Time ~ 2 hours (annealing time) 2048 x 2048 x 1024 grid (4.3 billion points) 2-4 compositions, 12 order parameters (65 billion DoF)
  9. 10 10 10 Full melt pool solidification simulations • Very

    few phase-field simulations of full melt pools with cells/dendrites – Even in 2D, let alone 3D • Solidification cells on the level of 0.5 – 1 μm • Need grid spacing ~10 nm • Melt-pool radius ~50 μm 2D, half melt pool: 5,000 x 5,000 grid (25 million grid points) 3D, quarter spot weld: 5,000 x 5,000 x 5,000 grid (125 billion grid points) Stoudt et al., IMMI, 9, 2020.
  10. 11 11 11 Two teams, two approaches ExaAM • Application

    in the Exascale Computing Project for additive manufacturing • Focus: Exascale problem of incorporating localized microstructure effects in a part-scale simulation • Also strongly interested in pushing simulations toward the exascale • A DOE applied math center • Optimal control under uncertainty, UQ, optimal experimental design, multifidelity methods, reduced-order modeling • Additive manufacturing is one of two application areas • One focus is on how applied math methods can direct the efficient use of 100s, 1000s, etc. of high-fidelity phase- field simulations • Not focused on exascale, but the methods are very relevant
  11. 12 12 12 Phase-field modeling in ExaAM PI: John Turner

    (ONRL) Co-PI: Jim Belak (LLNL) PF Component Leads: Balasubramaniam “Rad” Radhakrishnan (ORNL) Jean-Luc Fattebert (ORNL) Chris Newman (LANL) Solid-solid phase transformation, during build or heat treatment Constitutive models from microscale properties Macroscale thermo- mechanics using improved constitutive properties 3: Micromechanical properties 2: Late-time Microstructure 4: Full-part build simulation 1: As-built microstructure 0: Full-part build simulation Macroscale thermo-mechanics using assumed properties (Modified slide from John Turner) Thermal fluids at the melt-pool scale, microstructure at the grain-scale, microstructure at the dendrite/cell scale Phase-field solidification models Phase-field precipitation models
  12. 13 13 13 Motivation for code development in ExaAM 1.

    Want to be able to effectively use GPUs, lots of them 2. Want something open source and flexible enough to use in other contexts Does this already exist? CPU-only frameworks MOOSE FiPy PRISMS-PF Pace3D FEniCS Single-purpose GPU codes Shimokawabe, et al., SC’11, 2011 [4,000 GPUs] Zhu, et al., AIP Adv., 2018 [21 GPUs] GPU-capable frameworks ? MOOSE? Krol, et al. Prog. Sys. Eng., 2020 [1 GPU] Single-purpose CPU codes Many codes
  13. 14 14 14 ExaAM’s phase-field codes: A variety of approaches

    Implicit / Explicit Finite Difference / Finite Volume / Finite Element / Pseudospectral C++ / Fortran Library-Centric / Minimal Dependencies Pre-existing / New in ExaAM CUDA / HIP / OpenMP / Kokkos / Raja
  14. 15 15 15 The solidification codes: AMPE, Tusas, and MEUMAPPS-SL

    Dorr, et al., J. Comp. Phys. 229 (3), 2010. Fattebert, et al., Acta Materialia, 62, 2014. Disclaimer: I’m lightly involved in AMPE, not involved in Tusas or MEUMAPPS-SL AMPE github.com/LLNL/AMPE Tusas github.com/chrisknewman/tusas MEUMAPPS-SL (unreleased) ExaAM team Jean-Luc Fattebert (ORNL) Chris Newman (LANL) Balasubramaniam Radhakrishnan (ORNL) History ~10 years old (~6 years before ExaAM), started at LLNL, Jean-Luc moved to ORNL ~6 years old (~2 years before ExaAM) ~4 years old (concurrent start with ExaAM and HPC4Mfg project) Models KKS, dilute binary, pure material, grain growth KKS, dilute binary pure material, grain growth, Cahn-Hilliard, linear elasticity,… KKS Solver details FV, implicit, multigrid- preconditioned JFNK, structured mesh FE, implicit, multigrid- preconditioned JFNK, unstructured mesh Finite difference, explicit, structured mesh Key dependencies Sundials, hypre, SAMRAI, Raja Trilinos (Kokkos, ML, MueLu, NOX, Belos, AztecOO, Rythmos) Strengths Flexible governing equations, quaternions for polycrystals, adaptive time stepping, scalability, CALPHAD integration, (dormant) adaptive meshing Flexible governing equations, quaternions for polycrystals, adaptive time stepping, scalability, GPU utilization, body-fitted meshes Neighbor search for polycrystals, CALPHAD integration, small source code aids rapid prototyping Ghosh, et al., J. Comp. Phys. (submitted). Radhakrishnan, et al., Metals, (9) 2019
  15. 16 16 16 Applications of AMPE, Tusas, and MEUMAPPS-SL for

    solidification Laser melting of Cu-Ni thin film Perron, et al., Mod. Sim. Mater. Sci. Eng., (26), 2018 Additive manufacturing of Ti-Nb Roehling, et al., JOM, 70 (8), 2018 Disclaimer: I’m lightly involved in AMPE, not involved in Tusas or MEUMAPPS-SL Directional solidification of Al-Cu Ghosh, et al., J. Comp. Phys., (submitted) AMPE AMPE MEUMAPPS-SL Additive manufacturing of Ni- Fe-Nb Radhakrishnan, et al., Metals, (9) 2019 Tusas
  16. 17 17 17 AMPE spinoff: Thermo4PFM • A strength of

    AMPE: CALPHAD free energies for KKS models – Requires solving a nonlinear system of equations and care to not diverge outside the physical bounds – Nonlinear system is pointwise, independent of the spatial discretization • Jean-Luc is spinning off the CALPHAD part of AMPE as Thermo4PFM – Currently going through the ORNL software release process – Parse CALPHAD input, calculate homogenous free energies and their derivatives, calculate KKS single-phase compositions – Can be integrated into any phase-field code (that can link with C++) – Can be run on GPUs (tested with OpenMP Target, planned tests with Kokkos) Disclaimer: I’m lightly involved in AMPE
  17. 18 18 18 AMPE, Tusas, and MEUMAPPS-SL on GPUs AMPE

    • Full GPU offloading in progress • Three aspects – Hypre preconditioner: Done – Thermo4PFM: Currently testing – SAMRAI loops: Planned, need to be re- written with Raja • GPU speedup – Hypre: 9x speedup (in proxy app, vs. MPI) – Thermo4PFM: 4.5x speedup (vs. OpenMP) – SAMRAI loops: N/A Tusas • Full GPU offloading • Combination of Kokkos (plus other parts of Trilinos) and CUDA/HIP • GPU speedup: – 6x speedup overall (comparing to MPI+OpenMP) Note: All GPU speedups are relative to a Summit node, comparing some multiple of 1 GPU vs 7 CPU cores MEUMAPPS-SL • N/A Disclaimer: I’m lightly involved in AMPE, not involved in Tusas or MEUMAPPS-SL
  18. 19 19 19 Tusas on (lots of) GPUs 4.3 billion

    DoFs on 24,576 GPUs Disclaimer: I’m not involved in Tusas
  19. 20 20 20 The solid-state code: MEUMAPPS-SS MEUMAPPS-SS github.com/ORNL/meumapps_ss MEUMAPPS-SS

    (C++) (unreleased) ExaAM team Balasubramaniam Radhakrishnan (lead), Younggil Song, Stephen Nichols, Steve DeWitt (all ORNL) Steve DeWitt (lead), Philip Fackler, Younggil Song, Balasubramaniam Radhakrishnan (all ORNL) History ~5 years old (before ExaAM) ~1 year old, new in ExaAM Models Solid state KKS Solid state KKS, Cahn-Hilliard, Allen-Cahn Solver details Pseudospectral, Iterative perturbation method for nonlinear elasticity Pseudospectral, Iterative perturbation method for nonlinear elasticity Key dependencies P3DFFT, OpenACC/OpenMP heFFTe/AccFFT, Kokkos Strengths Scalable pseudospectral solver, arbitrary components and phases, built-in nucleation models, limited dependencies, CPU scaling Scalable pseudospectral solver, arbitrary components and phases, built-in nucleation models, limited dependencies, GPU speedup, flexible interface for governing equations Radhakrishnan et al, Met. Mater. Trans., 47A (2016)
  20. 21 21 21 Applications of MEUMAPPS-SS for solid-state transformation Lamellar

    colonies in Ti-6Al-4V Radhakrishnan et al, Met. Mater. Trans., 47A (2016) Localized ẟ phase in Inconel 625 Song et al, Phys. Rev. Mater, (in press) 𝛾” phase in Inconel 625 (unpublished) MEUMAPPS-SS MEUMAPPS-SS MEUMAPPS C++
  21. 22 22 22 GPU strategy for MEUMAPPS-SS 1. Original emphasis

    automatic offload with OpenACC – Observed 8x speedup with the “-acc” flag – OpenACC support on Frontier unclear 2. OpenMP Target – Fully supported on Frontier, working on a Frontier test machine (AMD GPUs) – Substantial re-write of code – 18x GPU speedup for offloaded loops – Still CPU-based FFT – Overall only 1.2x GPU speedup 3. Re-implementation as GPU-native in C++ Note: All GPU speedups are relative to a Summit node, comparing some multiple of 1 GPU vs 7 CPU cores
  22. 24 24 24 Designed for performance regardless of architecture MEUMAPPS-SS

    (C++) Kokkos heFFTe AccFFT Serial CUDA HIP Open MP FFTW MKL rocFFT cuFFT Kokkos: https://github.com/kokkos/kokkos heFFTe: https://bitbucket.org/icl/heffte AccFFT: https://github.com/amirgholami/accfft FFTW cuFFT Performance portable libraries Architecture-specific backends Flexible interface for different FFT library options Performance portable execution patterns and data structures (non-FFT code)
  23. 25 25 25 Single-node performance, 1683 grid 57x Speedup obtained

    Success in improving performance with GPUs Inconel 625 surrogate, Mo-Nb-Ni 3 𝛾” variants, 12 𝛿 variants Test on one Summit node: 42 CPUs/6 GPUs
  24. 26 26 26 Profiling: Where is the time spent for

    MEUMAPPS-SS (C++)? CPU-only Resources 42 CPU cores Total time 254.3 s Kokkos 111.6 s (44%) FFTs 131.7 s (52%) Other 10.9 s (4%)
  25. 27 27 27 CPU-only CPU+GPU Resources 42 CPU cores 6

    CPU cores + 6 GPUs Total time 254.3 s 51.3 s 5x faster Kokkos 111.6 s (44%) 6.7 s (13%) 17x faster FFTs 131.7 s (52%) 41.3 s (81%) 3x faster Other 10.9 s (4%) 3.3 s (6%) 3x faster Overall GPU speedup of 5x per node (~35x for 6 GPUs vs 6 CPU cores) Much larger GPU speedup for Kokkos loops than FFTs GPU calculations dominated by FFT time – FFTs have all the MPI Profiling: Where is the time spent for MEUMAPPS-SS (C++)?
  26. 28 28 28 Strong/weak scaling (single-variant test) 4203: Starting to

    see some decent strong scaling in the 24-192 GPU range 8403: Decent strong scaling through 384 GPUs, still lower wall time at 768 GPUs Weak scaling is pretty poor
  27. 29 29 29 What’s next for MEUMAPPS-SS (C++)? • Improving

    FFT library performance – Working with the heFFTe team to improve scaling – ECP another FFT team (FFTX) + new benchmarking effort – Fortran code has better CPU scaling, points to opportunities for heFFTe • Reduce FFTs – Kokkos loops get much better GPU speedups and no MPI communication – Want to trade more non-FFT work for fewer FFTs • Improved physics – Add new capabilities under development in MEUMAPPS-SS (Fortran) – Add support for full CALPHAD free energies with Thermo4PFM • Frontier – MEUMAPPS-SS (C++) up and running on AMD GPUs on an ECP test machine • Open source release
  28. 30 30 30 ExaAM phase-field code summary • ExaAM is

    developing 4 phase-field codes • Mix of methods • Current target applications are solidification and solid-state transformations – Codes have the physics capabilities for real problems – But the codes are flexible enough to modify for other applications • Encouraging results on GPUs – MEUMAPPS-SS (C++) and Tusas have 5-6x speedups (w/ ratio of 1 GPU/7 CPU cores) – MEUMAPPS-SS (C++) with strong scaling to hundreds of GPUs – Tusas with strong and weak scaling to 24,000 GPUs (!) • On track for deployment to Frontier
  29. 31 31 31 An aside: ExaAM and the PFHub benchmarks

    BM3 Upload: AMPE BM1a Upload: MEUMAPPS-SS (C++) • 128x128 grid • 1 million time steps in 35 minutes on 1 CPU core • Highlights the importance of adaptive time stepping Other uses • Tusas used BM3 for verification • Plans to use a 3D version of BM3 for a performance test between AMPE and Tusas • MEUMAPPS-SS (C++) used a simplified version of BM2 for initial testing and benchmarking
  30. 32 32 32 Enough about the codes …what can we

    do with them? Let’s revisit the exascale simulation examples
  31. 33 33 33 Precipitation in AM Inconel 625 From work

    at NIST, microsegregation from solidification cells on the level of 0.5 – 1 μm with thin domains on the nm scale Also from NIST, precipitate dimensions range from 8 nm – 900 nm For standard anneal (800 C, 2h), diffusion length is ~0.5 μm Stoudt et al., IMMI, 9, 2020. Zhang, et al., Acta Mater., 152, 2018. Zhang, et al., Acta Mater., 152, 2018. So what does this mean? To study nucleation growth, and coarsening for multiple cells, we need: Grid spacing ~ 1nm (precipitate thickness) Domain ~ 2 μm x 2 μm x 1 μm (multiple cells) Time ~ 2 hours (annealing time) 2048 x 2048 x 1024 grid (4.3 billion points) 2-4 compositions, 12 order parameters (65 billion DoF)
  32. 34 34 34 Precipitation in AM Inconel 625 From work

    at NIST, microsegregation from solidification cells on the level of 0.5 – 1 μm with thin domains on the nm scale Also from NIST, precipitate dimensions range from 8 nm – 900 nm For standard anneal (800 C, 2h), diffusion length is ~0.5 μm Stoudt et al., IMMI, 9, 2020. Zhang, et al., Acta Mater., 152, 2018. So what does this mean? To study nucleation growth, and coarsening for multiple cells, we need: Grid spacing ~ 1nm (precipitate thickness) Domain ~ 2 μm x 2 μm x 1 μm (multiple cells) Time ~ 2 hours (annealing time) 2048 x 2048 x 1024 grid (4.3 billion points) 2-4 compositions, 12 order parameters (65 billion DoF) Can we do it? heFFTe strong scales to at least 6,144 GPUs for 10243 (Ayala et al., Inter. Conf. Comp. Sci., 2020) This is 4x bigger domain 4x 6144 = 24,576 GPUs = Summit How far can we push on Frontier?
  33. 35 35 35 Full melt pool solidification simulations • Very

    few phase-field simulations of full melt pools with cells/dendrites – Even in 2D, let alone 3D • Solidification cells on the level of 0.5 – 1 μm • Need grid spacing ~10 nm • Melt-pool radius ~50 μm 2D, half melt pool: 5,000 x 5,000 grid (25 million grid points) 3D, quarter spot weld: 5,000 x 5,000 x 5,000 grid (125 billion grid points) Stoudt et al., IMMI, 9, 2020.
  34. 36 36 36 Full melt pool solidification simulations Can we

    do it? Tusas simulations up to 4.3 billion DoF -> about 1 billion elements 10003 domain in 3D, 32,0002 in 2D Summit -> Frontier gives us 10x Flops Need “just” another 10x… Ghosh, et al., J. Comp. Phys. (submitted). • Very few phase-field simulations of full melt pools with cells/dendrites – Even in 2D, let alone 3D • Solidification cells on the level of 0.5 – 1 μm • Need grid spacing ~10 nm • Melt-pool radius ~50 μm 2D, half melt pool: 5,000 x 5,000 grid (25 million grid points) 3D, quarter spot weld: 5,000 x 5,000 x 5,000 grid (125 billion grid points)
  35. 37 37 37 Pushing to the exascale with ExaAM Exascale

    computers are almost here Phase-field modeling has challenges that scale can solve But exascale machines are big, and it isn’t easy to use large fractions of them efficiently ExaAM is meeting this challenge with AMPE, Tusas, MEUMAPPS-SL, and MEUMAPPS-SS
  36. 38 38 38 AEOLUS Overview • A DOE applied math

    center (MMICC) • Optimal control under uncertainty, UQ, optimal experimental design, multifidelity methods, reduced-order modeling • Two application areas: – Additive manufacturing – Block co-polymers • One focus is on how applied math methods can direct the efficient use of 100s, 1000s, etc. of high-fidelity phase-field simulations • Not focused on exascale, but the methods are very relevant
  37. 39 39 39 Block copolymer highlights • Evolution based on

    a non-local variant of Cahn-Hilliard called the Ohta-Kawasaki model • Emphasis is on the final steady-state solution Disclaimer: I’m not involved in the copolymer applications Direct energy minimization method is 1,000x faster than gradient flow Cao, Ghattas, Oden Model inversion to infer model parameters from noisy experimental microstructure Baptisa, Cao, Chen, Ghattas, Li, Marzouk, Oden Optimal control of substrate chemistry to direct fine-scale self- assembly Cao, Chen, Chen, Ghattas, Luo, Oden
  38. 40 40 40 Additive manufacturing highlights • Specifically focused on

    solidification phenomena • Directional solidification of alloys is the target, but pure material used as a proving ground Burkovska, DeWitt, Radhakrishnan, Gunzburger Non-local Cahn-Hillard can have perfectly sharp interfaces. Can we create a solidification model like this? Solidification reduced order model using operator inference, leveraging equation structure for the reduced representation Khodabakhshi, Geelen, DeWitt, Radhakrishnan, Willcox Multiscale modeling for AM with validation from full-melt-pool simulations Bao, Qin, DeWitt, Radhakrishnan, Biros
  39. 41 41 41 Multiscale modeling of a spot weld Goal:

    Test if targeted, thin phase-field simulations can give insight into the dendrite/cell structure in a melt pool Perform melt-pool- scale thermal simulation Extract gradient and velocity for lines normal to the thermal gradient Perform transient phase-field simulations along those lines
  40. 42 42 42 How do we know if it works?

    (A full-melt-pool simulation) The only big discrepancy is the primary arm spacing Why? It’s a geometric effect from the converging dendrites Cell size can’t adjust fast enough (Gurevich et al., PRE, 2010) Can’t be seen in rectangular domains Is this a known phenomena?
  41. 43 43 43 The synthesis of ExaAM and AEOLUS •

    Can we test “line models” in 3D using ExaAM codes for full- melt-pool simulations? • If they work, can we develop efficient, accurate reduced order models for the “line models”? • If those work, can we solve optimal control problems for the heat source with dendrite-scale resolution? • Stay tuned! ExaAM
  42. 44 44 44 In conclusion… Exascale computers bring the promise

    of freeing phase-field simulations from current computational constraints But we need codes that use them effectively And we need to define high-value problems to solve Hopefully the work we’re doing in ExaAM and AEOLUS helps the community prepare to solve exascale problems …either directly through our codes and methods, or by learning from what we’ve done (good or bad)
  43. 45 45 45 Acknowledgements This research was supported by the

    Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative. This work was supported by the US Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under grant number DE-SC0019303 as part of the AEOLUS Center. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.