Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modernizing an Operational Real-time Tsunami Si...

Modernizing an Operational Real-time Tsunami Simulator to Support Diverse Hardware Platforms

Keichi Takahashi, Takashi Abe, Akihiro Musa, Yoshihiko Sato, Yoichi Shimomura, Hiroyuki Takizawa, Shunichi Koshimura, “Modernizing an Operational Real-time Tsunami Simulator to Support Diverse Hardware Platforms,” International Conference on Cluster Computing (CLUSTER 2024), Sep. 2024.

Keichi Takahashi

September 27, 2024
Tweet

More Decks by Keichi Takahashi

Other Decks in Research

Transcript

  1. 2024 IEEE International Conference on Cluster Computing (CLUSTER 2024) Modernizing

    an Operational Real-time Tsunami Simulator to Support Diverse Hardware Platforms Keichi Takahashi1, Takashi Abe1, Akihiro Musa2, Yoshihiko Sato3, Yoichi Shimomura3, Hiroyuki Takizawa1 and Shunichi Koshimura1 1 Tohoku University, Japan 2 NEC Corporation, Japan 3 NEC Solution Innovators, Japan
  2. Tsunami inundation forecast • Tsunamis have caused 251,770 human casualties

    and US$280 billion economic losses between 1998 and 2017 (reported by UNDDR in 2018). • Various tsunami inundation forecast systems have been developed to tackle tsunami disasters. 3 Database-driven approach • Large number of scenarios are simulated in advance and stored in a database. • Once an earthquake occurs, the DB is queried to fi nd the closest scenario. • Quick but fundamentally limited in coverage. Real-time forward approach • Starts a tsunami propagation and inundation once an earthquake occurs. • Can incorporate the latest information such as topography, bathymetry and tidal conditions. • Accurate but requires HPC.
  3. Real-time Tsunami inundation (RTi) model • World’s fi rst operational

    tsunami inundation simulation developed [1, 2]. • Integrated into the Japanese government’s disaster information system. • Achieves the10-10-10 challenge: ◦ Estimates the fault model within 10min and completes tsunami simulation within 10min using a 10m mesh. 4 https://www.tohoku.ac.jp/japanese/2024/04/ press20240408-01-cast.html [1] T. Inoue et al., “Development and validation of a tsunami numerical model with the polygonally nested grid system and its MPI- parallelization for real-time tsunami inundation forecast on a regional scale,” J. Disaster Res., vol. 14, no. 3, 2019. [2] A. Musa et al., “Real-time tsunami inundation forecast system for tsunami disaster prevention and mitigation,” J. Supercomput., vol. 74, no. 7, 2018.
  4. Numerical model 5 ∂M ∂t + ∂ ∂x ( M2

    D ) + ∂ ∂y ( MN D ) + gD ∂η ∂x + gn2 D7/3 M M2 + N2 = 0 ∂N ∂t + ∂ ∂x ( MN D ) + ∂ ∂y ( N2 D ) + gD ∂η ∂y + gn2 D7/3 N M2 + N2 = 0 ∂η ∂t + ∂M ∂x + ∂N ∂y = 0 Continuity equation Momentum equations Discharge fl uxes Water level Total water depth The RTi model is based on the UNESCO-approved TUNAMI-N2 model and solves the 2D non-linear shallow water equations in a Cartesian coordinate.
  5. Nested grid system 6 Tsunami is a multi-scale phenomena since

    its speed depends on the ocean depth: v = 2gh To satisfy the CFL condition while saving computational costs, fi ner grids are used along the coast and coarser grids in deep water. A grid is composed of multiple rectangular meshes (referred to as blocks) Δx Δt ≤ 2gh
  6. Related studies on tsunami simulation • Tsunami-HySEA (University of Málaga):

    ◦ Developed in MPI+CUDA and speci fi cally optimized for NVIDIA GPUs. • JAGURS (Tokushima University/JAMSTEC/JMA): ◦ Developed in MPI+OpenMP for CPUs, scaled up to 82,844 nodes on the K computer. • EasyWave (GFZ/University of Potsdam): ◦ Developed in OpenMP and ported to multiple programming models (CUDA, SYCL, etc.) but does not support inter-node parallelism. 7 Either developed for a single platform, or simpli fi ed for performance analysis.
  7. Motivation and goal • The original RTi model is developed

    and optimized for NEC’s SX-ACE and SX-Aurora TSUBASA vector supercomputers. • The RTi model is memory-bound and bene fi ts from the top- class memory performance of these vector supercomputers. • However, further widespread adoption demands e ffi cient execution on modern GPUs and CPUs. 8 To migrate an operational, production-scale tsunami simulation code developed for long-vector architectures to modern CPUs and GPUs with minimal modi fi cations.
  8. Overview of the original RTi code • Developed in Fortran

    90. • MPI is employed for inter- and intra-node parallelization (i.e., fl at MPI). • Vectorization is performed by NEC’s auto- vectorizing compiler (with some compiler directives). • Majority of runtime is spent in the NLMASS and NLMNT2 subroutines, which solve the continuity and momentum equations, resp. 10 NLMASS OPENBOUND JNZRCV JNZSND JNZRCVWAIT JNZSNDWAIT NLMNT2 PTP_Z JNQRCV JNQSND JNQRCVWAIT JNQSNDWAIT PTP_MN ZMAX ZIMAX CALINUNDT CALOBST POINT CHANGE Conservation of mass Exchange water depth Exchange flux Update output Swap buffer Conservation of momentum Communication Computation
  9. Programming models • To meet the goal of minimizing code

    modi fi cations, we selected directive-based programming models. • GPU: OpenACC ◦ C++ programming models (SYCL, Kokkos, RAJA, etc.) require complete rewrite. ◦ CUDA Fortran and Fortran Do Concurrent also require signi fi cant rewrite. ◦ OpenACC generally outperforms OpenMP (with NVIDIA’s compiler). • CPU: OpenMP ◦ OpenMP generally outperforms OpenACC. 11
  10. Structure of major loop nests in the original code •

    Major loop nests are triple nested: ◦ KK loop: iterates over blocks ◦ J/I loop: iterates over y- and x- directions • KK loop is short (< 100) ◦ KK loop alone does not provide enough degree of parallelism. • Trip counts of J and I loops depend on KK ◦ To collapse all loops, the trip counts of all loops need to be invariant. 12 A single KK loop may contain multiple J/I loops DO KK = BLK_PROC_ST, BLK_PROC_ED ! ... DO J = JST_MN_X2(KK), JED_MN_X2(KK) DO I = IST1_OWN(KK), IED2_OWN(KK) ! Update cell END DO END DO ! ... END DO
  11. Structure of loop nests after migration (GPU) • Computation for

    each block is o ff l oaded as a single kernel. • J and I loops are collapsed to increase the degree of parallelism. • To increase GPU utilization, kernels are asynchronously and concurrently launched. • Arguments are pass to subroutines by value to allow compiler autoparallelization. 13 DO KK = BLK_PROC_ST, BLK_PROC_ED ! ... !$acc kernels async(MOD(KK,NSTREAMS)) !$acc loop collapse(2) DO J = JST_MN_X2(KK), JED_MN_X2(KK) DO I = IST1_OWN(KK), IED2_OWN(KK) ! Update cell END DO END DO !$acc end kernels ! ... END DO !$acc wait Synchronization
  12. Structure of loop nests after migration (CPU) • Outermost KK

    loop is serially executed. • J loop is parallelized and I loop is vectorized. • Synchronization is only performed once at the exit of the KK loop. • Subroutines calls are inlined using “attributes forceinline” directive to enable vectorization. 14 !$omp parallel DO KK = BLK_PROC_ST, BLK_PROC_ED ! ... !$omp do DO J = JST_MN_X2(KK), JED_MN_X2(KK) !$omp simd DO I = IST1_OWN(KK), IED2_OWN(KK) ! Update cell END DO !$omp end simd END DO !$omp end do nowait ! ... END DO !$omp end parallel Synchronization
  13. Load balance improvement • Severe load imbalance was observed after

    migrating bottleneck subroutines. • Since individual blocks are o ff l oaded as kernels, a non-negligible cost is incurred per block. • Computation on ranks that have many small blocks takes longer than ranks with fewer blocks. 15 0 50 100 150 200 250 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 nlmass jnz ptp_z nlmnt2 jnq ptp_mn others Runtime [s] MPI Rank 0 5 10 15 20 25 0 4 8 12 15 Number of blocks Rank 0 1 2 3 4 5 6 0 4 8 12 15 Number of grid points [106] Rank Breakdown of runtimes # of cells # of blocks
  14. Methods for improving load balance • Method #1: O ff

    l oad the computation for all blocks as a single kernel ◦ Refactor the code to exploit the parallelism of the outermost block-level loop. • Method #2: Optimize the domain decomposition ◦ Design a new domain decomposition algorithm that accounts for the per-block fi xed cost, in addition to the total number of cells per rank. 16
  15. Method #1: Offload as a single kernel • “Pad” the

    J loop to make its trip count invariant. • Collapse the outer two loops (KK and J loops) and assign to gangs (i.e., thread blocks). • I loop does not have to be invariant, assign to vectors (i.e., threads). 17 Block 3 Block 2 Block 1 Block1 JSZ = MAXVAL(JED_MN_X2 - JST_MN_X2) !$acc parallel !$acc loop collapse(2) gang !$omp parallel do DO KK = BLK_PROC_ST, BLK_PROC_ED DO J = JST_MN_X2(KK), JST_MN_X2(KK) + JSZ IF (J > JED_MN_X2(KK)) CYCLE !$acc loop vector !$omp simd DO I = IST1_OWN(KK), IED2_OWN(KK) ! Update cell END DO !$omp end simd END DO END DO !$omp end parallel do !$acc end parallel loop Padding
  16. Method #2: Optimize domain decomposition (1/2) • To quantify the

    per-block cost, we pro fi led the runtime of NLMNT2 for each block. • The runtime can be modeled with a linear function with respect to the number of cells in a block with an approximately 46μs of constant cost. 18 0 50 100 150 200 250 300 350 400 450 0 500000 1x106 1.5x106 2x106 2.5x106 3x106 3.5x106 1.09e-4 * x + 46.2 Measurements Runtime [µs] Number of cells T = n ∑ i=1 1.09 ⋅ 10−4bi + 46.2 20 40 60 80 100 120 140 160 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Predicted Actual Runtime [µs] Rank Actual vs predicted runtime Number of cells vs runtime
  17. Method #2: Optimize domain decomposition (2/2) • The performance model

    is used to optimize the domain decomposition o ff -line. • The objective function initially is variance of predicted runtimes; later switched to maximum of predicted runtimes for faster convergence. 19 Block #1 Block #2 Block #4 Block #6 Rank 0 Rank 1 Rank 2 Rank 3 Block #3 Block #5 Selected separator New separator position
  18. Communication optimization • To improve scalability, CUDA-aware MPI and GPUDirect

    RDMA are used. • Communication-related routines (halo packing/unpacking and grid interpolation/ restriction) are o ff l oaded in addition to computation to minimize CPU-GPU copies. 20 DO J = SFTN(6), SFTN(7) DO I = SFTN(2), SFTN(3) ICNT = ICNT + 1 BUF_SND1(ICNT) = VAL1(I,J,1) BUF_SND1(ICNT+JNUM) = VAL1(I,J,2) BUF_SND1(ICNT+2*JNUM) = VAL2(I,J) END DO END DO !$acc kernels !$acc loop collapse(2) independent DO J = SFTN(6), SFTN(7) DO I = SFTN(2), SFTN(3) ICNT = (I- SFTN(2) + 1) + (J - SFTN(6)) * (SFTN(3) - SFTN(2) + 1) BUF_SND1(ICNT) = VAL1(I,J,1) BUF_SND1(ICNT+JNUM) = VAL1(I,J,2) BUF_SND1(ICNT+2*JNUM) = VAL2(I,J) END DO END DO !$acc end kernels Original halo packing O ffl oaded halo packing
  19. HPC systems used for the evaluation 22 AOBA-S SQUID (GPU)

    SQUID (CPU) Pegasus CPU AMD EPYC 7763 Intel Xeon Platinum 8368 (ICX) x2 Intel Xeon Platinum 8368 (ICX) x2 Intel Xeon Platinum 8468 (SPX) x1 Memory DDR4 256GB DDR4 512GB DDR4 256GB DDR5 128GB Accelerator NEC Vector Engine Type 30A x8 NVIDIA A100 (SXM4) x8 N/A NVIDIA H100 (PCIe) x1 Interconnect In fi niBand NDR200 x2 In fi niBand HDR100 x4 In fi niBand HDR200 x1 In fi niBand NDR200 x1 Compilers NEC Fortran 5.2.0 NVIDIA HPC SDK 22.11 Intel oneAPI 2023.2.4 Intel oneAPI 2023.0.0 [1] H. Takizawa et al., “AOBA: The Most Powerful Vector Supercomputer in the World,” in Sustained Simulation Performance 2022, 2024, pp. 71–81. [2] S. Date et al., “Supercomputer for Quest to Unsolved Interdisciplinary Datascience (SQUID) and its Five Challenges,” in Sustained Simulation Performance 2021, 2023, pp. 1–19. [3] Center for Computational Sciences, “Pegasus – Big memory supercomputer,” https://www.ccs.tsukuba.ac.jp/eng/ supercomputers/#Pegasus
  20. Dataset used for evaluation • Executed a 6h tsunami inundation

    simulation including the Paci fi c coast of Kochi Prefecture. • The model consists of fi ve-levels of nested grids, where the fi nest grid resolution is 10m. • is 0.2s across all grid levels. Δx Δt 23 Level Δx # of blocks # of cells 1 810m 1 2,012,940 2 270m 3 1,703,484 3 90m 9 2,230,056 4 30m 11 9,863,424 5 10m 60 31,401,540 Total 47,211,444 Organization of nested grids
  21. Concurrent kernel execution using multiple streams • Concurrent kernel execution

    provides up to 4x speedup. • Ranks having many blocks bene fi t more from this optimization since they need to launch many kernels. 24 0 20 40 60 80 100 0 1 2 3 4 5 GPU Utilization [%] Number of asynchronous queues 0 20 40 60 80 100 0 1 2 3 4 5 Memory Utilization [%] Number of asynchronous queues 1 1.5 2 2.5 3 3.5 4 4.5 0 1 2 3 4 5 6 7 8 9 10 Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 7 Speedup over synchronous case Number of asynchronous queues Speedup over synchronous launch Memory B/W Util. GPU Util.
  22. Load balance improvement • Both methods provide roughly same degree

    of speedup on the GPU. • However, Method #2 increases the runtime on the CPU due to redundant computation caused by padding. 25 0 20 40 60 80 100 120 140 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Baseline Collapse outer loops Optimize decomposition Runtime [µs] Rank 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Baseline Collapse outer loops Runtime [ms] Rank Runtime of NLMNT2 on A100 Runtime of NLMNT2 on ICX
  23. Communication tuning • O ff l oading halo packing/unpacking to

    the GPU results in up to 2.96x speedup by reducing CPU-GPU memory copies and synchronizations. • Setting UCX_PROTO_ENABLE=y on SQUID further improves performance by automatically selecting a better eager/rendezvous protocol threshold. 26 0 200 400 600 800 1000 1200 8 16 32 Copy boundaries Offload packing Number of ranks 0 200 400 600 800 1000 1200 8 16 32 Copy boundaries Offload packing UCX tuning Runtime [s] Number of ranks Runtime on SQUID (GPU) Runtime on Pegasus (GPU)
  24. Performance comparison across HPC systems • VE and GPU achieve

    the 10-10-10 challenge with 8 or more sockets. • With 32x H100s, the simulation completes in just 82s. • The performance di ff erence between CPU and GPU/VE decreases with more sockets due to increased cache hit. 27 0 200 400 600 800 1000 1200 1400 1600 1800 4 8 16 32 10-min deadline SQUID (CPU) Pegasus (CPU) AOBA-S SQUID (GPU) Pegasus (GPU) Runtime [s] Number of sockets 6h tsunami simulation runtime 82s
  25. Conclusions • We migrated the RTi model, an operational tsunami

    propagation and inundation simulation code speci fi cally designed for vector processors, to the latest CPUs and GPUs. • The migrated code runs e ffi ciently on CPUs, GPUs and vector processors (a 6h tsunami simulation completes in 1.5 min. using 32 NVIDIA H100 GPUs). • The migrated code enables broader access to accurate tsunami inundation forecasts in the future. 28 This work was supported by Council for Science, Technology and Innovation (CSTI), Cross ministerial Strategic Innovation Promotion Program (SIP), “Development of a Resilient Smart Network System against Natural Disasters” Grant Number JPJ012289 (Funding agency: NIED). This work was also supported by JSPS KAKENHI Grant Numbers JP23K11329, JP23K16890 and JP24K02945.