Modernizing an Operational Real-time Tsunami Simulator to Support Diverse Hardware Platforms

Slide 1

Slide 1 text

2024 IEEE International Conference on Cluster Computing (CLUSTER 2024) Modernizing an Operational Real-time Tsunami Simulator to Support Diverse Hardware Platforms Keichi Takahashi1, Takashi Abe1, Akihiro Musa2, Yoshihiko Sato3, Yoichi Shimomura3, Hiroyuki Takizawa1 and Shunichi Koshimura1 1 Tohoku University, Japan 2 NEC Corporation, Japan 3 NEC Solution Innovators, Japan

Slide 2

Slide 2 text

Background and motivation

Slide 3

Slide 3 text

Tsunami inundation forecast • Tsunamis have caused 251,770 human casualties and US$280 billion economic losses between 1998 and 2017 (reported by UNDDR in 2018). • Various tsunami inundation forecast systems have been developed to tackle tsunami disasters. 3 Database-driven approach • Large number of scenarios are simulated in advance and stored in a database. • Once an earthquake occurs, the DB is queried to fi nd the closest scenario. • Quick but fundamentally limited in coverage. Real-time forward approach • Starts a tsunami propagation and inundation once an earthquake occurs. • Can incorporate the latest information such as topography, bathymetry and tidal conditions. • Accurate but requires HPC.

Slide 4

Slide 4 text

Real-time Tsunami inundation (RTi) model • World’s fi rst operational tsunami inundation simulation developed [1, 2]. • Integrated into the Japanese government’s disaster information system. • Achieves the10-10-10 challenge: ◦ Estimates the fault model within 10min and completes tsunami simulation within 10min using a 10m mesh. 4 https://www.tohoku.ac.jp/japanese/2024/04/ press20240408-01-cast.html [1] T. Inoue et al., “Development and validation of a tsunami numerical model with the polygonally nested grid system and its MPI- parallelization for real-time tsunami inundation forecast on a regional scale,” J. Disaster Res., vol. 14, no. 3, 2019. [2] A. Musa et al., “Real-time tsunami inundation forecast system for tsunami disaster prevention and mitigation,” J. Supercomput., vol. 74, no. 7, 2018.

Slide 5

Slide 5 text

Numerical model 5 ∂M ∂t + ∂ ∂x ( M2 D ) + ∂ ∂y ( MN D ) + gD ∂η ∂x + gn2 D7/3 M M2 + N2 = 0 ∂N ∂t + ∂ ∂x ( MN D ) + ∂ ∂y ( N2 D ) + gD ∂η ∂y + gn2 D7/3 N M2 + N2 = 0 ∂η ∂t + ∂M ∂x + ∂N ∂y = 0 Continuity equation Momentum equations Discharge fl uxes Water level Total water depth The RTi model is based on the UNESCO-approved TUNAMI-N2 model and solves the 2D non-linear shallow water equations in a Cartesian coordinate.

Slide 6

Slide 6 text

Nested grid system 6 Tsunami is a multi-scale phenomena since its speed depends on the ocean depth: v = 2gh To satisfy the CFL condition while saving computational costs, fi ner grids are used along the coast and coarser grids in deep water. A grid is composed of multiple rectangular meshes (referred to as blocks) Δx Δt ≤ 2gh

Slide 7

Slide 7 text

Related studies on tsunami simulation • Tsunami-HySEA (University of Málaga): ◦ Developed in MPI+CUDA and speci fi cally optimized for NVIDIA GPUs. • JAGURS (Tokushima University/JAMSTEC/JMA): ◦ Developed in MPI+OpenMP for CPUs, scaled up to 82,844 nodes on the K computer. • EasyWave (GFZ/University of Potsdam): ◦ Developed in OpenMP and ported to multiple programming models (CUDA, SYCL, etc.) but does not support inter-node parallelism. 7 Either developed for a single platform, or simpli fi ed for performance analysis.

Slide 8

Slide 8 text

Motivation and goal • The original RTi model is developed and optimized for NEC’s SX-ACE and SX-Aurora TSUBASA vector supercomputers. • The RTi model is memory-bound and bene fi ts from the top- class memory performance of these vector supercomputers. • However, further widespread adoption demands e ffi cient execution on modern GPUs and CPUs. 8 To migrate an operational, production-scale tsunami simulation code developed for long-vector architectures to modern CPUs and GPUs with minimal modi fi cations.

Slide 9

Slide 9 text

Modernization of the RTi model

Slide 10

Slide 10 text

Overview of the original RTi code • Developed in Fortran 90. • MPI is employed for inter- and intra-node parallelization (i.e., fl at MPI). • Vectorization is performed by NEC’s auto- vectorizing compiler (with some compiler directives). • Majority of runtime is spent in the NLMASS and NLMNT2 subroutines, which solve the continuity and momentum equations, resp. 10 NLMASS OPENBOUND JNZRCV JNZSND JNZRCVWAIT JNZSNDWAIT NLMNT2 PTP_Z JNQRCV JNQSND JNQRCVWAIT JNQSNDWAIT PTP_MN ZMAX ZIMAX CALINUNDT CALOBST POINT CHANGE Conservation of mass Exchange water depth Exchange flux Update output Swap buffer Conservation of momentum Communication Computation

Slide 11

Slide 11 text

Programming models • To meet the goal of minimizing code modi fi cations, we selected directive-based programming models. • GPU: OpenACC ◦ C++ programming models (SYCL, Kokkos, RAJA, etc.) require complete rewrite. ◦ CUDA Fortran and Fortran Do Concurrent also require signi fi cant rewrite. ◦ OpenACC generally outperforms OpenMP (with NVIDIA’s compiler). • CPU: OpenMP ◦ OpenMP generally outperforms OpenACC. 11

Slide 12

Slide 12 text

Structure of major loop nests in the original code • Major loop nests are triple nested: ◦ KK loop: iterates over blocks ◦ J/I loop: iterates over y- and x- directions • KK loop is short (< 100) ◦ KK loop alone does not provide enough degree of parallelism. • Trip counts of J and I loops depend on KK ◦ To collapse all loops, the trip counts of all loops need to be invariant. 12 A single KK loop may contain multiple J/I loops DO KK = BLK_PROC_ST, BLK_PROC_ED ! ... DO J = JST_MN_X2(KK), JED_MN_X2(KK) DO I = IST1_OWN(KK), IED2_OWN(KK) ! Update cell END DO END DO ! ... END DO

Slide 13

Slide 13 text

Structure of loop nests after migration (GPU) • Computation for each block is o ff l oaded as a single kernel. • J and I loops are collapsed to increase the degree of parallelism. • To increase GPU utilization, kernels are asynchronously and concurrently launched. • Arguments are pass to subroutines by value to allow compiler autoparallelization. 13 DO KK = BLK_PROC_ST, BLK_PROC_ED ! ... !$acc kernels async(MOD(KK,NSTREAMS)) !$acc loop collapse(2) DO J = JST_MN_X2(KK), JED_MN_X2(KK) DO I = IST1_OWN(KK), IED2_OWN(KK) ! Update cell END DO END DO !$acc end kernels ! ... END DO !$acc wait Synchronization

Slide 14

Slide 14 text

Structure of loop nests after migration (CPU) • Outermost KK loop is serially executed. • J loop is parallelized and I loop is vectorized. • Synchronization is only performed once at the exit of the KK loop. • Subroutines calls are inlined using “attributes forceinline” directive to enable vectorization. 14 !$omp parallel DO KK = BLK_PROC_ST, BLK_PROC_ED ! ... !$omp do DO J = JST_MN_X2(KK), JED_MN_X2(KK) !$omp simd DO I = IST1_OWN(KK), IED2_OWN(KK) ! Update cell END DO !$omp end simd END DO !$omp end do nowait ! ... END DO !$omp end parallel Synchronization

Slide 15

Slide 15 text

Load balance improvement • Severe load imbalance was observed after migrating bottleneck subroutines. • Since individual blocks are o ff l oaded as kernels, a non-negligible cost is incurred per block. • Computation on ranks that have many small blocks takes longer than ranks with fewer blocks. 15 0 50 100 150 200 250 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 nlmass jnz ptp_z nlmnt2 jnq ptp_mn others Runtime [s] MPI Rank 0 5 10 15 20 25 0 4 8 12 15 Number of blocks Rank 0 1 2 3 4 5 6 0 4 8 12 15 Number of grid points [106] Rank Breakdown of runtimes # of cells # of blocks

Slide 16

Slide 16 text

Methods for improving load balance • Method #1: O ff l oad the computation for all blocks as a single kernel ◦ Refactor the code to exploit the parallelism of the outermost block-level loop. • Method #2: Optimize the domain decomposition ◦ Design a new domain decomposition algorithm that accounts for the per-block fi xed cost, in addition to the total number of cells per rank. 16

Slide 17

Slide 17 text

Method #1: Offload as a single kernel • “Pad” the J loop to make its trip count invariant. • Collapse the outer two loops (KK and J loops) and assign to gangs (i.e., thread blocks). • I loop does not have to be invariant, assign to vectors (i.e., threads). 17 Block 3 Block 2 Block 1 Block1 JSZ = MAXVAL(JED_MN_X2 - JST_MN_X2) !$acc parallel !$acc loop collapse(2) gang !$omp parallel do DO KK = BLK_PROC_ST, BLK_PROC_ED DO J = JST_MN_X2(KK), JST_MN_X2(KK) + JSZ IF (J > JED_MN_X2(KK)) CYCLE !$acc loop vector !$omp simd DO I = IST1_OWN(KK), IED2_OWN(KK) ! Update cell END DO !$omp end simd END DO END DO !$omp end parallel do !$acc end parallel loop Padding

Slide 18

Slide 18 text

Method #2: Optimize domain decomposition (1/2) • To quantify the per-block cost, we pro fi led the runtime of NLMNT2 for each block. • The runtime can be modeled with a linear function with respect to the number of cells in a block with an approximately 46μs of constant cost. 18 0 50 100 150 200 250 300 350 400 450 0 500000 1x106 1.5x106 2x106 2.5x106 3x106 3.5x106 1.09e-4 * x + 46.2 Measurements Runtime [µs] Number of cells T = n ∑ i=1 1.09 ⋅ 10−4bi + 46.2 20 40 60 80 100 120 140 160 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Predicted Actual Runtime [µs] Rank Actual vs predicted runtime Number of cells vs runtime

Slide 19

Slide 19 text

Method #2: Optimize domain decomposition (2/2) • The performance model is used to optimize the domain decomposition o ff -line. • The objective function initially is variance of predicted runtimes; later switched to maximum of predicted runtimes for faster convergence. 19 Block #1 Block #2 Block #4 Block #6 Rank 0 Rank 1 Rank 2 Rank 3 Block #3 Block #5 Selected separator New separator position

Slide 20

Slide 20 text

Communication optimization • To improve scalability, CUDA-aware MPI and GPUDirect RDMA are used. • Communication-related routines (halo packing/unpacking and grid interpolation/ restriction) are o ff l oaded in addition to computation to minimize CPU-GPU copies. 20 DO J = SFTN(6), SFTN(7) DO I = SFTN(2), SFTN(3) ICNT = ICNT + 1 BUF_SND1(ICNT) = VAL1(I,J,1) BUF_SND1(ICNT+JNUM) = VAL1(I,J,2) BUF_SND1(ICNT+2*JNUM) = VAL2(I,J) END DO END DO !$acc kernels !$acc loop collapse(2) independent DO J = SFTN(6), SFTN(7) DO I = SFTN(2), SFTN(3) ICNT = (I- SFTN(2) + 1) + (J - SFTN(6)) * (SFTN(3) - SFTN(2) + 1) BUF_SND1(ICNT) = VAL1(I,J,1) BUF_SND1(ICNT+JNUM) = VAL1(I,J,2) BUF_SND1(ICNT+2*JNUM) = VAL2(I,J) END DO END DO !$acc end kernels Original halo packing O ffl oaded halo packing

Slide 21

Slide 21 text

Performance evaluation

Slide 22

Slide 22 text

HPC systems used for the evaluation 22 AOBA-S SQUID (GPU) SQUID (CPU) Pegasus CPU AMD EPYC 7763 Intel Xeon Platinum 8368 (ICX) x2 Intel Xeon Platinum 8368 (ICX) x2 Intel Xeon Platinum 8468 (SPX) x1 Memory DDR4 256GB DDR4 512GB DDR4 256GB DDR5 128GB Accelerator NEC Vector Engine Type 30A x8 NVIDIA A100 (SXM4) x8 N/A NVIDIA H100 (PCIe) x1 Interconnect In fi niBand NDR200 x2 In fi niBand HDR100 x4 In fi niBand HDR200 x1 In fi niBand NDR200 x1 Compilers NEC Fortran 5.2.0 NVIDIA HPC SDK 22.11 Intel oneAPI 2023.2.4 Intel oneAPI 2023.0.0 [1] H. Takizawa et al., “AOBA: The Most Powerful Vector Supercomputer in the World,” in Sustained Simulation Performance 2022, 2024, pp. 71–81. [2] S. Date et al., “Supercomputer for Quest to Unsolved Interdisciplinary Datascience (SQUID) and its Five Challenges,” in Sustained Simulation Performance 2021, 2023, pp. 1–19. [3] Center for Computational Sciences, “Pegasus – Big memory supercomputer,” https://www.ccs.tsukuba.ac.jp/eng/ supercomputers/#Pegasus

Slide 23

Slide 23 text

Dataset used for evaluation • Executed a 6h tsunami inundation simulation including the Paci fi c coast of Kochi Prefecture. • The model consists of fi ve-levels of nested grids, where the fi nest grid resolution is 10m. • is 0.2s across all grid levels. Δx Δt 23 Level Δx # of blocks # of cells 1 810m 1 2,012,940 2 270m 3 1,703,484 3 90m 9 2,230,056 4 30m 11 9,863,424 5 10m 60 31,401,540 Total 47,211,444 Organization of nested grids

Slide 24

Slide 24 text

Concurrent kernel execution using multiple streams • Concurrent kernel execution provides up to 4x speedup. • Ranks having many blocks bene fi t more from this optimization since they need to launch many kernels. 24 0 20 40 60 80 100 0 1 2 3 4 5 GPU Utilization [%] Number of asynchronous queues 0 20 40 60 80 100 0 1 2 3 4 5 Memory Utilization [%] Number of asynchronous queues 1 1.5 2 2.5 3 3.5 4 4.5 0 1 2 3 4 5 6 7 8 9 10 Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 7 Speedup over synchronous case Number of asynchronous queues Speedup over synchronous launch Memory B/W Util. GPU Util.

Slide 25

Slide 25 text

Load balance improvement • Both methods provide roughly same degree of speedup on the GPU. • However, Method #2 increases the runtime on the CPU due to redundant computation caused by padding. 25 0 20 40 60 80 100 120 140 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Baseline Collapse outer loops Optimize decomposition Runtime [µs] Rank 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Baseline Collapse outer loops Runtime [ms] Rank Runtime of NLMNT2 on A100 Runtime of NLMNT2 on ICX

Slide 26

Slide 26 text

Communication tuning • O ff l oading halo packing/unpacking to the GPU results in up to 2.96x speedup by reducing CPU-GPU memory copies and synchronizations. • Setting UCX_PROTO_ENABLE=y on SQUID further improves performance by automatically selecting a better eager/rendezvous protocol threshold. 26 0 200 400 600 800 1000 1200 8 16 32 Copy boundaries Offload packing Number of ranks 0 200 400 600 800 1000 1200 8 16 32 Copy boundaries Offload packing UCX tuning Runtime [s] Number of ranks Runtime on SQUID (GPU) Runtime on Pegasus (GPU)

Slide 27

Slide 27 text

Performance comparison across HPC systems • VE and GPU achieve the 10-10-10 challenge with 8 or more sockets. • With 32x H100s, the simulation completes in just 82s. • The performance di ff erence between CPU and GPU/VE decreases with more sockets due to increased cache hit. 27 0 200 400 600 800 1000 1200 1400 1600 1800 4 8 16 32 10-min deadline SQUID (CPU) Pegasus (CPU) AOBA-S SQUID (GPU) Pegasus (GPU) Runtime [s] Number of sockets 6h tsunami simulation runtime 82s

Slide 28

Slide 28 text

Conclusions • We migrated the RTi model, an operational tsunami propagation and inundation simulation code speci fi cally designed for vector processors, to the latest CPUs and GPUs. • The migrated code runs e ffi ciently on CPUs, GPUs and vector processors (a 6h tsunami simulation completes in 1.5 min. using 32 NVIDIA H100 GPUs). • The migrated code enables broader access to accurate tsunami inundation forecasts in the future. 28 This work was supported by Council for Science, Technology and Innovation (CSTI), Cross ministerial Strategic Innovation Promotion Program (SIP), “Development of a Resilient Smart Network System against Natural Disasters” Grant Number JPJ012289 (Funding agency: NIED). This work was also supported by JSPS KAKENHI Grant Numbers JP23K11329, JP23K16890 and JP24K02945.