GPU コンピューティングを活用した物性研究 / 2023-04-04 ISSP

GPU コンピューティングを活⽤した物性研究 Shinnosuke Furuya, Ph.D., HPC Developer Relations | 2023/4/4

GPU コンピューティングを活⽤した物性研究…への第⼀歩 Shinnosuke Furuya, Ph.D., HPC Developer Relations | 2023/4/4

NVIDIA Overview Santa Clara Tokyo Founded in 1993 Founder &
CEO: Jensen Huang 25,000+ employees 50+ locations $26.9B revenue in FY22

Supercomputer Trends Accelerators on the TOP500 list are growing https://top500.org/
0 20 40 60 80 100 120 140 160 180 Jun-11 N ov-11 Jun-12 N ov-12 Jun-13 N ov-13 Jun-14 N ov-14 Jun-15 N ov-15 Jun-16 N ov-16 Jun-17 N ov-17 Jun-18 N ov-18 Jun-19 N ov-19 Jun-20 N ov-20 Jun-21 N ov-21 Jun-22 N ov-22 # of Systems NVIDIA Other

“Green” Supercomputers Need Accelerators G500 T500 System Name Accelerator /
Processor PFlops kW GFlops/W 1 405 Henri NVIDIA H100 2.04 31 65.091 2 32 Frontier TDS AMD MI250X 19.20 309 62.684 8 159 ATOS THX.A.B NVIDIA A100 3.50 86 41.411 9 359 MN-3 PFN MN-Core 2.18 53 40.901 34 438 NA-J2 PEZY PEZY-SC3 1.95 89 21.646 40 67 AiMOS NVIDIA V100 8.34 512 16.285 43 2 Supercomputer Fugaku Fujitsu A64FX 442.01 29,899 15.418

ISSP Supercomputer : NVIDIA A100 ACC node of System C
kugui https://mdcl.issp.u-tokyo.ac.jp/scc/system/newsystemc/newsystemc_hardware

Tesla K40 vs NVIDIA A100 • Compute Capability • Tesla
K40 3.5 • Tesla M40 5.2 • Tesla P100 6.0 • NVIDIA V100 7.0 • NVIDIA T4 7.5 • NVIDIA A100 8.0 • NVIDIA L40 8.9 • NVIDIA H100 9.0 • Software • PGI 2015 • OpenACC 2.0 • CUDA 6.0/6.5 • HPC SDK 23.3 • OpenACC 2.7 • CUDA 11.0/11.8/12.0

• Introduction • Hardware – CPU, GPU, DPU • GPU
Programming – CUDA, Directive, ISO Standard • Application Performance • Useful Information Agenda

NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090
Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 Ti A40 A2 A16 Hopper (2022) TITAN RTX H100 Ada Lovelace (2022) RTX 6000 Ada Gen RTX 4090 L40 L4 Computing (FP64/FP32) Computing (FP32) VDI (FP32) ProVis (FP32) Gaming (FP32)

NVIDIA A100 Tensor Core GPU • HPC / DL Training
/ DL Inference / HPDA • Two form factors • SXM for HGX / PCIe • FP64 / FP32 • 3rd Generation Tensor Core • FP64 / TF32 / BF16 / FP16 / INT8 • 3rd Generation NVLink • 600GB/s (SXM) / 600GB/s up to 2 GPUs via NVLink Bridge (PCIe) • High-Bandwidth Memory • 80GB HBM2e • Structural Sparsity • Multi-Instance GPU (MIG) https://www.nvidia.com/en-us/data-center/a100/

NVIDIA H100 Tensor Core GPU • HPC / DL Training
/ DL Inference / HPDA • Exascale HPC / LLM Inference • Two form factors • SXM for HGX / PCIe • FP64 / FP32 • 4th Generation Tensor Core • FP64 / TF32 / BF16 / FP16 / FP8 / INT8 • 4th Generation NVLink • 900GB/s (SXM) / 600GB/s up to 2 GPUs via NVLink Bridge (PCIe) • High-Bandwidth Memory • 80GB HBM3 (SXM) / 80GB HBM2e (PCIe) / 188GB HBM3 (NVL; total) • Transformer Engine • 2nd Generation Multi-Instance GPU (MIG) https://www.nvidia.com/en-us/data-center/h100/

1. Supported on Azure NVIDIA A100 with reduced performance compared
to A100 without Confidential Computing or H100 with Confidential Computing. 2. All Tensor Core numbers with sparsity. Without sparsity is ½ the value. 3. Includes AV1 in addition to H.265, H.264, VP9, VP8, MPEG4 H100 A100 A30 L4 A2 L40 A40 A10 A16 Design Highest Perf AI, Big NLP, HPC, DA High Perf Compute Mainstream Compute Universal AI, Video, and Graphics Entry-Level Small Footprint Powerful Graphics + AI High Perf Graphics Mainstream Graphics & Video with AI High Density Virtual Desktop Form Factor SXM5 x16 PCIe Gen5 2 Slot FHFL 3 NVlink Bridge X16 PCIe Gen5 Dual 2 Slot FHFL using 3 NVLink Bridges SXM4 x16 PCIe Gen4 2 Slot FHFL 3 NVLink Bridge x16 PCIe Gen4 2 Slot FHFL 1 NVLink Bridge X16 PCIe Gen4 1 slot LP x8 PCIe Gen4 1 Slot LP x16 PCIe Gen4 2 Slot FHFL x16 PCIe Gen4 2 Slot FHFL 1 NVLink Bridge x16 PCIe Gen4 1 slot FHFL x16 PCIe Gen4 2 Slot FHFL Max Power 700W 350W 2x 400W 500W 300W 165W 72W 60W 300W 300W 150W 250W FP64 TC | FP32 TFLOPS2 67 | 67 51 | 51 134 | 134 19.5 | 19.5 10 |10 NA | 30 NA | 4.5 NA | TBD3 NA | 37 NA | 31 NA | 4x4.5 TF32 TC | FP16 TC TFLOPS2 989 | 1979 756 | 1513 1979 | 3958 312 | 624 165 | 330 120 | 242 18 | 36 TBD3 | TBD3 150 | 300 125 | 250 4x18 | 4x36 FP8 TC | INT8 TC TFLOPS/TOPS2 3958 | 3958 3026 | 3026 7916 | 7916 NA | 1248 NA | 661 485 | 485 NA | 72 TBD3 | TBD3 NA | 600 NA | 500 NA | 4x72 GPU Memory 80GB HBM3 80GB HBM2e 188GB HBM3 80GB HBM2e 24GB HBM2 24GB GDDR6 16GB GDDR6 48GB GDDR6 48GB GDDR6 24GB GDDR6 4x 16GB GDDR6 Multi-Instance GPU (MIG) Up to 7 UP to 14 Up to 7 Up to 4 - - - - - - Media Acceleration 7 JPEG Decoder 7 Video Decoder 14 JPED Decoder 14 Video Decoder 1 JPEG Decoder 5 Video Decoder 1 JPEG Decoder 4 Video Decoder 2 Video Encoder3 4 Video Decoder3 4 JPEG Decode 1 Video Encoder 2 Video Decoder (+AV1 decode) 3 Video Encoder 3 Video Decoder 4 JPEG Decoder 1 Video Encoder 2 Video Decoder (+AV1 decode) 4 Video Encoder 8 Video Decoder (+AV1 decode) Ray Tracing - - - Yes Yes Yes Transformer Engine Yes Yes - FP8 - FP8 - - - DPX Instructions Yes Yes - - - - - - Graphics For in-situ visualization (no NVIDIA vPC or RTX vWS) For in-situ visualization (no NVIDIA vPC or RTX vWS) Better Good Top-of-Line Best Better Good vGPU Yes Yes Hardware Root of Trust Internal and External Internal with Option for External Internal Internal with Option for External Confidential Computing Yes (1) - - - - - - - NVIDIA AI Enterprise Add-on Included Add-on Add-on Add-on Data Center GPU Comparison

NVIDIA Grace CPU • Grace Hopper Superchip (CPU + GPU)
• CPU • Up to 72 cores • Up to 546 GB/s LPDDR5X • Up to 512 GB LPDDR5X • GPU • 3 TB/s HBM3 • Up to 96GB HBM3 • NVLink C2C 900 GB/s • Grace CPU Superchip (CPU + CPU) • 144 cores • Up to 1TG/s LPDDR5X • Up to 960 GB LPDDR5X • NVLink C2C 900GB/s https://www.nvidia.com/en-us/data-center/grace-cpu/

NVIDIA BlueField-3 DPU • Network Interfaces • 1 or 2
ports with up to 400Gb/s Ethernet or InfiniBand • Arm CPU cores • Up to 16 Armv8.2+ A78 Hercules cores (64bit) • Programmable Datapath Accelerator • 16 cores, 256 threads • DOCA • DDR DIMM • 32GB DDR5 https://www.nvidia.com/en-us/networking/products/data-processing-unit/

“Overview of GPU Programming Models” by Jeff L.

“Overview of GPU Programming Models” by Jeff L. 6 Programming
the NVIDIA Platform CPU, GPU, and Network ACCELERATED STANDARD LANGUAGES ISO C++, ISO Fortran PLATFORM SPECIALIZATION CUDA ACCELERATION LIBRARIES Core Communication Math Data Analytics AI Quantum std::transform(par, x, x+n, y, y, [=](float x, float y){ return y + a*x; } ); do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo import cunumeric as np … def saxpy(a, x, y): y[:] += a*x #pragma acc data copy(x,y) { ... std::transform(par, x, x+n, y, y, [=](float x, float y){ return y + a*x; }); ... } #pragma omp target data map(x,y) { ... std::transform(par, x, x+n, y, y, [=](float x, float y){ return y + a*x; }); ... } __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] += a*x[i]; } int main(void) { ... cudaMemcpy(d_x, x, ...); cudaMemcpy(d_y, y, ...); saxpy<<<(N+255)/256,256>>>(...); cudaMemcpy(y, d_y, ...); ACCELERATED STANDARD LANGUAGES ISO C++, ISO Fortran INCREMENTAL PORTABLE OPTIMIZATION OpenACC, OpenMP PLATFORM SPECIALIZATION CUDA

“Overview of GPU Programming Models” by Jeff L. 7 NVIDIA
Math Libraries Linear Algebra, FFT, RNG and Basic Math Math API cuFFT cuSPARSE cuSOLVER cuBLAS cuTENSOR cuRAND CUTLASS AMGX

“Overview of GPU Programming Models” by Jeff L. 9 Standard
Language Approaches

“Overview of GPU Programming Models” by Jeff L. 10 HPC
PROGRAMMING IN ISO C++ C++23 And Beyond Executors / Senders-Receivers ➢ Simplify launching and managing parallel work across CPUs and accelerators ➢ Preview Implementation In Progress! Linear Algebra ➢ C++ standard algorithms API to linear algebra ➢ Maps to vendor optimized BLAS libraries ➢ Preview Implementation In Progress! ISO is the place for portable concurrency and parallelism C++17 & C++20 Parallel Algorithms ➢ In NVC++ ➢ Parallel and vector concurrency Forward Progress Guarantees ➢ Extend the C++ execution model for accelerators Memory Model Clarifications ➢ Extend the C++ memory model for accelerators Ranges ➢ Simplifies iterating over a range of values Scalable Synchronization Library ➢ Express thread synchronization that is portable and scalable across CPUs and accelerators ➢ In libcu++: ➢ std::atomic<T> ➢ std::barrier ➢ std::counting_semaphore ➢ std::atomic<T>::wait/notify_* ➢ std::atomic_ref<T> Preview support coming to NVC++ std::mdspan/mdarray ➢ HPC-oriented multi-dimensional array abstractions. ➢ Preview Implementation In Progress! Range-Based Parallel Algorithms ➢ Improved multi-dimensional loops Extended Floating Point Types ➢ First-class support for formats new and old: std::float16_t/float64_t

“Overview of GPU Programming Models” by Jeff L. 11 static
inline void CalcHydroConstraintForElems(Domain &domain, Index_t length, Index_t *regElemlist, Real_t dvovmax, Real_t& dthydro) { #if _OPENMP const Index_t threads = omp_get_max_threads(); Index_t hydro_elem_per_thread[threads]; Real_t dthydro_per_thread[threads]; #else Index_t threads = 1; Index_t hydro_elem_per_thread[1]; Real_t dthydro_per_thread[1]; #endif #pragma omp parallel firstprivate(length, dvovmax) { Real_t dthydro_tmp = dthydro ; Index_t hydro_elem = -1 ; #if _OPENMP Index_t thread_num = omp_get_thread_num(); #else Index_t thread_num = 0; #endif #pragma omp for for (Index_t i = 0 ; i < length ; ++i) { Index_t indx = regElemlist[i] ; if (domain.vdov(indx) != Real_t(0.)) { Real_t dtdvov = dvovmax / (FABS(domain.vdov(indx))+Real_t(1.e-20)) ; if ( dthydro_tmp > dtdvov ) { dthydro_tmp = dtdvov ; hydro_elem = indx ; } } } dthydro_per_thread[thread_num] = dthydro_tmp ; hydro_elem_per_thread[thread_num] = hydro_elem ; } for (Index_t i = 1; i < threads; ++i) { if(dthydro_per_thread[i] < dthydro_per_thread[0]) { dthydro_per_thread[0] = dthydro_per_thread[i]; hydro_elem_per_thread[0] = hydro_elem_per_thread[i]; } } if (hydro_elem_per_thread[0] != -1) { dthydro = dthydro_per_thread[0] ; } return ; } C++ with OpenMP Lulesh with Standard C++ About Lulesh ➢ Hydrodynamics Mini-App from LLNL ➢ ~9000 LOC, C++, OpenMP, CUDA, RAJA, … With Standard C++: ➢ Composable, compact and elegant ➢ Easy to read and maintain ➢ ISO Standard ➢ Portable – nvc++, g++, icpc, MSVC, … static inline void CalcHydroConstraintForElems(Domain &domain, Index_t length, Index_t *regElemlist, Real_t dvovmax, Real_t &dthydro) { dthydro = std::transform_reduce( std::execution::par, counting_iterator(0), counting_iterator(length), dthydro, [](Real_t a, Real_t b) { return a < b ? a : b; }, [=, &domain](Index_t i) { Index_t indx = regElemlist[i]; if (domain.vdov(indx) == Real_t(0.0)) { return std::numeric_limits<Real_t>::max(); } else { return dvovmax / (std::abs(domain.vdov(indx)) + Real_t(1.e-20)); } }); } Standard C++ codesign.llnl.gov/lulesh

“Overview of GPU Programming Models” by Jeff L. 12 C++
Standard Parallelism Same ISO C++ Code Lulesh Performance NVC++ GCC 1.19X 1.00X 1.00X 1.98X 1.91X 2.71X 14.75X 0.00X 2.00X 4.00X 6.00X 8.00X 10.00X 12.00X 14.00X 16.00X g++ icpc nvc++ g++ icpc nvc++ (CPU) nvc++ (GPU) OpenMP (default) ISO C++ Lulesh Speed-up Intel AMD EPYC 7742 CPU, NVIDIA A100 GPU. g++ version 10.3.0, icpc version 2021.5.0, nvc++ version 22.3

“Overview of GPU Programming Models” by Jeff L. 13 HPC
PROGRAMMING IN ISO FORTRAN Fortran 2018 Fortran 202x Fortran Array Intrinsics ➢ NVFORTRAN 20.5 ➢ Accelerated matmul, reshape, spread, … DO CONCURRENT ➢ NVFORTRAN 20.11 ➢ Auto-offload & multi-core Co-Arrays ➢ Not currently available ➢ Accelerated co-array images DO CONCURRENT Reductions ➢ NVFORTRAN 21.11 ➢ REDUCE subclause added ➢ Support for +, *, MIN, MAX, IAND, IOR, IEOR. ➢ Support for .AND., .OR., .EQV., .NEQV on LOGICAL values ISO is the place for portable concurrency and parallelism Preview support available now in NVFORTRAN

“Overview of GPU Programming Models” by Jeff L. 14 MiniWeather
Standard Language Parallelism in Climate/Weather Applications Mini-App written in C++ and Fortran that simulates weather-like fluid flows using Finite Volume and Runge-Kutta methods. Existing parallelization in MPI, OpenMP, OpenACC, … Included in the SPEChpc benchmark suite* Open-source and commonly-used in training events. https://github.com/mrnorman/miniWeather/ MiniWeather 0 10 20 OpenMP (CPU) Concurrent (CPU) Concurrent (GPU) OpenACC do concurrent (ll=1:NUM_VARS, k=1:nz, i=1:nx) local(x,z,x0,z0,xrad,zrad,amp,dist,wpert) if (data_spec_int == DATA_SPEC_GRAVITY_WAVES) then x = (i_beg-1 + i-0.5_rp) * dx z = (k_beg-1 + k-0.5_rp) * dz x0 = xlen/8 z0 = 1000 xrad = 500 zrad = 500 amp = 0.01_rp dist = sqrt( ((x-x0)/xrad)**2 + ((z-z0)/zrad)**2 ) * pi / 2._rp if (dist <= pi / 2._rp) then wpert = amp * cos(dist)**2 else wpert = 0._rp endif tend(i,k,ID_WMOM) = tend(i,k,ID_WMOM) + wpert*hy_dens_cell(k) endif state_out(i,k,ll) = state_init(i,k,ll) + dt * tend(i,k,ll) enddo Source: HPC SDK 22.1, AMD EPYC 7742, NVIDIA A100. MiniWeather: NX=2000, NZ=1000, SIM_TIME=5. OpenACC version uses –gpu=managed option. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation

“Overview of GPU Programming Models” by Jeff L. 15 Compiler
Directive Approaches

“Overview of GPU Programming Models” by Jeff L. 16 What
is OpenACC? PLATFORMS SUPPORTED APPLICATIONS NVIDIA GPU X86 CPU POWER CPU Sunway ARM CPU AMD GPU FPGA COMMUNITY 250+ 3 out of Top 5 ~3000 Slack Members OpenACC is a directive-based parallel programming model designed for productivity, performance, and portability

“Overview of GPU Programming Models” by Jeff L. 17 Parallelize
with OpenACC while ( error > tol && iter < iter_max ) { double error = 0.0; #pragma acc parallel loop reduction(max:error) for (int j = 1; j < n - 1; j++) { for (int i = 1; i < m - 1; i++) { Anew[OFFSET(j, i, m)] = 0.25 * \ (A[OFFSET(j, i + 1, m)] + A[OFFSET(j, i - 1, m)] + \ A[OFFSET(j - 1, i, m)] + A[OFFSET(j + 1, i, m)]); error = fmax(error, fabs(Anew[OFFSET(j, i, m)] - A[OFFSET(j, i, m)])); } } #pragma acc parallel loop for (int j = 1; j < n - 1; j++) { for (int i = 1; i < m - 1; i++) { A[OFFSET(j, i, m)] = Anew[OFFSET(j, i, m)]; } } if (iter % 100 == 0) printf("%5d, %0.6f\n", iter, error); iter++; } Parallelize first loop nest, max reduction required. Parallelize second loop. We didn’t detail how to parallelize the loops, just which loops to parallelize.

“Overview of GPU Programming Models” by Jeff L. 18 Parallelize
with OpenMP Offloading while ( error > tol && iter < iter_max ) { double error = 0.0; #pragma omp target teams loop reduction(max:error) collapse(2) for (int j = 1; j < n - 1; j++) { for (int i = 1; i < m - 1; i++) { Anew[OFFSET(j, i, m)] = 0.25 * \ (A[OFFSET(j, i + 1, m)] + A[OFFSET(j, i - 1, m)] + \ A[OFFSET(j - 1, i, m)] + A[OFFSET(j + 1, i, m)]); error = fmax(error, fabs(Anew[OFFSET(j, i, m)] - A[OFFSET(j, i, m)])); } } #pragma omp target teams loop collapse(2) for (int j = 1; j < n - 1; j++) { for (int i = 1; i < m - 1; i++) { A[OFFSET(j, i, m)] = Anew[OFFSET(j, i, m)]; } } if (iter % 100 == 0) printf("%5d, %0.6f\n", iter, error); iter++; } OpenMP Target Offloading looks similar to OpenACC, but requires more understanding from the developer due to having a myriad combination of possible directives to use.

“Overview of GPU Programming Models” by Jeff L. 19 Python
Appoaches

“Overview of GPU Programming Models” by Jeff L. 20 Overview
of cupy • CuPy supports a subset of numpy.ndarray interface which includes: ✓ Basic & advance indexing, and Broadcasting ✓ Data types (int32, float32, uint64, complex64,... ) ✓ Array manipulation routine (reshape) ✓ Linear Algebra functions (dot, matmul, etc) ✓ Reduction along axis (max, sum, argmax, etc) For more details on broadcasting visit (https://numpy.org/doc/stable/user/basics.broadcasting.html) >>> import numpy as np >>> X = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) #Basic indexing and slicing >>> X[5:] array([5, 6, 7, 8, 9]) >>> X[1:7:2] array([1, 3, 5]) #Advance indexing >>> X = np.array([[1, 2],[3, 4],[5, 6]]) >>> X[[0, 1, 2], [0, 1, 0]] array([1, 4, 5]) #reduction and Linear Algebra function >>> max(X) 9.0 >>> B = np.array([1,2,3,4], dtype=np.float32) >>> C = np.array([5,6,7,8], dtype=np.float32) >>> np.matmul(B, C) 70.0 #data type and array manipulation routine >>> A =1j*np.arange(9, dtype=np.complex64).reshape(3,3) [[0.+0.j 0.+1.j 0.+2.j] [0.+3.j 0.+4.j 0.+5.j] [0.+6.j 0.+7.j 0.+8.j]]

“Overview of GPU Programming Models” by Jeff L. 21 Overview
of cupy • CuPy supports a subset of numpy.ndarray interface which includes: ✓ Basic & advance indexing, and Broadcasting ✓ Data types (int32, float32, uint64, complex64,... ) ✓ Array manipulation routine (reshape) ✓ Linear Algebra functions (dot, matmul, etc) ✓ Reduction along axis (max, sum, argmax, etc) For more details on broadcasting visit (https://numpy.org/doc/stable/user/basics.broadcasting.html) >>> import cupy as cp >>> X = cp.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) #Basic indexing and slicing >>> X[5:] array([5, 6, 7, 8, 9]) >>> X[1:7:2] array([1, 3, 5]) #Advance indexing >>> X = cp.array([[1, 2],[3, 4],[5, 6]]) >>> X[[0, 1, 2], [0, 1, 0]] array([1, 4, 5]) #reduction and Linear Algebra function >>> max(X) 9.0 >>> B = cp.array([1,2,3,4], dtype=np.float32) >>> C = cp.array([5,6,7,8], dtype=np.float32) >>> cp.matmul(B, C) 70.0 #data type and array manipulation routine >>> A =1j*cp.arange(9, dtype=np.complex64).reshape(3,3) [[0.+0.j 0.+1.j 0.+2.j] [0.+3.j 0.+4.j 0.+5.j] [0.+6.j 0.+7.j 0.+8.j]]

“Overview of GPU Programming Models” by Jeff L. 22 cunumeric
Automatic NumPy Acceleration and Scalability Time (seconds) Relative dataset size Number of GPUs 0 50 100 150 1 2 4 8 16 32 64 128 256 512 1024 Distributed NumPy Performance (weak scaling) cuPy Legate for _ in range(iter): un = u.copy() vn = v.copy() b = build_up_b(rho, dt, dx, dy, u, v) p = pressure_poisson_periodic(b, nit, p, dx, dy) … Extracted from “CFD Python” course at https://github.com/barbagroup/CFDPython Barba, Lorena A., and Forsyth, Gilbert F. (2018). CFD Python: the 12 steps to Navier-Stokes equations. Journal of Open Source Education, 1(9), 21, https://doi.org/10.21105/jose.00021 CuNumeric transparently accelerates and scales existing Numpy workloads Program from the edge to the supercomputer in Python by changing as little as 1 import line Pass data between Legate libraries without worrying about distribution or synchronization requirements Alpha release available at github.com/nv-legate cuNumeric

OpenACC OpenMP (CPU) から OpenACC (GPU) への移⾏ subroutine saxpy(n, a,
X, Y) real :: a, X(:), Y(:) integer :: n, i !$omp parallel do do i=1, n Y(i) = a*X(i)+Y(i) enddo !$omp end parallel do end subroutine saxpy ... call saxpy(N, 3.0, x, y) ... subroutine saxpy(n, a, X, Y) real :: a, X(:), Y(:) integer :: n, i !$acc parallel copy(Y(:)) copyin(X(:)) do i=1, n Y(i) = a*X(i)+Y(i) enddo !$acc end parallel end subroutine saxpy ... call saxpy(N, 3.0, x, y) ... OpenMP OpenACC

OpenACC “parallel” と “kernels” directives subroutine saxpy(n, a, X, Y)
real :: a, X(:), Y(:) integer :: n, i !$acc parallel copy(Y(:)) copyin(X(:)) do i=1, n Y(i) = a*X(i)+Y(i) enddo !$acc end parallel end subroutine saxpy ... call saxpy(N, 3.0, x, y) ... subroutine saxpy(n, a, X, Y) real :: a, X(:), Y(:) integer :: n, i !$acc kernels copy(Y(:)) copyin(X(:)) do i=1, n Y(i) = a*X(i)+Y(i) enddo !$acc end parallel end subroutine saxpy ... call saxpy(N, 3.0, x, y) ... parallel (ユーザ主導) kernels (コンパイラ主導)

OpenACC loop directive : ループの並列化⽅法を指⽰ • independent: 並列化可能 • seq:
逐次実⾏ • collapse: ループ融合 • gang/vector: 並列化粒度 • etc. subroutine saxpy(n, a, X, Y) real :: a, X(:), Y(:) integer :: n, i !$acc parallel copy(Y(:)) copyin(X(:)) !$acc loop independent do i=1, n Y(i) = a*X(i)+Y(i) enddo !$acc end parallel end subroutine saxpy ... call saxpy(N, 3.0, x, y) ...

OpenACC data clause : データの移動⽅法を指⽰ • copyin: CPU -> GPU
• copyout: CPU <- GPU • copy: 両⽅ • create: メモリ確保 • etc. subroutine saxpy(n, a, X, Y) real :: a, X(:), Y(:) integer :: n, i !$acc parallel copy(Y(:)) copyin(X(:)) do i=1, n Y(i) = a*X(i)+Y(i) enddo !$acc end parallel end subroutine saxpy ... call saxpy(N, 3.0, x, y) ...

OpenACC data directive : データ領域を指⽰する subroutine saxpy(n, a, X, Y)
real :: a, X(:), Y(:) integer :: n, i !$acc parallel present (x, y) do i=1, n Y(i) = a*X(i)+Y(i) enddo !$acc end parallel end subroutine saxpy ... !$acc data copy(Y(:)) copyin(X(:)) call saxpy(N, 3.0, x, y) ...

OpenACC NVIDIA HPC SDK でビルドする例 • コマンド • C: nvc
• C++: nvc++ • Fortran: nvfortran • -acc=gpu • OpenACC を有効にし、NVIDIA GPU 向けにビルド • -acc=multicore とすることで、マルチコア CPU向けにもビルド可能 • -Minfo=accel • どのように並列化されたかに関する、コンパイラメッセージを表⽰ • -gpu=... GPU コード⽣成に関する詳細を指定 • -gpu=managed とすることで、CUDA Unified Memory を有効化 $ nvfortran -acc=gpu -Minfo=accel -gpu=managed saxpy.f90

MPS (Multi-Process Service) GPU を複数プロセスで効率良く使⽤ • Starting MPS control daemon
$ export CUDA_VISIBLE_DEVICES=0 $ export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps $ export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log $ nvidia-cuda-mps-control -d • Starting MPS client application $ export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps $ export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log • Shutting Down MPS $ echo quit | nvidia-cuda-mps-control https://docs.nvidia.com/deploy/mps/

GPU Spec Used in this benchmarks FP64 FP32 Memory Memory
BW H100 SXM 34 TFLOPS 67 TFLOPS 80GB HBM3 3.35 TB/s H100 PCIe 26 TFLOPS 51 TFLOPS 80GB HBM2e 2 TB/s A100 SXM 9.7 TFLOPS 19.5 TFLOPS 80GB HBM2e 2039 GB/s A100 PCIe 9.7 TFLOPS 19.5 TFLOPS 80GB HBM2e 1935 GB/s A30 5.2 TFLOPS 10.3 TFLOPS 24GB HBM2 933 GB/s A40 37.4 TFLOPS 48GB GDDR6 696 GB/s V100 SXM 7.8 TFLOPS 15.7 TFLOPS 32GB HBM2 900 GB/s V100S PCIe 8.2 TFLOPS 16.4 TFLOPS 32GB HBM2 1134 GB/s T4 8.1 TFLOPS 16GB GDDR6 300 GB/s

NVIDIA HPC Application Performance https://developer.nvidia.com/hpc-application-performance

LAMMPS Version: stable_23Jun2022_update1 | CPU: 2x Xeon Gold 6240 0.00E+00
5.00E+06 1.00E+07 1.50E+07 2.00E+07 2.50E+07 3.00E+07 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 ATOM-Time Steps/s ReaxFF/C 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU 0.00E+00 2.00E+06 4.00E+06 6.00E+06 8.00E+06 1.00E+07 1.20E+07 1.40E+07 1.60E+07 1.80E+07 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 ATOM-Time Steps/s SNAP 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU https://developer.nvidia.com/hpc-application-performance

Quantum Espresso Version: V7.0 CPU; V7.1 GPU | CPU: 2x
Xeon Gold 6240 0 100 200 300 400 500 600 700 800 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 Total CPU Time (Sec) AUSURF112-jR 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU 0 5 10 15 20 25 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 Speedup AUSURF112-jR 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU https://developer.nvidia.com/hpc-application-performance

VASP Model: GaAsBi_512 • Cell size: 22.6x22.6x22.6 Å³ • Atoms:
256 Ga, 255 As, 1 Bi (512 total) • 4 k-points, 1536 bands, 313 eV cutoff Energy, 145 484 PWs • Standard DFT (GGA: PBE) • Algo=Fast (Davidson + RMM-DIIS) • Real-space projection scheme

VASP Version: 6.3.0 | Model: GaAsBi_512

GPU ミニキャンプ In collaboration with various supercomputing centers

GPU 移⾏ In collabolation with JCAHPC https://jcahpc.github.io/gpu_porting/

HPC と物理学 — GPU コンピューティングが拓く新しい世界⽇本物理学会誌第 76 巻第
12 号 1. はじめに 2. ハイパフォーマンスコンピューティング 3. GPU コンピューティング 4. スーパーコンピューター 5. 物理シミュレーション 6. 機械学習を⽤いた物理シミュレーション 7. 量⼦コンピューター 8. おわりに https://www.jps.or.jp/books/gakkaishi/2021/12/76-12.php

NVIDIA Japan Social Media Directory Find Us Online! • Twitter
• https://twitter.com/NVIDIAJapan • Facebook • https://www.facebook.com/NVIDIA.JP • YouTube • https://www.youtube.com/user/NVIDIAJapan • Twitter • https://twitter.com/NVIDIAAIJP • Facebook • https://www.facebook.com/NVIDIAAI.JP • Facebook • https://www.facebook.com/NVIDIANetworkingJapan • Twitter • https://twitter.com/NVIDIAGeForceJP • Facebook • https://www.facebook.com/NVIDIAGeForceJP • Instagram • https://instagram.com/nvidiageforcejp • YouTube • https://www.youtube.com/@nvidiageforcejapan44 • Twitch • https://www.twitch.tv/nvidiajapan • Twitter • https://twitter.com/NVIDIAStudioJP • Facebook • https://www.facebook.com/NVIDIAStudioJP • Instagram • https://instagram.com/nvidiastudiojp • YouTube • https://www.youtube.com/@nvidiastudiojapan1621

GPU コンピューティングを活用した物性研究 / 2023-04-04 ISSP

GPU コンピューティングを活用した物性研究 / 2023-04-04 ISSP

More Decks by Shinnosuke Furuya

Other Decks in Technology

Featured

Transcript