Tutorial on Migrating Phase Field Codes to GPUs

Tutorial on Migrating Phase Field Codes to GPUs Jin Zhang
Voorhees Group, Northwestern University May 11, 2022

Outline The GPU hardware Brief introduction to CUDA Example: Allen-Cahn
equation The standard way: CUDA The ‘easy’ way: Standard C++ 2 / 22

The hardware: GPU 3 / 22

The hardware: GPU CPU Memory Host GPU die GPU Memory
Device ... ... SM SM ... ... CUDA cores SM: Streaming Multiprocessor 3 / 22

Introduction to CUDA Thread Thread block Grid 4 / 22

Introduction to CUDA Thread Thread block Grid ... ... SM
... ... Device SM SM 4 / 22

Introduction to CUDA CUDA development tools Need an NVIDIA GPU
Install NVIDIA Driver Install compiler: CUDA Toolkit/NVIDIA HPC SDK 4 / 22

Phase ﬁeld Allen-Cahn equation ∂u ∂t = 2∇2u − u(1
− u)(1 − 2u) Discretization (2D) ∂u ∂t ij = u(n+1) ij − u(n) ij ∆t + O(∆t) (∇2u)ij = u(n) i+1j + u(n) i−1j + u(n) ij+1 + u(n) ij−1 − 4u(n) ij ∆x2 + O(∆x2) 5 / 22

Phase ﬁeld Allen-Cahn equation ∂u ∂t = 2∇2u − u(1
− u)(1 − 2u) Discretization (2D) unew ij = uij + ∆t∆uij where ∆uij = 2 ui+1j + ui−1j + uij+1 + uij−1 − 4uij ∆x2 − gij gij = uij(1 − uij)(1 − 2uij) 5 / 22

Our task Implement a CPU code Modify it to run
on GPU 6 / 22

CPU implementation NJ NI 012345678... NJ-1 012345678... NI-1 7 /
22

CPU implementation 1 int main(int argc , char *argv [])
2 { 3 double *u, *u1; // order parameter 4 u = (double *)malloc(NI*NJ * sizeof(double)); // old value 5 u1 = (double *)malloc(NI*NJ * sizeof(double)); // new value 6 // ... 7 ic(u); // initial condition 8 for (long t=1L; t<=n; t++){ // loop timesteps 9 sweep(u, u1 , e, dt); // u1 = u + du 10 double *tmp = u1; u1 = u; u = tmp; // u <-> u1 11 } 12 } 8 / 22

CPU implementation 1 void sweep(double *u, double *u1 , double
e, double dt){ 2 for (int i=0;i<NI;i++){ 3 for (int j=0;j<NJ;j++){ 4 // periodic boundary condition 5 int im = (i< 1) ? NI -1 : i-1; // i-1 6 int ip = (i>NI -2) ? 0 : i+1; // i+1 7 int jm = (j< 1) ? NJ -1 : j-1; // j-1 8 int jp = (j>NJ -2) ? 0 : j+1; // j+1 9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 11 u1[i,j] = u[i,j] + du*dt; 12 } 13 } 14 } System size NI×NJ, Index: 0 1 2 N-1 N-2 9 / 22

e, double dt){ 2 for (int i=0;i<NI;i++){ 3 for (int j=0;j<NJ;j++){ 4 // periodic boundary condition 5 int im = (i< 1) ? NI -1 : i-1; // i-1 6 int ip = (i>NI -2) ? 0 : i+1; // i+1 7 int jm = (j< 1) ? NJ -1 : j-1; // j-1 8 int jp = (j>NJ -2) ? 0 : j+1; // j+1 9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 11 u1[i,j] = u[i,j] + du*dt; 12 } 13 } 14 } Index: 0 1 2 N-1 N-2 Neighbors: i,j i m ,j i p ,j i,j m i,j p 10 / 22

e, double dt){ 2 for (int i=0;i<NI;i++){ 3 for (int j=0;j<NJ;j++){ 4 // periodic boundary condition 5 int im = (i< 1) ? NI -1 : i-1; // i-1 6 int ip = (i>NI -2) ? 0 : i+1; // i+1 7 int jm = (j< 1) ? NJ -1 : j-1; // j-1 8 int jp = (j>NJ -2) ? 0 : j+1; // j+1 9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 11 u1[i,j] = u[i,j] + du*dt; 12 } 13 } 14 } g = u(1 − u)(1 − 2u), ∆uij = 2 ∆x2 ui−1j + ui+1j + uij−1 + uij+1 − 4uij − gij 11 / 22

GPU implementation Thread Thread block Grid 12 / 22

GPU implementation NJ NI 012345678... NJ-1 012345678... NI-1 Thread Thread
block Grid Block (0,0) Block (0,1) Block (0,2) Block (1,0) Block (1,1) Block (1,2) Block (2,0) Block (2,1) Block (2,2) Block index: (blockIdx.x, blockIdx.y) 12 / 22

GPU implementation Thread Thread block Grid Block (2,2) 01234567 01234567
Thread (4,3) Block size: (blockDim.x, blockDim.y), Thread index: (threadIdx.x, threadIdx.y) 12 / 22

GPU implementation 012345678... NJ-1 012345678... NI-1 Thread Thread block Grid
Block (2,2) Thread (4,3) (blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y) 12 / 22

GPU implementation CPU Host GPU die Device Uniﬁed Memory Memory
GPU Memory No need to explicitly copy between host and device (since CUDA 6) 12 / 22

CUDA implementation 1 int main(int argc , char *argv [])
2 { 3 double *u, *u1; // order parameter 4 cudaMallocManaged ( &u, NI*NJ * sizeof(double)); // old value 5 cudaMallocManaged (&u1 , NI*NJ * sizeof(double)); // new value 6 dim3 dimBlock(BLOCK_X , BLOCK_Y); 7 dim3 dimGrid(NI/BLOCK_X , NJ/BLOCK_Y); 8 // ... 9 ic(u); // initial condition 10 for (long t=1L; t<=n; t++){ 11 sweep<<<dimGrid, dimBlock>>>(u, u1 , e, dt); 12 double *tmp = u1; u1 = u; u = tmp; // u <-> u1 13 } 14 } 13 / 22

CUDA implementation 1 __global__ 2 void sweep(double *u, double *u1
, double e, double dt){ 3 int i = blockIdx.x * blockDim.x + threadIdx.x; 4 int j = blockIdx.y * blockDim.y + threadIdx.y; 5 // periodic boundary condition 6 int im = (i< 1) ? NI -1 : i-1; // i-1 7 int ip = (i>NI -2) ? 0 : i+1; // i+1 8 int jm = (j< 1) ? NJ -1 : j-1; // j-1 9 int jp = (j>NJ -2) ? 0 : j+1; // j+1 10 11 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 12 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 13 u1[i,j] = u[i,j]+du*dt; 14 } 14 / 22

Compiling and error checking 1 # compile 2 nvcc -o
myapp myapp.cu 3 # memcheck 4 compute -sanitizer ./ myapp 5 # racecheck 6 compute -sanitizer --tool racecheck ./ myapp 7 # initcheck 8 compute -sanitizer --tool initcheck ./ myapp 9 # synccheck 10 compute -sanitizer --tool synccheck ./ myapp 11 # nsight system 12 nsys profile -o report --stats=true ./ myapp 15 / 22

Compare CPU and GPU code 1 # compile CPU code
2 gcc -o myapp myapp.c -lm 3 # compile GPU code 4 nvcc -o myapp myapp.cu 5 # compare the result 6 paraview --state=compare.pvsm 7 diff cpu.vtk gpu.vtk 8 # timing 9 time ./ myapp 16 / 22

Debugging 1 # compile with debugging info 2 nvcc -g
-G -o myapp myapp.cu 3 # debugging 4 cuda -gdb ./ myapp 5 # Visual Studio Code 6 code 17 / 22

Proﬁling Nsight Systems blank Nsight Compute 1 # Nsignt Systems
2 nsys profile -o report --stats=true ./ myapp 3 # Nsight Compute 4 sudo ncu -k mykernel -o report ./ myapp 18 / 22

Optimization Size of a block (multiple of 32) Memory hierarchy
Contiguous memory access Parallel reduction Warp (warp-level primitives) Shared memory (bank conﬂict) 19 / 22

Optimization Host Memory GPU (Grid) GPU Memory (Global) 16 GB/s
Block 0 Thread 0 register shared memory Thread 1 register Block 1 Thread 0 register shared memory Thread 1 register 900 GB/s 12TB/s 12TB/s 19 / 22

Best practices Have a CPU code at first Have a
working GPU code before any optimization Use profiling tools to find the bottleneck The execution order of threads is random 20 / 22

An ‘easy’ way: Standard C++ Parallelism (C++20) 1 void sweep(double
*u, double *u1 , double e, double dt){ 2 auto v = std:: ranges :: views :: iota(0, NI*NJ); 3 std:: for_each( 4 std:: execution :: par_unseq , 5 std:: begin(v), std::end(v), 6 [=] (auto idx){ 7 int i = idx / NJ; 8 int j = idx % NJ; 9 // periodic boundary condition 10 int im = (i< 1) ? NI -1 : i-1; // i-1 11 int ip = (i>NI -2) ? 0 : i+1; // i+1 12 int jm = (j< 1) ? NJ -1 : j-1; // j-1 13 int jp = (j>NJ -2) ? 0 : j+1; // j+1 14 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 15 double du= e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 16 u1[i,j] = u[i,j]+du*dt; 17 }); 18 } 21 / 22

Migrating your code to GPUs If your code uses a
library, check its GPU support If you have a Mac or an AMD GPU, check OpenCL High-order time stepping Adaptive time stepping Implicit time stepping: cuSPARSE, cuSOLVER Other spatial discretization: FFT, FV, FEM 22 / 22

Tutorial on Migrating Phase Field Codes to GPUs

Tutorial on Migrating Phase Field Codes to GPUs

More Decks by Daniel Wheeler

Other Decks in Science

Featured

Transcript