Daniel Wheeler
July 21, 2022
110

# Tutorial on Migrating Phase Field Codes to GPUs

July 21, 2022

## Transcript

1. ### Tutorial on Migrating Phase Field Codes to GPUs Jin Zhang

Voorhees Group, Northwestern University May 11, 2022
2. ### Outline The GPU hardware Brief introduction to CUDA Example: Allen-Cahn

equation The standard way: CUDA The ‘easy’ way: Standard C++ 2 / 22

8. ### The hardware: GPU CPU Memory Host GPU die GPU Memory

Device ... ... SM SM ... ... CUDA cores SM: Streaming Multiprocessor 3 / 22

10. ### Introduction to CUDA Thread Thread block Grid ... ... SM

... ... Device SM SM 4 / 22
11. ### Introduction to CUDA CUDA development tools Need an NVIDIA GPU

Install NVIDIA Driver Install compiler: CUDA Toolkit/NVIDIA HPC SDK 4 / 22
12. ### Phase ﬁeld Allen-Cahn equation ∂u ∂t = 2∇2u − u(1

− u)(1 − 2u) Discretization (2D) ∂u ∂t ij = u(n+1) ij − u(n) ij ∆t + O(∆t) (∇2u)ij = u(n) i+1j + u(n) i−1j + u(n) ij+1 + u(n) ij−1 − 4u(n) ij ∆x2 + O(∆x2) 5 / 22
13. ### Phase ﬁeld Allen-Cahn equation ∂u ∂t = 2∇2u − u(1

− u)(1 − 2u) Discretization (2D) unew ij = uij + ∆t∆uij where ∆uij = 2 ui+1j + ui−1j + uij+1 + uij−1 − 4uij ∆x2 − gij gij = uij(1 − uij)(1 − 2uij) 5 / 22
14. ### Our task Implement a CPU code Modify it to run

on GPU 6 / 22

22
16. ### CPU implementation 1 int main(int argc , char *argv [])

2 { 3 double *u, *u1; // order parameter 4 u = (double *)malloc(NI*NJ * sizeof(double)); // old value 5 u1 = (double *)malloc(NI*NJ * sizeof(double)); // new value 6 // ... 7 ic(u); // initial condition 8 for (long t=1L; t<=n; t++){ // loop timesteps 9 sweep(u, u1 , e, dt); // u1 = u + du 10 double *tmp = u1; u1 = u; u = tmp; // u <-> u1 11 } 12 } 8 / 22
17. ### CPU implementation 1 void sweep(double *u, double *u1 , double

e, double dt){ 2 for (int i=0;i<NI;i++){ 3 for (int j=0;j<NJ;j++){ 4 // periodic boundary condition 5 int im = (i< 1) ? NI -1 : i-1; // i-1 6 int ip = (i>NI -2) ? 0 : i+1; // i+1 7 int jm = (j< 1) ? NJ -1 : j-1; // j-1 8 int jp = (j>NJ -2) ? 0 : j+1; // j+1 9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 11 u1[i,j] = u[i,j] + du*dt; 12 } 13 } 14 } System size NI×NJ, Index: 0 1 2 N-1 N-2 9 / 22
18. ### CPU implementation 1 void sweep(double *u, double *u1 , double

e, double dt){ 2 for (int i=0;i<NI;i++){ 3 for (int j=0;j<NJ;j++){ 4 // periodic boundary condition 5 int im = (i< 1) ? NI -1 : i-1; // i-1 6 int ip = (i>NI -2) ? 0 : i+1; // i+1 7 int jm = (j< 1) ? NJ -1 : j-1; // j-1 8 int jp = (j>NJ -2) ? 0 : j+1; // j+1 9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 11 u1[i,j] = u[i,j] + du*dt; 12 } 13 } 14 } Index: 0 1 2 N-1 N-2 Neighbors: i,j i m ,j i p ,j i,j m i,j p 10 / 22
19. ### CPU implementation 1 void sweep(double *u, double *u1 , double

e, double dt){ 2 for (int i=0;i<NI;i++){ 3 for (int j=0;j<NJ;j++){ 4 // periodic boundary condition 5 int im = (i< 1) ? NI -1 : i-1; // i-1 6 int ip = (i>NI -2) ? 0 : i+1; // i+1 7 int jm = (j< 1) ? NJ -1 : j-1; // j-1 8 int jp = (j>NJ -2) ? 0 : j+1; // j+1 9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 11 u1[i,j] = u[i,j] + du*dt; 12 } 13 } 14 } g = u(1 − u)(1 − 2u), ∆uij = 2 ∆x2 ui−1j + ui+1j + uij−1 + uij+1 − 4uij − gij 11 / 22

21. ### GPU implementation NJ NI 012345678... NJ-1 012345678... NI-1 Thread Thread

block Grid Block (0,0) Block (0,1) Block (0,2) Block (1,0) Block (1,1) Block (1,2) Block (2,0) Block (2,1) Block (2,2) Block index: (blockIdx.x, blockIdx.y) 12 / 22

23. ### GPU implementation 012345678... NJ-1 012345678... NI-1 Thread Thread block Grid

Block (2,2) Thread (4,3) (blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y) 12 / 22
24. ### GPU implementation CPU Host GPU die Device Uniﬁed Memory Memory

GPU Memory No need to explicitly copy between host and device (since CUDA 6) 12 / 22
25. ### CUDA implementation 1 int main(int argc , char *argv [])

2 { 3 double *u, *u1; // order parameter 4 cudaMallocManaged ( &u, NI*NJ * sizeof(double)); // old value 5 cudaMallocManaged (&u1 , NI*NJ * sizeof(double)); // new value 6 dim3 dimBlock(BLOCK_X , BLOCK_Y); 7 dim3 dimGrid(NI/BLOCK_X , NJ/BLOCK_Y); 8 // ... 9 ic(u); // initial condition 10 for (long t=1L; t<=n; t++){ 11 sweep<<<dimGrid, dimBlock>>>(u, u1 , e, dt); 12 double *tmp = u1; u1 = u; u = tmp; // u <-> u1 13 } 14 } 13 / 22
26. ### CUDA implementation 1 __global__ 2 void sweep(double *u, double *u1

, double e, double dt){ 3 int i = blockIdx.x * blockDim.x + threadIdx.x; 4 int j = blockIdx.y * blockDim.y + threadIdx.y; 5 // periodic boundary condition 6 int im = (i< 1) ? NI -1 : i-1; // i-1 7 int ip = (i>NI -2) ? 0 : i+1; // i+1 8 int jm = (j< 1) ? NJ -1 : j-1; // j-1 9 int jp = (j>NJ -2) ? 0 : j+1; // j+1 10 11 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 12 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 13 u1[i,j] = u[i,j]+du*dt; 14 } 14 / 22
27. ### Compiling and error checking 1 # compile 2 nvcc -o

myapp myapp.cu 3 # memcheck 4 compute -sanitizer ./ myapp 5 # racecheck 6 compute -sanitizer --tool racecheck ./ myapp 7 # initcheck 8 compute -sanitizer --tool initcheck ./ myapp 9 # synccheck 10 compute -sanitizer --tool synccheck ./ myapp 11 # nsight system 12 nsys profile -o report --stats=true ./ myapp 15 / 22
28. ### Compare CPU and GPU code 1 # compile CPU code

2 gcc -o myapp myapp.c -lm 3 # compile GPU code 4 nvcc -o myapp myapp.cu 5 # compare the result 6 paraview --state=compare.pvsm 7 diff cpu.vtk gpu.vtk 8 # timing 9 time ./ myapp 16 / 22
29. ### Debugging 1 # compile with debugging info 2 nvcc -g

-G -o myapp myapp.cu 3 # debugging 4 cuda -gdb ./ myapp 5 # Visual Studio Code 6 code 17 / 22
30. ### Proﬁling Nsight Systems blank Nsight Compute 1 # Nsignt Systems

2 nsys profile -o report --stats=true ./ myapp 3 # Nsight Compute 4 sudo ncu -k mykernel -o report ./ myapp 18 / 22
31. ### Optimization Size of a block (multiple of 32) Memory hierarchy

Contiguous memory access Parallel reduction Warp (warp-level primitives) Shared memory (bank conﬂict) 19 / 22
32. ### Optimization Host Memory GPU (Grid) GPU Memory (Global) 16 GB/s

Block 0 Thread 0 register shared memory Thread 1 register Block 1 Thread 0 register shared memory Thread 1 register 900 GB/s 12TB/s 12TB/s 19 / 22
33. ### Best practices Have a CPU code at ﬁrst Have a

working GPU code before any optimization Use proﬁling tools to ﬁnd the bottleneck The execution order of threads is random 20 / 22
34. ### An ‘easy’ way: Standard C++ Parallelism (C++20) 1 void sweep(double

*u, double *u1 , double e, double dt){ 2 auto v = std:: ranges :: views :: iota(0, NI*NJ); 3 std:: for_each( 4 std:: execution :: par_unseq , 5 std:: begin(v), std::end(v), 6 [=] (auto idx){ 7 int i = idx / NJ; 8 int j = idx % NJ; 9 // periodic boundary condition 10 int im = (i< 1) ? NI -1 : i-1; // i-1 11 int ip = (i>NI -2) ? 0 : i+1; // i+1 12 int jm = (j< 1) ? NJ -1 : j-1; // j-1 13 int jp = (j>NJ -2) ? 0 : j+1; // j+1 14 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 15 double du= e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 16 u1[i,j] = u[i,j]+du*dt; 17 }); 18 } 21 / 22
35. ### Migrating your code to GPUs If your code uses a

library, check its GPU support If you have a Mac or an AMD GPU, check OpenCL High-order time stepping Adaptive time stepping Implicit time stepping: cuSPARSE, cuSOLVER Other spatial discretization: FFT, FV, FEM 22 / 22