Daniel Wheeler
July 21, 2022
96

# Tutorial on Migrating Phase Field Codes to GPUs

July 21, 2022

## Transcript

1. Tutorial on Migrating Phase Field Codes to
GPUs
Jin Zhang
Voorhees Group, Northwestern University
May 11, 2022

2. Outline
The GPU hardware
Brief introduction to CUDA
Example: Allen-Cahn equation
The standard way: CUDA
The ‘easy’ way: Standard C++
2 / 22

3. The hardware: GPU
3 / 22

4. The hardware: GPU
3 / 22

5. The hardware: GPU
3 / 22

6. The hardware: GPU
3 / 22

7. The hardware: GPU
3 / 22

8. The hardware: GPU
CPU
Memory
Host
GPU
die
GPU
Memory
Device
...
...
SM
SM
...
...
CUDA
cores
SM: Streaming Multiprocessor
3 / 22

9. Introduction to CUDA
block
Grid
4 / 22

10. Introduction to CUDA
block
Grid
...
...
SM
...
...
Device
SM
SM
4 / 22

11. Introduction to CUDA
CUDA development tools
Need an NVIDIA GPU
Install NVIDIA Driver
Install compiler: CUDA Toolkit/NVIDIA HPC SDK
4 / 22

12. Phase ﬁeld
Allen-Cahn equation
∂u
∂t
= 2∇2u − u(1 − u)(1 − 2u)
Discretization (2D)
∂u
∂t ij
=
u(n+1)
ij
− u(n)
ij
∆t
+ O(∆t)
(∇2u)ij =
u(n)
i+1j
+ u(n)
i−1j
+ u(n)
ij+1
+ u(n)
ij−1
− 4u(n)
ij
∆x2
+ O(∆x2)
5 / 22

13. Phase ﬁeld
Allen-Cahn equation
∂u
∂t
= 2∇2u − u(1 − u)(1 − 2u)
Discretization (2D)
unew
ij
= uij + ∆t∆uij
where
∆uij = 2
ui+1j + ui−1j + uij+1 + uij−1 − 4uij
∆x2
− gij
gij = uij(1 − uij)(1 − 2uij)
5 / 22

Implement a CPU code
Modify it to run on GPU
6 / 22

15. CPU implementation
NJ
NI
012345678... NJ-1
012345678... NI-1
7 / 22

16. CPU implementation
1 int main(int argc , char *argv [])
2 {
3 double *u, *u1; // order parameter
4 u = (double *)malloc(NI*NJ * sizeof(double)); // old value
5 u1 = (double *)malloc(NI*NJ * sizeof(double)); // new value
6 // ...
7 ic(u); // initial condition
8 for (long t=1L; t<=n; t++){ // loop timesteps
9 sweep(u, u1 , e, dt); // u1 = u + du
10 double *tmp = u1; u1 = u; u = tmp; // u <-> u1
11 }
12 }
8 / 22

17. CPU implementation
1 void sweep(double *u, double *u1 , double e, double dt){
2 for (int i=0;i3 for (int j=0;j4 // periodic boundary condition
5 int im = (i< 1) ? NI -1 : i-1; // i-1
6 int ip = (i>NI -2) ? 0 : i+1; // i+1
7 int jm = (j< 1) ? NJ -1 : j-1; // j-1
8 int jp = (j>NJ -2) ? 0 : j+1; // j+1
9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]);
10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g;
11 u1[i,j] = u[i,j] + du*dt;
12 }
13 }
14 }
System size NI×NJ, Index: 0 1 2 N-1
N-2
9 / 22

18. CPU implementation
1 void sweep(double *u, double *u1 , double e, double dt){
2 for (int i=0;i3 for (int j=0;j4 // periodic boundary condition
5 int im = (i< 1) ? NI -1 : i-1; // i-1
6 int ip = (i>NI -2) ? 0 : i+1; // i+1
7 int jm = (j< 1) ? NJ -1 : j-1; // j-1
8 int jp = (j>NJ -2) ? 0 : j+1; // j+1
9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]);
10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g;
11 u1[i,j] = u[i,j] + du*dt;
12 }
13 }
14 }
Index: 0 1 2 N-1
N-2
Neighbors:
i,j
i
m
,j
i
p
,j
i,j
m
i,j
p
10 / 22

19. CPU implementation
1 void sweep(double *u, double *u1 , double e, double dt){
2 for (int i=0;i3 for (int j=0;j4 // periodic boundary condition
5 int im = (i< 1) ? NI -1 : i-1; // i-1
6 int ip = (i>NI -2) ? 0 : i+1; // i+1
7 int jm = (j< 1) ? NJ -1 : j-1; // j-1
8 int jp = (j>NJ -2) ? 0 : j+1; // j+1
9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]);
10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g;
11 u1[i,j] = u[i,j] + du*dt;
12 }
13 }
14 }
g = u(1 − u)(1 − 2u), ∆uij = 2
∆x2
ui−1j + ui+1j + uij−1 + uij+1 − 4uij − gij
11 / 22

20. GPU implementation
block
Grid
12 / 22

21. GPU implementation
NJ
NI
012345678... NJ-1
012345678... NI-1
block
Grid
Block
(0,0)
Block
(0,1)
Block
(0,2)
Block
(1,0)
Block
(1,1)
Block
(1,2)
Block
(2,0)
Block
(2,1)
Block
(2,2)
Block index: (blockIdx.x, blockIdx.y)
12 / 22

22. GPU implementation
block
Grid
Block
(2,2)
01234567
01234567
(4,3)
12 / 22

23. GPU implementation
012345678... NJ-1
012345678... NI-1
block
Grid
Block
(2,2)
(4,3)
12 / 22

24. GPU implementation
CPU
Host
GPU
die
Device
Uniﬁed
Memory
Memory
GPU
Memory
No need to explicitly copy between host and device (since CUDA 6)
12 / 22

25. CUDA implementation
1 int main(int argc , char *argv [])
2 {
3 double *u, *u1; // order parameter
4 cudaMallocManaged ( &u, NI*NJ * sizeof(double)); // old value
5 cudaMallocManaged (&u1 , NI*NJ * sizeof(double)); // new value
6 dim3 dimBlock(BLOCK_X , BLOCK_Y);
7 dim3 dimGrid(NI/BLOCK_X , NJ/BLOCK_Y);
8 // ...
9 ic(u); // initial condition
10 for (long t=1L; t<=n; t++){
11 sweep<<>>(u, u1 , e, dt);
12 double *tmp = u1; u1 = u; u = tmp; // u <-> u1
13 }
14 }
13 / 22

26. CUDA implementation
1 __global__
2 void sweep(double *u, double *u1 , double e, double dt){
3 int i = blockIdx.x * blockDim.x + threadIdx.x;
4 int j = blockIdx.y * blockDim.y + threadIdx.y;
5 // periodic boundary condition
6 int im = (i< 1) ? NI -1 : i-1; // i-1
7 int ip = (i>NI -2) ? 0 : i+1; // i+1
8 int jm = (j< 1) ? NJ -1 : j-1; // j-1
9 int jp = (j>NJ -2) ? 0 : j+1; // j+1
10
11 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]);
12 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g;
13 u1[i,j] = u[i,j]+du*dt;
14 }
14 / 22

27. Compiling and error checking
1 # compile
2 nvcc -o myapp myapp.cu
3 # memcheck
4 compute -sanitizer ./ myapp
5 # racecheck
6 compute -sanitizer --tool racecheck ./ myapp
7 # initcheck
8 compute -sanitizer --tool initcheck ./ myapp
9 # synccheck
10 compute -sanitizer --tool synccheck ./ myapp
11 # nsight system
12 nsys profile -o report --stats=true ./ myapp
15 / 22

28. Compare CPU and GPU code
1 # compile CPU code
2 gcc -o myapp myapp.c -lm
3 # compile GPU code
4 nvcc -o myapp myapp.cu
5 # compare the result
6 paraview --state=compare.pvsm
7 diff cpu.vtk gpu.vtk
8 # timing
9 time ./ myapp
16 / 22

29. Debugging
1 # compile with debugging info
2 nvcc -g -G -o myapp myapp.cu
3 # debugging
4 cuda -gdb ./ myapp
5 # Visual Studio Code
6 code
17 / 22

30. Proﬁling
Nsight Systems blank Nsight Compute
1 # Nsignt Systems
2 nsys profile -o report --stats=true ./ myapp
3 # Nsight Compute
4 sudo ncu -k mykernel -o report ./ myapp
18 / 22

31. Optimization
Size of a block (multiple of 32)
Memory hierarchy
Contiguous memory access
Parallel reduction
Warp (warp-level primitives)
Shared memory (bank conﬂict)
19 / 22

32. Optimization
Host Memory
GPU (Grid)
GPU Memory (Global)
16 GB/s
Block 0
register
shared memory
register
Block 1
register
shared memory
register
900 GB/s
12TB/s
12TB/s
19 / 22

33. Best practices
Have a CPU code at ﬁrst
Have a working GPU code before any optimization
Use proﬁling tools to ﬁnd the bottleneck
The execution order of threads is random
20 / 22

34. An ‘easy’ way: Standard C++ Parallelism (C++20)
1 void sweep(double *u, double *u1 , double e, double dt){
2 auto v = std:: ranges :: views :: iota(0, NI*NJ);
3 std:: for_each(
4 std:: execution :: par_unseq ,
5 std:: begin(v), std::end(v),
6 [=] (auto idx){
7 int i = idx / NJ;
8 int j = idx % NJ;
9 // periodic boundary condition
10 int im = (i< 1) ? NI -1 : i-1; // i-1
11 int ip = (i>NI -2) ? 0 : i+1; // i+1
12 int jm = (j< 1) ? NJ -1 : j-1; // j-1
13 int jp = (j>NJ -2) ? 0 : j+1; // j+1
14 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]);
15 double du= e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g;
16 u1[i,j] = u[i,j]+du*dt;
17 });
18 }
21 / 22

35. Migrating your code to GPUs
If your code uses a library, check its GPU support
If you have a Mac or an AMD GPU, check OpenCL
High-order time stepping