Tutorial on Migrating Phase Field Codes to GPUs

Slide 1

Slide 1 text

Tutorial on Migrating Phase Field Codes to GPUs Jin Zhang Voorhees Group, Northwestern University May 11, 2022

Slide 2

Slide 2 text

Outline The GPU hardware Brief introduction to CUDA Example: Allen-Cahn equation The standard way: CUDA The ‘easy’ way: Standard C++ 2 / 22

Slide 3

Slide 3 text

The hardware: GPU 3 / 22

Slide 4

Slide 4 text

The hardware: GPU 3 / 22

Slide 5

Slide 5 text

The hardware: GPU 3 / 22

Slide 6

Slide 6 text

The hardware: GPU 3 / 22

Slide 7

Slide 7 text

The hardware: GPU 3 / 22

Slide 8

Slide 8 text

The hardware: GPU CPU Memory Host GPU die GPU Memory Device ... ... SM SM ... ... CUDA cores SM: Streaming Multiprocessor 3 / 22

Slide 9

Slide 9 text

Introduction to CUDA Thread Thread block Grid 4 / 22

Slide 10

Slide 10 text

Introduction to CUDA Thread Thread block Grid ... ... SM ... ... Device SM SM 4 / 22

Slide 11

Slide 11 text

Introduction to CUDA CUDA development tools Need an NVIDIA GPU Install NVIDIA Driver Install compiler: CUDA Toolkit/NVIDIA HPC SDK 4 / 22

Slide 12

Slide 12 text

Phase ﬁeld Allen-Cahn equation ∂u ∂t = 2∇2u − u(1 − u)(1 − 2u) Discretization (2D) ∂u ∂t ij = u(n+1) ij − u(n) ij ∆t + O(∆t) (∇2u)ij = u(n) i+1j + u(n) i−1j + u(n) ij+1 + u(n) ij−1 − 4u(n) ij ∆x2 + O(∆x2) 5 / 22

Slide 13

Slide 13 text

Phase ﬁeld Allen-Cahn equation ∂u ∂t = 2∇2u − u(1 − u)(1 − 2u) Discretization (2D) unew ij = uij + ∆t∆uij where ∆uij = 2 ui+1j + ui−1j + uij+1 + uij−1 − 4uij ∆x2 − gij gij = uij(1 − uij)(1 − 2uij) 5 / 22

Slide 14

Slide 14 text

Our task Implement a CPU code Modify it to run on GPU 6 / 22

Slide 15

Slide 15 text

CPU implementation NJ NI 012345678... NJ-1 012345678... NI-1 7 / 22

Slide 16

Slide 16 text

CPU implementation 1 int main(int argc , char *argv []) 2 { 3 double *u, *u1; // order parameter 4 u = (double *)malloc(NI*NJ * sizeof(double)); // old value 5 u1 = (double *)malloc(NI*NJ * sizeof(double)); // new value 6 // ... 7 ic(u); // initial condition 8 for (long t=1L; t<=n; t++){ // loop timesteps 9 sweep(u, u1 , e, dt); // u1 = u + du 10 double *tmp = u1; u1 = u; u = tmp; // u <-> u1 11 } 12 } 8 / 22

Slide 17

Slide 17 text

CPU implementation 1 void sweep(double *u, double *u1 , double e, double dt){ 2 for (int i=0;iNI -2) ? 0 : i+1; // i+1 7 int jm = (j< 1) ? NJ -1 : j-1; // j-1 8 int jp = (j>NJ -2) ? 0 : j+1; // j+1 9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 11 u1[i,j] = u[i,j] + du*dt; 12 } 13 } 14 } System size NI×NJ, Index: 0 1 2 N-1 N-2 9 / 22

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

GPU implementation Thread Thread block Grid 12 / 22

Slide 21

Slide 21 text

GPU implementation NJ NI 012345678... NJ-1 012345678... NI-1 Thread Thread block Grid Block (0,0) Block (0,1) Block (0,2) Block (1,0) Block (1,1) Block (1,2) Block (2,0) Block (2,1) Block (2,2) Block index: (blockIdx.x, blockIdx.y) 12 / 22

Slide 22

Slide 22 text

GPU implementation Thread Thread block Grid Block (2,2) 01234567 01234567 Thread (4,3) Block size: (blockDim.x, blockDim.y), Thread index: (threadIdx.x, threadIdx.y) 12 / 22

Slide 23

Slide 23 text

GPU implementation 012345678... NJ-1 012345678... NI-1 Thread Thread block Grid Block (2,2) Thread (4,3) (blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y) 12 / 22

Slide 24

Slide 24 text

GPU implementation CPU Host GPU die Device Uniﬁed Memory Memory GPU Memory No need to explicitly copy between host and device (since CUDA 6) 12 / 22

Slide 25

Slide 25 text

CUDA implementation 1 int main(int argc , char *argv []) 2 { 3 double *u, *u1; // order parameter 4 cudaMallocManaged ( &u, NI*NJ * sizeof(double)); // old value 5 cudaMallocManaged (&u1 , NI*NJ * sizeof(double)); // new value 6 dim3 dimBlock(BLOCK_X , BLOCK_Y); 7 dim3 dimGrid(NI/BLOCK_X , NJ/BLOCK_Y); 8 // ... 9 ic(u); // initial condition 10 for (long t=1L; t<=n; t++){ 11 sweep<<>>(u, u1 , e, dt); 12 double *tmp = u1; u1 = u; u = tmp; // u <-> u1 13 } 14 } 13 / 22

Slide 26

Slide 26 text

CUDA implementation 1 __global__ 2 void sweep(double *u, double *u1 , double e, double dt){ 3 int i = blockIdx.x * blockDim.x + threadIdx.x; 4 int j = blockIdx.y * blockDim.y + threadIdx.y; 5 // periodic boundary condition 6 int im = (i< 1) ? NI -1 : i-1; // i-1 7 int ip = (i>NI -2) ? 0 : i+1; // i+1 8 int jm = (j< 1) ? NJ -1 : j-1; // j-1 9 int jp = (j>NJ -2) ? 0 : j+1; // j+1 10 11 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 12 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 13 u1[i,j] = u[i,j]+du*dt; 14 } 14 / 22

Slide 27

Slide 27 text

Compiling and error checking 1 # compile 2 nvcc -o myapp myapp.cu 3 # memcheck 4 compute -sanitizer ./ myapp 5 # racecheck 6 compute -sanitizer --tool racecheck ./ myapp 7 # initcheck 8 compute -sanitizer --tool initcheck ./ myapp 9 # synccheck 10 compute -sanitizer --tool synccheck ./ myapp 11 # nsight system 12 nsys profile -o report --stats=true ./ myapp 15 / 22

Slide 28

Slide 28 text

Compare CPU and GPU code 1 # compile CPU code 2 gcc -o myapp myapp.c -lm 3 # compile GPU code 4 nvcc -o myapp myapp.cu 5 # compare the result 6 paraview --state=compare.pvsm 7 diff cpu.vtk gpu.vtk 8 # timing 9 time ./ myapp 16 / 22

Slide 29

Slide 29 text

Debugging 1 # compile with debugging info 2 nvcc -g -G -o myapp myapp.cu 3 # debugging 4 cuda -gdb ./ myapp 5 # Visual Studio Code 6 code 17 / 22

Slide 30

Slide 30 text

Proﬁling Nsight Systems blank Nsight Compute 1 # Nsignt Systems 2 nsys profile -o report --stats=true ./ myapp 3 # Nsight Compute 4 sudo ncu -k mykernel -o report ./ myapp 18 / 22

Slide 31

Slide 31 text

Optimization Size of a block (multiple of 32) Memory hierarchy Contiguous memory access Parallel reduction Warp (warp-level primitives) Shared memory (bank conﬂict) 19 / 22

Slide 32

Slide 32 text

Optimization Host Memory GPU (Grid) GPU Memory (Global) 16 GB/s Block 0 Thread 0 register shared memory Thread 1 register Block 1 Thread 0 register shared memory Thread 1 register 900 GB/s 12TB/s 12TB/s 19 / 22

Slide 33

Slide 33 text

Best practices Have a CPU code at first Have a working GPU code before any optimization Use profiling tools to find the bottleneck The execution order of threads is random 20 / 22

Slide 34

Slide 34 text

An ‘easy’ way: Standard C++ Parallelism (C++20) 1 void sweep(double *u, double *u1 , double e, double dt){ 2 auto v = std:: ranges :: views :: iota(0, NI*NJ); 3 std:: for_each( 4 std:: execution :: par_unseq , 5 std:: begin(v), std::end(v), 6 [=] (auto idx){ 7 int i = idx / NJ; 8 int j = idx % NJ; 9 // periodic boundary condition 10 int im = (i< 1) ? NI -1 : i-1; // i-1 11 int ip = (i>NI -2) ? 0 : i+1; // i+1 12 int jm = (j< 1) ? NJ -1 : j-1; // j-1 13 int jp = (j>NJ -2) ? 0 : j+1; // j+1 14 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 15 double du= e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 16 u1[i,j] = u[i,j]+du*dt; 17 }); 18 } 21 / 22

Slide 35

Slide 35 text

Migrating your code to GPUs If your code uses a library, check its GPU support If you have a Mac or an AMD GPU, check OpenCL High-order time stepping Adaptive time stepping Implicit time stepping: cuSPARSE, cuSOLVER Other spatial discretization: FFT, FV, FEM 22 / 22