Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tutorial on Migrating Phase Field Codes to GPUs

Tutorial on Migrating Phase Field Codes to GPUs

Daniel Wheeler

July 21, 2022
Tweet

More Decks by Daniel Wheeler

Other Decks in Science

Transcript

  1. Tutorial on Migrating Phase Field Codes to
    GPUs
    Jin Zhang
    Voorhees Group, Northwestern University
    May 11, 2022

    View full-size slide

  2. Outline
    The GPU hardware
    Brief introduction to CUDA
    Example: Allen-Cahn equation
    The standard way: CUDA
    The ‘easy’ way: Standard C++
    2 / 22

    View full-size slide

  3. The hardware: GPU
    3 / 22

    View full-size slide

  4. The hardware: GPU
    3 / 22

    View full-size slide

  5. The hardware: GPU
    3 / 22

    View full-size slide

  6. The hardware: GPU
    3 / 22

    View full-size slide

  7. The hardware: GPU
    3 / 22

    View full-size slide

  8. The hardware: GPU
    CPU
    Memory
    Host
    GPU
    die
    GPU
    Memory
    Device
    ...
    ...
    SM
    SM
    ...
    ...
    CUDA
    cores
    SM: Streaming Multiprocessor
    3 / 22

    View full-size slide

  9. Introduction to CUDA
    Thread
    Thread
    block
    Grid
    4 / 22

    View full-size slide

  10. Introduction to CUDA
    Thread
    Thread
    block
    Grid
    ...
    ...
    SM
    ...
    ...
    Device
    SM
    SM
    4 / 22

    View full-size slide

  11. Introduction to CUDA
    CUDA development tools
    Need an NVIDIA GPU
    Install NVIDIA Driver
    Install compiler: CUDA Toolkit/NVIDIA HPC SDK
    4 / 22

    View full-size slide

  12. Phase field
    Allen-Cahn equation
    ∂u
    ∂t
    = 2∇2u − u(1 − u)(1 − 2u)
    Discretization (2D)
    ∂u
    ∂t ij
    =
    u(n+1)
    ij
    − u(n)
    ij
    ∆t
    + O(∆t)
    (∇2u)ij =
    u(n)
    i+1j
    + u(n)
    i−1j
    + u(n)
    ij+1
    + u(n)
    ij−1
    − 4u(n)
    ij
    ∆x2
    + O(∆x2)
    5 / 22

    View full-size slide

  13. Phase field
    Allen-Cahn equation
    ∂u
    ∂t
    = 2∇2u − u(1 − u)(1 − 2u)
    Discretization (2D)
    unew
    ij
    = uij + ∆t∆uij
    where
    ∆uij = 2
    ui+1j + ui−1j + uij+1 + uij−1 − 4uij
    ∆x2
    − gij
    gij = uij(1 − uij)(1 − 2uij)
    5 / 22

    View full-size slide

  14. Our task
    Implement a CPU code
    Modify it to run on GPU
    6 / 22

    View full-size slide

  15. CPU implementation
    NJ
    NI
    012345678... NJ-1
    012345678... NI-1
    7 / 22

    View full-size slide

  16. CPU implementation
    1 int main(int argc , char *argv [])
    2 {
    3 double *u, *u1; // order parameter
    4 u = (double *)malloc(NI*NJ * sizeof(double)); // old value
    5 u1 = (double *)malloc(NI*NJ * sizeof(double)); // new value
    6 // ...
    7 ic(u); // initial condition
    8 for (long t=1L; t<=n; t++){ // loop timesteps
    9 sweep(u, u1 , e, dt); // u1 = u + du
    10 double *tmp = u1; u1 = u; u = tmp; // u <-> u1
    11 }
    12 }
    8 / 22

    View full-size slide

  17. CPU implementation
    1 void sweep(double *u, double *u1 , double e, double dt){
    2 for (int i=0;i3 for (int j=0;j4 // periodic boundary condition
    5 int im = (i< 1) ? NI -1 : i-1; // i-1
    6 int ip = (i>NI -2) ? 0 : i+1; // i+1
    7 int jm = (j< 1) ? NJ -1 : j-1; // j-1
    8 int jp = (j>NJ -2) ? 0 : j+1; // j+1
    9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]);
    10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g;
    11 u1[i,j] = u[i,j] + du*dt;
    12 }
    13 }
    14 }
    System size NI×NJ, Index: 0 1 2 N-1
    N-2
    9 / 22

    View full-size slide

  18. CPU implementation
    1 void sweep(double *u, double *u1 , double e, double dt){
    2 for (int i=0;i3 for (int j=0;j4 // periodic boundary condition
    5 int im = (i< 1) ? NI -1 : i-1; // i-1
    6 int ip = (i>NI -2) ? 0 : i+1; // i+1
    7 int jm = (j< 1) ? NJ -1 : j-1; // j-1
    8 int jp = (j>NJ -2) ? 0 : j+1; // j+1
    9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]);
    10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g;
    11 u1[i,j] = u[i,j] + du*dt;
    12 }
    13 }
    14 }
    Index: 0 1 2 N-1
    N-2
    Neighbors:
    i,j
    i
    m
    ,j
    i
    p
    ,j
    i,j
    m
    i,j
    p
    10 / 22

    View full-size slide

  19. CPU implementation
    1 void sweep(double *u, double *u1 , double e, double dt){
    2 for (int i=0;i3 for (int j=0;j4 // periodic boundary condition
    5 int im = (i< 1) ? NI -1 : i-1; // i-1
    6 int ip = (i>NI -2) ? 0 : i+1; // i+1
    7 int jm = (j< 1) ? NJ -1 : j-1; // j-1
    8 int jp = (j>NJ -2) ? 0 : j+1; // j+1
    9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]);
    10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g;
    11 u1[i,j] = u[i,j] + du*dt;
    12 }
    13 }
    14 }
    g = u(1 − u)(1 − 2u), ∆uij = 2
    ∆x2
    ui−1j + ui+1j + uij−1 + uij+1 − 4uij − gij
    11 / 22

    View full-size slide

  20. GPU implementation
    Thread
    Thread
    block
    Grid
    12 / 22

    View full-size slide

  21. GPU implementation
    NJ
    NI
    012345678... NJ-1
    012345678... NI-1
    Thread
    Thread
    block
    Grid
    Block
    (0,0)
    Block
    (0,1)
    Block
    (0,2)
    Block
    (1,0)
    Block
    (1,1)
    Block
    (1,2)
    Block
    (2,0)
    Block
    (2,1)
    Block
    (2,2)
    Block index: (blockIdx.x, blockIdx.y)
    12 / 22

    View full-size slide

  22. GPU implementation
    Thread
    Thread
    block
    Grid
    Block
    (2,2)
    01234567
    01234567
    Thread
    (4,3)
    Block size: (blockDim.x, blockDim.y), Thread index: (threadIdx.x, threadIdx.y)
    12 / 22

    View full-size slide

  23. GPU implementation
    012345678... NJ-1
    012345678... NI-1
    Thread
    Thread
    block
    Grid
    Block
    (2,2)
    Thread
    (4,3)
    (blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y)
    12 / 22

    View full-size slide

  24. GPU implementation
    CPU
    Host
    GPU
    die
    Device
    Unified
    Memory
    Memory
    GPU
    Memory
    No need to explicitly copy between host and device (since CUDA 6)
    12 / 22

    View full-size slide

  25. CUDA implementation
    1 int main(int argc , char *argv [])
    2 {
    3 double *u, *u1; // order parameter
    4 cudaMallocManaged ( &u, NI*NJ * sizeof(double)); // old value
    5 cudaMallocManaged (&u1 , NI*NJ * sizeof(double)); // new value
    6 dim3 dimBlock(BLOCK_X , BLOCK_Y);
    7 dim3 dimGrid(NI/BLOCK_X , NJ/BLOCK_Y);
    8 // ...
    9 ic(u); // initial condition
    10 for (long t=1L; t<=n; t++){
    11 sweep<<>>(u, u1 , e, dt);
    12 double *tmp = u1; u1 = u; u = tmp; // u <-> u1
    13 }
    14 }
    13 / 22

    View full-size slide

  26. CUDA implementation
    1 __global__
    2 void sweep(double *u, double *u1 , double e, double dt){
    3 int i = blockIdx.x * blockDim.x + threadIdx.x;
    4 int j = blockIdx.y * blockDim.y + threadIdx.y;
    5 // periodic boundary condition
    6 int im = (i< 1) ? NI -1 : i-1; // i-1
    7 int ip = (i>NI -2) ? 0 : i+1; // i+1
    8 int jm = (j< 1) ? NJ -1 : j-1; // j-1
    9 int jp = (j>NJ -2) ? 0 : j+1; // j+1
    10
    11 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]);
    12 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g;
    13 u1[i,j] = u[i,j]+du*dt;
    14 }
    14 / 22

    View full-size slide

  27. Compiling and error checking
    1 # compile
    2 nvcc -o myapp myapp.cu
    3 # memcheck
    4 compute -sanitizer ./ myapp
    5 # racecheck
    6 compute -sanitizer --tool racecheck ./ myapp
    7 # initcheck
    8 compute -sanitizer --tool initcheck ./ myapp
    9 # synccheck
    10 compute -sanitizer --tool synccheck ./ myapp
    11 # nsight system
    12 nsys profile -o report --stats=true ./ myapp
    15 / 22

    View full-size slide

  28. Compare CPU and GPU code
    1 # compile CPU code
    2 gcc -o myapp myapp.c -lm
    3 # compile GPU code
    4 nvcc -o myapp myapp.cu
    5 # compare the result
    6 paraview --state=compare.pvsm
    7 diff cpu.vtk gpu.vtk
    8 # timing
    9 time ./ myapp
    16 / 22

    View full-size slide

  29. Debugging
    1 # compile with debugging info
    2 nvcc -g -G -o myapp myapp.cu
    3 # debugging
    4 cuda -gdb ./ myapp
    5 # Visual Studio Code
    6 code
    17 / 22

    View full-size slide

  30. Profiling
    Nsight Systems blank Nsight Compute
    1 # Nsignt Systems
    2 nsys profile -o report --stats=true ./ myapp
    3 # Nsight Compute
    4 sudo ncu -k mykernel -o report ./ myapp
    18 / 22

    View full-size slide

  31. Optimization
    Size of a block (multiple of 32)
    Memory hierarchy
    Contiguous memory access
    Parallel reduction
    Warp (warp-level primitives)
    Shared memory (bank conflict)
    19 / 22

    View full-size slide

  32. Optimization
    Host Memory
    GPU (Grid)
    GPU Memory (Global)
    16 GB/s
    Block 0
    Thread 0
    register
    shared memory
    Thread 1
    register
    Block 1
    Thread 0
    register
    shared memory
    Thread 1
    register
    900 GB/s
    12TB/s
    12TB/s
    19 / 22

    View full-size slide

  33. Best practices
    Have a CPU code at first
    Have a working GPU code before any optimization
    Use profiling tools to find the bottleneck
    The execution order of threads is random
    20 / 22

    View full-size slide

  34. An ‘easy’ way: Standard C++ Parallelism (C++20)
    1 void sweep(double *u, double *u1 , double e, double dt){
    2 auto v = std:: ranges :: views :: iota(0, NI*NJ);
    3 std:: for_each(
    4 std:: execution :: par_unseq ,
    5 std:: begin(v), std::end(v),
    6 [=] (auto idx){
    7 int i = idx / NJ;
    8 int j = idx % NJ;
    9 // periodic boundary condition
    10 int im = (i< 1) ? NI -1 : i-1; // i-1
    11 int ip = (i>NI -2) ? 0 : i+1; // i+1
    12 int jm = (j< 1) ? NJ -1 : j-1; // j-1
    13 int jp = (j>NJ -2) ? 0 : j+1; // j+1
    14 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]);
    15 double du= e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g;
    16 u1[i,j] = u[i,j]+du*dt;
    17 });
    18 }
    21 / 22

    View full-size slide

  35. Migrating your code to GPUs
    If your code uses a library, check its GPU support
    If you have a Mac or an AMD GPU, check OpenCL
    High-order time stepping
    Adaptive time stepping
    Implicit time stepping: cuSPARSE, cuSOLVER
    Other spatial discretization: FFT, FV, FEM
    22 / 22

    View full-size slide