Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Tutorial on Migrating Phase Field Codes to GPUs
Daniel Wheeler
July 21, 2022
Science
0
34
Tutorial on Migrating Phase Field Codes to GPUs
Daniel Wheeler
July 21, 2022
Tweet
Share
More Decks by Daniel Wheeler
See All by Daniel Wheeler
Semi-supervised Learning Approaches For Microstructure Classification
wd15
0
30
Deep Materials Informatics: Illustrative Applications of Deep Learning in Materials Science
wd15
0
56
Fitting Free Energies with Neural Networks
wd15
0
26
SymPhas: Symbolic Algebra for Phase-Field Simulations
wd15
0
38
PFHUB REIMPLEMENTATION FOR FAIR DATA COLLECTION
wd15
0
33
Preparing for Exascale Phase-Field Simulations: Phase-Field Modeling in ExaAM and AEOLUS
wd15
0
33
Selected Highlights of Accelerated Microstructure Design Using the High Performance Materials Simulation Framework Pace3D
wd15
0
33
Phase Field Methods + FEniCS/Firedrake
wd15
0
84
Phase Field Modeling with COMSOL Multiphysics
wd15
0
110
Other Decks in Science
See All in Science
統計的因果推論は「交絡」の問題だけではない ~データサイエンティストが見落としがちな視点〜
pol_user
0
150
A Fresh Look at Genomics Data: Grammar-Based Visualization
ngehlenborg
0
180
spatialDLPFC Poster BDS22
lahuuki
0
100
GEOLOGI DAN STUDI HIDROGEOLOGI DAERAH CIKALONG WETAN, KABUPATEN BANDUNG BARAT
dasaptaerwin
0
120
[10.06.2022] | Инструменты поддержки МУ СПб ФИЦ РАН | Абрамов М.В.
ysspcras
0
100
PCアルゴリズムによるベイジアンネットワーク
s1ok69oo
0
150
深センにしかないものと日本にしかないものを組み合わせて、世界の問題を解いていく
takasumasakazu
0
190
target trial emulation入門
koro485
1
3k
Search at Bloomberg: Challenges, Opportunities, and Lessons Learned
emeij
0
330
Tokoy.R #99 パーマーステーションのペンギンたち #1
bob3bob3
1
760
pccr_ppt
zugalniy
0
150
論文紹介: Transformer Memory as a Differentiable Search Index (NeurIPS 2022)
ynakano
3
390
Featured
See All Featured
Building Adaptive Systems
keathley
27
1.3k
Pencils Down: Stop Designing & Start Developing
hursman
114
10k
Building Applications with DynamoDB
mza
85
5k
Making Projects Easy
brettharned
102
4.8k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
44
14k
Fontdeck: Realign not Redesign
paulrobertlloyd
74
4.3k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
236
1.1M
For a Future-Friendly Web
brad_frost
166
7.8k
Put a Button on it: Removing Barriers to Going Fast.
kastner
56
2.5k
Side Projects
sachag
451
37k
Happy Clients
brianwarren
90
5.8k
A Modern Web Designer's Workflow
chriscoyier
689
180k
Transcript
Tutorial on Migrating Phase Field Codes to GPUs Jin Zhang
Voorhees Group, Northwestern University May 11, 2022
Outline The GPU hardware Brief introduction to CUDA Example: Allen-Cahn
equation The standard way: CUDA The ‘easy’ way: Standard C++ 2 / 22
The hardware: GPU 3 / 22
The hardware: GPU 3 / 22
The hardware: GPU 3 / 22
The hardware: GPU 3 / 22
The hardware: GPU 3 / 22
The hardware: GPU CPU Memory Host GPU die GPU Memory
Device ... ... SM SM ... ... CUDA cores SM: Streaming Multiprocessor 3 / 22
Introduction to CUDA Thread Thread block Grid 4 / 22
Introduction to CUDA Thread Thread block Grid ... ... SM
... ... Device SM SM 4 / 22
Introduction to CUDA CUDA development tools Need an NVIDIA GPU
Install NVIDIA Driver Install compiler: CUDA Toolkit/NVIDIA HPC SDK 4 / 22
Phase field Allen-Cahn equation ∂u ∂t = 2∇2u − u(1
− u)(1 − 2u) Discretization (2D) ∂u ∂t ij = u(n+1) ij − u(n) ij ∆t + O(∆t) (∇2u)ij = u(n) i+1j + u(n) i−1j + u(n) ij+1 + u(n) ij−1 − 4u(n) ij ∆x2 + O(∆x2) 5 / 22
Phase field Allen-Cahn equation ∂u ∂t = 2∇2u − u(1
− u)(1 − 2u) Discretization (2D) unew ij = uij + ∆t∆uij where ∆uij = 2 ui+1j + ui−1j + uij+1 + uij−1 − 4uij ∆x2 − gij gij = uij(1 − uij)(1 − 2uij) 5 / 22
Our task Implement a CPU code Modify it to run
on GPU 6 / 22
CPU implementation NJ NI 012345678... NJ-1 012345678... NI-1 7 /
22
CPU implementation 1 int main(int argc , char *argv [])
2 { 3 double *u, *u1; // order parameter 4 u = (double *)malloc(NI*NJ * sizeof(double)); // old value 5 u1 = (double *)malloc(NI*NJ * sizeof(double)); // new value 6 // ... 7 ic(u); // initial condition 8 for (long t=1L; t<=n; t++){ // loop timesteps 9 sweep(u, u1 , e, dt); // u1 = u + du 10 double *tmp = u1; u1 = u; u = tmp; // u <-> u1 11 } 12 } 8 / 22
CPU implementation 1 void sweep(double *u, double *u1 , double
e, double dt){ 2 for (int i=0;i<NI;i++){ 3 for (int j=0;j<NJ;j++){ 4 // periodic boundary condition 5 int im = (i< 1) ? NI -1 : i-1; // i-1 6 int ip = (i>NI -2) ? 0 : i+1; // i+1 7 int jm = (j< 1) ? NJ -1 : j-1; // j-1 8 int jp = (j>NJ -2) ? 0 : j+1; // j+1 9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 11 u1[i,j] = u[i,j] + du*dt; 12 } 13 } 14 } System size NI×NJ, Index: 0 1 2 N-1 N-2 9 / 22
CPU implementation 1 void sweep(double *u, double *u1 , double
e, double dt){ 2 for (int i=0;i<NI;i++){ 3 for (int j=0;j<NJ;j++){ 4 // periodic boundary condition 5 int im = (i< 1) ? NI -1 : i-1; // i-1 6 int ip = (i>NI -2) ? 0 : i+1; // i+1 7 int jm = (j< 1) ? NJ -1 : j-1; // j-1 8 int jp = (j>NJ -2) ? 0 : j+1; // j+1 9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 11 u1[i,j] = u[i,j] + du*dt; 12 } 13 } 14 } Index: 0 1 2 N-1 N-2 Neighbors: i,j i m ,j i p ,j i,j m i,j p 10 / 22
CPU implementation 1 void sweep(double *u, double *u1 , double
e, double dt){ 2 for (int i=0;i<NI;i++){ 3 for (int j=0;j<NJ;j++){ 4 // periodic boundary condition 5 int im = (i< 1) ? NI -1 : i-1; // i-1 6 int ip = (i>NI -2) ? 0 : i+1; // i+1 7 int jm = (j< 1) ? NJ -1 : j-1; // j-1 8 int jp = (j>NJ -2) ? 0 : j+1; // j+1 9 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 10 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 11 u1[i,j] = u[i,j] + du*dt; 12 } 13 } 14 } g = u(1 − u)(1 − 2u), ∆uij = 2 ∆x2 ui−1j + ui+1j + uij−1 + uij+1 − 4uij − gij 11 / 22
GPU implementation Thread Thread block Grid 12 / 22
GPU implementation NJ NI 012345678... NJ-1 012345678... NI-1 Thread Thread
block Grid Block (0,0) Block (0,1) Block (0,2) Block (1,0) Block (1,1) Block (1,2) Block (2,0) Block (2,1) Block (2,2) Block index: (blockIdx.x, blockIdx.y) 12 / 22
GPU implementation Thread Thread block Grid Block (2,2) 01234567 01234567
Thread (4,3) Block size: (blockDim.x, blockDim.y), Thread index: (threadIdx.x, threadIdx.y) 12 / 22
GPU implementation 012345678... NJ-1 012345678... NI-1 Thread Thread block Grid
Block (2,2) Thread (4,3) (blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y) 12 / 22
GPU implementation CPU Host GPU die Device Unified Memory Memory
GPU Memory No need to explicitly copy between host and device (since CUDA 6) 12 / 22
CUDA implementation 1 int main(int argc , char *argv [])
2 { 3 double *u, *u1; // order parameter 4 cudaMallocManaged ( &u, NI*NJ * sizeof(double)); // old value 5 cudaMallocManaged (&u1 , NI*NJ * sizeof(double)); // new value 6 dim3 dimBlock(BLOCK_X , BLOCK_Y); 7 dim3 dimGrid(NI/BLOCK_X , NJ/BLOCK_Y); 8 // ... 9 ic(u); // initial condition 10 for (long t=1L; t<=n; t++){ 11 sweep<<<dimGrid, dimBlock>>>(u, u1 , e, dt); 12 double *tmp = u1; u1 = u; u = tmp; // u <-> u1 13 } 14 } 13 / 22
CUDA implementation 1 __global__ 2 void sweep(double *u, double *u1
, double e, double dt){ 3 int i = blockIdx.x * blockDim.x + threadIdx.x; 4 int j = blockIdx.y * blockDim.y + threadIdx.y; 5 // periodic boundary condition 6 int im = (i< 1) ? NI -1 : i-1; // i-1 7 int ip = (i>NI -2) ? 0 : i+1; // i+1 8 int jm = (j< 1) ? NJ -1 : j-1; // j-1 9 int jp = (j>NJ -2) ? 0 : j+1; // j+1 10 11 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 12 double du = e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 13 u1[i,j] = u[i,j]+du*dt; 14 } 14 / 22
Compiling and error checking 1 # compile 2 nvcc -o
myapp myapp.cu 3 # memcheck 4 compute -sanitizer ./ myapp 5 # racecheck 6 compute -sanitizer --tool racecheck ./ myapp 7 # initcheck 8 compute -sanitizer --tool initcheck ./ myapp 9 # synccheck 10 compute -sanitizer --tool synccheck ./ myapp 11 # nsight system 12 nsys profile -o report --stats=true ./ myapp 15 / 22
Compare CPU and GPU code 1 # compile CPU code
2 gcc -o myapp myapp.c -lm 3 # compile GPU code 4 nvcc -o myapp myapp.cu 5 # compare the result 6 paraview --state=compare.pvsm 7 diff cpu.vtk gpu.vtk 8 # timing 9 time ./ myapp 16 / 22
Debugging 1 # compile with debugging info 2 nvcc -g
-G -o myapp myapp.cu 3 # debugging 4 cuda -gdb ./ myapp 5 # Visual Studio Code 6 code 17 / 22
Profiling Nsight Systems blank Nsight Compute 1 # Nsignt Systems
2 nsys profile -o report --stats=true ./ myapp 3 # Nsight Compute 4 sudo ncu -k mykernel -o report ./ myapp 18 / 22
Optimization Size of a block (multiple of 32) Memory hierarchy
Contiguous memory access Parallel reduction Warp (warp-level primitives) Shared memory (bank conflict) 19 / 22
Optimization Host Memory GPU (Grid) GPU Memory (Global) 16 GB/s
Block 0 Thread 0 register shared memory Thread 1 register Block 1 Thread 0 register shared memory Thread 1 register 900 GB/s 12TB/s 12TB/s 19 / 22
Best practices Have a CPU code at first Have a
working GPU code before any optimization Use profiling tools to find the bottleneck The execution order of threads is random 20 / 22
An ‘easy’ way: Standard C++ Parallelism (C++20) 1 void sweep(double
*u, double *u1 , double e, double dt){ 2 auto v = std:: ranges :: views :: iota(0, NI*NJ); 3 std:: for_each( 4 std:: execution :: par_unseq , 5 std:: begin(v), std::end(v), 6 [=] (auto idx){ 7 int i = idx / NJ; 8 int j = idx % NJ; 9 // periodic boundary condition 10 int im = (i< 1) ? NI -1 : i-1; // i-1 11 int ip = (i>NI -2) ? 0 : i+1; // i+1 12 int jm = (j< 1) ? NJ -1 : j-1; // j-1 13 int jp = (j>NJ -2) ? 0 : j+1; // j+1 14 double g = u[i,j]*(1.0-u[i,j])*(1.0 -2.0*u[i,j]); 15 double du= e*(u[im ,j]+u[ip ,j]+u[i,jm]+u[i,jp] -4.0*u[i,j]) - g; 16 u1[i,j] = u[i,j]+du*dt; 17 }); 18 } 21 / 22
Migrating your code to GPUs If your code uses a
library, check its GPU support If you have a Mac or an AMD GPU, check OpenCL High-order time stepping Adaptive time stepping Implicit time stepping: cuSPARSE, cuSOLVER Other spatial discretization: FFT, FV, FEM 22 / 22