Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Optimizing for GPUs
Search
Arnaud Bergeron
April 25, 2017
Programming
0
660
Optimizing for GPUs
A bag of tricks to improve performance on the GPU and avoid the most common pitfalls.
Arnaud Bergeron
April 25, 2017
Tweet
Share
Other Decks in Programming
See All in Programming
TransformerからMCPまで(現代AIを理解するための羅針盤)
mickey_kubo
7
5.5k
20251016_Rails News ~Rails 8.1の足音を聴く~
morimorihoge
3
860
Software Architecture
hschwentner
6
2.3k
社会人になっても趣味開発を続けたい! / traPavilion
mazrean
1
110
Domain-centric? Why Hexagonal, Onion, and Clean Architecture Are Answers to the Wrong Question
olivergierke
3
980
Webサーバーサイド言語としてのRustについて
kouyuume
1
5k
CSC509 Lecture 08
javiergs
PRO
0
260
AI 駆動開発におけるコミュニティと AWS CDK の価値
konokenj
5
280
GC25 Recap: The Code You Reviewed is Not the Code You Built / #newt_gophercon_tour
mazrean
0
120
One Enishi After Another
snoozer05
PRO
0
160
Blazing Fast UI Development with Compose Hot Reload (Bangladesh KUG, October 2025)
zsmb
1
280
Towards Transactional Buffering of CDC Events @ Flink Forward 2025 Barcelona Spain
hpgrahsl
0
120
Featured
See All Featured
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4k
GraphQLとの向き合い方2022年版
quramy
49
14k
Speed Design
sergeychernyshev
32
1.2k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
132
19k
Done Done
chrislema
185
16k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Principles of Awesome APIs and How to Build Them.
keavy
127
17k
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
10
620
Gamification - CAS2011
davidbonilla
81
5.5k
Balancing Empowerment & Direction
lara
5
700
Six Lessons from altMBA
skipperchong
29
4k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
Transcript
Optimizing for GPUs Arnaud Bergeron
Kernels __kernel void add(__global float *a, __global float *b, __global
float *c, size_t n) { size_t i = get_global_id(0); if (i < n) c[i] = a[i] + b[i]; } __global__ void add(float *a, float *b, float *c, size_t n) { size_t i = (blockIdx.x * blockDim.x) + threadIdx.x; if (i < n) c[i] = a[i] + b[i]; } OpenCL CUDA
Unified Kernel KERNEL void add(GLOBAL_MEM ga_float *a, GLOBAL_MEM ga_float *b,
GLOBAL_MEM ga_float *c, ga_size n) { ga_size i = GID_0 * LDIM_0 + LID_0; if (i < n) c[i] = a[i] + b[i]; }
Grid, Blocks, Threads Grid Block
Scheduling Time (s) 0E+00 2E-03 4E-03 6E-03 8E-03 Local Size
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Same work (TITAN X) Same total (TITAN X) Same work (GTX 750) Same total (GTX 750)
Scheduling (2) Time (s) 1E-04 1,3E-04 1,6E-04 1,9E-04 2,2E-04 2,5E-04
Local Size 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608 640 672 704 736 768 800 832 864 896 928 960 992 1024 GTX 750 TITAN X
Scheduling (3) Time (s) 0,005 0,009 0,012 0,016 0,019 Global
size divisor 1 2 4 8 16 32 64 128 ls 32 ls 64 ls 736 ls 1024 ls 32 ls 64 ls 704 ls 1024 GTX 750 TITAN X
A CPU Core T0 T1 ALU Cache
A GPU Core T0 T2 ALU T1 T5 T6 T3
T4 T9 T8 T7 Cache
Blocking Operations CPU Sync Sync Sync GPU Add kernel Add
kernel CPU Sync Sync GPU Add kernel Add kernel
Blocking operations Time 1E-05 s 1E-04 s 1E-03 s 1E-02
s 1E-01 s 1E+00 s Number of loops 1 10 100 500 1000 5000 10000 Non-Blocking Blocking
Warp Divergence if (x < 0.0) z = x -
2.0; else z = sqrt(x); Divergent code Straight-line code @p = (x < 0.0); p: z = x - 2.0; !p: z = sqrt(x);
Divergent Kernel KERNEL void add(GLOBAL_MEM ga_float *a, GLOBAL_MEM ga_float *b,
GLOBAL_MEM ga_float *c, ga_size n) { ga_size i = GID_0 * LDIM_0 + LID_0; if (i < n) { if (i % 2) c[i] = a[i] + b[i]; else c[i] = asinhf(a[i]) + erfinvf(b[i]); } }
Warp Divergence (2) Time (s) 0,000 0,005 0,010 0,015 0,020
0,025 0,030 0,035 0,040 0,045 0,050 Fast Kernel Slow Kernel Divergent Kernel Baseline Compute Time
Last Kernel (simple) KERNEL void add(GLOBAL_MEM ga_float *a, ga_ssize lda,
GLOBAL_MEM ga_float *b, ga_ssize ldb, GLOBAL_MEM ga_float *c, ga_ssize ldc, ga_size M, ga_size N) { for (ga_size row = GID_1 * LDIM_1 + LID_1; row < M; row += GDIM_1 * LDIM_1) { for (ga_size col = GID_0 * LDIM_0 + LID_0; col < N; col += GDIM_0 * LDIM_0) { c[row * ldc + col] = rdA(row, col) * rdB(row, col); } } }
Last Kernel (local) KERNEL void add(GLOBAL_MEM ga_float *a, ga_ssize lda,
GLOBAL_MEM ga_float *b, ga_ssize ldb, GLOBAL_MEM ga_float *c, ga_ssize ldc, ga_size M, ga_size N) { LOCAL_MEM ga_float bufA[32][32]; LOCAL_MEM ga_float bufB[32][32]; for (ga_size row = GID_1; row < 32; row += GDIM_1) { for (ga_size col = GID_0; row < 32; row += GDIM_0) { // kernel code } } }
Inner Code (local) for (int i = 0; i <
32; i++) bufA[i][LID_0] = rdA(row*32 + i, col*32 + LID_0); for (int i = 0; i < 32; i++) bufB[i][LID_0] = rdB(row*32 + i, col*32 + LID_0); local_barrier(); for (int i = 0; i < 32; i++) { for (int j = 0; j < 32; j++) { c[(row*32 + i)*ldc + (col*32 + j)] = bufA[i][j] * bufB[i][j]; } }
Final example Time (s) 0 0,001 0,002 0,003 0,004 0,005
0,006 C order F order F order (with scheduling) C order (shared memory) F order (shared memory)