Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Optimizing for GPUs
Search
Arnaud Bergeron
April 25, 2017
Programming
0
660
Optimizing for GPUs
A bag of tricks to improve performance on the GPU and avoid the most common pitfalls.
Arnaud Bergeron
April 25, 2017
Tweet
Share
Other Decks in Programming
See All in Programming
ALL CODE BASE ARE BELONG TO STUDY
uzulla
27
6.6k
チームの境界をブチ抜いていけ
tokai235
0
210
理論と実務のギャップを超える
eycjur
0
170
Foundation Modelsを実装日本語学習アプリを作ってみた!
hypebeans
1
120
はじめてのDSPy - 言語モデルを『プロンプト』ではなく『プログラミング』するための仕組み
masahiro_nishimi
3
8.8k
Building, Deploying, and Monitoring Ruby Web Applications with Falcon (Kaigi on Rails 2025)
ioquatix
4
2.5k
スキーマ駆動で、Zod OpenAPI Honoによる、API開発するために、Hono Takibiというライブラリを作っている
nakita628
0
290
『毎日の移動』を支えるGoバックエンド内製開発
yutautsugi
2
270
EMこそClaude Codeでコード調査しよう
shibayu36
0
300
Google Opalで使える37のライブラリ
mickey_kubo
3
130
コード生成なしでモック処理を実現!ovechkin-dm/mockioで学ぶメタプログラミング
qualiarts
0
240
Developer Joy - The New Paradigm
hollycummins
1
330
Featured
See All Featured
Done Done
chrislema
185
16k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
132
19k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
48
9.7k
Making the Leap to Tech Lead
cromwellryan
135
9.6k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
190
55k
Art, The Web, and Tiny UX
lynnandtonic
303
21k
Build The Right Thing And Hit Your Dates
maggiecrowley
37
2.9k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.7k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4k
jQuery: Nuts, Bolts and Bling
dougneiner
65
7.9k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
30
2.9k
A better future with KSS
kneath
239
18k
Transcript
Optimizing for GPUs Arnaud Bergeron
Kernels __kernel void add(__global float *a, __global float *b, __global
float *c, size_t n) { size_t i = get_global_id(0); if (i < n) c[i] = a[i] + b[i]; } __global__ void add(float *a, float *b, float *c, size_t n) { size_t i = (blockIdx.x * blockDim.x) + threadIdx.x; if (i < n) c[i] = a[i] + b[i]; } OpenCL CUDA
Unified Kernel KERNEL void add(GLOBAL_MEM ga_float *a, GLOBAL_MEM ga_float *b,
GLOBAL_MEM ga_float *c, ga_size n) { ga_size i = GID_0 * LDIM_0 + LID_0; if (i < n) c[i] = a[i] + b[i]; }
Grid, Blocks, Threads Grid Block
Scheduling Time (s) 0E+00 2E-03 4E-03 6E-03 8E-03 Local Size
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Same work (TITAN X) Same total (TITAN X) Same work (GTX 750) Same total (GTX 750)
Scheduling (2) Time (s) 1E-04 1,3E-04 1,6E-04 1,9E-04 2,2E-04 2,5E-04
Local Size 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608 640 672 704 736 768 800 832 864 896 928 960 992 1024 GTX 750 TITAN X
Scheduling (3) Time (s) 0,005 0,009 0,012 0,016 0,019 Global
size divisor 1 2 4 8 16 32 64 128 ls 32 ls 64 ls 736 ls 1024 ls 32 ls 64 ls 704 ls 1024 GTX 750 TITAN X
A CPU Core T0 T1 ALU Cache
A GPU Core T0 T2 ALU T1 T5 T6 T3
T4 T9 T8 T7 Cache
Blocking Operations CPU Sync Sync Sync GPU Add kernel Add
kernel CPU Sync Sync GPU Add kernel Add kernel
Blocking operations Time 1E-05 s 1E-04 s 1E-03 s 1E-02
s 1E-01 s 1E+00 s Number of loops 1 10 100 500 1000 5000 10000 Non-Blocking Blocking
Warp Divergence if (x < 0.0) z = x -
2.0; else z = sqrt(x); Divergent code Straight-line code @p = (x < 0.0); p: z = x - 2.0; !p: z = sqrt(x);
Divergent Kernel KERNEL void add(GLOBAL_MEM ga_float *a, GLOBAL_MEM ga_float *b,
GLOBAL_MEM ga_float *c, ga_size n) { ga_size i = GID_0 * LDIM_0 + LID_0; if (i < n) { if (i % 2) c[i] = a[i] + b[i]; else c[i] = asinhf(a[i]) + erfinvf(b[i]); } }
Warp Divergence (2) Time (s) 0,000 0,005 0,010 0,015 0,020
0,025 0,030 0,035 0,040 0,045 0,050 Fast Kernel Slow Kernel Divergent Kernel Baseline Compute Time
Last Kernel (simple) KERNEL void add(GLOBAL_MEM ga_float *a, ga_ssize lda,
GLOBAL_MEM ga_float *b, ga_ssize ldb, GLOBAL_MEM ga_float *c, ga_ssize ldc, ga_size M, ga_size N) { for (ga_size row = GID_1 * LDIM_1 + LID_1; row < M; row += GDIM_1 * LDIM_1) { for (ga_size col = GID_0 * LDIM_0 + LID_0; col < N; col += GDIM_0 * LDIM_0) { c[row * ldc + col] = rdA(row, col) * rdB(row, col); } } }
Last Kernel (local) KERNEL void add(GLOBAL_MEM ga_float *a, ga_ssize lda,
GLOBAL_MEM ga_float *b, ga_ssize ldb, GLOBAL_MEM ga_float *c, ga_ssize ldc, ga_size M, ga_size N) { LOCAL_MEM ga_float bufA[32][32]; LOCAL_MEM ga_float bufB[32][32]; for (ga_size row = GID_1; row < 32; row += GDIM_1) { for (ga_size col = GID_0; row < 32; row += GDIM_0) { // kernel code } } }
Inner Code (local) for (int i = 0; i <
32; i++) bufA[i][LID_0] = rdA(row*32 + i, col*32 + LID_0); for (int i = 0; i < 32; i++) bufB[i][LID_0] = rdB(row*32 + i, col*32 + LID_0); local_barrier(); for (int i = 0; i < 32; i++) { for (int j = 0; j < 32; j++) { c[(row*32 + i)*ldc + (col*32 + j)] = bufA[i][j] * bufB[i][j]; } }
Final example Time (s) 0 0,001 0,002 0,003 0,004 0,005
0,006 C order F order F order (with scheduling) C order (shared memory) F order (shared memory)