Scheduling (3)
Time (s)
0,005
0,009
0,012
0,016
0,019
Global size divisor
1 2 4 8 16 32 64 128
ls 32 ls 64 ls 736 ls 1024
ls 32 ls 64 ls 704 ls 1024
GTX 750
TITAN X
Slide 8
Slide 8 text
A CPU Core
T0 T1
ALU
Cache
Slide 9
Slide 9 text
A GPU Core
T0 T2
ALU
T1
T5 T6
T3 T4
T9
T8
T7
Cache
Slide 10
Slide 10 text
Blocking Operations
CPU Sync Sync Sync
GPU Add kernel Add kernel
CPU Sync Sync
GPU Add kernel Add kernel
Slide 11
Slide 11 text
Blocking operations
Time
1E-05 s
1E-04 s
1E-03 s
1E-02 s
1E-01 s
1E+00 s
Number of loops
1 10 100 500 1000 5000 10000
Non-Blocking Blocking
Slide 12
Slide 12 text
Warp Divergence
if (x < 0.0)
z = x - 2.0;
else
z = sqrt(x);
Divergent code Straight-line code
@p = (x < 0.0);
p: z = x - 2.0;
!p: z = sqrt(x);
Slide 13
Slide 13 text
Divergent Kernel
KERNEL void add(GLOBAL_MEM ga_float *a,
GLOBAL_MEM ga_float *b,
GLOBAL_MEM ga_float *c, ga_size n) {
ga_size i = GID_0 * LDIM_0 + LID_0;
if (i < n) {
if (i % 2)
c[i] = a[i] + b[i];
else
c[i] = asinhf(a[i]) + erfinvf(b[i]);
}
}
Slide 14
Slide 14 text
Warp Divergence (2)
Time (s)
0,000
0,005
0,010
0,015
0,020
0,025
0,030
0,035
0,040
0,045
0,050
Fast Kernel Slow Kernel Divergent Kernel
Baseline Compute Time
Slide 15
Slide 15 text
Last Kernel (simple)
KERNEL void add(GLOBAL_MEM ga_float *a, ga_ssize lda,
GLOBAL_MEM ga_float *b, ga_ssize ldb,
GLOBAL_MEM ga_float *c, ga_ssize ldc,
ga_size M, ga_size N) {
for (ga_size row = GID_1 * LDIM_1 + LID_1; row < M;
row += GDIM_1 * LDIM_1) {
for (ga_size col = GID_0 * LDIM_0 + LID_0; col < N;
col += GDIM_0 * LDIM_0) {
c[row * ldc + col] = rdA(row, col) * rdB(row, col);
}
}
}