Video of the presentation: http://youtu.be/ARqE4yT2Z0o
Workshop presentation on using Accelerate effectively: https://speakerdeck.com/tmcdonell/gpgpu-programming-in-haskell-with-accelerate-workshop
Current graphics cards are massively parallel multicore processors optimised for workloads with a large degree of SIMD parallelism. Peak performance of these devices is far greater than that of traditional CPUs, however this is difficult to realise because good performance requires highly idiomatic programs, whose development is work intensive and requires expert knowledge. To raise the level of abstraction we are developing a domain-specific high-level language in Haskell for programming these devices. Computations are expressed in the form of parameterised collective operations —such as maps, reductions, and permutations— over multi-dimensional arrays. These computations are online compiled and executed on the graphics processor.
In this talk, I will introduce the Accelerate project; the language and its embedding in Haskell, as well as the online code generator and runtime system targeting CUDA GPUs.
GPGPU Programming in Haskell
with Accelerate
Trevor L. McDonell
University of New South Wales
!
@tlmcdonell
[email protected]
!
https://github.com/AccelerateHS
What is GPGPU Programming?
• General Purpose Programming on Graphics Processing Units (GPUs)
• Using your graphics card for something other than playing games
• GPUs have many more cores than a CPU
- GeForce GTX Titan
- 2688 cores @ 837 MHz
- 6 GB memory @ 288 GB/s
What is GPGPU Programming?
• Main differences:
- Single program multiple data (SPMD / SIMD), or just data-parallelism
- All the cores run the same program, but on different data
• We can’t program these in the same way as a CPU
- Different instruction sets: can’t run a Haskell program directly
- More restrictive hardware designs, limited control structures
• GPUs have their own memory
- Data has to be explicitly moved back and forth
Dot product four ways
• Dot-product: pair-wise multiply two arrays and sum the result
• C (sequential):
float
dotp(float
*xs,
float
*ys,
int
size)
{
int
i;
float
sum
=
0;
!
for
(i
=
0;
i
<
size;
++i)
{
sum
=
sum
+
xs[i]
*
ys[i];
}
!
return
sum;
}
Dot product four ways
• Haskell (sequential):
dotp
::
[Float]
-‐>
[Float]
-‐>
Float
dotp
xs
ys
=
foldl
(+)
0
(
zipWith
(*)
xs
ys
)
Dot product four ways
• Haskell (sequential):
- [Float] is a list of floating point numbers
dotp
::
[Float]
-‐>
[Float]
-‐>
Float
dotp
xs
ys
=
foldl
(+)
0
(
zipWith
(*)
xs
ys
)
Dot product four ways
• Haskell (sequential):
- [Float] is a list of floating point numbers
- zipWith applies the function (*) element-wise to the two input lists
dotp
::
[Float]
-‐>
[Float]
-‐>
Float
dotp
xs
ys
=
foldl
(+)
0
(
zipWith
(*)
xs
ys
)
Dot product four ways
• Haskell (sequential):
- [Float] is a list of floating point numbers
- zipWith applies the function (*) element-wise to the two input lists
- foldl sums the result of zipWith
dotp
::
[Float]
-‐>
[Float]
-‐>
Float
dotp
xs
ys
=
foldl
(+)
0
(
zipWith
(*)
xs
ys
)
Dot product four ways
• CUDA (parallel):
- Step 1: element-wise multiplication
__global__
void
zipWithMult(float
*xs,
float
*ys,
float
*zs,
int
size)
{
int
i
=
blockDim.x
*
blockIdx.x
+
threadIdx.x;
!
if
(i
<
size)
{
zs[i]
=
xs[i]
*
ys[i];
}
}
Dot product four ways
• CUDA (parallel):
- Step 1: element-wise multiplication
- __global__ indicates this is a function that runs on the GPU
__global__
void
zipWithMult(float
*xs,
float
*ys,
float
*zs,
int
size)
{
int
i
=
blockDim.x
*
blockIdx.x
+
threadIdx.x;
!
if
(i
<
size)
{
zs[i]
=
xs[i]
*
ys[i];
}
}
Dot product four ways
• CUDA (parallel):
- Step 1: element-wise multiplication
- __global__ indicates this is a function that runs on the GPU
- spawn one thread for each element in the vector: unique for each thread
__global__
void
zipWithMult(float
*xs,
float
*ys,
float
*zs,
int
size)
{
int
i
=
blockDim.x
*
blockIdx.x
+
threadIdx.x;
!
if
(i
<
size)
{
zs[i]
=
xs[i]
*
ys[i];
}
}
Dot product four ways
• CUDA (parallel):
- Step 2: vector reduction … is somewhat complex
struct
SharedMemory
{
__device__
inline
operator
float
*()
{
extern
__shared__
int
__smem[];
return
(float
*)__smem;
}
!
__device__
inline
operator
const
float
*()
const
{
extern
__shared__
int
__smem[];
return
(float
*)__smem;
}
};
!
template
__global__
void
reduce_kernel(float
*g_idata,
float
*g_odata,
unsigned
int
n)
{
float
*sdata
=
SharedMemory();
!
unsigned
int
tid
=
threadIdx.x;
unsigned
int
i
=
blockIdx.x*blockSize*2
+
threadIdx.x;
unsigned
int
gridSize
=
blockSize*2*gridDim.x;
!
float
sum
=
0;
!
while
(i
<
n)
{
sum
+=
g_idata[i];
!
if
(nIsPow2
||
i
+
blockSize
<
n)
sum
+=
g_idata[i+blockSize];
!
i
+=
gridSize;
}
!
sdata[tid]
=
sum;
__syncthreads();
!
if
(blockSize
>=
512)
{
if
(tid
<
256)
{
sdata[tid]
=
sum
=
sum
+
sdata[tid
+
256];
}
!
__syncthreads();
}
!
if
(blockSize
>=
256)
{
if
(tid
<
128)
{
sdata[tid]
=
sum
=
sum
+
sdata[tid
+
128];
}
!
__syncthreads();
}
!
if
(blockSize
>=
128)
{
if
(tid
<
64)
{
sdata[tid]
=
sum
=
sum
+
sdata[tid
+
64];
}
!
__syncthreads();
}
!
if
(tid
<
32)
{
volatile
float
*smem
=
sdata;
!
if
(blockSize
>=
64)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
32];
}
!
if
(blockSize
>=
32)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
16];
}
!
if
(blockSize
>=
16)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
8];
}
!
if
(blockSize
>=
8)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
4];
}
!
if
(blockSize
>=
4)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
2];
}
!
if
(blockSize
>=
2)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
1];
}
}
!
if
(tid
==
0)
g_odata[blockIdx.x]
=
sdata[0];
}
!
void
getNumBlocksAndThreads(int
n,
int
maxBlocks,
int
maxThreads,
int
&blocks,
int
&threads)
{
cudaDeviceProp
prop;
int
device;
checkCudaErrors(cudaGetDevice(&device));
checkCudaErrors(cudaGetDeviceProperties(&prop,
device));
!
threads
=
(n
<
maxThreads*2)
?
nextPow2((n
+
1)/
2)
:
maxThreads;
blocks
=
(n
+
(threads
*
2
-‐
1))
/
(threads
*
2);
!
if
(blocks
>
prop.maxGridSize[0])
{
blocks
/=
2;
threads
*=
2;
}
!
blocks
=
min(maxBlocks,
blocks);
}
!
float
reduce(int
n,
float
*d_idata,
float
*d_odata)
{
int
threads
=
0;
int
blocks
=
0;
int
maxThreads
=
256;
int
maxBlocks
=
64;
int
size
=
n
!
while
(size
>
1)
{
getNumBlocksAndThreads(size,
maxBlocks,
maxThreads,
blocks,
threads);
!
int
smemSize
=
(threads
<=
32)
?
2
*
threads
*
sizeof(float)
:
threads
*
sizeof(float);
dim3
dimBlock(threads,
1,
1);
dim3
dimGrid(blocks,
1,
1);
!
if
(isPow2(size))
{
switch
(threads)
{
case
512:
reduce_kernel<512,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
256:
reduce_kernel<256,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
128:
reduce_kernel<128,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
64:
reduce_kernel<
64,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
32:
reduce_kernel<
32,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
16:
reduce_kernel<
16,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
8:
reduce_kernel<
8,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
4:
reduce_kernel<
4,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
2:
reduce_kernel<
2,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
1:
reduce_kernel<
1,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
}
}
else
{
switch
(threads)
{
case
512:
reduce_kernel<512,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
256:
reduce_kernel<256,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
128:
reduce_kernel<128,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
64:
reduce_kernel<
64,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
32:
reduce_kernel<
32,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
16:
reduce_kernel<
16,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
8:
reduce_kernel<
8,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
4:
reduce_kernel<
4,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
2:
reduce_kernel<
2,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
1:
reduce_kernel<
1,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
}
}
!
size
=
(size
+
(threads*2-‐1))
/
(threads*2);
}
!
float
sum;
checkCudaErrors(cudaMemcpy(&sum,
d_odata,
sizeof(float),
cudaMemcpyDeviceToHost));
!
}
!
Dot product four ways
• CUDA (parallel):
- Step 2: vector reduction … is somewhat complex
struct
SharedMemory
{
__device__
inline
operator
float
*()
{
extern
__shared__
int
__smem[];
return
(float
*)__smem;
}
!
__device__
inline
operator
const
float
*()
const
{
extern
__shared__
int
__smem[];
return
(float
*)__smem;
}
};
!
template
__global__
void
reduce_kernel(float
*g_idata,
float
*g_odata,
unsigned
int
n)
{
float
*sdata
=
SharedMemory();
!
unsigned
int
tid
=
threadIdx.x;
unsigned
int
i
=
blockIdx.x*blockSize*2
+
threadIdx.x;
unsigned
int
gridSize
=
blockSize*2*gridDim.x;
!
float
sum
=
0;
!
while
(i
<
n)
{
sum
+=
g_idata[i];
!
if
(nIsPow2
||
i
+
blockSize
<
n)
sum
+=
g_idata[i+blockSize];
!
i
+=
gridSize;
}
!
sdata[tid]
=
sum;
__syncthreads();
!
if
(blockSize
>=
512)
{
if
(tid
<
256)
{
sdata[tid]
=
sum
=
sum
+
sdata[tid
+
256];
}
!
__syncthreads();
}
!
if
(blockSize
>=
256)
{
if
(tid
<
128)
{
sdata[tid]
=
sum
=
sum
+
sdata[tid
+
128];
}
!
__syncthreads();
}
!
if
(blockSize
>=
128)
{
if
(tid
<
64)
{
sdata[tid]
=
sum
=
sum
+
sdata[tid
+
64];
}
!
__syncthreads();
}
!
if
(tid
<
32)
{
volatile
float
*smem
=
sdata;
!
if
(blockSize
>=
64)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
32];
}
!
if
(blockSize
>=
32)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
16];
}
!
if
(blockSize
>=
16)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
8];
}
!
if
(blockSize
>=
8)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
4];
}
!
if
(blockSize
>=
4)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
2];
}
!
if
(blockSize
>=
2)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
1];
}
}
!
if
(tid
==
0)
g_odata[blockIdx.x]
=
sdata[0];
}
!
void
getNumBlocksAndThreads(int
n,
int
maxBlocks,
int
maxThreads,
int
&blocks,
int
&threads)
{
cudaDeviceProp
prop;
int
device;
checkCudaErrors(cudaGetDevice(&device));
checkCudaErrors(cudaGetDeviceProperties(&prop,
device));
!
threads
=
(n
<
maxThreads*2)
?
nextPow2((n
+
1)/
2)
:
maxThreads;
blocks
=
(n
+
(threads
*
2
-‐
1))
/
(threads
*
2);
!
if
(blocks
>
prop.maxGridSize[0])
{
blocks
/=
2;
threads
*=
2;
}
!
blocks
=
min(maxBlocks,
blocks);
}
!
float
reduce(int
n,
float
*d_idata,
float
*d_odata)
{
int
threads
=
0;
int
blocks
=
0;
int
maxThreads
=
256;
int
maxBlocks
=
64;
int
size
=
n
!
while
(size
>
1)
{
getNumBlocksAndThreads(size,
maxBlocks,
maxThreads,
blocks,
threads);
!
int
smemSize
=
(threads
<=
32)
?
2
*
threads
*
sizeof(float)
:
threads
*
sizeof(float);
dim3
dimBlock(threads,
1,
1);
dim3
dimGrid(blocks,
1,
1);
!
if
(isPow2(size))
{
switch
(threads)
{
case
512:
reduce_kernel<512,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
256:
reduce_kernel<256,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
128:
reduce_kernel<128,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
64:
reduce_kernel<
64,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
32:
reduce_kernel<
32,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
16:
reduce_kernel<
16,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
8:
reduce_kernel<
8,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
4:
reduce_kernel<
4,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
2:
reduce_kernel<
2,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
1:
reduce_kernel<
1,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
}
}
else
{
switch
(threads)
{
case
512:
reduce_kernel<512,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
256:
reduce_kernel<256,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
128:
reduce_kernel<128,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
64:
reduce_kernel<
64,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
32:
reduce_kernel<
32,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
16:
reduce_kernel<
16,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
8:
reduce_kernel<
8,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
4:
reduce_kernel<
4,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
2:
reduce_kernel<
2,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
1:
reduce_kernel<
1,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
}
}
!
size
=
(size
+
(threads*2-‐1))
/
(threads*2);
}
!
float
sum;
checkCudaErrors(cudaMemcpy(&sum,
d_odata,
sizeof(float),
cudaMemcpyDeviceToHost));
!
}
!
Dot product four ways
• CUDA (parallel):
- Step 2: vector reduction … is somewhat complex
struct
SharedMemory
{
__device__
inline
operator
float
*()
{
extern
__shared__
int
__smem[];
return
(float
*)__smem;
}
!
__device__
inline
operator
const
float
*()
const
{
extern
__shared__
int
__smem[];
return
(float
*)__smem;
}
};
!
template
__global__
void
reduce_kernel(float
*g_idata,
float
*g_odata,
unsigned
int
n)
{
float
*sdata
=
SharedMemory();
!
unsigned
int
tid
=
threadIdx.x;
unsigned
int
i
=
blockIdx.x*blockSize*2
+
threadIdx.x;
unsigned
int
gridSize
=
blockSize*2*gridDim.x;
!
float
sum
=
0;
!
while
(i
<
n)
{
sum
+=
g_idata[i];
!
if
(nIsPow2
||
i
+
blockSize
<
n)
sum
+=
g_idata[i+blockSize];
!
i
+=
gridSize;
}
!
sdata[tid]
=
sum;
__syncthreads();
!
if
(blockSize
>=
512)
{
if
(tid
<
256)
{
sdata[tid]
=
sum
=
sum
+
sdata[tid
+
256];
}
!
__syncthreads();
}
!
if
(blockSize
>=
256)
{
if
(tid
<
128)
{
sdata[tid]
=
sum
=
sum
+
sdata[tid
+
128];
}
!
__syncthreads();
}
!
if
(blockSize
>=
128)
{
if
(tid
<
64)
{
sdata[tid]
=
sum
=
sum
+
sdata[tid
+
64];
}
!
__syncthreads();
}
!
if
(tid
<
32)
{
volatile
float
*smem
=
sdata;
!
if
(blockSize
>=
64)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
32];
}
!
if
(blockSize
>=
32)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
16];
}
!
if
(blockSize
>=
16)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
8];
}
!
if
(blockSize
>=
8)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
4];
}
!
if
(blockSize
>=
4)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
2];
}
!
if
(blockSize
>=
2)
{
smem[tid]
=
sum
=
sum
+
smem[tid
+
1];
}
}
!
if
(tid
==
0)
g_odata[blockIdx.x]
=
sdata[0];
}
!
void
getNumBlocksAndThreads(int
n,
int
maxBlocks,
int
maxThreads,
int
&blocks,
int
&threads)
{
cudaDeviceProp
prop;
int
device;
checkCudaErrors(cudaGetDevice(&device));
checkCudaErrors(cudaGetDeviceProperties(&prop,
device));
!
threads
=
(n
<
maxThreads*2)
?
nextPow2((n
+
1)/
2)
:
maxThreads;
blocks
=
(n
+
(threads
*
2
-‐
1))
/
(threads
*
2);
!
if
(blocks
>
prop.maxGridSize[0])
{
blocks
/=
2;
threads
*=
2;
}
!
blocks
=
min(maxBlocks,
blocks);
}
!
float
reduce(int
n,
float
*d_idata,
float
*d_odata)
{
int
threads
=
0;
int
blocks
=
0;
int
maxThreads
=
256;
int
maxBlocks
=
64;
int
size
=
n
!
while
(size
>
1)
{
getNumBlocksAndThreads(size,
maxBlocks,
maxThreads,
blocks,
threads);
!
int
smemSize
=
(threads
<=
32)
?
2
*
threads
*
sizeof(float)
:
threads
*
sizeof(float);
dim3
dimBlock(threads,
1,
1);
dim3
dimGrid(blocks,
1,
1);
!
if
(isPow2(size))
{
switch
(threads)
{
case
512:
reduce_kernel<512,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
256:
reduce_kernel<256,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
128:
reduce_kernel<128,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
64:
reduce_kernel<
64,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
32:
reduce_kernel<
32,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
16:
reduce_kernel<
16,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
8:
reduce_kernel<
8,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
4:
reduce_kernel<
4,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
2:
reduce_kernel<
2,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
1:
reduce_kernel<
1,
true><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
}
}
else
{
switch
(threads)
{
case
512:
reduce_kernel<512,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
256:
reduce_kernel<256,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
128:
reduce_kernel<128,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
64:
reduce_kernel<
64,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
32:
reduce_kernel<
32,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
16:
reduce_kernel<
16,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
8:
reduce_kernel<
8,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
4:
reduce_kernel<
4,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
2:
reduce_kernel<
2,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
case
1:
reduce_kernel<
1,
false><<<
dimGrid,
dimBlock,
smemSize
>>>(d_idata,
d_odata,
size);
break;
}
}
!
size
=
(size
+
(threads*2-‐1))
/
(threads*2);
}
!
float
sum;
checkCudaErrors(cudaMemcpy(&sum,
d_odata,
sizeof(float),
cudaMemcpyDeviceToHost));
!
}
!
o_O
Dot product four ways
• Accelerate (parallel):
dotp
::
Acc
(Vector
Float)
-‐>
Acc
(Vector
Float)
-‐>
Acc
(Scalar
Float)
dotp
xs
ys
=
fold
(+)
0
(
zipWith
(*)
xs
ys
)
Dot product four ways
• Accelerate (parallel):
• Recall the sequential Haskell version:
dotp
::
Acc
(Vector
Float)
-‐>
Acc
(Vector
Float)
-‐>
Acc
(Scalar
Float)
dotp
xs
ys
=
fold
(+)
0
(
zipWith
(*)
xs
ys
)
dotp
::
[Float]
-‐>
[Float]
-‐>
Float
dotp
xs
ys
=
foldl
(+)
0
(
zipWith
(*)
xs
ys
)
Dot product four ways
• Accelerate (parallel):
• Recall the sequential Haskell version:
dotp
::
Acc
(Vector
Float)
-‐>
Acc
(Vector
Float)
-‐>
Acc
(Scalar
Float)
dotp
xs
ys
=
fold
(+)
0
(
zipWith
(*)
xs
ys
)
dotp
::
[Float]
-‐>
[Float]
-‐>
Float
dotp
xs
ys
=
foldl
(+)
0
(
zipWith
(*)
xs
ys
)
Dot product four ways
• Accelerate (parallel):
• Recall the sequential Haskell version:
dotp
::
Acc
(Vector
Float)
-‐>
Acc
(Vector
Float)
-‐>
Acc
(Scalar
Float)
dotp
xs
ys
=
fold
(+)
0
(
zipWith
(*)
xs
ys
)
dotp
::
[Float]
-‐>
[Float]
-‐>
Float
dotp
xs
ys
=
foldl
(+)
0
(
zipWith
(*)
xs
ys
)
Dot product four ways
• Accelerate (parallel):
• Recall the sequential Haskell version:
dotp
::
Acc
(Vector
Float)
-‐>
Acc
(Vector
Float)
-‐>
Acc
(Scalar
Float)
dotp
xs
ys
=
fold
(+)
0
(
zipWith
(*)
xs
ys
)
dotp
::
[Float]
-‐>
[Float]
-‐>
Float
dotp
xs
ys
=
foldl
(+)
0
(
zipWith
(*)
xs
ys
)
left-to-right
traversal
Dot product four ways
• Accelerate (parallel):
• Recall the sequential Haskell version:
dotp
::
Acc
(Vector
Float)
-‐>
Acc
(Vector
Float)
-‐>
Acc
(Scalar
Float)
dotp
xs
ys
=
fold
(+)
0
(
zipWith
(*)
xs
ys
)
dotp
::
[Float]
-‐>
[Float]
-‐>
Float
dotp
xs
ys
=
foldl
(+)
0
(
zipWith
(*)
xs
ys
)
left-to-right
traversal
neither left nor right:
happens in parallel (tree-like)
Dot product four ways
• Accelerate (parallel):
• Recall the sequential Haskell version:
dotp
::
Acc
(Vector
Float)
-‐>
Acc
(Vector
Float)
-‐>
Acc
(Scalar
Float)
dotp
xs
ys
=
fold
(+)
0
(
zipWith
(*)
xs
ys
)
dotp
::
[Float]
-‐>
[Float]
-‐>
Float
dotp
xs
ys
=
foldl
(+)
0
(
zipWith
(*)
xs
ys
)
left-to-right
traversal
neither left nor right:
happens in parallel (tree-like)
But… how does it perform?
Dot product four ways
0.1
1
10
100
2 4 6 8 10 12 14 16 18 20
Run Time (ms)
Elements (millions)
Dot product
sequential
Accelerate
CUBLAS
Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)
Dot product four ways
0.1
1
10
100
2 4 6 8 10 12 14 16 18 20
Run Time (ms)
Elements (millions)
Dot product
sequential
Accelerate
CUBLAS
1.2x
Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)
Dot product four ways
0.1
1
10
100
2 4 6 8 10 12 14 16 18 20
Run Time (ms)
Elements (millions)
Dot product
sequential
Accelerate
CUBLAS
30x
1.2x
Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)
Mandelbrot fractal
Mandelbrot fractal
n-body gravitational simulation
Mandelbrot fractal
n-body gravitational simulation
Canny edge detection
Mandelbrot fractal
n-body gravitational simulation
Canny edge detection
SmoothLife cellular automata
Mandelbrot fractal
n-body gravitational simulation
Canny edge detection
SmoothLife cellular automata
stable fluid flow
Mandelbrot fractal
n-body gravitational simulation
Canny edge detection
SmoothLife cellular automata
stable fluid flow
...
d6b821d937a4170b3c4f8ad93495575d:
saitek1
d0e52829bf7962ee0aa90550ffdcccaa:
laura1230
494a8204b800c41b2da763f9bbbcc462:
lina03
d8ff07c52a95b30800809758f84ce28c:
Jenny10
e81bed02faa9892f8360c705241191ae:
carmen89
46f7d75718029de99dd81fd907034bc9:
mellon22
0dd3c176cf34486ec00b526b6920b782:
helena04
9351c4bc8c8ba17b58d5a6a1f839f356:
85548554
9c36c5599f40d08f874559ac824d091a:
585123456
4b4dce6c91b429e8360aa65f97342e90:
5678go
3aa561d4c17d9d58443fc15d10cc86ae:
momo55
!
Recovered
150/1000
(15.00
%)
digests
in
59.45
s,
185.03
MHash/sec
Password “recovery” (MD5 dictionary attack)
Accelerate
• Accelerate is a Domain-Specific Language for GPU programming
Haskell/Accelerate
program
CUDA code
Compile with NVIDIA’s
compiler & load onto the GPU
Copy result back to Haskell
Transform Accelerate
program into CUDA program
Accelerate
• Accelerate is a Domain-Specific Language for GPU programming
- This process may happen several times during program execution
- Code and data fragments get cached and reused
• An Accelerate program is a Haskell program that generates a CUDA program
- However, in many respects this still looks like a Haskell program
- Shares various concepts with Repa, a Haskell array library for CPUs
Accelerate
• To execute an Accelerate computation (on the GPU):
- run comes from whichever backend we have chosen (CUDA)
run
::
Arrays
a
=>
Acc
a
-‐>
a
Accelerate
• To execute an Accelerate computation (on the GPU):
- run comes from whichever backend we have chosen (CUDA)
- Arrays constrains the result to be an Array, or tuple thereof
run
::
Arrays
a
=>
Acc
a
-‐>
a
Accelerate
• To execute an Accelerate computation (on the GPU):
- run comes from whichever backend we have chosen (CUDA)
- Arrays constrains the result to be an Array, or tuple thereof
• What is Acc?
run
::
Arrays
a
=>
Acc
a
-‐>
a
Accelerate
• To execute an Accelerate computation (on the GPU):
- run comes from whichever backend we have chosen (CUDA)
- Arrays constrains the result to be an Array, or tuple thereof
• What is Acc?
- This is our DSL type
- A data structure (Abstract Syntax Tree) representing a computation that
once executed will yield a result of type ‘a’
run
::
Arrays
a
=>
Acc
a
-‐>
a
Accelerate
• To execute an Accelerate computation (on the GPU):
- run comes from whichever backend we have chosen (CUDA)
- Arrays constrains the result to be an Array, or tuple thereof
• What is Acc?
- This is our DSL type
- A data structure (Abstract Syntax Tree) representing a computation that
once executed will yield a result of type ‘a’
run
::
Arrays
a
=>
Acc
a
-‐>
a
Accelerate is a library of
collective operations over
arrays of type Acc
a
Accelerate
• Accelerate computations take place on arrays
- Parallelism is introduced in the form of collective operations over arrays
Accelerate
computation
Arrays in Arrays out
run
::
Arrays
a
=>
Acc
a
-‐>
a
Accelerate
• Accelerate computations take place on arrays
- Parallelism is introduced in the form of collective operations over arrays
• Arrays have two type parameters
data
Array
sh
e
Accelerate
computation
Arrays in Arrays out
run
::
Arrays
a
=>
Acc
a
-‐>
a
Accelerate
• Accelerate computations take place on arrays
- Parallelism is introduced in the form of collective operations over arrays
• Arrays have two type parameters
- The shape of the array, or dimensionality
data
Array
sh
e
Accelerate
computation
Arrays in Arrays out
run
::
Arrays
a
=>
Acc
a
-‐>
a
Accelerate
• Accelerate computations take place on arrays
- Parallelism is introduced in the form of collective operations over arrays
• Arrays have two type parameters
- The shape of the array, or dimensionality
- The element type of the array: Int, Float, etc.
data
Array
sh
e
Accelerate
computation
Arrays in Arrays out
run
::
Arrays
a
=>
Acc
a
-‐>
a
Arrays
• Supported array element types are members of the Elt class:
- ()
- Int, Int32, Int64, Word, Word32, Word64...
- Float, Double
- Char
- Bool
- Tuples up to 9-tuples of these, including nested tuples
• Note that Array itself is not an allowable element type. There are no nested
arrays in Accelerate, regular arrays only!
data
Array
sh
e
Accelerate by example
Mandelbrot fractal
Mandelbrot set generator
• Basics
- Pick a window onto the complex plane & divide into pixels
- A point is in the set if the value of does not diverge to infinity
!
- Each pixel has a value given by its coordinates in the complex plane
- Colour depends on number of iterations before divergence
• Each pixel is independent: lots of data parallelism
c
n
|z|
zn+1 = c + z2
n
Mandelbrot set generator
• First, some types:
- A pair of floating point numbers for the real and imaginary parts
data
Complex
=
(Float,
Float)
data
Array
sh
e
Mandelbrot set generator
• First, some types:
- A pair of floating point numbers for the real and imaginary parts
- DIM2 is a type synonym for a two dimensional Shape
data
Complex
=
(Float,
Float)
data
ComplexPlane
=
Array
DIM2
Complex
data
Array
sh
e
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
data
Z
=
Z
data
tail
:.
head
=
tail
:.
head
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
data
Z
=
Z
data
tail
:.
head
=
tail
:.
head
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
data
Z
=
Z
data
tail
:.
head
=
tail
:.
head
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
• Examples:
- One-dimensional array (Vector) indexed by Int: (Z
:.
Int)
- Two-dimensional array, indexed by Int: (Z
:.
Int
:.
Int)
data
Z
=
Z
data
tail
:.
head
=
tail
:.
head
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
• Examples:
- One-dimensional array (Vector) indexed by Int: (Z
:.
Int)
- Two-dimensional array, indexed by Int: (Z
:.
Int
:.
Int)
• This style is used at both the type and value level:
data
Z
=
Z
data
tail
:.
head
=
tail
:.
head
sh
::
Z
:.
Int
sh
=
Z
:.
10
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
• Examples:
- One-dimensional array (Vector) indexed by Int: (Z
:.
Int)
- Two-dimensional array, indexed by Int: (Z
:.
Int
:.
Int)
• This style is used at both the type and value level:
data
Z
=
Z
data
tail
:.
head
=
tail
:.
head
sh
::
Z
:.
Int
sh
=
Z
:.
10
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
• Examples:
- One-dimensional array (Vector) indexed by Int: (Z
:.
Int)
- Two-dimensional array, indexed by Int: (Z
:.
Int
:.
Int)
• This style is used at both the type and value level:
data
Z
=
Z
data
tail
:.
head
=
tail
:.
head
sh
::
Z
:.
Int
sh
=
Z
:.
10
Mandelbrot set generator
• The initial complex plane:
• generate is a collective operation that yields a value for an index in the array
generate
::
(Shape
sh,
Elt
a)
=>
Exp
sh
-‐>
(Exp
sh
-‐>
Exp
a)
-‐>
Acc
(Array
sh
a)
z0 = c
Mandelbrot set generator
• The initial complex plane:
• generate is a collective operation that yields a value for an index in the array
- Supported shape and element types: we will use DIM2 and Complex
generate
::
(Shape
sh,
Elt
a)
=>
Exp
sh
-‐>
(Exp
sh
-‐>
Exp
a)
-‐>
Acc
(Array
sh
a)
z0 = c
Mandelbrot set generator
• The initial complex plane:
• generate is a collective operation that yields a value for an index in the array
- Supported shape and element types: we will use DIM2 and Complex
- Size of the result array: number of pixels in the image
generate
::
(Shape
sh,
Elt
a)
=>
Exp
sh
-‐>
(Exp
sh
-‐>
Exp
a)
-‐>
Acc
(Array
sh
a)
z0 = c
Mandelbrot set generator
• The initial complex plane:
• generate is a collective operation that yields a value for an index in the array
- Supported shape and element types: we will use DIM2 and Complex
- Size of the result array: number of pixels in the image
- A function to apply at every index: generate the values of at each pixel
generate
::
(Shape
sh,
Elt
a)
=>
Exp
sh
-‐>
(Exp
sh
-‐>
Exp
a)
-‐>
Acc
(Array
sh
a)
c
z0 = c
Mandelbrot set generator
• The initial complex plane:
• generate is a collective operation that yields a value for an index in the array
- Supported shape and element types: we will use DIM2 and Complex
- Size of the result array: number of pixels in the image
- A function to apply at every index: generate the values of at each pixel
generate
::
(Shape
sh,
Elt
a)
=>
Exp
sh
-‐>
(Exp
sh
-‐>
Exp
a)
-‐>
Acc
(Array
sh
a)
c
z0 = c
Mandelbrot set generator
• The initial complex plane:
• generate is a collective operation that yields a value for an index in the array
- Supported shape and element types: we will use DIM2 and Complex
- Size of the result array: number of pixels in the image
- A function to apply at every index: generate the values of at each pixel
generate
::
(Shape
sh,
Elt
a)
=>
Exp
sh
-‐>
(Exp
sh
-‐>
Exp
a)
-‐>
Acc
(Array
sh
a)
c
z0 = c
If Acc is our DSL type,
what is Exp?
A Stratified Language
• Accelerate is split into two worlds: Acc and Exp
- Acc represents collective operations over instances of Arrays
- Exp is a scalar computation on things of type Elt
A Stratified Language
• Accelerate is split into two worlds: Acc and Exp
- Acc represents collective operations over instances of Arrays
- Exp is a scalar computation on things of type Elt
• Collective operations in Acc comprise many scalar operations in Exp,
executed in parallel over Arrays
- Scalar operations can not contain collective operations
A Stratified Language
• Accelerate is split into two worlds: Acc and Exp
- Acc represents collective operations over instances of Arrays
- Exp is a scalar computation on things of type Elt
• Collective operations in Acc comprise many scalar operations in Exp,
executed in parallel over Arrays
- Scalar operations can not contain collective operations
• This stratification excludes nested data parallelism
Collective Operations
• Collective operations comprise many scalar operations applied in parallel
example
::
Acc
(Vector
Int)
example
=
generate
(constant
(Z:.10))
(\ix
-‐>
f
ix)
Collective Operations
• Collective operations comprise many scalar operations applied in parallel
- constant lifts a plain value into Exp land of scalar expressions
example
::
Acc
(Vector
Int)
example
=
generate
(constant
(Z:.10))
(\ix
-‐>
f
ix)
constant
::
Elt
e
=>
e
-‐>
Exp
e
Collective Operations
• Collective operations comprise many scalar operations applied in parallel
- constant lifts a plain value into Exp land of scalar expressions
example
::
Acc
(Vector
Int)
example
=
generate
(constant
(Z:.10))
(\ix
-‐>
f
ix)
generate
(constant
(Z:.10))
f
constant
::
Elt
e
=>
e
-‐>
Exp
e
Collective Operations
• Collective operations comprise many scalar operations applied in parallel
- constant lifts a plain value into Exp land of scalar expressions
example
::
Acc
(Vector
Int)
example
=
generate
(constant
(Z:.10))
(\ix
-‐>
f
ix)
f
0 f
1 f
2 f
3 f
4 f
5 f
6 f
7 f
8 f
9
generate
(constant
(Z:.10))
f
f
::
Exp
DIM1
-‐>
Exp
Int
constant
::
Elt
e
=>
e
-‐>
Exp
e
Collective Operations
• Collective operations comprise many scalar operations applied in parallel
- constant lifts a plain value into Exp land of scalar expressions
example
::
Acc
(Vector
Int)
example
=
generate
(constant
(Z:.10))
(\ix
-‐>
f
ix)
f
0 f
1 f
2 f
3 f
4 f
5 f
6 f
7 f
8 f
9
Acc
(Vector
Int)
constant
::
Elt
e
=>
e
-‐>
Exp
e
Mandelbrot set generator
genPlane
::
Exp
(Float,Float,Float,Float)
-‐>
Acc
ComplexPlane
genPlane
view
=
generate
(constant
(Z:.600:.800))
(\ix
-‐>
let
Z
:.
y
:.
x
=
unlift
ix
in
lift
(
xmin
+
(fromIntegral
x
*
viewx)
/
800
,
ymin
+
(fromIntegral
y
*
viewy)
/
600
))
where
(xmin,ymin,xmax,ymax)
=
unlift
view
viewx
=
xmax
-‐
xmin
viewy
=
ymax
-‐
ymin
Mandelbrot set generator
genPlane
::
Exp
(Float,Float,Float,Float)
-‐>
Acc
ComplexPlane
genPlane
view
=
generate
(constant
(Z:.600:.800))
(\ix
-‐>
let
Z
:.
y
:.
x
=
unlift
ix
in
lift
(
xmin
+
(fromIntegral
x
*
viewx)
/
800
,
ymin
+
(fromIntegral
y
*
viewy)
/
600
))
where
(xmin,ymin,xmax,ymax)
=
unlift
view
viewx
=
xmax
-‐
xmin
viewy
=
ymax
-‐
ymin
- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.
Mandelbrot set generator
genPlane
::
Exp
(Float,Float,Float,Float)
-‐>
Acc
ComplexPlane
genPlane
view
=
generate
(constant
(Z:.600:.800))
(\ix
-‐>
let
Z
:.
y
:.
x
=
unlift
ix
in
lift
(
xmin
+
(fromIntegral
x
*
viewx)
/
800
,
ymin
+
(fromIntegral
y
*
viewy)
/
600
))
where
(xmin,ymin,xmax,ymax)
=
unlift
view
viewx
=
xmax
-‐
xmin
viewy
=
ymax
-‐
ymin
unlift
::
Exp
(Z
:.
Int
:.
Int)
-‐>
Z
:.
Exp
Int
:.
Exp
Int
- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.
Mandelbrot set generator
genPlane
::
Exp
(Float,Float,Float,Float)
-‐>
Acc
ComplexPlane
genPlane
view
=
generate
(constant
(Z:.600:.800))
(\ix
-‐>
let
Z
:.
y
:.
x
=
unlift
ix
in
lift
(
xmin
+
(fromIntegral
x
*
viewx)
/
800
,
ymin
+
(fromIntegral
y
*
viewy)
/
600
))
where
(xmin,ymin,xmax,ymax)
=
unlift
view
viewx
=
xmax
-‐
xmin
viewy
=
ymax
-‐
ymin
unlift
::
Exp
(F,F,F,F)
-‐>
(Exp
F,
Exp
F,
Exp
F,
Exp
F)
- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.
Mandelbrot set generator
genPlane
::
Exp
(Float,Float,Float,Float)
-‐>
Acc
ComplexPlane
genPlane
view
=
generate
(constant
(Z:.600:.800))
(\ix
-‐>
let
Z
:.
y
:.
x
=
unlift
ix
in
lift
(
xmin
+
(fromIntegral
x
*
viewx)
/
800
,
ymin
+
(fromIntegral
y
*
viewy)
/
600
))
where
(xmin,ymin,xmax,ymax)
=
unlift
view
viewx
=
xmax
-‐
xmin
viewy
=
ymax
-‐
ymin
- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.
- Even though operations are in Exp, can still use standard Haskell
operations like (*) and (-‐)
Mandelbrot set generator
genPlane
::
Exp
(Float,Float,Float,Float)
-‐>
Acc
ComplexPlane
genPlane
view
=
generate
(constant
(Z:.600:.800))
(\ix
-‐>
let
Z
:.
y
:.
x
=
unlift
ix
in
lift
(
xmin
+
(fromIntegral
x
*
viewx)
/
800
,
ymin
+
(fromIntegral
y
*
viewy)
/
600
))
where
(xmin,ymin,xmax,ymax)
=
unlift
view
viewx
=
xmax
-‐
xmin
viewy
=
ymax
-‐
ymin
Mandelbrot set generator
• Let’s define the function we will iterate
zn+1 = c + z2
n
next
::
Exp
Complex
-‐>
Exp
Complex
-‐>
Exp
Complex
next
c
z
=
plus
c
(times
z
z)
Mandelbrot set generator
• Let’s define the function we will iterate
- Use lift and unlift as before to unpack the tuples
zn+1 = c + z2
n
next
::
Exp
Complex
-‐>
Exp
Complex
-‐>
Exp
Complex
next
c
z
=
plus
c
(times
z
z)
plus
::
Exp
Complex
-‐>
Exp
Complex
-‐>
Exp
Complex
plus
a
b
=
lift
(ax+bx,
ay+by)
where
(ax,ay)
=
unlift
a
::
(Exp
Float,
Exp
Float)
(bx,by)
=
unlift
b
::
(Exp
Float,
Exp
Float)
Mandelbrot set generator
• Let’s define the function we will iterate
- Use lift and unlift as before to unpack the tuples
zn+1 = c + z2
n
next
::
Exp
Complex
-‐>
Exp
Complex
-‐>
Exp
Complex
next
c
z
=
plus
c
(times
z
z)
plus
::
Exp
Complex
-‐>
Exp
Complex
-‐>
Exp
Complex
plus
a
b
=
lift
(ax+bx,
ay+by)
where
(ax,ay)
=
unlift
a
::
(Exp
Float,
Exp
Float)
(bx,by)
=
unlift
b
::
(Exp
Float,
Exp
Float)
lift
::
(Exp
Float,
Exp
Float)
-‐>
Exp
(Float,
Float)
Mandelbrot set generator
• Let’s define the function we will iterate
- Use lift and unlift as before to unpack the tuples
- Note that we had to add some type signatures to unlift
zn+1 = c + z2
n
next
::
Exp
Complex
-‐>
Exp
Complex
-‐>
Exp
Complex
next
c
z
=
plus
c
(times
z
z)
plus
::
Exp
Complex
-‐>
Exp
Complex
-‐>
Exp
Complex
plus
a
b
=
lift
(ax+bx,
ay+by)
where
(ax,ay)
=
unlift
a
::
(Exp
Float,
Exp
Float)
(bx,by)
=
unlift
b
::
(Exp
Float,
Exp
Float)
lift
::
(Exp
Float,
Exp
Float)
-‐>
Exp
(Float,
Float)
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
zn+1 = c + z2
n
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
zn+1 = c + z2
n
step
::
Exp
Complex
-‐>
Exp
(Complex,
Int)
-‐>
Exp
(Complex,
Int)
step
c
zi
=
f
(fst
zi)
(snd
zi)
where
f
z
i
=
let
z'
=
next
c
z
in
dot
z'
>*
4
?
(
zi,
lift
(z',
i+1)
)
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
zn+1 = c + z2
n
step
::
Exp
Complex
-‐>
Exp
(Complex,
Int)
-‐>
Exp
(Complex,
Int)
step
c
zi
=
f
(fst
zi)
(snd
zi)
where
f
z
i
=
let
z'
=
next
c
z
in
dot
z'
>*
4
?
(
zi,
lift
(z',
i+1)
)
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
- dot is the magnitude of a complex number squared
zn+1 = c + z2
n
step
::
Exp
Complex
-‐>
Exp
(Complex,
Int)
-‐>
Exp
(Complex,
Int)
step
c
zi
=
f
(fst
zi)
(snd
zi)
where
f
z
i
=
let
z'
=
next
c
z
in
dot
z'
>*
4
?
(
zi,
lift
(z',
i+1)
)
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
- dot is the magnitude of a complex number squared
- Conditionals can lead to SIMD divergence, so use sparingly
zn+1 = c + z2
n
(?)
::
Elt
t
=>
Exp
Bool
-‐>
(Exp
t,
Exp
t)
-‐>
Exp
t
step
::
Exp
Complex
-‐>
Exp
(Complex,
Int)
-‐>
Exp
(Complex,
Int)
step
c
zi
=
f
(fst
zi)
(snd
zi)
where
f
z
i
=
let
z'
=
next
c
z
in
dot
z'
>*
4
?
(
zi,
lift
(z',
i+1)
)
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
- dot is the magnitude of a complex number squared
- Conditionals can lead to SIMD divergence, so use sparingly
zn+1 = c + z2
n
(?)
::
Elt
t
=>
Exp
Bool
-‐>
(Exp
t,
Exp
t)
-‐>
Exp
t
step
::
Exp
Complex
-‐>
Exp
(Complex,
Int)
-‐>
Exp
(Complex,
Int)
step
c
zi
=
f
(fst
zi)
(snd
zi)
where
f
z
i
=
let
z'
=
next
c
z
in
dot
z'
>*
4
?
(
zi,
lift
(z',
i+1)
)
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
- dot is the magnitude of a complex number squared
- Conditionals can lead to SIMD divergence, so use sparingly
zn+1 = c + z2
n
(?)
::
Elt
t
=>
Exp
Bool
-‐>
(Exp
t,
Exp
t)
-‐>
Exp
t
step
::
Exp
Complex
-‐>
Exp
(Complex,
Int)
-‐>
Exp
(Complex,
Int)
step
c
zi
=
f
(fst
zi)
(snd
zi)
where
f
z
i
=
let
z'
=
next
c
z
in
dot
z'
>*
4
?
(
zi,
lift
(z',
i+1)
)
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
- dot is the magnitude of a complex number squared
- Conditionals can lead to SIMD divergence, so use sparingly
zn+1 = c + z2
n
(?)
::
Elt
t
=>
Exp
Bool
-‐>
(Exp
t,
Exp
t)
-‐>
Exp
t
step
::
Exp
Complex
-‐>
Exp
(Complex,
Int)
-‐>
Exp
(Complex,
Int)
step
c
zi
=
f
(fst
zi)
(snd
zi)
where
f
z
i
=
let
z'
=
next
c
z
in
dot
z'
>*
4
?
(
zi,
lift
(z',
i+1)
)
Mandelbrot set generator
• Accelerate is a meta programming language
- So, just use regular Haskell to unroll the loop a fixed number of times
zn+1 = c + z2
n
mandel
::
Exp
(Float,Float,Float,Float)
-‐>
Acc
(Array
DIM2
(Complex,
Int))
mandel
view
=
Prelude.foldl1
(.)
(Prelude.replicate
255
go)
zs0
where
cs
=
genPlane
view
zs0
=
zip
cs
(fill
(shape
cs)
0)
go
zs
=
zipWith
step
cs
zs
Mandelbrot set generator
• Accelerate is a meta programming language
- So, just use regular Haskell to unroll the loop a fixed number of times
- zipWith applies its function step pairwise to elements of the two arrays
zn+1 = c + z2
n
mandel
::
Exp
(Float,Float,Float,Float)
-‐>
Acc
(Array
DIM2
(Complex,
Int))
mandel
view
=
Prelude.foldl1
(.)
(Prelude.replicate
255
go)
zs0
where
cs
=
genPlane
view
zs0
=
zip
cs
(fill
(shape
cs)
0)
go
zs
=
zipWith
step
cs
zs
Mandelbrot set generator
• Accelerate is a meta programming language
- So, just use regular Haskell to unroll the loop a fixed number of times
- zipWith applies its function step pairwise to elements of the two arrays
- Replicate the transition step from to
zn+1 = c + z2
n
mandel
::
Exp
(Float,Float,Float,Float)
-‐>
Acc
(Array
DIM2
(Complex,
Int))
mandel
view
=
Prelude.foldl1
(.)
(Prelude.replicate
255
go)
zs0
where
cs
=
genPlane
view
zs0
=
zip
cs
(fill
(shape
cs)
0)
go
zs
=
zipWith
step
cs
zs
zn
zn+1
Mandelbrot set generator
• Accelerate is a meta programming language
- So, just use regular Haskell to unroll the loop a fixed number of times
- zipWith applies its function step pairwise to elements of the two arrays
- Replicate the transition step from to
- Applies the steps in sequence, beginning with the initial data zs0
zn+1 = c + z2
n
mandel
::
Exp
(Float,Float,Float,Float)
-‐>
Acc
(Array
DIM2
(Complex,
Int))
mandel
view
=
Prelude.foldl1
(.)
(Prelude.replicate
255
go)
zs0
where
cs
=
genPlane
view
zs0
=
zip
cs
(fill
(shape
cs)
0)
go
zs
=
zipWith
step
cs
zs
zn
zn+1
Mandelbrot set generator
• Accelerate is a meta programming language
- So, just use regular Haskell to unroll the loop a fixed number of times
- zipWith applies its function step pairwise to elements of the two arrays
- Replicate the transition step from to
- Applies the steps in sequence, beginning with the initial data zs0
zn+1 = c + z2
n
mandel
::
Exp
(Float,Float,Float,Float)
-‐>
Acc
(Array
DIM2
(Complex,
Int))
mandel
view
=
Prelude.foldl1
(.)
(Prelude.replicate
255
go)
zs0
where
cs
=
genPlane
view
zs0
=
zip
cs
(fill
(shape
cs)
0)
go
zs
=
zipWith
step
cs
zs
zn
zn+1
f (g x) ≣ (f . g) x
In the workshop…
• More example code, computational kernels
- Common pitfalls
- Tips for good performance
- Figuring out what went wrong (or knowing who to blame)
In the workshop…
• More example code, computational kernels
- Common pitfalls
- Tips for good performance
- Figuring out what went wrong (or knowing who to blame)
me
Questions?
https://github.com/AccelerateHS/
http://xkcd.com/365/
Extra Slides…
Seriously?
Arrays
• Create an array from a list:
!
- Generates a multidimensional array by consuming elements from the list
and adding them to the array in row-major order
• Example:
data
Array
sh
e
fromList
::
(Shape
sh,
Elt
e)
=>
sh
-‐>
[e]
-‐>
Array
sh
e
ghci>
fromList
(Z:.10)
[1..10]
Arrays
• Create an array from a list:
data
Array
sh
e
>
fromList
(Z:.10)
[1..10]
::
Vector
Float
Array
(Z
:.
10)
[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]
Arrays
• Create an array from a list:
• Multidimensional arrays are similar:
- Elements are filled along the right-most dimension first
data
Array
sh
e
>
fromList
(Z:.10)
[1..10]
::
Vector
Float
Array
(Z
:.
10)
[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]
>
fromList
(Z:.3:.5)
[1..]
::
Array
DIM2
Int
Array
(Z
:.
3
:.
5)
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
Arrays
• Array indices start counting from zero
data
Array
sh
e
>
let
mat
=
fromList
(Z:.3:.5)
[1..]
::
Array
DIM2
Int
>
indexArray
mat
(Z:.2:.1)
12
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
Arrays
• Similarly, an array of (nested) tuples:
- This is just a trick: internally converted into a tuple of arrays
data
Array
sh
e
>
fromList
(Z:.2:.3)
$
P.zip
[1..]
['a'..]
::
Array
DIM2
(Int,Char)
Array
(Z
:.
2
:.
3)
[(1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e'),(6,'f')]
1 2 3
4 5 6
a b c
d e f
( )
,
Data.Array.Accelerate
• Need to import both the base library as well as a backend
- There is also an interpreter available for testing
- Runs without using the GPU (much more slowly of course)
import
Prelude
as
P
import
Data.Array.Accelerate
as
A
import
Data.Array.Accelerate.CUDA
as
CUDA
Data.Array.Accelerate
Data.Array.Accelerate
• To get arrays into the Acc world:
!
- This may involve copying data to the GPU
use
::
Arrays
a
=>
a
-‐>
Acc
a
Data.Array.Accelerate
• To get arrays into the Acc world:
!
- This may involve copying data to the GPU
• use injects arrays into our DSL
• run executes the computation to get arrays out
• Using Accelerate focuses on everything in between
use
::
Arrays
a
=>
a
-‐>
Acc
a
Collective Operations
• Example: add one to each element of an array
>
let
arr
=
fromList
(Z:.3:.5)
[1..]
::
Array
DIM2
Int
>
run
$
A.map
(+1)
(use
arr)
Array
(Z
:.
3
:.
5)
[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
Collective Operations
• Example: add one to each element of an array
• What is the type of map?
>
let
arr
=
fromList
(Z:.3:.5)
[1..]
::
Array
DIM2
Int
>
run
$
A.map
(+1)
(use
arr)
Array
(Z
:.
3
:.
5)
[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
map
::
(Shape
sh,
Elt
a,
Elt
b)
=>
(Exp
a
-‐>
Exp
b)
-‐>
Acc
(Array
sh
a)
-‐>
Acc
(Array
sh
b)
Collective Operations
• Example: add one to each element of an array
• What is the type of map?
>
let
arr
=
fromList
(Z:.3:.5)
[1..]
::
Array
DIM2
Int
>
run
$
A.map
(+1)
(use
arr)
Array
(Z
:.
3
:.
5)
[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
map
::
(Shape
sh,
Elt
a,
Elt
b)
=>
(Exp
a
-‐>
Exp
b)
-‐>
Acc
(Array
sh
a)
-‐>
Acc
(Array
sh
b)
Supported shape &
element types
Collective Operations
• Example: add one to each element of an array
• What is the type of map?
>
let
arr
=
fromList
(Z:.3:.5)
[1..]
::
Array
DIM2
Int
>
run
$
A.map
(+1)
(use
arr)
Array
(Z
:.
3
:.
5)
[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
map
::
(Shape
sh,
Elt
a,
Elt
b)
=>
(Exp
a
-‐>
Exp
b)
-‐>
Acc
(Array
sh
a)
-‐>
Acc
(Array
sh
b)
DSL array
Supported shape &
element types
Collective Operations
• Example: add one to each element of an array
• What is the type of map?
>
let
arr
=
fromList
(Z:.3:.5)
[1..]
::
Array
DIM2
Int
>
run
$
A.map
(+1)
(use
arr)
Array
(Z
:.
3
:.
5)
[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
map
::
(Shape
sh,
Elt
a,
Elt
b)
=>
(Exp
a
-‐>
Exp
b)
-‐>
Acc
(Array
sh
a)
-‐>
Acc
(Array
sh
b)
DSL array
Function to apply at every
element. But what is Exp?
Supported shape &
element types
• The type class overloading trick is used for standard Haskell classes
!
• Standard boolean operations are available with slightly different names
- The standard names can not be overloaded
!
• Conditionals
- Use sparingly: leads to SIMD divergence
(+1)
::
(Elt
a,
IsNum
a)
=>
Exp
a
-‐>
Exp
a
Scalar Expressions
(==*)
::
(Elt
t,
IsScalar
t)
=>
Exp
t
-‐>
Exp
t
-‐>
Exp
Bool
(/=*),
(<*),
(>*),
min,
max,
(||*),
(&&*)
-‐-‐
and
so
on...
(?)
::
Elt
t
=>
Exp
Bool
-‐>
(Exp
t,
Exp
t)
-‐>
Exp
t
Scalar Expressions
• Bring a Haskell value into Exp land
!
• Lift an expression into a singleton array
!
• Extract the element from a singleton array
constant
::
Elt
e
=>
e
-‐>
Exp
e
unit
::
Exp
e
-‐>
Acc
(Scalar
e)
the
::
Acc
(Scalar
e)
-‐>
Exp
e
Reductions
• Folding (+) over a vector produces a sum
- The result is a one-element array (scalar). Why?
!
>
let
xs
=
fromList
(Z:.10)
[1..]
::
Vector
Int
>
run
$
A.fold
(+)
0
(use
xs)
Array
(Z)
[55]
Reductions
• Folding (+) over a vector produces a sum
- The result is a one-element array (scalar). Why?
!
• Fold has an interesting type:
>
let
xs
=
fromList
(Z:.10)
[1..]
::
Vector
Int
>
run
$
A.fold
(+)
0
(use
xs)
Array
(Z)
[55]
fold
::
(Shape
sh,
Elt
a)
=>
(Exp
a
-‐>
Exp
a
-‐>
Exp
a)
-‐>
Exp
a
-‐>
Acc
(Array
(sh:.Int)
a)
-‐>
Acc
(Array
sh
a)
Reductions
• Folding (+) over a vector produces a sum
- The result is a one-element array (scalar). Why?
!
• Fold has an interesting type:
>
let
xs
=
fromList
(Z:.10)
[1..]
::
Vector
Int
>
run
$
A.fold
(+)
0
(use
xs)
Array
(Z)
[55]
fold
::
(Shape
sh,
Elt
a)
=>
(Exp
a
-‐>
Exp
a
-‐>
Exp
a)
-‐>
Exp
a
-‐>
Acc
(Array
(sh:.Int)
a)
-‐>
Acc
(Array
sh
a)
input array
Reductions
• Folding (+) over a vector produces a sum
- The result is a one-element array (scalar). Why?
!
• Fold has an interesting type:
>
let
xs
=
fromList
(Z:.10)
[1..]
::
Vector
Int
>
run
$
A.fold
(+)
0
(use
xs)
Array
(Z)
[55]
fold
::
(Shape
sh,
Elt
a)
=>
(Exp
a
-‐>
Exp
a
-‐>
Exp
a)
-‐>
Exp
a
-‐>
Acc
(Array
(sh:.Int)
a)
-‐>
Acc
(Array
sh
a)
outer dimension removed
input array
Reductions
• Fold occurs over the outer dimension of the array
>
let
mat
=
fromList
(Z:.3:.5)
[1..]
::
Array
DIM2
Int
>
run
$
A.fold
(+)
0
(use
mat)
Array
(Z
:.
3)
[15,40,65]
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
15
40
65
Reductions
• Is this a left-fold or a right-fold?
- Neither! The fold happens in parallel, tree-like
- Therefore the function must be associative: (Exp
a
-‐>
Exp
a
-‐>
Exp
a)
- (We pretend that floating point operations are associative, though strictly
speaking they are not)
Stencils
• A stencil is a map with access to the neighbourhood around each element
- Useful in many scientific & image processing algorithms
laplace
::
Stencil3x3
Int
-‐>
Exp
Int
laplace
((_,t,_)
,(l,c,r)
,(_,b,_))
=
t
+
b
+
l
+
r
-‐
4*c
t
l c r
b
Stencils
• A stencil is a map with access to the neighbourhood around each element
- Useful in many scientific & image processing algorithms
- Boundary conditions specify how to handle out-of-bounds neighbours
laplace
::
Stencil3x3
Int
-‐>
Exp
Int
laplace
((_,t,_)
,(l,c,r)
,(_,b,_))
=
t
+
b
+
l
+
r
-‐
4*c
>
let
mat
=
fromList
(Z:.3:.5)
[1..]
::
Array
DIM2
Int
>
run
$
stencil
laplace
(Constant
0)
(use
mat)
Array
(Z
:.
3
:.
5)
[4,3,2,1,-‐6,-‐5,0,0,0,-‐11,-‐26,-‐17,-‐18,-‐19,-‐36]
t
l c r
b
Stencils
• A stencil is a map with access to the neighbourhood around each element
- Useful in many scientific & image processing algorithms
Index Transforms
• Index transforms change the order of elements, not their values
- We usually want to push such operations into the consumer
• backpermute specifies which element to read from a source array
backpermute
::
(Shape
ix,
Shape
ix',
Elt
a)
=>
Exp
ix'
-‐>
(Exp
ix'
-‐>
Exp
ix)
-‐>
Acc
(Array
ix
a)
-‐>
Acc
(Array
ix'
a)
Index Transforms
• Index transforms change the order of elements, not their values
- We usually want to push such operations into the consumer
• backpermute specifies which element to read from a source array
backpermute
::
(Shape
ix,
Shape
ix',
Elt
a)
=>
Exp
ix'
-‐>
(Exp
ix'
-‐>
Exp
ix)
-‐>
Acc
(Array
ix
a)
-‐>
Acc
(Array
ix'
a)
shape of result
Index Transforms
• Index transforms change the order of elements, not their values
- We usually want to push such operations into the consumer
• backpermute specifies which element to read from a source array
backpermute
::
(Shape
ix,
Shape
ix',
Elt
a)
=>
Exp
ix'
-‐>
(Exp
ix'
-‐>
Exp
ix)
-‐>
Acc
(Array
ix
a)
-‐>
Acc
(Array
ix'
a)
shape of result
source data
Index Transforms
• Index transforms change the order of elements, not their values
- We usually want to push such operations into the consumer
• backpermute specifies which element to read from a source array
backpermute
::
(Shape
ix,
Shape
ix',
Elt
a)
=>
Exp
ix'
-‐>
(Exp
ix'
-‐>
Exp
ix)
-‐>
Acc
(Array
ix
a)
-‐>
Acc
(Array
ix'
a)
shape of result
source data
index mapping from
destination array to
source
Index Transforms
• Index transforms change the order of elements, not their values
transpose
mat
=
let
swap
=
lift1
$
\(Z:.j:.i)
-‐>
Z:.i:.j
::
Z
:.
Exp
Int
:.
Exp
Int
in
backpermute
(swap
$
shape
mat)
swap
mat
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
1 6 11
2 7 12
3 8 13
4 9 14
5 10 15