GPGPU Programming in Haskell with Accelerate

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University
of New South Wales ! @tlmcdonell [email protected] ! https://github.com/AccelerateHS

What is GPGPU Programming? • General Purpose Programming on Graphics
Processing Units (GPUs) • Using your graphics card for something other than playing games • GPUs have many more cores than a CPU - GeForce GTX Titan - 2688 cores @ 837 MHz - 6 GB memory @ 288 GB/s

What is GPGPU Programming? • Main differences: - Single program
multiple data (SPMD / SIMD), or just data-parallelism - All the cores run the same program, but on different data • We can’t program these in the same way as a CPU - Different instruction sets: can’t run a Haskell program directly - More restrictive hardware designs, limited control structures • GPUs have their own memory - Data has to be explicitly moved back and forth

Dot product four ways • Dot-product: pair-wise multiply two arrays
and sum the result • C (sequential): float dotp(float *xs, float *ys, int size) { int i; float sum = 0; ! for (i = 0; i < size; ++i) { sum = sum + xs[i] * ys[i]; } ! return sum; }

Dot product four ways • Haskell (sequential): dotp :: [Float]
-‐> [Float] -‐> Float dotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )

Dot product four ways • Haskell (sequential): - [Float] is
a list of ﬂoating point numbers dotp :: [Float] -‐> [Float] -‐> Float dotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )

a list of ﬂoating point numbers - zipWith applies the function (*) element-wise to the two input lists dotp :: [Float] -‐> [Float] -‐> Float dotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )

a list of ﬂoating point numbers - zipWith applies the function (*) element-wise to the two input lists - foldl sums the result of zipWith dotp :: [Float] -‐> [Float] -‐> Float dotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )

Dot product four ways • CUDA (parallel): - Step 1:
element-wise multiplication __global__ void zipWithMult(float *xs, float *ys, float *zs, int size) { int i = blockDim.x * blockIdx.x + threadIdx.x; ! if (i < size) { zs[i] = xs[i] * ys[i]; } }

element-wise multiplication - __global__ indicates this is a function that runs on the GPU __global__ void zipWithMult(float *xs, float *ys, float *zs, int size) { int i = blockDim.x * blockIdx.x + threadIdx.x; ! if (i < size) { zs[i] = xs[i] * ys[i]; } }

element-wise multiplication - __global__ indicates this is a function that runs on the GPU - spawn one thread for each element in the vector: unique for each thread __global__ void zipWithMult(float *xs, float *ys, float *zs, int size) { int i = blockDim.x * blockIdx.x + threadIdx.x; ! if (i < size) { zs[i] = xs[i] * ys[i]; } }

vector reduction … is somewhat complex struct SharedMemory { __device__ inline operator float *() { extern __shared__ int __smem[]; return (float *)__smem; } ! __device__ inline operator const float *() const { extern __shared__ int __smem[]; return (float *)__smem; } }; ! template <unsigned int blockSize, bool nIsPow2> __global__ void reduce_kernel(float *g_idata, float *g_odata, unsigned int n) { float *sdata = SharedMemory(); ! unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x; unsigned int gridSize = blockSize*2*gridDim.x; ! float sum = 0; ! while (i < n) { sum += g_idata[i]; ! if (nIsPow2 || i + blockSize < n) sum += g_idata[i+blockSize]; ! i += gridSize; } ! sdata[tid] = sum; __syncthreads(); ! if (blockSize >= 512) { if (tid < 256) { sdata[tid] = sum = sum + sdata[tid + 256]; } ! __syncthreads(); } ! if (blockSize >= 256) { if (tid < 128) { sdata[tid] = sum = sum + sdata[tid + 128]; } ! __syncthreads(); } ! if (blockSize >= 128) { if (tid < 64) { sdata[tid] = sum = sum + sdata[tid + 64]; } ! __syncthreads(); } ! if (tid < 32) { volatile float *smem = sdata; ! if (blockSize >= 64) { smem[tid] = sum = sum + smem[tid + 32]; } ! if (blockSize >= 32) { smem[tid] = sum = sum + smem[tid + 16]; } ! if (blockSize >= 16) { smem[tid] = sum = sum + smem[tid + 8]; } ! if (blockSize >= 8) { smem[tid] = sum = sum + smem[tid + 4]; } ! if (blockSize >= 4) { smem[tid] = sum = sum + smem[tid + 2]; } ! if (blockSize >= 2) { smem[tid] = sum = sum + smem[tid + 1]; } } ! if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } ! void getNumBlocksAndThreads(int n, int maxBlocks, int maxThreads, int &blocks, int &threads) { cudaDeviceProp prop; int device; checkCudaErrors(cudaGetDevice(&device)); checkCudaErrors(cudaGetDeviceProperties(&prop, device)); ! threads = (n < maxThreads*2) ? nextPow2((n + 1)/ 2) : maxThreads; blocks = (n + (threads * 2 -‐ 1)) / (threads * 2); ! if (blocks > prop.maxGridSize[0]) { blocks /= 2; threads *= 2; } ! blocks = min(maxBlocks, blocks); } ! float reduce(int n, float *d_idata, float *d_odata) { int threads = 0; int blocks = 0; int maxThreads = 256; int maxBlocks = 64; int size = n ! while (size > 1) { getNumBlocksAndThreads(size, maxBlocks, maxThreads, blocks, threads); ! int smemSize = (threads <= 32) ? 2 * threads * sizeof(float) : threads * sizeof(float); dim3 dimBlock(threads, 1, 1); dim3 dimGrid(blocks, 1, 1); ! if (isPow2(size)) { switch (threads) { case 512: reduce_kernel<512, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } } else { switch (threads) { case 512: reduce_kernel<512, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } } ! size = (size + (threads*2-‐1)) / (threads*2); } ! float sum; checkCudaErrors(cudaMemcpy(&sum, d_odata, sizeof(float), cudaMemcpyDeviceToHost)); ! } !

vector reduction … is somewhat complex struct SharedMemory { __device__ inline operator float *() { extern __shared__ int __smem[]; return (float *)__smem; } ! __device__ inline operator const float *() const { extern __shared__ int __smem[]; return (float *)__smem; } }; ! template <unsigned int blockSize, bool nIsPow2> __global__ void reduce_kernel(float *g_idata, float *g_odata, unsigned int n) { float *sdata = SharedMemory(); ! unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x; unsigned int gridSize = blockSize*2*gridDim.x; ! float sum = 0; ! while (i < n) { sum += g_idata[i]; ! if (nIsPow2 || i + blockSize < n) sum += g_idata[i+blockSize]; ! i += gridSize; } ! sdata[tid] = sum; __syncthreads(); ! if (blockSize >= 512) { if (tid < 256) { sdata[tid] = sum = sum + sdata[tid + 256]; } ! __syncthreads(); } ! if (blockSize >= 256) { if (tid < 128) { sdata[tid] = sum = sum + sdata[tid + 128]; } ! __syncthreads(); } ! if (blockSize >= 128) { if (tid < 64) { sdata[tid] = sum = sum + sdata[tid + 64]; } ! __syncthreads(); } ! if (tid < 32) { volatile float *smem = sdata; ! if (blockSize >= 64) { smem[tid] = sum = sum + smem[tid + 32]; } ! if (blockSize >= 32) { smem[tid] = sum = sum + smem[tid + 16]; } ! if (blockSize >= 16) { smem[tid] = sum = sum + smem[tid + 8]; } ! if (blockSize >= 8) { smem[tid] = sum = sum + smem[tid + 4]; } ! if (blockSize >= 4) { smem[tid] = sum = sum + smem[tid + 2]; } ! if (blockSize >= 2) { smem[tid] = sum = sum + smem[tid + 1]; } } ! if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } ! void getNumBlocksAndThreads(int n, int maxBlocks, int maxThreads, int &blocks, int &threads) { cudaDeviceProp prop; int device; checkCudaErrors(cudaGetDevice(&device)); checkCudaErrors(cudaGetDeviceProperties(&prop, device)); ! threads = (n < maxThreads*2) ? nextPow2((n + 1)/ 2) : maxThreads; blocks = (n + (threads * 2 -‐ 1)) / (threads * 2); ! if (blocks > prop.maxGridSize[0]) { blocks /= 2; threads *= 2; } ! blocks = min(maxBlocks, blocks); } ! float reduce(int n, float *d_idata, float *d_odata) { int threads = 0; int blocks = 0; int maxThreads = 256; int maxBlocks = 64; int size = n ! while (size > 1) { getNumBlocksAndThreads(size, maxBlocks, maxThreads, blocks, threads); ! int smemSize = (threads <= 32) ? 2 * threads * sizeof(float) : threads * sizeof(float); dim3 dimBlock(threads, 1, 1); dim3 dimGrid(blocks, 1, 1); ! if (isPow2(size)) { switch (threads) { case 512: reduce_kernel<512, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } } else { switch (threads) { case 512: reduce_kernel<512, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } } ! size = (size + (threads*2-‐1)) / (threads*2); } ! float sum; checkCudaErrors(cudaMemcpy(&sum, d_odata, sizeof(float), cudaMemcpyDeviceToHost)); ! } ! o_O

Dot product four ways • Accelerate (parallel): dotp :: Acc
(Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float) dotp xs ys = fold (+) 0 ( zipWith (*) xs ys )

Dot product four ways • Accelerate (parallel): • Recall the
sequential Haskell version: dotp :: Acc (Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float) dotp xs ys = fold (+) 0 ( zipWith (*) xs ys ) dotp :: [Float] -‐> [Float] -‐> Float dotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )

sequential Haskell version: dotp :: Acc (Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float) dotp xs ys = fold (+) 0 ( zipWith (*) xs ys ) dotp :: [Float] -‐> [Float] -‐> Float dotp xs ys = foldl (+) 0 ( zipWith (*) xs ys ) left-to-right traversal

sequential Haskell version: dotp :: Acc (Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float) dotp xs ys = fold (+) 0 ( zipWith (*) xs ys ) dotp :: [Float] -‐> [Float] -‐> Float dotp xs ys = foldl (+) 0 ( zipWith (*) xs ys ) left-to-right traversal neither left nor right: happens in parallel (tree-like)

sequential Haskell version: dotp :: Acc (Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float) dotp xs ys = fold (+) 0 ( zipWith (*) xs ys ) dotp :: [Float] -‐> [Float] -‐> Float dotp xs ys = foldl (+) 0 ( zipWith (*) xs ys ) left-to-right traversal neither left nor right: happens in parallel (tree-like) But… how does it perform?

Dot product four ways 0.1 1 10 100 2 4
6 8 10 12 14 16 18 20 Run Time (ms) Elements (millions) Dot product sequential Accelerate CUBLAS Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)

6 8 10 12 14 16 18 20 Run Time (ms) Elements (millions) Dot product sequential Accelerate CUBLAS 1.2x Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)

6 8 10 12 14 16 18 20 Run Time (ms) Elements (millions) Dot product sequential Accelerate CUBLAS 30x 1.2x Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)

Mandelbrot fractal

Mandelbrot fractal n-body gravitational simulation

Mandelbrot fractal n-body gravitational simulation Canny edge detection

Mandelbrot fractal n-body gravitational simulation Canny edge detection SmoothLife cellular
automata

automata stable ﬂuid ﬂow

automata stable ﬂuid ﬂow ... d6b821d937a4170b3c4f8ad93495575d: saitek1 d0e52829bf7962ee0aa90550ffdcccaa: laura1230 494a8204b800c41b2da763f9bbbcc462: lina03 d8ff07c52a95b30800809758f84ce28c: Jenny10 e81bed02faa9892f8360c705241191ae: carmen89 46f7d75718029de99dd81fd907034bc9: mellon22 0dd3c176cf34486ec00b526b6920b782: helena04 9351c4bc8c8ba17b58d5a6a1f839f356: 85548554 9c36c5599f40d08f874559ac824d091a: 585123456 4b4dce6c91b429e8360aa65f97342e90: 5678go 3aa561d4c17d9d58443fc15d10cc86ae: momo55 ! Recovered 150/1000 (15.00 %) digests in 59.45 s, 185.03 MHash/sec Password “recovery” (MD5 dictionary attack)

Accelerate • Accelerate is a Domain-Speciﬁc Language for GPU programming
Haskell/Accelerate program CUDA code Compile with NVIDIA’s compiler & load onto the GPU Copy result back to Haskell Transform Accelerate program into CUDA program

Accelerate • Accelerate is a Domain-Speciﬁc Language for GPU programming
- This process may happen several times during program execution - Code and data fragments get cached and reused • An Accelerate program is a Haskell program that generates a CUDA program - However, in many respects this still looks like a Haskell program - Shares various concepts with Repa, a Haskell array library for CPUs

Accelerate • To execute an Accelerate computation (on the GPU):
- run comes from whichever backend we have chosen (CUDA) run :: Arrays a => Acc a -‐> a

- run comes from whichever backend we have chosen (CUDA) - Arrays constrains the result to be an Array, or tuple thereof run :: Arrays a => Acc a -‐> a

- run comes from whichever backend we have chosen (CUDA) - Arrays constrains the result to be an Array, or tuple thereof • What is Acc? run :: Arrays a => Acc a -‐> a

- run comes from whichever backend we have chosen (CUDA) - Arrays constrains the result to be an Array, or tuple thereof • What is Acc? - This is our DSL type - A data structure (Abstract Syntax Tree) representing a computation that once executed will yield a result of type ‘a’ run :: Arrays a => Acc a -‐> a

- run comes from whichever backend we have chosen (CUDA) - Arrays constrains the result to be an Array, or tuple thereof • What is Acc? - This is our DSL type - A data structure (Abstract Syntax Tree) representing a computation that once executed will yield a result of type ‘a’ run :: Arrays a => Acc a -‐> a Accelerate is a library of collective operations over arrays of type Acc a

Accelerate • Accelerate computations take place on arrays - Parallelism
is introduced in the form of collective operations over arrays Accelerate computation Arrays in Arrays out run :: Arrays a => Acc a -‐> a

is introduced in the form of collective operations over arrays • Arrays have two type parameters data Array sh e Accelerate computation Arrays in Arrays out run :: Arrays a => Acc a -‐> a

is introduced in the form of collective operations over arrays • Arrays have two type parameters - The shape of the array, or dimensionality data Array sh e Accelerate computation Arrays in Arrays out run :: Arrays a => Acc a -‐> a

is introduced in the form of collective operations over arrays • Arrays have two type parameters - The shape of the array, or dimensionality - The element type of the array: Int, Float, etc. data Array sh e Accelerate computation Arrays in Arrays out run :: Arrays a => Acc a -‐> a

Arrays • Supported array element types are members of the
Elt class: - () - Int, Int32, Int64, Word, Word32, Word64... - Float, Double - Char - Bool - Tuples up to 9-tuples of these, including nested tuples • Note that Array itself is not an allowable element type. There are no nested arrays in Accelerate, regular arrays only! data Array sh e

Accelerate by example Mandelbrot fractal

Mandelbrot set generator • Basics - Pick a window onto
the complex plane & divide into pixels - A point is in the set if the value of does not diverge to inﬁnity ! - Each pixel has a value given by its coordinates in the complex plane - Colour depends on number of iterations before divergence • Each pixel is independent: lots of data parallelism c n |z| zn+1 = c + z2 n

Mandelbrot set generator • First, some types: - A pair
of ﬂoating point numbers for the real and imaginary parts data Complex = (Float, Float) data Array sh e

Mandelbrot set generator • First, some types: - A pair
of ﬂoating point numbers for the real and imaginary parts - DIM2 is a type synonym for a two dimensional Shape data Complex = (Float, Float) data ComplexPlane = Array DIM2 Complex data Array sh e

Shapes • Shapes determines the dimensions of the array and
the type of the index - Z represents a rank-zero array (singleton array with one element) - (:.) increases the rank by adding a new dimension on the right data Z = Z data tail :. head = tail :. head

the type of the index - Z represents a rank-zero array (singleton array with one element) - (:.) increases the rank by adding a new dimension on the right • Examples: - One-dimensional array (Vector) indexed by Int: (Z :. Int) - Two-dimensional array, indexed by Int: (Z :. Int :. Int) data Z = Z data tail :. head = tail :. head

the type of the index - Z represents a rank-zero array (singleton array with one element) - (:.) increases the rank by adding a new dimension on the right • Examples: - One-dimensional array (Vector) indexed by Int: (Z :. Int) - Two-dimensional array, indexed by Int: (Z :. Int :. Int) • This style is used at both the type and value level: data Z = Z data tail :. head = tail :. head sh :: Z :. Int sh = Z :. 10

Mandelbrot set generator • The initial complex plane: • generate
is a collective operation that yields a value for an index in the array generate :: (Shape sh, Elt a) => Exp sh -‐> (Exp sh -‐> Exp a) -‐> Acc (Array sh a) z0 = c

is a collective operation that yields a value for an index in the array - Supported shape and element types: we will use DIM2 and Complex generate :: (Shape sh, Elt a) => Exp sh -‐> (Exp sh -‐> Exp a) -‐> Acc (Array sh a) z0 = c

is a collective operation that yields a value for an index in the array - Supported shape and element types: we will use DIM2 and Complex - Size of the result array: number of pixels in the image generate :: (Shape sh, Elt a) => Exp sh -‐> (Exp sh -‐> Exp a) -‐> Acc (Array sh a) z0 = c

is a collective operation that yields a value for an index in the array - Supported shape and element types: we will use DIM2 and Complex - Size of the result array: number of pixels in the image - A function to apply at every index: generate the values of at each pixel generate :: (Shape sh, Elt a) => Exp sh -‐> (Exp sh -‐> Exp a) -‐> Acc (Array sh a) c z0 = c

is a collective operation that yields a value for an index in the array - Supported shape and element types: we will use DIM2 and Complex - Size of the result array: number of pixels in the image - A function to apply at every index: generate the values of at each pixel generate :: (Shape sh, Elt a) => Exp sh -‐> (Exp sh -‐> Exp a) -‐> Acc (Array sh a) c z0 = c If Acc is our DSL type, what is Exp?

A Stratiﬁed Language • Accelerate is split into two worlds:
Acc and Exp - Acc represents collective operations over instances of Arrays - Exp is a scalar computation on things of type Elt

Acc and Exp - Acc represents collective operations over instances of Arrays - Exp is a scalar computation on things of type Elt • Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays - Scalar operations can not contain collective operations

Acc and Exp - Acc represents collective operations over instances of Arrays - Exp is a scalar computation on things of type Elt • Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays - Scalar operations can not contain collective operations • This stratiﬁcation excludes nested data parallelism

Collective Operations • Collective operations comprise many scalar operations applied
in parallel example :: Acc (Vector Int) example = generate (constant (Z:.10)) (\ix -‐> f ix)

in parallel - constant lifts a plain value into Exp land of scalar expressions example :: Acc (Vector Int) example = generate (constant (Z:.10)) (\ix -‐> f ix) constant :: Elt e => e -‐> Exp e

in parallel - constant lifts a plain value into Exp land of scalar expressions example :: Acc (Vector Int) example = generate (constant (Z:.10)) (\ix -‐> f ix) generate (constant (Z:.10)) f constant :: Elt e => e -‐> Exp e

in parallel - constant lifts a plain value into Exp land of scalar expressions example :: Acc (Vector Int) example = generate (constant (Z:.10)) (\ix -‐> f ix) f 0 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 generate (constant (Z:.10)) f f :: Exp DIM1 -‐> Exp Int constant :: Elt e => e -‐> Exp e

in parallel - constant lifts a plain value into Exp land of scalar expressions example :: Acc (Vector Int) example = generate (constant (Z:.10)) (\ix -‐> f ix) f 0 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 Acc (Vector Int) constant :: Elt e => e -‐> Exp e

Mandelbrot set generator genPlane :: Exp (Float,Float,Float,Float) -‐> Acc ComplexPlane
genPlane view = generate (constant (Z:.600:.800)) (\ix -‐> let Z :. y :. x = unlift ix in lift ( xmin + (fromIntegral x * viewx) / 800 , ymin + (fromIntegral y * viewy) / 600 )) where (xmin,ymin,xmax,ymax) = unlift view viewx = xmax -‐ xmin viewy = ymax -‐ ymin

- unlift has a funky type, but just unpacks “stuﬀ”.
lift does the reverse. Mandelbrot set generator genPlane :: Exp (Float,Float,Float,Float) -‐> Acc ComplexPlane genPlane view = generate (constant (Z:.600:.800)) (\ix -‐> let Z :. y :. x = unlift ix in lift ( xmin + (fromIntegral x * viewx) / 800 , ymin + (fromIntegral y * viewy) / 600 )) where (xmin,ymin,xmax,ymax) = unlift view viewx = xmax -‐ xmin viewy = ymax -‐ ymin unlift :: Exp (Z :. Int :. Int) -‐> Z :. Exp Int :. Exp Int

lift does the reverse. Mandelbrot set generator genPlane :: Exp (Float,Float,Float,Float) -‐> Acc ComplexPlane genPlane view = generate (constant (Z:.600:.800)) (\ix -‐> let Z :. y :. x = unlift ix in lift ( xmin + (fromIntegral x * viewx) / 800 , ymin + (fromIntegral y * viewy) / 600 )) where (xmin,ymin,xmax,ymax) = unlift view viewx = xmax -‐ xmin viewy = ymax -‐ ymin unlift :: Exp (F,F,F,F) -‐> (Exp F, Exp F, Exp F, Exp F)

lift does the reverse. Mandelbrot set generator genPlane :: Exp (Float,Float,Float,Float) -‐> Acc ComplexPlane genPlane view = generate (constant (Z:.600:.800)) (\ix -‐> let Z :. y :. x = unlift ix in lift ( xmin + (fromIntegral x * viewx) / 800 , ymin + (fromIntegral y * viewy) / 600 )) where (xmin,ymin,xmax,ymax) = unlift view viewx = xmax -‐ xmin viewy = ymax -‐ ymin

lift does the reverse. - Even though operations are in Exp, can still use standard Haskell operations like (*) and (-‐) Mandelbrot set generator genPlane :: Exp (Float,Float,Float,Float) -‐> Acc ComplexPlane genPlane view = generate (constant (Z:.600:.800)) (\ix -‐> let Z :. y :. x = unlift ix in lift ( xmin + (fromIntegral x * viewx) / 800 , ymin + (fromIntegral y * viewy) / 600 )) where (xmin,ymin,xmax,ymax) = unlift view viewx = xmax -‐ xmin viewy = ymax -‐ ymin

Mandelbrot set generator • Let’s deﬁne the function we will
iterate zn+1 = c + z2 n next :: Exp Complex -‐> Exp Complex -‐> Exp Complex next c z = plus c (times z z)

iterate - Use lift and unlift as before to unpack the tuples zn+1 = c + z2 n next :: Exp Complex -‐> Exp Complex -‐> Exp Complex next c z = plus c (times z z) plus :: Exp Complex -‐> Exp Complex -‐> Exp Complex plus a b = lift (ax+bx, ay+by) where (ax,ay) = unlift a :: (Exp Float, Exp Float) (bx,by) = unlift b :: (Exp Float, Exp Float)

iterate - Use lift and unlift as before to unpack the tuples zn+1 = c + z2 n next :: Exp Complex -‐> Exp Complex -‐> Exp Complex next c z = plus c (times z z) plus :: Exp Complex -‐> Exp Complex -‐> Exp Complex plus a b = lift (ax+bx, ay+by) where (ax,ay) = unlift a :: (Exp Float, Exp Float) (bx,by) = unlift b :: (Exp Float, Exp Float) lift :: (Exp Float, Exp Float) -‐> Exp (Float, Float)

iterate - Use lift and unlift as before to unpack the tuples - Note that we had to add some type signatures to unlift zn+1 = c + z2 n next :: Exp Complex -‐> Exp Complex -‐> Exp Complex next c z = plus c (times z z) plus :: Exp Complex -‐> Exp Complex -‐> Exp Complex plus a b = lift (ax+bx, ay+by) where (ax,ay) = unlift a :: (Exp Float, Exp Float) (bx,by) = unlift b :: (Exp Float, Exp Float) lift :: (Exp Float, Exp Float) -‐> Exp (Float, Float)

Mandelbrot set generator • Complication: GPUs must do the same
thing to lots of diﬀerent data - Keep a pair (z,i) for each pixel, where i is the point iteration diverged zn+1 = c + z2 n

thing to lots of diﬀerent data - Keep a pair (z,i) for each pixel, where i is the point iteration diverged zn+1 = c + z2 n step :: Exp Complex -‐> Exp (Complex, Int) -‐> Exp (Complex, Int) step c zi = f (fst zi) (snd zi) where f z i = let z' = next c z in dot z' >* 4 ? ( zi, lift (z', i+1) )

thing to lots of diﬀerent data - Keep a pair (z,i) for each pixel, where i is the point iteration diverged - fst and snd to extract individual tuple components zn+1 = c + z2 n step :: Exp Complex -‐> Exp (Complex, Int) -‐> Exp (Complex, Int) step c zi = f (fst zi) (snd zi) where f z i = let z' = next c z in dot z' >* 4 ? ( zi, lift (z', i+1) )

thing to lots of diﬀerent data - Keep a pair (z,i) for each pixel, where i is the point iteration diverged - fst and snd to extract individual tuple components - dot is the magnitude of a complex number squared zn+1 = c + z2 n step :: Exp Complex -‐> Exp (Complex, Int) -‐> Exp (Complex, Int) step c zi = f (fst zi) (snd zi) where f z i = let z' = next c z in dot z' >* 4 ? ( zi, lift (z', i+1) )

thing to lots of diﬀerent data - Keep a pair (z,i) for each pixel, where i is the point iteration diverged - fst and snd to extract individual tuple components - dot is the magnitude of a complex number squared - Conditionals can lead to SIMD divergence, so use sparingly zn+1 = c + z2 n (?) :: Elt t => Exp Bool -‐> (Exp t, Exp t) -‐> Exp t step :: Exp Complex -‐> Exp (Complex, Int) -‐> Exp (Complex, Int) step c zi = f (fst zi) (snd zi) where f z i = let z' = next c z in dot z' >* 4 ? ( zi, lift (z', i+1) )

Mandelbrot set generator • Accelerate is a meta programming language
- So, just use regular Haskell to unroll the loop a ﬁxed number of times zn+1 = c + z2 n mandel :: Exp (Float,Float,Float,Float) -‐> Acc (Array DIM2 (Complex, Int)) mandel view = Prelude.foldl1 (.) (Prelude.replicate 255 go) zs0 where cs = genPlane view zs0 = zip cs (fill (shape cs) 0) go zs = zipWith step cs zs

- So, just use regular Haskell to unroll the loop a ﬁxed number of times - zipWith applies its function step pairwise to elements of the two arrays zn+1 = c + z2 n mandel :: Exp (Float,Float,Float,Float) -‐> Acc (Array DIM2 (Complex, Int)) mandel view = Prelude.foldl1 (.) (Prelude.replicate 255 go) zs0 where cs = genPlane view zs0 = zip cs (fill (shape cs) 0) go zs = zipWith step cs zs

- So, just use regular Haskell to unroll the loop a ﬁxed number of times - zipWith applies its function step pairwise to elements of the two arrays - Replicate the transition step from to zn+1 = c + z2 n mandel :: Exp (Float,Float,Float,Float) -‐> Acc (Array DIM2 (Complex, Int)) mandel view = Prelude.foldl1 (.) (Prelude.replicate 255 go) zs0 where cs = genPlane view zs0 = zip cs (fill (shape cs) 0) go zs = zipWith step cs zs zn zn+1

- So, just use regular Haskell to unroll the loop a ﬁxed number of times - zipWith applies its function step pairwise to elements of the two arrays - Replicate the transition step from to - Applies the steps in sequence, beginning with the initial data zs0 zn+1 = c + z2 n mandel :: Exp (Float,Float,Float,Float) -‐> Acc (Array DIM2 (Complex, Int)) mandel view = Prelude.foldl1 (.) (Prelude.replicate 255 go) zs0 where cs = genPlane view zs0 = zip cs (fill (shape cs) 0) go zs = zipWith step cs zs zn zn+1

- So, just use regular Haskell to unroll the loop a ﬁxed number of times - zipWith applies its function step pairwise to elements of the two arrays - Replicate the transition step from to - Applies the steps in sequence, beginning with the initial data zs0 zn+1 = c + z2 n mandel :: Exp (Float,Float,Float,Float) -‐> Acc (Array DIM2 (Complex, Int)) mandel view = Prelude.foldl1 (.) (Prelude.replicate 255 go) zs0 where cs = genPlane view zs0 = zip cs (fill (shape cs) 0) go zs = zipWith step cs zs zn zn+1 f (g x) ≣ (f . g) x

In the workshop… • More example code, computational kernels -
Common pitfalls - Tips for good performance - Figuring out what went wrong (or knowing who to blame)

In the workshop… • More example code, computational kernels -
Common pitfalls - Tips for good performance - Figuring out what went wrong (or knowing who to blame) me

Questions? https://github.com/AccelerateHS/ http://xkcd.com/365/

Extra Slides…

Seriously?

Arrays • Create an array from a list: ! -
Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order • Example: data Array sh e fromList :: (Shape sh, Elt e) => sh -‐> [e] -‐> Array sh e ghci> fromList (Z:.10) [1..10]

Arrays • Create an array from a list: data Array
sh e > fromList (Z:.10) [1..10] :: Vector Float Array (Z :. 10) [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

Arrays • Create an array from a list: • Multidimensional
arrays are similar: - Elements are ﬁlled along the right-most dimension ﬁrst data Array sh e > fromList (Z:.10) [1..10] :: Vector Float Array (Z :. 10) [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0] > fromList (Z:.3:.5) [1..] :: Array DIM2 Int Array (Z :. 3 :. 5) [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Arrays • Array indices start counting from zero data Array
sh e > let mat = fromList (Z:.3:.5) [1..] :: Array DIM2 Int > indexArray mat (Z:.2:.1) 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Arrays • Similarly, an array of (nested) tuples: - This
is just a trick: internally converted into a tuple of arrays data Array sh e > fromList (Z:.2:.3) $ P.zip [1..] ['a'..] :: Array DIM2 (Int,Char) Array (Z :. 2 :. 3) [(1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e'),(6,'f')] 1 2 3 4 5 6 a b c d e f ( ) ,

Data.Array.Accelerate • Need to import both the base library as
well as a backend - There is also an interpreter available for testing - Runs without using the GPU (much more slowly of course) import Prelude as P import Data.Array.Accelerate as A import Data.Array.Accelerate.CUDA as CUDA

Data.Array.Accelerate

Data.Array.Accelerate • To get arrays into the Acc world: !
- This may involve copying data to the GPU use :: Arrays a => a -‐> Acc a

Data.Array.Accelerate • To get arrays into the Acc world: !
- This may involve copying data to the GPU • use injects arrays into our DSL • run executes the computation to get arrays out • Using Accelerate focuses on everything in between use :: Arrays a => a -‐> Acc a

Collective Operations • Example: add one to each element of
an array > let arr = fromList (Z:.3:.5) [1..] :: Array DIM2 Int > run $ A.map (+1) (use arr) Array (Z :. 3 :. 5) [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]

an array • What is the type of map? > let arr = fromList (Z:.3:.5) [1..] :: Array DIM2 Int > run $ A.map (+1) (use arr) Array (Z :. 3 :. 5) [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] map :: (Shape sh, Elt a, Elt b) => (Exp a -‐> Exp b) -‐> Acc (Array sh a) -‐> Acc (Array sh b)

an array • What is the type of map? > let arr = fromList (Z:.3:.5) [1..] :: Array DIM2 Int > run $ A.map (+1) (use arr) Array (Z :. 3 :. 5) [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] map :: (Shape sh, Elt a, Elt b) => (Exp a -‐> Exp b) -‐> Acc (Array sh a) -‐> Acc (Array sh b) Supported shape & element types

an array • What is the type of map? > let arr = fromList (Z:.3:.5) [1..] :: Array DIM2 Int > run $ A.map (+1) (use arr) Array (Z :. 3 :. 5) [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] map :: (Shape sh, Elt a, Elt b) => (Exp a -‐> Exp b) -‐> Acc (Array sh a) -‐> Acc (Array sh b) DSL array Supported shape & element types

an array • What is the type of map? > let arr = fromList (Z:.3:.5) [1..] :: Array DIM2 Int > run $ A.map (+1) (use arr) Array (Z :. 3 :. 5) [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] map :: (Shape sh, Elt a, Elt b) => (Exp a -‐> Exp b) -‐> Acc (Array sh a) -‐> Acc (Array sh b) DSL array Function to apply at every element. But what is Exp? Supported shape & element types

• The type class overloading trick is used for standard
Haskell classes ! • Standard boolean operations are available with slightly diﬀerent names - The standard names can not be overloaded ! • Conditionals - Use sparingly: leads to SIMD divergence (+1) :: (Elt a, IsNum a) => Exp a -‐> Exp a Scalar Expressions (==*) :: (Elt t, IsScalar t) => Exp t -‐> Exp t -‐> Exp Bool (/=*), (<*), (>*), min, max, (||*), (&&*) -‐-‐ and so on... (?) :: Elt t => Exp Bool -‐> (Exp t, Exp t) -‐> Exp t

Scalar Expressions • Bring a Haskell value into Exp land
! • Lift an expression into a singleton array ! • Extract the element from a singleton array constant :: Elt e => e -‐> Exp e unit :: Exp e -‐> Acc (Scalar e) the :: Acc (Scalar e) -‐> Exp e

Reductions • Folding (+) over a vector produces a sum
- The result is a one-element array (scalar). Why? ! > let xs = fromList (Z:.10) [1..] :: Vector Int > run $ A.fold (+) 0 (use xs) Array (Z) [55]

- The result is a one-element array (scalar). Why? ! • Fold has an interesting type: > let xs = fromList (Z:.10) [1..] :: Vector Int > run $ A.fold (+) 0 (use xs) Array (Z) [55] fold :: (Shape sh, Elt a) => (Exp a -‐> Exp a -‐> Exp a) -‐> Exp a -‐> Acc (Array (sh:.Int) a) -‐> Acc (Array sh a)

- The result is a one-element array (scalar). Why? ! • Fold has an interesting type: > let xs = fromList (Z:.10) [1..] :: Vector Int > run $ A.fold (+) 0 (use xs) Array (Z) [55] fold :: (Shape sh, Elt a) => (Exp a -‐> Exp a -‐> Exp a) -‐> Exp a -‐> Acc (Array (sh:.Int) a) -‐> Acc (Array sh a) input array

- The result is a one-element array (scalar). Why? ! • Fold has an interesting type: > let xs = fromList (Z:.10) [1..] :: Vector Int > run $ A.fold (+) 0 (use xs) Array (Z) [55] fold :: (Shape sh, Elt a) => (Exp a -‐> Exp a -‐> Exp a) -‐> Exp a -‐> Acc (Array (sh:.Int) a) -‐> Acc (Array sh a) outer dimension removed input array

Reductions • Fold occurs over the outer dimension of the
array > let mat = fromList (Z:.3:.5) [1..] :: Array DIM2 Int > run $ A.fold (+) 0 (use mat) Array (Z :. 3) [15,40,65] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 40 65

Reductions • Is this a left-fold or a right-fold? -
Neither! The fold happens in parallel, tree-like - Therefore the function must be associative: (Exp a -‐> Exp a -‐> Exp a) - (We pretend that ﬂoating point operations are associative, though strictly speaking they are not)

Stencils • A stencil is a map with access to
the neighbourhood around each element - Useful in many scientiﬁc & image processing algorithms laplace :: Stencil3x3 Int -‐> Exp Int laplace ((_,t,_) ,(l,c,r) ,(_,b,_)) = t + b + l + r -‐ 4*c t l c r b

the neighbourhood around each element - Useful in many scientiﬁc & image processing algorithms - Boundary conditions specify how to handle out-of-bounds neighbours laplace :: Stencil3x3 Int -‐> Exp Int laplace ((_,t,_) ,(l,c,r) ,(_,b,_)) = t + b + l + r -‐ 4*c > let mat = fromList (Z:.3:.5) [1..] :: Array DIM2 Int > run $ stencil laplace (Constant 0) (use mat) Array (Z :. 3 :. 5) [4,3,2,1,-‐6,-‐5,0,0,0,-‐11,-‐26,-‐17,-‐18,-‐19,-‐36] t l c r b

the neighbourhood around each element - Useful in many scientiﬁc & image processing algorithms

Index Transforms • Index transforms change the order of elements,
not their values - We usually want to push such operations into the consumer • backpermute speciﬁes which element to read from a source array backpermute :: (Shape ix, Shape ix', Elt a) => Exp ix' -‐> (Exp ix' -‐> Exp ix) -‐> Acc (Array ix a) -‐> Acc (Array ix' a)

not their values - We usually want to push such operations into the consumer • backpermute speciﬁes which element to read from a source array backpermute :: (Shape ix, Shape ix', Elt a) => Exp ix' -‐> (Exp ix' -‐> Exp ix) -‐> Acc (Array ix a) -‐> Acc (Array ix' a) shape of result

not their values - We usually want to push such operations into the consumer • backpermute speciﬁes which element to read from a source array backpermute :: (Shape ix, Shape ix', Elt a) => Exp ix' -‐> (Exp ix' -‐> Exp ix) -‐> Acc (Array ix a) -‐> Acc (Array ix' a) shape of result source data

not their values - We usually want to push such operations into the consumer • backpermute speciﬁes which element to read from a source array backpermute :: (Shape ix, Shape ix', Elt a) => Exp ix' -‐> (Exp ix' -‐> Exp ix) -‐> Acc (Array ix a) -‐> Acc (Array ix' a) shape of result source data index mapping from destination array to source

not their values transpose mat = let swap = lift1 $ \(Z:.j:.i) -‐> Z:.i:.j :: Z :. Exp Int :. Exp Int in backpermute (swap $ shape mat) swap mat 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 6 11 2 7 12 3 8 13 4 9 14 5 10 15

GPGPU Programming in Haskell with Accelerate

GPGPU Programming in Haskell with Accelerate

More Decks by Trevor L. McDonell

Other Decks in Research

Featured

Transcript