GPGPU Programming in Haskell with Accelerate (Workshop)

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University
of New South Wales ! @tlmcdonell [email protected] ! https://github.com/AccelerateHS

Preliminaries • Get it from Hackage: - Debugging support enables
some extra options to see what is going on ! • Need to import both the base library as well as a speciﬁc backend - Import qualiﬁed to avoid name clashes with the Prelude cabal%install%accelerate%,fdebug% cabal%install%accelerate,cuda%,fdebug import%Prelude%%%%%%%%%%%%%%%as%P% import%Data.Array.Accelerate%as%A% import%Data.Array.Accelerate.CUDA% %%!!"or% import%Data.Array.Accelerate.Interpreter http://hackage.haskell.org/package/accelerate https://github.com/AccelerateHS/

Accelerate • Accelerate is a Domain-Speciﬁc Language for GPU programming
Haskell/Accelerate program CUDA code Compile with NVIDIA’s compiler & load onto the GPU Copy result back to Haskell Transform Accelerate program into CUDA program

Accelerate • Accelerate computations take place on arrays - Parallelism
is introduced in the form of collective operations over arrays ! ! • Arrays have two type parameters - The shape of the array, or dimensionality - The element type of the array: Int, Float, etc. data%Array%sh%e Accelerate computation Arrays in Arrays out

Accelerate • Accelerate is split into two worlds: Acc and
Exp - Acc represents collective operations over instances of Arrays - Exp is a scalar computation on things of type Elt • Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays - Scalar operations can not contain collective operations • This stratiﬁcation excludes nested data parallelism

Accelerate • To execute an Accelerate computation (on the GPU):
! - run comes from whichever backend we have chosen (CUDA) run%::%Arrays%a%=>%Acc%a%,>%a

! - run comes from whichever backend we have chosen (CUDA) • To get arrays into Acc land ! - This may involve copying data to the GPU run%::%Arrays%a%=>%Acc%a%,>%a use%::%Arrays%a%=>%a%,>%Acc%a

! - run comes from whichever backend we have chosen (CUDA) • To get arrays into Acc land ! - This may involve copying data to the GPU • Using Accelerate focus on everything in between: using combinators of type Acc to build an AST that will be turned into CUDA code and executed by run run%::%Arrays%a%=>%Acc%a%,>%a use%::%Arrays%a%=>%a%,>%Acc%a

Arrays • Create an array from a list: - Generates
a multidimensional array by consuming elements from the list and adding them to the array in row-major order • Example: data%Array%sh%e fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e ghci>%fromList%(Z:.10)%[1..10]

a multidimensional array by consuming elements from the list and adding them to the array in row-major order • Example: data%Array%sh%e fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e ghci>%fromList%(Z:.10)%[1..10] <interactive>:3:1:% %%%%No%instance%for%(Shape%(Z%:.%head0))% %%%%%%arising%from%a%use%of%`fromList'% %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%is%a%potential%instance%available:% %%%%%%instance%Shape%sh%=>%Shape%(sh%:.%Int)% %%%%%%%%,,%Defined%in%`Data.Array.Accelerate.Array.Sugar'% %%%%Possible%fix:%add%an%instance%declaration%for%(Shape%(Z%: %%%%In%the%expression:%fromList%(Z%:.%10)%[1%..%10]% %%%%In%an%equation%for%`it':%it%=%fromList%(Z%:.%10)%[1%..%10 ! <interactive>:3:14:% %%%%No%instance%for%(Num%head0)%arising%from%the%literal%`10' %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%are%several%potential%instances:% %%%%%%instance%Num%Double%,,%Defined%in%`GHC.Float'% %%%%%%instance%Num%Float%,,%Defined%in%`GHC.Float'% %%%%%%instance%Integral%a%=>%Num%(GHC.Real.Ratio%a)% %%%%%%%%,,%Defined%in%`GHC.Real'% %%%%%%...plus%12%others% %%%%In%the%second%argument%of%`(:.)',%namely%`10'%

a multidimensional array by consuming elements from the list and adding them to the array in row-major order • Example: - Defaulting does not apply, because  Shape is not a standard class data%Array%sh%e fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e ghci>%fromList%(Z:.10)%[1..10] <interactive>:3:1:% %%%%No%instance%for%(Shape%(Z%:.%head0))% %%%%%%arising%from%a%use%of%`fromList'% %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%is%a%potential%instance%available:% %%%%%%instance%Shape%sh%=>%Shape%(sh%:.%Int)% %%%%%%%%,,%Defined%in%`Data.Array.Accelerate.Array.Sugar'% %%%%Possible%fix:%add%an%instance%declaration%for%(Shape%(Z%: %%%%In%the%expression:%fromList%(Z%:.%10)%[1%..%10]% %%%%In%an%equation%for%`it':%it%=%fromList%(Z%:.%10)%[1%..%10 ! <interactive>:3:14:% %%%%No%instance%for%(Num%head0)%arising%from%the%literal%`10' %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%are%several%potential%instances:% %%%%%%instance%Num%Double%,,%Defined%in%`GHC.Float'% %%%%%%instance%Num%Float%,,%Defined%in%`GHC.Float'% %%%%%%instance%Integral%a%=>%Num%(GHC.Real.Ratio%a)% %%%%%%%%,,%Defined%in%`GHC.Real'% %%%%%%...plus%12%others% %%%%In%the%second%argument%of%`(:.)',%namely%`10'%

a multidimensional array by consuming elements from the list and adding them to the array in row-major order • Example: - Defaulting does not apply, because  Shape is not a standard class data%Array%sh%e fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e ghci>%fromList%(Z:.10)%[1..10] <interactive>:3:1:% %%%%No%instance%for%(Shape%(Z%:.%head0))% %%%%%%arising%from%a%use%of%`fromList'% %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%is%a%potential%instance%available:% %%%%%%instance%Shape%sh%=>%Shape%(sh%:.%Int)% %%%%%%%%,,%Defined%in%`Data.Array.Accelerate.Array.Sugar'% %%%%Possible%fix:%add%an%instance%declaration%for%(Shape%(Z%: %%%%In%the%expression:%fromList%(Z%:.%10)%[1%..%10]% %%%%In%an%equation%for%`it':%it%=%fromList%(Z%:.%10)%[1%..%10 ! <interactive>:3:14:% %%%%No%instance%for%(Num%head0)%arising%from%the%literal%`10' %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%are%several%potential%instances:% %%%%%%instance%Num%Double%,,%Defined%in%`GHC.Float'% %%%%%%instance%Num%Float%,,%Defined%in%`GHC.Float'% %%%%%%instance%Integral%a%=>%Num%(GHC.Real.Ratio%a)% %%%%%%%%,,%Defined%in%`GHC.Real'% %%%%%%...plus%12%others% %%%%In%the%second%argument%of%`(:.)',%namely%`10'% Number 1 tip: Add type signatures

Arrays • Create an array from a list: data%Array%sh%e >%fromList%(Z:.10)%[1..10]%::%Vector%Float%
Array%(Z%:.%10)%[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

Arrays • Create an array from a list: • Multidimensional
arrays are similar: - Elements are ﬁlled along the right-most dimension ﬁrst data%Array%sh%e >%fromList%(Z:.10)%[1..10]%::%Vector%Float% Array%(Z%:.%10)%[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0] >%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int% Array%(Z%:.%3%:.%5)%[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Accelerate by example Password “recovery”

MD5 Algorithm • Aim: - Implement one round of MD5:
unsalted, single 512-bit block - Apply to an array of words - Compare hashes to some unknown hash - i.e. standard dictionary attack

MD5 Algorithm • Algorithm operates on a 4-word state,  A,
B, C, and D • There are 4 x 16 rounds: F, G, H, and I - Mi is a word from the input message - Ki is a constant - <<<s is left rotate, by some constant ri • Each round operates on the 512-bit message   block, modifying the state http://en.wikipedia.org/wiki/Md5

MD5 Algorithm in Accelerate • Accelerate is a meta programming
language - Use regular Haskell to generate the expression for each step of the round - Produces an unrolled loop type%ABCD%=%(Exp%Word32,%Exp%Word32,%...%)% ! md5round%::%Acc%(Vector%Word32)%,>%ABCD% md5round%msg% %%=%P.foldl%round%(a0,b0,c0,d0)%[0..64]% %%where% %%%%round%::%ABDC%,>%Int%,>%ABCD% %%%%round%(a,b,c,d)%i%=%...

MD5 Algorithm in Accelerate • The constants ki and ri
can be embedded directly - The simple list lookup would be death in standard Haskell - Generating the expression need not be performant, only executing it k%::%Int%,>%Exp%Word32% k%i%=%constant%(ks%P.!!%i)% %%where% %%%%ks%=%[%0xd76aa478,%0xe8c7b756,%0x242070db,%0xc1bdceee% %%%%%%%%%,%...

MD5 Algorithm in Accelerate • The message M is stored
as an array, so we need array indexing - Be wary, arbitrary array indexing can kill performance... (!)%::%(Shape%ix,%Elt%e)%=>%Acc%(Array%ix%e)%,>%Exp%ix%,>%Exp%e

MD5 Algorithm in Accelerate • The message M is stored
as an array, so we need array indexing - Be wary, arbitrary array indexing can kill performance... • Get the right word of the message for the given round (!)%::%(Shape%ix,%Elt%e)%=>%Acc%(Array%ix%e)%,>%Exp%ix%,>%Exp%e m%::%Int%,>%Exp%Word32% m%i% %%|%i%<%16%=%msg%!%index1%(constant%i)% %%|%i%<%32%=%msg%!%index1%(constant%((5*i%+%1)%`rem`%16))% %%|%...

MD5 Algorithm in Accelerate • Finally, the non-linear functions F,
G, H, and I round%::%ABDC%,>%Int%,>%ABCD% round%(a,b,c,d)%i% %%|%i%<%16%=%shfl%(f%b%c%d)% %%|%...% %%where% %%%%shfl%x%%=%(d,%b%+%((a%+%x%+%k%i%+%m%i)%`rotateL`%r%i),%b,%c)% ! %%%%f%x%y%z%=%(x%.&.%y)%.|.%((complement%x)%.&.%z)% %%%%...

MD5 Algorithm in Accelerate • MD5 applied to a single
16-word vector: no parallelism here • Lift this operation to an array of n words - Process many words in parallel to compare against the unknown - Need to use generate, the most general form of array construction. Equivalently, the most easily misused generate%::%(Shape%sh,%Elt%e)% %%%%%%%%%=>%Exp%sh% %%%%%%%%%,>%(Exp%sh%,>%Elt%e)% %%%%%%%%%,>%Acc%(Array%sh%e)

Problem Solving Accelerate Hints for when things don’t work as
you expect

MD5 Algorithm in Accelerate • As always, data layout is
important - Accelerate arrays are stored in row-major order - For a CPU, this means the input word would be read on a cache line c o r r e c t h o r s e b a t t e r y s t a p l e …

MD5 Algorithm in Accelerate • However in a parallel context
many threads must work together - generate uses one thread per element c o r r e c t h o r s e b a t t e r y s t a p l e …

many threads must work together - generate uses one thread per element c o r r e c t h o r s e b a t t e r y s t a p l e … c o r r e c t h o r s e b a t t e r y s t a p l e

many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead … c o r r e c t h o r s e b a t t e r y s t a p l e

many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c h b s o o a t r r t a r s t p e e e l c r e t y …

many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c h b s o o a t r r t a r s t p e e e l c r e t y … Pay attention to data layout if you use indexing operators

Nested Data-Parallelism • Accelerate is a language for ﬂat data-parallel
computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp

computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn’t always succeed, and the error is uninformative… ***%Exception:%% ***%Internal%error%in%package%accelerate%***% ***%Please%submit%a%bug%report%at%https://github.com/Accelerat...% ./Data/Array/Accelerate/Smart.hs:561%(convertSharingExp):% inconsistent%valuation%@%shared%'Exp'%tree%with%stable%name%54;% env%=%[56]

computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn’t always succeed, and the error is uninformative… ***%Exception:%% ***%Internal%error%in%package%accelerate%***% ***%Please%submit%a%bug%report%at%https://github.com/Accelerat...% ./Data/Array/Accelerate/Smart.hs:561%(convertSharingExp):% inconsistent%valuation%@%shared%'Exp'%tree%with%stable%name%54;% env%=%[56] ***%Exception:%Cyclic%definition%of%a%value%of%type%'Exp'%(sa%=%45)

Nested Data-Parallelism • Matrix-vector multiplication - Deﬁne vector dot product
dotp%::%Acc%(Vector%e)%,>%Acc%(Vector%e)%,>%Acc%(Scalar%e)% dotp%u%v%=%fold%(+)%0% %%%%%%%%%(%zipWith%(*)%u%v%)

Nested Data-Parallelism • Matrix-vector multiplication - Want to apply dotp
to every row of the matrix - Extract a row from the matrix takeRow%::%Exp%Int%,>%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)% takeRow%n%mat%=% %%let%Z%:.%_%:.%cols%=%unlift%(shape%mat)%::%Z:.%Exp%Int%:.%Exp%Int% %%in%backpermute%(index1%cols)% %%%%%%%%%%%%%%%%%(\ix%,>%index2%n%(unindex1%ix))% %%%%%%%%%%%%%%%%%mat

to every row of the matrix - Extract a row from the matrix - At each element in the output array, where in the input do I read from? takeRow%::%Exp%Int%,>%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)% takeRow%n%mat%=% %%let%Z%:.%_%:.%cols%=%unlift%(shape%mat)%::%Z:.%Exp%Int%:.%Exp%Int% %%in%backpermute%(index1%cols)% %%%%%%%%%%%%%%%%%(\ix%,>%index2%n%(unindex1%ix))% %%%%%%%%%%%%%%%%%mat

to every row of the matrix - Extract a row from the matrix - At each element in the output array, where in the input do I read from? - [un]index1 converts an Int into a (Z:.Int) 1D index (and vice versa) takeRow%::%Exp%Int%,>%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)% takeRow%n%mat%=% %%let%Z%:.%_%:.%cols%=%unlift%(shape%mat)%::%Z:.%Exp%Int%:.%Exp%Int% %%in%backpermute%(index1%cols)% %%%%%%%%%%%%%%%%%(\ix%,>%index2%n%(unindex1%ix))% %%%%%%%%%%%%%%%%%mat

Nested Data-Parallelism • Matrix-vector multiplication - Apply dot product to
each row of the matrix ! ! ! mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%in%generate%(index1%rows)% %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat))

each row of the matrix ! ! ! mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%in%generate%(index1%rows)% %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat)) indexing an array that …

each row of the matrix ! ! ! • Extract the element from a singleton array mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%in%generate%(index1%rows)% %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat)) indexing an array that … the%::%Acc%(Scalar%e)%,>%Exp%e

each row of the matrix ! ! ! • Extract the element from a singleton array mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%in%generate%(index1%rows)% %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat)) indexing an array that … depends on the index given by generate the%::%Acc%(Scalar%e)%,>%Exp%e

Nested Data-Parallelism • The problem is attempting to execute many
separate dot products in parallel ! ! • We need a way to execute this step as a single collective operation dotp%::%Acc%(Vector%e)%,>%Acc%(Vector%e)%,>%Acc%(Scalar%e)% dotp%u%v%=%fold%(+)%0% %%%%%%%%%(%zipWith%(*)%u%v%)

Reductions • Folding (+) over a vector produces a sum
- The result is a one-element array (scalar). Why? ! >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%run%$%fold%(+)%0%(use%xs)% Array%(Z)%[55]

- The result is a one-element array (scalar). Why? ! • Fold has an interesting type: >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%run%$%fold%(+)%0%(use%xs)% Array%(Z)%[55] fold%::%(Shape%sh,%Elt%a)% %%%%%=>%(Exp%a%,>%Exp%a%,>%Exp%a)% %%%%%,>%Exp%a% %%%%%,>%Acc%(Array%(sh:.Int)%a)% %%%%%,>%Acc%(Array%sh%%%%%%%%a)

- The result is a one-element array (scalar). Why? ! • Fold has an interesting type: >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%run%$%fold%(+)%0%(use%xs)% Array%(Z)%[55] fold%::%(Shape%sh,%Elt%a)% %%%%%=>%(Exp%a%,>%Exp%a%,>%Exp%a)% %%%%%,>%Exp%a% %%%%%,>%Acc%(Array%(sh:.Int)%a)% %%%%%,>%Acc%(Array%sh%%%%%%%%a) input array

- The result is a one-element array (scalar). Why? ! • Fold has an interesting type: >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%run%$%fold%(+)%0%(use%xs)% Array%(Z)%[55] fold%::%(Shape%sh,%Elt%a)% %%%%%=>%(Exp%a%,>%Exp%a%,>%Exp%a)% %%%%%,>%Exp%a% %%%%%,>%Acc%(Array%(sh:.Int)%a)% %%%%%,>%Acc%(Array%sh%%%%%%%%a) outer dimension removed input array

• Fold occurs over the outer dimension of the array
15 40 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Reductions >%let%mat%=%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int% >%run%$%fold%(+)%0%(use%mat)% Array%(Z%:.%3)%[15,40,65]

Matrix-Vector Multiplication • The trick is that we can use
this to do all of our dot-products in parallel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5

this to do all of our dot-products in parallel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 1 2 3 4 1 2 3 4 5

this to do all of our dot-products in parallel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 1 2 3 4 5 1 2 3 4 5

Replicate • Multidimensional replicate introduces new magic replicate%::%(Slice%slix,%Elt%e)% %%%%%%%%%%=>%Exp%slix% %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)%
%%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e)

Replicate • Multidimensional replicate introduces new magic • Type hackery
replicate%::%(Slice%slix,%Elt%e)% %%%%%%%%%%=>%Exp%slix% %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)% %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e) >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float% >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)% Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]

- All indicates the entirety of the existing dimension replicate%::%(Slice%slix,%Elt%e)% %%%%%%%%%%=>%Exp%slix% %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)% %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e) >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float% >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)% Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]

- All indicates the entirety of the existing dimension - Number of new rows replicate%::%(Slice%slix,%Elt%e)% %%%%%%%%%%=>%Exp%slix% %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)% %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e) >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float% >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)% Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]

Matrix-Vector Multiplication • Looks very much like vector dot product
mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%%%%%vec'%%%%%%%%%%%=%A.replicate%(lift%(Z%:.%rows%:.%All))%vec% %%in% %%fold%(+)%0%(A.zipWith%(*)%vec'%mat)

N-Body Simulation • Calculate all interactions between a vector of
particles calcAccels%::%Exp%R%,>%Acc%(Vector%Body)%,>%Acc%(Vector%Accel)% calcAccels%epsilon%bodies% %%=%let%n%%%%%%%=%A.size%bodies% ! %%%%%%%%rows%%%%=%A.replicate%(lift%$%Z%:.%n%:.%All)%bodies% %%%%%%%%cols%%%%=%A.replicate%(lift%$%Z%:.%All%:.%n)%bodies% ! %%%%in% %%%%A.fold%(.+.)%(vec%0)%(%A.zipWith%(accel%epsilon)%rows%cols%)

N-Body Simulation • Calculate all interactions between a vector of
particles calcAccels%::%Exp%R%,>%Acc%(Vector%Body)%,>%Acc%(Vector%Accel)% calcAccels%epsilon%bodies% %%=%let%n%%%%%%%=%A.size%bodies% ! %%%%%%%%rows%%%%=%A.replicate%(lift%$%Z%:.%n%:.%All)%bodies% %%%%%%%%cols%%%%=%A.replicate%(lift%$%Z%:.%All%:.%n)%bodies% ! %%%%in% %%%%A.fold%(.+.)%(vec%0)%(%A.zipWith%(accel%epsilon)%rows%cols%) Turn nested computations into segmented or higher- dimensional operations

Accelerate • Recall that Accelerate computations take place on arrays
Accelerate computation Arrays in Arrays out

• Accelerate evaluates the expression passed to run to generate a series of CUDA kernels Accelerate computation Arrays in Arrays out

• Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded Accelerate computation Arrays in Arrays out

• Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded - The goal is to make kernels that can be reused Accelerate computation Arrays in Arrays out

• Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded - The goal is to make kernels that can be reused - If we don’t, the overhead of compilation can ruin performance Accelerate computation Arrays in Arrays out

Embedded Scalars • Consider drop, which yields all but the
ﬁrst n elements of a vector drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop%n%arr%=% %%let%n'%=%the%(unit%n)% %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))% %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr

ﬁrst n elements of a vector • Lift an expression into a singleton array drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop%n%arr%=% %%let%n'%=%the%(unit%n)% %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))% %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr unit%::%Exp%e%,>%Acc%(Scalar%e)

ﬁrst n elements of a vector • Lift an expression into a singleton array • Extract the element of a singleton array drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop%n%arr%=% %%let%n'%=%the%(unit%n)% %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))% %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr unit%::%Exp%e%,>%Acc%(Scalar%e) the%::%Acc%(Scalar%e)%,>%Exp%e

Embedded Scalars • Check the expression Accelerate sees when it
evaluates run >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in% let%a1%=%unit%4% in%backpermute% %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))% %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))% %%%%%a0

evaluates run - Corresponds to the array we created for n >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in% let%a1%=%unit%4% in%backpermute% %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))% %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))% %%%%%a0

evaluates run - Corresponds to the array we created for n - Critically, this is outside the call to backpermute >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in% let%a1%=%unit%4% in%backpermute% %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))% %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))% %%%%%a0

Embedded Scalars drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop'%n%arr%=% %%backpermute%(ilift1%(subtract%n)%(shape%arr))% %%%%%%%%%%%%%%(ilift1%(+%n))%arr

Embedded Scalars drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop'%n%arr%=% %%backpermute%(ilift1%(subtract%n)%(shape%arr))% %%%%%%%%%%%%%%(ilift1%(+%n))%arr >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop'%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])% in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))%
%%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0

evaluates run drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop'%n%arr%=% %%backpermute%(ilift1%(subtract%n)%(shape%arr))% %%%%%%%%%%%%%%(ilift1%(+%n))%arr >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop'%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])% in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))% %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0

evaluates run - This will defeat Accelerate’s caching of compiled kernels drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop'%n%arr%=% %%backpermute%(ilift1%(subtract%n)%(shape%arr))% %%%%%%%%%%%%%%(ilift1%(+%n))%arr >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop'%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])% in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))% %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0

evaluates run - This will defeat Accelerate’s caching of compiled kernels drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop'%n%arr%=% %%backpermute%(ilift1%(subtract%n)%(shape%arr))% %%%%%%%%%%%%%%(ilift1%(+%n))%arr >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop'%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])% in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))% %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0 Make sure any arguments that change are passed as Arrays

Embedded Scalars • Inspect the expression directly, as done here
• Alternatively: Accelerate has some debugging options - Run the program with the ,ddump,cc command line switch - Make sure you have installed Accelerate with ,fdebug

Executing Computations • Recall: to actually execute a computation we
use run ! • This entails quite a bit of work setting up the computation to run on the GPU - And sometimes the computation never changes… run%::%Arrays%a%=>%Acc%a%,>%a

Executing Computations • Alternative: execute an array program of one
argument • What is the diﬀerence? - This version can be partially applied with the (Acc%a%,>%Acc%b) argument - Returns a new function (a%,>%b) run1%::%(Arrays%a,%Arrays%b)%=>%(Acc%a%,>%Acc%b)%,>%a%,>%b

Executing Computations • Alternative: execute an array program of one
argument • What is the diﬀerence? - This version can be partially applied with the (Acc%a%,>%Acc%b) argument - Returns a new function (a%,>%b) - Key new thing: behind the scenes everything other than ﬁnal execution is already done. AST is annotated with object code required for execution. run1%::%(Arrays%a,%Arrays%b)%=>%(Acc%a%,>%Acc%b)%,>%a%,>%b

Executing Computations canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)% canny%=%...

• No magic: ! Executing Computations canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)% canny%=%... edges%::%Array%DIM2%RGBA%,>%Array%DIM2%Float% edges%img%=%run%(%canny%(use%img)%)

• No magic: ! • Magic: Executing Computations canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)% canny%=%...
edges%::%Array%DIM2%RGBA%,>%Array%DIM2%Float% edges%img%=%run%(%canny%(use%img)%) edges%::%Array%DIM2%RGBA%,>%Array%DIM2%Float% edges%img%=%run1%canny%img

! ! • Criterion benchmarks Executing Computations canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)% canny%=%... benchmarking%canny/run%
collecting%100%samples,%1%iterations%each,%in%estimated%6.711102%s% mean:%37.55531%ms,%lb%36.76889%ms,%ub%38.30969%ms,%ci%0.950% std%dev:%3.953049%ms,%lb%3.655842%ms,%ub%4.300031%ms,%ci%0.950% variance%introduced%by%outliers:%81.052%% variance%is%severely%inflated%by%outliers% ! benchmarking%canny/run1% mean:%5.570774%ms,%lb%5.324862%ms,%ub%5.854066%ms,%ci%0.950% std%dev:%1.352904%ms,%lb%1.210735%ms,%ub%1.654683%ms,%ci%0.950% variance%introduced%by%outliers:%95.756%% variance%is%severely%inflated%by%outliers

Floyd-Warshall Algorithm • Find the shortest paths in a weighted
graph shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight% shortestPath%g%i%j%0%=%weight%g%i%j% shortestPath%g%i%j%k%=% %%min%(shortestPath%g%i%j%(k,1))% %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))

graph - k=0: path between vertices i and j is the direct case only shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight% shortestPath%g%i%j%0%=%weight%g%i%j% shortestPath%g%i%j%k%=% %%min%(shortestPath%g%i%j%(k,1))% %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))

graph - k=0: path between vertices i and j is the direct case only - Otherwise the shortest path from i to j passes through k, or it does not shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight% shortestPath%g%i%j%0%=%weight%g%i%j% shortestPath%g%i%j%k%=% %%min%(shortestPath%g%i%j%(k,1))% %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))

graph - k=0: path between vertices i and j is the direct case only - Otherwise the shortest path from i to j passes through k, or it does not - Exponential in the number of calls: instead traverse bottom-up to make the results at (k,1) available shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight% shortestPath%g%i%j%0%=%weight%g%i%j% shortestPath%g%i%j%k%=% %%min%(shortestPath%g%i%j%(k,1))% %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))

Floyd-Warshall Algorithm • Instead of a sparse graph, we’ll use
a dense adjacency matrix - 2D array where the element is the distance between the two vertices ! ! ! ! ! - Must the iteration number k as a scalar array - sp is the core of the algorithm type%Weight%=%Int32% type%Graph%%=%Array%DIM2%Weight% ! step%::%Acc%(Scalar%Int)%,>%Acc%Graph%,>%Acc%Graph% step%k%g%=%generate%(shape%g)%sp% %where% %%%k'%=%the%k% ! %%%sp%::%Exp%DIM2%,>%Exp%Weight% %%%sp%ix%=%let%(Z%:.%i%:.%j)%=%unlift%ix% %%%%%%%%%%%in%%min%(g%!%(index2%i%j))% %%%%%%%%%%%%%%%%%%%(g%!%(index2%i%k')%+%g%!%(index2%k'%j))

Floyd-Warshall Algorithm shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]

Floyd-Warshall Algorithm - Construct a list of steps, applying k
in sequence 0%..%n,1 shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]

in sequence 0%..%n,1 - Compose the sequence together shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]

in sequence 0%..%n,1 - Compose the sequence together • Let’s try a 1000 x 1000 matrix shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]

in sequence 0%..%n,1 - Compose the sequence together • Let’s try a 1000 x 1000 matrix shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%] $%./fwaccel%1000%+RTS%,s% fwaccel:% ***%Internal%error%in%package%accelerate%***% ***%Please%submit%a%bug%report%at%https://github.com/AccelerateHS/accelerate/issues% ./Data/Array/Accelerate/CUDA.hs:246%(unhandled):%CUDA%Exception:%out%of%memory% ! %%%6,627,846,360%bytes%allocated%in%the%heap% %%%2,830,827,472%bytes%copied%during%GC% %%%%%733,091,960%bytes%maximum%residency%(26%sample(s))% %%%%%%31,575,272%bytes%maximum%slop% %%%%%%%%%%%%%812%MB%total%memory%in%use%(54%MB%lost%due%to%fragmentation)% ...% %%Total%%%time%%%%7.16s%%(%11.42s%elapsed)

Pipelining • To sequence computations we have a special operator
! ! ! - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the ﬁrst can be GC’d when the second begins (>,>)%::%(Arrays%a,%Arrays%b,%Arrays%c)% %%%%%%=>%(Acc%a%,>%Acc%b)% %%%%%%,>%(Acc%b%,>%Acc%c)% %%%%%%,>%Acc%a%,>%Acc%c

! ! ! - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the ﬁrst can be GC’d when the second begins (>,>)%::%(Arrays%a,%Arrays%b,%Arrays%c)% %%%%%%=>%(Acc%a%,>%Acc%b)% %%%%%%,>%(Acc%b%,>%Acc%c)% %%%%%%,>%Acc%a%,>%Acc%c $%./fwaccel%1000%+RTS%,s% Array%(Z)%[1783293664]% %%%6,211,776,544%bytes%allocated%in%the%heap% %%%%%914,044,096%bytes%copied%during%GC% %%%%%%24,953,768%bytes%maximum%residency%(211%sample(s))% %%%%%%%2,023,496%bytes%maximum%slop% %%%%%%%%%%%%%%63%MB%total%memory%in%use%(0%MB%lost%due%to%fragmentation)% ...% %%Total%%%time%%%%3.86s%%(%%3.92s%elapsed)

! ! ! - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the ﬁrst can be GC’d when the second begins (>,>)%::%(Arrays%a,%Arrays%b,%Arrays%c)% %%%%%%=>%(Acc%a%,>%Acc%b)% %%%%%%,>%(Acc%b%,>%Acc%c)% %%%%%%,>%Acc%a%,>%Acc%c $%./fwaccel%1000%+RTS%,s% Array%(Z)%[1783293664]% %%%6,211,776,544%bytes%allocated%in%the%heap% %%%%%914,044,096%bytes%copied%during%GC% %%%%%%24,953,768%bytes%maximum%residency%(211%sample(s))% %%%%%%%2,023,496%bytes%maximum%slop% %%%%%%%%%%%%%%63%MB%total%memory%in%use%(0%MB%lost%due%to%fragmentation)% ...% %%Total%%%time%%%%3.86s%%(%%3.92s%elapsed) (>#>) can help control memory use & startup time

Questions? https://github.com/AccelerateHS/

Extra slides…

Stencils • A stencil is a map with access to
the neighbourhood around each element - Useful in many scientiﬁc & image processing algorithms laplace%::%Stencil3x3%Int%,>%Exp%Int% laplace%((_,t,_)% %%%%%%%%,(l,c,r)% %%%%%%%%,(_,b,_))%=%t%+%b%+%l%+%r%,%4*c t l c r b

the neighbourhood around each element - Useful in many scientiﬁc & image processing algorithms - Boundary conditions specify how to handle out-of-bounds neighbours laplace%::%Stencil3x3%Int%,>%Exp%Int% laplace%((_,t,_)% %%%%%%%%,(l,c,r)% %%%%%%%%,(_,b,_))%=%t%+%b%+%l%+%r%,%4*c >%let%mat%=%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int% >%run%$%stencil%laplace%(Constant%0)%(use%mat)% Array%(Z%:.%3%:.%5)%[4,3,2,1,,6,,5,0,0,0,,11,,26,,17,,18,,19,,36] t l c r b

the neighbourhood around each element - Useful in many scientiﬁc & image processing algorithms

Debugging options • -dverbose • -ddump-sharing • -ddump-simpl-stats, -ddump-simpl-iterations •
-ddump-cc, -ddebug-cc • -ddump-exec • -ddump-gc • -ﬄush-cache

GPGPU Programming in Haskell with Accelerate (W...

GPGPU Programming in Haskell with Accelerate (Workshop)

More Decks by Trevor L. McDonell

Other Decks in Research

Featured

Transcript