GPGPU Programming in Haskell with Accelerate (Workshop)

Slide 1

Slide 1 text

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South Wales ! @tlmcdonell [email protected] ! https://github.com/AccelerateHS

Slide 2

Slide 2 text

Preliminaries • Get it from Hackage: - Debugging support enables some extra options to see what is going on ! • Need to import both the base library as well as a speciﬁc backend - Import qualiﬁed to avoid name clashes with the Prelude cabal%install%accelerate%,fdebug% cabal%install%accelerate,cuda%,fdebug import%Prelude%%%%%%%%%%%%%%%as%P% import%Data.Array.Accelerate%as%A% import%Data.Array.Accelerate.CUDA% %%!!"or% import%Data.Array.Accelerate.Interpreter http://hackage.haskell.org/package/accelerate https://github.com/AccelerateHS/

Slide 3

Slide 3 text

Accelerate • Accelerate is a Domain-Speciﬁc Language for GPU programming Haskell/Accelerate program CUDA code Compile with NVIDIA’s compiler & load onto the GPU Copy result back to Haskell Transform Accelerate program into CUDA program

Slide 4

Slide 4 text

Accelerate • Accelerate computations take place on arrays - Parallelism is introduced in the form of collective operations over arrays ! ! • Arrays have two type parameters - The shape of the array, or dimensionality - The element type of the array: Int, Float, etc. data%Array%sh%e Accelerate computation Arrays in Arrays out

Slide 5

Slide 5 text

Accelerate • Accelerate is split into two worlds: Acc and Exp - Acc represents collective operations over instances of Arrays - Exp is a scalar computation on things of type Elt • Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays - Scalar operations can not contain collective operations • This stratiﬁcation excludes nested data parallelism

Slide 6

Slide 6 text

Accelerate • To execute an Accelerate computation (on the GPU): ! - run comes from whichever backend we have chosen (CUDA) run%::%Arrays%a%=>%Acc%a%,>%a

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Accelerate • To execute an Accelerate computation (on the GPU): ! - run comes from whichever backend we have chosen (CUDA) • To get arrays into Acc land ! - This may involve copying data to the GPU • Using Accelerate focus on everything in between: using combinators of type Acc to build an AST that will be turned into CUDA code and executed by run run%::%Arrays%a%=>%Acc%a%,>%a use%::%Arrays%a%=>%a%,>%Acc%a

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Arrays • Create an array from a list: - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order • Example: data%Array%sh%e fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e ghci>%fromList%(Z:.10)%[1..10] :3:1:% %%%%No%instance%for%(Shape%(Z%:.%head0))% %%%%%%arising%from%a%use%of%`fromList'% %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%is%a%potential%instance%available:% %%%%%%instance%Shape%sh%=>%Shape%(sh%:.%Int)% %%%%%%%%,,%Defined%in%`Data.Array.Accelerate.Array.Sugar'% %%%%Possible%fix:%add%an%instance%declaration%for%(Shape%(Z%: %%%%In%the%expression:%fromList%(Z%:.%10)%[1%..%10]% %%%%In%an%equation%for%`it':%it%=%fromList%(Z%:.%10)%[1%..%10 ! :3:14:% %%%%No%instance%for%(Num%head0)%arising%from%the%literal%`10' %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%are%several%potential%instances:% %%%%%%instance%Num%Double%,,%Defined%in%`GHC.Float'% %%%%%%instance%Num%Float%,,%Defined%in%`GHC.Float'% %%%%%%instance%Integral%a%=>%Num%(GHC.Real.Ratio%a)% %%%%%%%%,,%Defined%in%`GHC.Real'% %%%%%%...plus%12%others% %%%%In%the%second%argument%of%`(:.)',%namely%`10'%

Slide 11

Slide 11 text

Arrays • Create an array from a list: - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order • Example: - Defaulting does not apply, because  Shape is not a standard class data%Array%sh%e fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e ghci>%fromList%(Z:.10)%[1..10] :3:1:% %%%%No%instance%for%(Shape%(Z%:.%head0))% %%%%%%arising%from%a%use%of%`fromList'% %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%is%a%potential%instance%available:% %%%%%%instance%Shape%sh%=>%Shape%(sh%:.%Int)% %%%%%%%%,,%Defined%in%`Data.Array.Accelerate.Array.Sugar'% %%%%Possible%fix:%add%an%instance%declaration%for%(Shape%(Z%: %%%%In%the%expression:%fromList%(Z%:.%10)%[1%..%10]% %%%%In%an%equation%for%`it':%it%=%fromList%(Z%:.%10)%[1%..%10 ! :3:14:% %%%%No%instance%for%(Num%head0)%arising%from%the%literal%`10' %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%are%several%potential%instances:% %%%%%%instance%Num%Double%,,%Defined%in%`GHC.Float'% %%%%%%instance%Num%Float%,,%Defined%in%`GHC.Float'% %%%%%%instance%Integral%a%=>%Num%(GHC.Real.Ratio%a)% %%%%%%%%,,%Defined%in%`GHC.Real'% %%%%%%...plus%12%others% %%%%In%the%second%argument%of%`(:.)',%namely%`10'%

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Arrays • Create an array from a list: data%Array%sh%e >%fromList%(Z:.10)%[1..10]%::%Vector%Float% Array%(Z%:.%10)%[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

Slide 14

Slide 14 text

Arrays • Create an array from a list: • Multidimensional arrays are similar: - Elements are ﬁlled along the right-most dimension ﬁrst data%Array%sh%e >%fromList%(Z:.10)%[1..10]%::%Vector%Float% Array%(Z%:.%10)%[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0] >%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int% Array%(Z%:.%3%:.%5)%[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Slide 15

Slide 15 text

Accelerate by example Password “recovery”

Slide 16

Slide 16 text

MD5 Algorithm • Aim: - Implement one round of MD5: unsalted, single 512-bit block - Apply to an array of words - Compare hashes to some unknown hash - i.e. standard dictionary attack

Slide 17

Slide 17 text

MD5 Algorithm • Algorithm operates on a 4-word state,  A, B, C, and D • There are 4 x 16 rounds: F, G, H, and I - Mi is a word from the input message - Ki is a constant - <<

Slide 18

Slide 18 text

MD5 Algorithm in Accelerate • Accelerate is a meta programming language - Use regular Haskell to generate the expression for each step of the round - Produces an unrolled loop type%ABCD%=%(Exp%Word32,%Exp%Word32,%...%)% ! md5round%::%Acc%(Vector%Word32)%,>%ABCD% md5round%msg% %%=%P.foldl%round%(a0,b0,c0,d0)%[0..64]% %%where% %%%%round%::%ABDC%,>%Int%,>%ABCD% %%%%round%(a,b,c,d)%i%=%...

Slide 19

Slide 19 text

MD5 Algorithm in Accelerate • The constants ki and ri can be embedded directly - The simple list lookup would be death in standard Haskell - Generating the expression need not be performant, only executing it k%::%Int%,>%Exp%Word32% k%i%=%constant%(ks%P.!!%i)% %%where% %%%%ks%=%[%0xd76aa478,%0xe8c7b756,%0x242070db,%0xc1bdceee% %%%%%%%%%,%...

Slide 20

Slide 20 text

MD5 Algorithm in Accelerate • The message M is stored as an array, so we need array indexing - Be wary, arbitrary array indexing can kill performance... (!)%::%(Shape%ix,%Elt%e)%=>%Acc%(Array%ix%e)%,>%Exp%ix%,>%Exp%e

Slide 21

Slide 21 text

MD5 Algorithm in Accelerate • The message M is stored as an array, so we need array indexing - Be wary, arbitrary array indexing can kill performance... • Get the right word of the message for the given round (!)%::%(Shape%ix,%Elt%e)%=>%Acc%(Array%ix%e)%,>%Exp%ix%,>%Exp%e m%::%Int%,>%Exp%Word32% m%i% %%|%i%<%16%=%msg%!%index1%(constant%i)% %%|%i%<%32%=%msg%!%index1%(constant%((5*i%+%1)%`rem`%16))% %%|%...

Slide 22

Slide 22 text

MD5 Algorithm in Accelerate • Finally, the non-linear functions F, G, H, and I round%::%ABDC%,>%Int%,>%ABCD% round%(a,b,c,d)%i% %%|%i%<%16%=%shfl%(f%b%c%d)% %%|%...% %%where% %%%%shfl%x%%=%(d,%b%+%((a%+%x%+%k%i%+%m%i)%`rotateL`%r%i),%b,%c)% ! %%%%f%x%y%z%=%(x%.&.%y)%.|.%((complement%x)%.&.%z)% %%%%...

Slide 23

Slide 23 text

MD5 Algorithm in Accelerate • MD5 applied to a single 16-word vector: no parallelism here • Lift this operation to an array of n words - Process many words in parallel to compare against the unknown - Need to use generate, the most general form of array construction. Equivalently, the most easily misused generate%::%(Shape%sh,%Elt%e)% %%%%%%%%%=>%Exp%sh% %%%%%%%%%,>%(Exp%sh%,>%Elt%e)% %%%%%%%%%,>%Acc%(Array%sh%e)

Slide 24

Slide 24 text

Problem Solving Accelerate Hints for when things don’t work as you expect

Slide 25

Slide 25 text

MD5 Algorithm in Accelerate • As always, data layout is important - Accelerate arrays are stored in row-major order - For a CPU, this means the input word would be read on a cache line c o r r e c t h o r s e b a t t e r y s t a p l e …

Slide 26

Slide 26 text

MD5 Algorithm in Accelerate • However in a parallel context many threads must work together - generate uses one thread per element c o r r e c t h o r s e b a t t e r y s t a p l e …

Slide 27

Slide 27 text

MD5 Algorithm in Accelerate • However in a parallel context many threads must work together - generate uses one thread per element c o r r e c t h o r s e b a t t e r y s t a p l e … c o r r e c t h o r s e b a t t e r y s t a p l e

Slide 28

Slide 28 text

MD5 Algorithm in Accelerate • However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead … c o r r e c t h o r s e b a t t e r y s t a p l e

Slide 29

Slide 29 text

MD5 Algorithm in Accelerate • However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c h b s o o a t r r t a r s t p e e e l c r e t y …

Slide 30

Slide 30 text

MD5 Algorithm in Accelerate • However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c h b s o o a t r r t a r s t p e e e l c r e t y … Pay attention to data layout if you use indexing operators

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Nested Data-Parallelism • Accelerate is a language for ﬂat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn’t always succeed, and the error is uninformative… ***%Exception:%% ***%Internal%error%in%package%accelerate%***% ***%Please%submit%a%bug%report%at%https://github.com/Accelerat...% ./Data/Array/Accelerate/Smart.hs:561%(convertSharingExp):% inconsistent%valuation%@%shared%'Exp'%tree%with%stable%name%54;% env%=%[56]

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Nested Data-Parallelism • Matrix-vector multiplication - Deﬁne vector dot product dotp%::%Acc%(Vector%e)%,>%Acc%(Vector%e)%,>%Acc%(Scalar%e)% dotp%u%v%=%fold%(+)%0% %%%%%%%%%(%zipWith%(*)%u%v%)

Slide 36

Slide 36 text

Nested Data-Parallelism • Matrix-vector multiplication - Want to apply dotp to every row of the matrix - Extract a row from the matrix takeRow%::%Exp%Int%,>%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)% takeRow%n%mat%=% %%let%Z%:.%_%:.%cols%=%unlift%(shape%mat)%::%Z:.%Exp%Int%:.%Exp%Int% %%in%backpermute%(index1%cols)% %%%%%%%%%%%%%%%%%(\ix%,>%index2%n%(unindex1%ix))% %%%%%%%%%%%%%%%%%mat

Slide 37

Slide 37 text

Nested Data-Parallelism • Matrix-vector multiplication - Want to apply dotp to every row of the matrix - Extract a row from the matrix - At each element in the output array, where in the input do I read from? takeRow%::%Exp%Int%,>%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)% takeRow%n%mat%=% %%let%Z%:.%_%:.%cols%=%unlift%(shape%mat)%::%Z:.%Exp%Int%:.%Exp%Int% %%in%backpermute%(index1%cols)% %%%%%%%%%%%%%%%%%(\ix%,>%index2%n%(unindex1%ix))% %%%%%%%%%%%%%%%%%mat

Slide 38

Slide 38 text

Nested Data-Parallelism • Matrix-vector multiplication - Want to apply dotp to every row of the matrix - Extract a row from the matrix - At each element in the output array, where in the input do I read from? - [un]index1 converts an Int into a (Z:.Int) 1D index (and vice versa) takeRow%::%Exp%Int%,>%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)% takeRow%n%mat%=% %%let%Z%:.%_%:.%cols%=%unlift%(shape%mat)%::%Z:.%Exp%Int%:.%Exp%Int% %%in%backpermute%(index1%cols)% %%%%%%%%%%%%%%%%%(\ix%,>%index2%n%(unindex1%ix))% %%%%%%%%%%%%%%%%%mat

Slide 39

Slide 39 text

Nested Data-Parallelism • Matrix-vector multiplication - Apply dot product to each row of the matrix ! ! ! mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%in%generate%(index1%rows)% %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat))

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Nested Data-Parallelism • Matrix-vector multiplication - Apply dot product to each row of the matrix ! ! ! • Extract the element from a singleton array mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%in%generate%(index1%rows)% %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat)) indexing an array that … the%::%Acc%(Scalar%e)%,>%Exp%e

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Nested Data-Parallelism • The problem is attempting to execute many separate dot products in parallel ! ! • We need a way to execute this step as a single collective operation dotp%::%Acc%(Vector%e)%,>%Acc%(Vector%e)%,>%Acc%(Scalar%e)% dotp%u%v%=%fold%(+)%0% %%%%%%%%%(%zipWith%(*)%u%v%)

Slide 44

Slide 44 text

Reductions • Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? ! >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%run%$%fold%(+)%0%(use%xs)% Array%(Z)%[55]

Slide 45

Slide 45 text

Reductions • Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? ! • Fold has an interesting type: >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%run%$%fold%(+)%0%(use%xs)% Array%(Z)%[55] fold%::%(Shape%sh,%Elt%a)% %%%%%=>%(Exp%a%,>%Exp%a%,>%Exp%a)% %%%%%,>%Exp%a% %%%%%,>%Acc%(Array%(sh:.Int)%a)% %%%%%,>%Acc%(Array%sh%%%%%%%%a)

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

• Fold occurs over the outer dimension of the array 15 40 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Reductions >%let%mat%=%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int% >%run%$%fold%(+)%0%(use%mat)% Array%(Z%:.%3)%[15,40,65]

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Matrix-Vector Multiplication • The trick is that we can use this to do all of our dot-products in parallel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5

Slide 51

Slide 51 text

Matrix-Vector Multiplication • The trick is that we can use this to do all of our dot-products in parallel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5

Slide 52

Slide 52 text

Matrix-Vector Multiplication • The trick is that we can use this to do all of our dot-products in parallel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 1 2 3 4 1 2 3 4 5

Slide 53

Slide 53 text

Matrix-Vector Multiplication • The trick is that we can use this to do all of our dot-products in parallel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 1 2 3 4 5 1 2 3 4 5

Slide 54

Slide 54 text

Replicate • Multidimensional replicate introduces new magic replicate%::%(Slice%slix,%Elt%e)% %%%%%%%%%%=>%Exp%slix% %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)% %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e)

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Replicate • Multidimensional replicate introduces new magic • Type hackery replicate%::%(Slice%slix,%Elt%e)% %%%%%%%%%%=>%Exp%slix% %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)% %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e) >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float% >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)% Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Replicate • Multidimensional replicate introduces new magic • Type hackery - All indicates the entirety of the existing dimension replicate%::%(Slice%slix,%Elt%e)% %%%%%%%%%%=>%Exp%slix% %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)% %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e) >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float% >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)% Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]

Slide 59

Slide 59 text

Replicate • Multidimensional replicate introduces new magic • Type hackery - All indicates the entirety of the existing dimension - Number of new rows replicate%::%(Slice%slix,%Elt%e)% %%%%%%%%%%=>%Exp%slix% %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)% %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e) >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float% >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)% Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]

Slide 60

Slide 60 text

Matrix-Vector Multiplication • Looks very much like vector dot product mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%%%%%vec'%%%%%%%%%%%=%A.replicate%(lift%(Z%:.%rows%:.%All))%vec% %%in% %%fold%(+)%0%(A.zipWith%(*)%vec'%mat)

Slide 61

Slide 61 text

N-Body Simulation • Calculate all interactions between a vector of particles calcAccels%::%Exp%R%,>%Acc%(Vector%Body)%,>%Acc%(Vector%Accel)% calcAccels%epsilon%bodies% %%=%let%n%%%%%%%=%A.size%bodies% ! %%%%%%%%rows%%%%=%A.replicate%(lift%$%Z%:.%n%:.%All)%bodies% %%%%%%%%cols%%%%=%A.replicate%(lift%$%Z%:.%All%:.%n)%bodies% ! %%%%in% %%%%A.fold%(.+.)%(vec%0)%(%A.zipWith%(accel%epsilon)%rows%cols%)

Slide 62

Slide 62 text

Slide 63

Slide 63 text

Accelerate • Recall that Accelerate computations take place on arrays Accelerate computation Arrays in Arrays out

Slide 64

Slide 64 text

Accelerate • Recall that Accelerate computations take place on arrays • Accelerate evaluates the expression passed to run to generate a series of CUDA kernels Accelerate computation Arrays in Arrays out

Slide 65

Slide 65 text

Slide 66

Slide 66 text

Accelerate • Recall that Accelerate computations take place on arrays • Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded - The goal is to make kernels that can be reused Accelerate computation Arrays in Arrays out

Slide 67

Slide 67 text

Slide 68

Slide 68 text

Embedded Scalars • Consider drop, which yields all but the ﬁrst n elements of a vector drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop%n%arr%=% %%let%n'%=%the%(unit%n)% %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))% %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Embedded Scalars • Consider drop, which yields all but the ﬁrst n elements of a vector • Lift an expression into a singleton array drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop%n%arr%=% %%let%n'%=%the%(unit%n)% %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))% %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr unit%::%Exp%e%,>%Acc%(Scalar%e)

Slide 71

Slide 71 text

Embedded Scalars • Consider drop, which yields all but the ﬁrst n elements of a vector • Lift an expression into a singleton array • Extract the element of a singleton array drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop%n%arr%=% %%let%n'%=%the%(unit%n)% %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))% %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr unit%::%Exp%e%,>%Acc%(Scalar%e) the%::%Acc%(Scalar%e)%,>%Exp%e

Slide 72

Slide 72 text

Embedded Scalars • Check the expression Accelerate sees when it evaluates run >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in% let%a1%=%unit%4% in%backpermute% %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))% %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))% %%%%%a0

Slide 73

Slide 73 text

Slide 74

Slide 74 text

Embedded Scalars • Check the expression Accelerate sees when it evaluates run - Corresponds to the array we created for n >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in% let%a1%=%unit%4% in%backpermute% %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))% %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))% %%%%%a0

Slide 75

Slide 75 text

Embedded Scalars • Check the expression Accelerate sees when it evaluates run - Corresponds to the array we created for n - Critically, this is outside the call to backpermute >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in% let%a1%=%unit%4% in%backpermute% %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))% %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))% %%%%%a0

Slide 76

Slide 76 text

Embedded Scalars drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop'%n%arr%=% %%backpermute%(ilift1%(subtract%n)%(shape%arr))% %%%%%%%%%%%%%%(ilift1%(+%n))%arr

Slide 77

Slide 77 text

Embedded Scalars drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop'%n%arr%=% %%backpermute%(ilift1%(subtract%n)%(shape%arr))% %%%%%%%%%%%%%%(ilift1%(+%n))%arr >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop'%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])% in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))% %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0

Slide 78

Slide 78 text

Slide 79

Slide 79 text

Embedded Scalars • Check the expression Accelerate sees when it evaluates run drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop'%n%arr%=% %%backpermute%(ilift1%(subtract%n)%(shape%arr))% %%%%%%%%%%%%%%(ilift1%(+%n))%arr >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop'%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])% in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))% %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0

Slide 80

Slide 80 text

Embedded Scalars • Check the expression Accelerate sees when it evaluates run - This will defeat Accelerate’s caching of compiled kernels drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop'%n%arr%=% %%backpermute%(ilift1%(subtract%n)%(shape%arr))% %%%%%%%%%%%%%%(ilift1%(+%n))%arr >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop'%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])% in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))% %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0

Slide 81

Slide 81 text

Slide 82

Slide 82 text

Embedded Scalars • Inspect the expression directly, as done here • Alternatively: Accelerate has some debugging options - Run the program with the ,ddump,cc command line switch - Make sure you have installed Accelerate with ,fdebug

Slide 83

Slide 83 text

Executing Computations • Recall: to actually execute a computation we use run ! • This entails quite a bit of work setting up the computation to run on the GPU - And sometimes the computation never changes… run%::%Arrays%a%=>%Acc%a%,>%a

Slide 84

Slide 84 text

Slide 85

Slide 85 text

Executing Computations • Alternative: execute an array program of one argument • What is the diﬀerence? - This version can be partially applied with the (Acc%a%,>%Acc%b) argument - Returns a new function (a%,>%b) - Key new thing: behind the scenes everything other than ﬁnal execution is already done. AST is annotated with object code required for execution. run1%::%(Arrays%a,%Arrays%b)%=>%(Acc%a%,>%Acc%b)%,>%a%,>%b

Slide 86

Slide 86 text

Executing Computations canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)% canny%=%...

Slide 87

Slide 87 text

• No magic: ! Executing Computations canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)% canny%=%... edges%::%Array%DIM2%RGBA%,>%Array%DIM2%Float% edges%img%=%run%(%canny%(use%img)%)

Slide 88

Slide 88 text

• No magic: ! • Magic: Executing Computations canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)% canny%=%... edges%::%Array%DIM2%RGBA%,>%Array%DIM2%Float% edges%img%=%run%(%canny%(use%img)%) edges%::%Array%DIM2%RGBA%,>%Array%DIM2%Float% edges%img%=%run1%canny%img

Slide 89

Slide 89 text

! ! • Criterion benchmarks Executing Computations canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)% canny%=%... benchmarking%canny/run% collecting%100%samples,%1%iterations%each,%in%estimated%6.711102%s% mean:%37.55531%ms,%lb%36.76889%ms,%ub%38.30969%ms,%ci%0.950% std%dev:%3.953049%ms,%lb%3.655842%ms,%ub%4.300031%ms,%ci%0.950% variance%introduced%by%outliers:%81.052%% variance%is%severely%inflated%by%outliers% ! benchmarking%canny/run1% mean:%5.570774%ms,%lb%5.324862%ms,%ub%5.854066%ms,%ci%0.950% std%dev:%1.352904%ms,%lb%1.210735%ms,%ub%1.654683%ms,%ci%0.950% variance%introduced%by%outliers:%95.756%% variance%is%severely%inflated%by%outliers

Slide 90

Slide 90 text

Slide 91

Slide 91 text

Floyd-Warshall Algorithm • Find the shortest paths in a weighted graph shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight% shortestPath%g%i%j%0%=%weight%g%i%j% shortestPath%g%i%j%k%=% %%min%(shortestPath%g%i%j%(k,1))% %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))

Slide 92

Slide 92 text

Floyd-Warshall Algorithm • Find the shortest paths in a weighted graph - k=0: path between vertices i and j is the direct case only shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight% shortestPath%g%i%j%0%=%weight%g%i%j% shortestPath%g%i%j%k%=% %%min%(shortestPath%g%i%j%(k,1))% %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))

Slide 93

Slide 93 text

Floyd-Warshall Algorithm • Find the shortest paths in a weighted graph - k=0: path between vertices i and j is the direct case only - Otherwise the shortest path from i to j passes through k, or it does not shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight% shortestPath%g%i%j%0%=%weight%g%i%j% shortestPath%g%i%j%k%=% %%min%(shortestPath%g%i%j%(k,1))% %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))

Slide 94

Slide 94 text

Floyd-Warshall Algorithm • Find the shortest paths in a weighted graph - k=0: path between vertices i and j is the direct case only - Otherwise the shortest path from i to j passes through k, or it does not - Exponential in the number of calls: instead traverse bottom-up to make the results at (k,1) available shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight% shortestPath%g%i%j%0%=%weight%g%i%j% shortestPath%g%i%j%k%=% %%min%(shortestPath%g%i%j%(k,1))% %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))

Slide 95

Slide 95 text

Floyd-Warshall Algorithm • Instead of a sparse graph, we’ll use a dense adjacency matrix - 2D array where the element is the distance between the two vertices ! ! ! ! ! - Must the iteration number k as a scalar array - sp is the core of the algorithm type%Weight%=%Int32% type%Graph%%=%Array%DIM2%Weight% ! step%::%Acc%(Scalar%Int)%,>%Acc%Graph%,>%Acc%Graph% step%k%g%=%generate%(shape%g)%sp% %where% %%%k'%=%the%k% ! %%%sp%::%Exp%DIM2%,>%Exp%Weight% %%%sp%ix%=%let%(Z%:.%i%:.%j)%=%unlift%ix% %%%%%%%%%%%in%%min%(g%!%(index2%i%j))% %%%%%%%%%%%%%%%%%%%(g%!%(index2%i%k')%+%g%!%(index2%k'%j))

Slide 96

Slide 96 text

Slide 97

Slide 97 text

Floyd-Warshall Algorithm shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]

Slide 98

Slide 98 text

Floyd-Warshall Algorithm - Construct a list of steps, applying k in sequence 0%..%n,1 shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]

Slide 99

Slide 99 text

Floyd-Warshall Algorithm - Construct a list of steps, applying k in sequence 0%..%n,1 - Compose the sequence together shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]

Slide 100

Slide 100 text

Slide 101

Slide 101 text

Floyd-Warshall Algorithm - Construct a list of steps, applying k in sequence 0%..%n,1 - Compose the sequence together • Let’s try a 1000 x 1000 matrix shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%] $%./fwaccel%1000%+RTS%,s% fwaccel:% ***%Internal%error%in%package%accelerate%***% ***%Please%submit%a%bug%report%at%https://github.com/AccelerateHS/accelerate/issues% ./Data/Array/Accelerate/CUDA.hs:246%(unhandled):%CUDA%Exception:%out%of%memory% ! %%%6,627,846,360%bytes%allocated%in%the%heap% %%%2,830,827,472%bytes%copied%during%GC% %%%%%733,091,960%bytes%maximum%residency%(26%sample(s))% %%%%%%31,575,272%bytes%maximum%slop% %%%%%%%%%%%%%812%MB%total%memory%in%use%(54%MB%lost%due%to%fragmentation)% ...% %%Total%%%time%%%%7.16s%%(%11.42s%elapsed)

Slide 102

Slide 102 text

Slide 103

Slide 103 text

Slide 104

Slide 104 text

Pipelining • To sequence computations we have a special operator ! ! ! - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the ﬁrst can be GC’d when the second begins (>,>)%::%(Arrays%a,%Arrays%b,%Arrays%c)% %%%%%%=>%(Acc%a%,>%Acc%b)% %%%%%%,>%(Acc%b%,>%Acc%c)% %%%%%%,>%Acc%a%,>%Acc%c $%./fwaccel%1000%+RTS%,s% Array%(Z)%[1783293664]% %%%6,211,776,544%bytes%allocated%in%the%heap% %%%%%914,044,096%bytes%copied%during%GC% %%%%%%24,953,768%bytes%maximum%residency%(211%sample(s))% %%%%%%%2,023,496%bytes%maximum%slop% %%%%%%%%%%%%%%63%MB%total%memory%in%use%(0%MB%lost%due%to%fragmentation)% ...% %%Total%%%time%%%%3.86s%%(%%3.92s%elapsed)

Slide 105

Slide 105 text

Slide 106

Slide 106 text

Questions? https://github.com/AccelerateHS/

Slide 107

Slide 107 text

Extra slides…

Slide 108

Slide 108 text

Stencils • A stencil is a map with access to the neighbourhood around each element - Useful in many scientiﬁc & image processing algorithms laplace%::%Stencil3x3%Int%,>%Exp%Int% laplace%((_,t,_)% %%%%%%%%,(l,c,r)% %%%%%%%%,(_,b,_))%=%t%+%b%+%l%+%r%,%4*c t l c r b

Slide 109

Slide 109 text

Stencils • A stencil is a map with access to the neighbourhood around each element - Useful in many scientiﬁc & image processing algorithms - Boundary conditions specify how to handle out-of-bounds neighbours laplace%::%Stencil3x3%Int%,>%Exp%Int% laplace%((_,t,_)% %%%%%%%%,(l,c,r)% %%%%%%%%,(_,b,_))%=%t%+%b%+%l%+%r%,%4*c >%let%mat%=%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int% >%run%$%stencil%laplace%(Constant%0)%(use%mat)% Array%(Z%:.%3%:.%5)%[4,3,2,1,,6,,5,0,0,0,,11,,26,,17,,18,,19,,36] t l c r b

Slide 110

Slide 110 text

Stencils • A stencil is a map with access to the neighbourhood around each element - Useful in many scientiﬁc & image processing algorithms

Slide 111

Slide 111 text

Debugging options • -dverbose • -ddump-sharing • -ddump-simpl-stats, -ddump-simpl-iterations • -ddump-cc, -ddebug-cc • -ddump-exec • -ddump-gc • -ﬄush-cache

Slide 112

Slide 112 text

No content