Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GPGPU Programming in Haskell with Accelerate (W...

GPGPU Programming in Haskell with Accelerate (Workshop)

Companion workshop to the main talk: https://speakerdeck.com/tmcdonell/gpgpu-programming-in-haskell-with-accelerate

Current graphics cards are massively parallel multicore processors optimised for workloads with a large degree of SIMD parallelism. Peak performance of these devices is far greater than that of traditional CPUs, however this is difficult to realise because good performance requires highly idiomatic programs, whose development is work intensive and requires expert knowledge. To raise the level of abstraction we are developing a domain-specific high-level language in Haskell for programming these devices. Computations are expressed in the form of parameterised collective operations —such as maps, reductions, and permutations— over multi-dimensional arrays. These computations are online compiled and executed on the graphics processor.

In this talk, I give some more details about how to use Accelerate effectively.

Trevor L. McDonell

May 17, 2013
Tweet

More Decks by Trevor L. McDonell

Other Decks in Research

Transcript

  1. GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University

    of New South Wales ! @tlmcdonell [email protected] ! https://github.com/AccelerateHS
  2. Preliminaries • Get it from Hackage: - Debugging support enables

    some extra options to see what is going on ! • Need to import both the base library as well as a specific backend - Import qualified to avoid name clashes with the Prelude cabal%install%accelerate%,fdebug% cabal%install%accelerate,cuda%,fdebug import%Prelude%%%%%%%%%%%%%%%as%P% import%Data.Array.Accelerate%as%A% import%Data.Array.Accelerate.CUDA% %%!!"or% import%Data.Array.Accelerate.Interpreter http://hackage.haskell.org/package/accelerate https://github.com/AccelerateHS/
  3. Accelerate • Accelerate is a Domain-Specific Language for GPU programming

    Haskell/Accelerate program CUDA code Compile with NVIDIA’s compiler & load onto the GPU Copy result back to Haskell Transform Accelerate program into CUDA program
  4. Accelerate • Accelerate computations take place on arrays - Parallelism

    is introduced in the form of collective operations over arrays ! ! • Arrays have two type parameters - The shape of the array, or dimensionality - The element type of the array: Int, Float, etc. data%Array%sh%e Accelerate computation Arrays in Arrays out
  5. Accelerate • Accelerate is split into two worlds: Acc and

    Exp - Acc represents collective operations over instances of Arrays - Exp is a scalar computation on things of type Elt • Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays - Scalar operations can not contain collective operations • This stratification excludes nested data parallelism
  6. Accelerate • To execute an Accelerate computation (on the GPU):

    ! - run comes from whichever backend we have chosen (CUDA) run%::%Arrays%a%=>%Acc%a%,>%a
  7. Accelerate • To execute an Accelerate computation (on the GPU):

    ! - run comes from whichever backend we have chosen (CUDA) • To get arrays into Acc land ! - This may involve copying data to the GPU run%::%Arrays%a%=>%Acc%a%,>%a use%::%Arrays%a%=>%a%,>%Acc%a
  8. Accelerate • To execute an Accelerate computation (on the GPU):

    ! - run comes from whichever backend we have chosen (CUDA) • To get arrays into Acc land ! - This may involve copying data to the GPU • Using Accelerate focus on everything in between: using combinators of type Acc to build an AST that will be turned into CUDA code and executed by run run%::%Arrays%a%=>%Acc%a%,>%a use%::%Arrays%a%=>%a%,>%Acc%a
  9. Arrays • Create an array from a list: - Generates

    a multidimensional array by consuming elements from the list and adding them to the array in row-major order • Example: data%Array%sh%e fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e ghci>%fromList%(Z:.10)%[1..10]
  10. Arrays • Create an array from a list: - Generates

    a multidimensional array by consuming elements from the list and adding them to the array in row-major order • Example: data%Array%sh%e fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e ghci>%fromList%(Z:.10)%[1..10] <interactive>:3:1:% %%%%No%instance%for%(Shape%(Z%:.%head0))% %%%%%%arising%from%a%use%of%`fromList'% %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%is%a%potential%instance%available:% %%%%%%instance%Shape%sh%=>%Shape%(sh%:.%Int)% %%%%%%%%,,%Defined%in%`Data.Array.Accelerate.Array.Sugar'% %%%%Possible%fix:%add%an%instance%declaration%for%(Shape%(Z%: %%%%In%the%expression:%fromList%(Z%:.%10)%[1%..%10]% %%%%In%an%equation%for%`it':%it%=%fromList%(Z%:.%10)%[1%..%10 ! <interactive>:3:14:% %%%%No%instance%for%(Num%head0)%arising%from%the%literal%`10' %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%are%several%potential%instances:% %%%%%%instance%Num%Double%,,%Defined%in%`GHC.Float'% %%%%%%instance%Num%Float%,,%Defined%in%`GHC.Float'% %%%%%%instance%Integral%a%=>%Num%(GHC.Real.Ratio%a)% %%%%%%%%,,%Defined%in%`GHC.Real'% %%%%%%...plus%12%others% %%%%In%the%second%argument%of%`(:.)',%namely%`10'%
  11. Arrays • Create an array from a list: - Generates

    a multidimensional array by consuming elements from the list and adding them to the array in row-major order • Example: - Defaulting does not apply, because
 Shape is not a standard class data%Array%sh%e fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e ghci>%fromList%(Z:.10)%[1..10] <interactive>:3:1:% %%%%No%instance%for%(Shape%(Z%:.%head0))% %%%%%%arising%from%a%use%of%`fromList'% %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%is%a%potential%instance%available:% %%%%%%instance%Shape%sh%=>%Shape%(sh%:.%Int)% %%%%%%%%,,%Defined%in%`Data.Array.Accelerate.Array.Sugar'% %%%%Possible%fix:%add%an%instance%declaration%for%(Shape%(Z%: %%%%In%the%expression:%fromList%(Z%:.%10)%[1%..%10]% %%%%In%an%equation%for%`it':%it%=%fromList%(Z%:.%10)%[1%..%10 ! <interactive>:3:14:% %%%%No%instance%for%(Num%head0)%arising%from%the%literal%`10' %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%are%several%potential%instances:% %%%%%%instance%Num%Double%,,%Defined%in%`GHC.Float'% %%%%%%instance%Num%Float%,,%Defined%in%`GHC.Float'% %%%%%%instance%Integral%a%=>%Num%(GHC.Real.Ratio%a)% %%%%%%%%,,%Defined%in%`GHC.Real'% %%%%%%...plus%12%others% %%%%In%the%second%argument%of%`(:.)',%namely%`10'%
  12. Arrays • Create an array from a list: - Generates

    a multidimensional array by consuming elements from the list and adding them to the array in row-major order • Example: - Defaulting does not apply, because
 Shape is not a standard class data%Array%sh%e fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e ghci>%fromList%(Z:.10)%[1..10] <interactive>:3:1:% %%%%No%instance%for%(Shape%(Z%:.%head0))% %%%%%%arising%from%a%use%of%`fromList'% %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%is%a%potential%instance%available:% %%%%%%instance%Shape%sh%=>%Shape%(sh%:.%Int)% %%%%%%%%,,%Defined%in%`Data.Array.Accelerate.Array.Sugar'% %%%%Possible%fix:%add%an%instance%declaration%for%(Shape%(Z%: %%%%In%the%expression:%fromList%(Z%:.%10)%[1%..%10]% %%%%In%an%equation%for%`it':%it%=%fromList%(Z%:.%10)%[1%..%10 ! <interactive>:3:14:% %%%%No%instance%for%(Num%head0)%arising%from%the%literal%`10' %%%%The%type%variable%`head0'%is%ambiguous% %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type% %%%%Note:%there%are%several%potential%instances:% %%%%%%instance%Num%Double%,,%Defined%in%`GHC.Float'% %%%%%%instance%Num%Float%,,%Defined%in%`GHC.Float'% %%%%%%instance%Integral%a%=>%Num%(GHC.Real.Ratio%a)% %%%%%%%%,,%Defined%in%`GHC.Real'% %%%%%%...plus%12%others% %%%%In%the%second%argument%of%`(:.)',%namely%`10'% Number 1 tip: Add type signatures
  13. Arrays • Create an array from a list: • Multidimensional

    arrays are similar: - Elements are filled along the right-most dimension first data%Array%sh%e >%fromList%(Z:.10)%[1..10]%::%Vector%Float% Array%(Z%:.%10)%[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0] >%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int% Array%(Z%:.%3%:.%5)%[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  14. MD5 Algorithm • Aim: - Implement one round of MD5:

    unsalted, single 512-bit block - Apply to an array of words - Compare hashes to some unknown hash - i.e. standard dictionary attack
  15. MD5 Algorithm • Algorithm operates on a 4-word state,
 A,

    B, C, and D • There are 4 x 16 rounds: F, G, H, and I - Mi is a word from the input message - Ki is a constant - <<<s is left rotate, by some constant ri • Each round operates on the 512-bit message 
 block, modifying the state http://en.wikipedia.org/wiki/Md5
  16. MD5 Algorithm in Accelerate • Accelerate is a meta programming

    language - Use regular Haskell to generate the expression for each step of the round - Produces an unrolled loop type%ABCD%=%(Exp%Word32,%Exp%Word32,%...%)% ! md5round%::%Acc%(Vector%Word32)%,>%ABCD% md5round%msg% %%=%P.foldl%round%(a0,b0,c0,d0)%[0..64]% %%where% %%%%round%::%ABDC%,>%Int%,>%ABCD% %%%%round%(a,b,c,d)%i%=%...
  17. MD5 Algorithm in Accelerate • The constants ki and ri

    can be embedded directly - The simple list lookup would be death in standard Haskell - Generating the expression need not be performant, only executing it k%::%Int%,>%Exp%Word32% k%i%=%constant%(ks%P.!!%i)% %%where% %%%%ks%=%[%0xd76aa478,%0xe8c7b756,%0x242070db,%0xc1bdceee% %%%%%%%%%,%...
  18. MD5 Algorithm in Accelerate • The message M is stored

    as an array, so we need array indexing - Be wary, arbitrary array indexing can kill performance... (!)%::%(Shape%ix,%Elt%e)%=>%Acc%(Array%ix%e)%,>%Exp%ix%,>%Exp%e
  19. MD5 Algorithm in Accelerate • The message M is stored

    as an array, so we need array indexing - Be wary, arbitrary array indexing can kill performance... • Get the right word of the message for the given round (!)%::%(Shape%ix,%Elt%e)%=>%Acc%(Array%ix%e)%,>%Exp%ix%,>%Exp%e m%::%Int%,>%Exp%Word32% m%i% %%|%i%<%16%=%msg%!%index1%(constant%i)% %%|%i%<%32%=%msg%!%index1%(constant%((5*i%+%1)%`rem`%16))% %%|%...
  20. MD5 Algorithm in Accelerate • Finally, the non-linear functions F,

    G, H, and I round%::%ABDC%,>%Int%,>%ABCD% round%(a,b,c,d)%i% %%|%i%<%16%=%shfl%(f%b%c%d)% %%|%...% %%where% %%%%shfl%x%%=%(d,%b%+%((a%+%x%+%k%i%+%m%i)%`rotateL`%r%i),%b,%c)% ! %%%%f%x%y%z%=%(x%.&.%y)%.|.%((complement%x)%.&.%z)% %%%%...
  21. MD5 Algorithm in Accelerate • MD5 applied to a single

    16-word vector: no parallelism here • Lift this operation to an array of n words - Process many words in parallel to compare against the unknown - Need to use generate, the most general form of array construction. Equivalently, the most easily misused generate%::%(Shape%sh,%Elt%e)% %%%%%%%%%=>%Exp%sh% %%%%%%%%%,>%(Exp%sh%,>%Elt%e)% %%%%%%%%%,>%Acc%(Array%sh%e)
  22. MD5 Algorithm in Accelerate • As always, data layout is

    important - Accelerate arrays are stored in row-major order - For a CPU, this means the input word would be read on a cache line c o r r e c t h o r s e b a t t e r y s t a p l e …
  23. MD5 Algorithm in Accelerate • However in a parallel context

    many threads must work together - generate uses one thread per element c o r r e c t h o r s e b a t t e r y s t a p l e …
  24. MD5 Algorithm in Accelerate • However in a parallel context

    many threads must work together - generate uses one thread per element c o r r e c t h o r s e b a t t e r y s t a p l e … c o r r e c t h o r s e b a t t e r y s t a p l e
  25. MD5 Algorithm in Accelerate • However in a parallel context

    many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead … c o r r e c t h o r s e b a t t e r y s t a p l e
  26. MD5 Algorithm in Accelerate • However in a parallel context

    many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c h b s o o a t r r t a r s t p e e e l c r e t y …
  27. MD5 Algorithm in Accelerate • However in a parallel context

    many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c h b s o o a t r r t a r s t p e e e l c r e t y … Pay attention to data layout if you use indexing operators
  28. Nested Data-Parallelism • Accelerate is a language for flat data-parallel

    computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp
  29. Nested Data-Parallelism • Accelerate is a language for flat data-parallel

    computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn’t always succeed, and the error is uninformative… ***%Exception:%% ***%Internal%error%in%package%accelerate%***% ***%Please%submit%a%bug%report%at%https://github.com/Accelerat...% ./Data/Array/Accelerate/Smart.hs:561%(convertSharingExp):% inconsistent%valuation%@%shared%'Exp'%tree%with%stable%name%54;% env%=%[56]
  30. Nested Data-Parallelism • Accelerate is a language for flat data-parallel

    computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn’t always succeed, and the error is uninformative… ***%Exception:%% ***%Internal%error%in%package%accelerate%***% ***%Please%submit%a%bug%report%at%https://github.com/Accelerat...% ./Data/Array/Accelerate/Smart.hs:561%(convertSharingExp):% inconsistent%valuation%@%shared%'Exp'%tree%with%stable%name%54;% env%=%[56]
  31. Nested Data-Parallelism • Accelerate is a language for flat data-parallel

    computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn’t always succeed, and the error is uninformative… ***%Exception:%% ***%Internal%error%in%package%accelerate%***% ***%Please%submit%a%bug%report%at%https://github.com/Accelerat...% ./Data/Array/Accelerate/Smart.hs:561%(convertSharingExp):% inconsistent%valuation%@%shared%'Exp'%tree%with%stable%name%54;% env%=%[56] ***%Exception:%Cyclic%definition%of%a%value%of%type%'Exp'%(sa%=%45)
  32. Nested Data-Parallelism • Matrix-vector multiplication - Define vector dot product

    dotp%::%Acc%(Vector%e)%,>%Acc%(Vector%e)%,>%Acc%(Scalar%e)% dotp%u%v%=%fold%(+)%0% %%%%%%%%%(%zipWith%(*)%u%v%)
  33. Nested Data-Parallelism • Matrix-vector multiplication - Want to apply dotp

    to every row of the matrix - Extract a row from the matrix takeRow%::%Exp%Int%,>%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)% takeRow%n%mat%=% %%let%Z%:.%_%:.%cols%=%unlift%(shape%mat)%::%Z:.%Exp%Int%:.%Exp%Int% %%in%backpermute%(index1%cols)% %%%%%%%%%%%%%%%%%(\ix%,>%index2%n%(unindex1%ix))% %%%%%%%%%%%%%%%%%mat
  34. Nested Data-Parallelism • Matrix-vector multiplication - Want to apply dotp

    to every row of the matrix - Extract a row from the matrix - At each element in the output array, where in the input do I read from? takeRow%::%Exp%Int%,>%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)% takeRow%n%mat%=% %%let%Z%:.%_%:.%cols%=%unlift%(shape%mat)%::%Z:.%Exp%Int%:.%Exp%Int% %%in%backpermute%(index1%cols)% %%%%%%%%%%%%%%%%%(\ix%,>%index2%n%(unindex1%ix))% %%%%%%%%%%%%%%%%%mat
  35. Nested Data-Parallelism • Matrix-vector multiplication - Want to apply dotp

    to every row of the matrix - Extract a row from the matrix - At each element in the output array, where in the input do I read from? - [un]index1 converts an Int into a (Z:.Int) 1D index (and vice versa) takeRow%::%Exp%Int%,>%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)% takeRow%n%mat%=% %%let%Z%:.%_%:.%cols%=%unlift%(shape%mat)%::%Z:.%Exp%Int%:.%Exp%Int% %%in%backpermute%(index1%cols)% %%%%%%%%%%%%%%%%%(\ix%,>%index2%n%(unindex1%ix))% %%%%%%%%%%%%%%%%%mat
  36. Nested Data-Parallelism • Matrix-vector multiplication - Apply dot product to

    each row of the matrix ! ! ! mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%in%generate%(index1%rows)% %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat))
  37. Nested Data-Parallelism • Matrix-vector multiplication - Apply dot product to

    each row of the matrix ! ! ! mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%in%generate%(index1%rows)% %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat)) indexing an array that …
  38. Nested Data-Parallelism • Matrix-vector multiplication - Apply dot product to

    each row of the matrix ! ! ! • Extract the element from a singleton array mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%in%generate%(index1%rows)% %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat)) indexing an array that … the%::%Acc%(Scalar%e)%,>%Exp%e
  39. Nested Data-Parallelism • Matrix-vector multiplication - Apply dot product to

    each row of the matrix ! ! ! • Extract the element from a singleton array mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%in%generate%(index1%rows)% %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat)) indexing an array that … depends on the index given by generate the%::%Acc%(Scalar%e)%,>%Exp%e
  40. Nested Data-Parallelism • The problem is attempting to execute many

    separate dot products in parallel ! ! • We need a way to execute this step as a single collective operation dotp%::%Acc%(Vector%e)%,>%Acc%(Vector%e)%,>%Acc%(Scalar%e)% dotp%u%v%=%fold%(+)%0% %%%%%%%%%(%zipWith%(*)%u%v%)
  41. Reductions • Folding (+) over a vector produces a sum

    - The result is a one-element array (scalar). Why? ! >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%run%$%fold%(+)%0%(use%xs)% Array%(Z)%[55]
  42. Reductions • Folding (+) over a vector produces a sum

    - The result is a one-element array (scalar). Why? ! • Fold has an interesting type: >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%run%$%fold%(+)%0%(use%xs)% Array%(Z)%[55] fold%::%(Shape%sh,%Elt%a)% %%%%%=>%(Exp%a%,>%Exp%a%,>%Exp%a)% %%%%%,>%Exp%a% %%%%%,>%Acc%(Array%(sh:.Int)%a)% %%%%%,>%Acc%(Array%sh%%%%%%%%a)
  43. Reductions • Folding (+) over a vector produces a sum

    - The result is a one-element array (scalar). Why? ! • Fold has an interesting type: >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%run%$%fold%(+)%0%(use%xs)% Array%(Z)%[55] fold%::%(Shape%sh,%Elt%a)% %%%%%=>%(Exp%a%,>%Exp%a%,>%Exp%a)% %%%%%,>%Exp%a% %%%%%,>%Acc%(Array%(sh:.Int)%a)% %%%%%,>%Acc%(Array%sh%%%%%%%%a) input array
  44. Reductions • Folding (+) over a vector produces a sum

    - The result is a one-element array (scalar). Why? ! • Fold has an interesting type: >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%run%$%fold%(+)%0%(use%xs)% Array%(Z)%[55] fold%::%(Shape%sh,%Elt%a)% %%%%%=>%(Exp%a%,>%Exp%a%,>%Exp%a)% %%%%%,>%Exp%a% %%%%%,>%Acc%(Array%(sh:.Int)%a)% %%%%%,>%Acc%(Array%sh%%%%%%%%a) outer dimension removed input array
  45. • Fold occurs over the outer dimension of the array

    15 40 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Reductions >%let%mat%=%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int% >%run%$%fold%(+)%0%(use%mat)% Array%(Z%:.%3)%[15,40,65]
  46. • Fold occurs over the outer dimension of the array

    15 40 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Reductions >%let%mat%=%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int% >%run%$%fold%(+)%0%(use%mat)% Array%(Z%:.%3)%[15,40,65]
  47. Matrix-Vector Multiplication • The trick is that we can use

    this to do all of our dot-products in parallel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5
  48. Matrix-Vector Multiplication • The trick is that we can use

    this to do all of our dot-products in parallel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5
  49. Matrix-Vector Multiplication • The trick is that we can use

    this to do all of our dot-products in parallel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 1 2 3 4 1 2 3 4 5
  50. Matrix-Vector Multiplication • The trick is that we can use

    this to do all of our dot-products in parallel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 1 2 3 4 5 1 2 3 4 5
  51. Replicate • Multidimensional replicate introduces new magic • Type hackery

    replicate%::%(Slice%slix,%Elt%e)% %%%%%%%%%%=>%Exp%slix% %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)% %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e) >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float% >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)% Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]
  52. Replicate • Multidimensional replicate introduces new magic • Type hackery

    replicate%::%(Slice%slix,%Elt%e)% %%%%%%%%%%=>%Exp%slix% %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)% %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e) >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float% >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)% Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]
  53. Replicate • Multidimensional replicate introduces new magic • Type hackery

    - All indicates the entirety of the existing dimension replicate%::%(Slice%slix,%Elt%e)% %%%%%%%%%%=>%Exp%slix% %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)% %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e) >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float% >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)% Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]
  54. Replicate • Multidimensional replicate introduces new magic • Type hackery

    - All indicates the entirety of the existing dimension - Number of new rows replicate%::%(Slice%slix,%Elt%e)% %%%%%%%%%%=>%Exp%slix% %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)% %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e) >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float% >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)% Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]
  55. Matrix-Vector Multiplication • Looks very much like vector dot product

    mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% mvm%mat%vec%=% %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int% %%%%%%vec'%%%%%%%%%%%=%A.replicate%(lift%(Z%:.%rows%:.%All))%vec% %%in% %%fold%(+)%0%(A.zipWith%(*)%vec'%mat)
  56. N-Body Simulation • Calculate all interactions between a vector of

    particles calcAccels%::%Exp%R%,>%Acc%(Vector%Body)%,>%Acc%(Vector%Accel)% calcAccels%epsilon%bodies% %%=%let%n%%%%%%%=%A.size%bodies% ! %%%%%%%%rows%%%%=%A.replicate%(lift%$%Z%:.%n%:.%All)%bodies% %%%%%%%%cols%%%%=%A.replicate%(lift%$%Z%:.%All%:.%n)%bodies% ! %%%%in% %%%%A.fold%(.+.)%(vec%0)%(%A.zipWith%(accel%epsilon)%rows%cols%)
  57. N-Body Simulation • Calculate all interactions between a vector of

    particles calcAccels%::%Exp%R%,>%Acc%(Vector%Body)%,>%Acc%(Vector%Accel)% calcAccels%epsilon%bodies% %%=%let%n%%%%%%%=%A.size%bodies% ! %%%%%%%%rows%%%%=%A.replicate%(lift%$%Z%:.%n%:.%All)%bodies% %%%%%%%%cols%%%%=%A.replicate%(lift%$%Z%:.%All%:.%n)%bodies% ! %%%%in% %%%%A.fold%(.+.)%(vec%0)%(%A.zipWith%(accel%epsilon)%rows%cols%) Turn nested computations into segmented or higher- dimensional operations
  58. Accelerate • Recall that Accelerate computations take place on arrays

    • Accelerate evaluates the expression passed to run to generate a series of CUDA kernels Accelerate computation Arrays in Arrays out
  59. Accelerate • Recall that Accelerate computations take place on arrays

    • Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded Accelerate computation Arrays in Arrays out
  60. Accelerate • Recall that Accelerate computations take place on arrays

    • Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded - The goal is to make kernels that can be reused Accelerate computation Arrays in Arrays out
  61. Accelerate • Recall that Accelerate computations take place on arrays

    • Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded - The goal is to make kernels that can be reused - If we don’t, the overhead of compilation can ruin performance Accelerate computation Arrays in Arrays out
  62. Embedded Scalars • Consider drop, which yields all but the

    first n elements of a vector drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop%n%arr%=% %%let%n'%=%the%(unit%n)% %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))% %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr
  63. Embedded Scalars • Consider drop, which yields all but the

    first n elements of a vector drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop%n%arr%=% %%let%n'%=%the%(unit%n)% %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))% %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr
  64. Embedded Scalars • Consider drop, which yields all but the

    first n elements of a vector • Lift an expression into a singleton array drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop%n%arr%=% %%let%n'%=%the%(unit%n)% %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))% %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr unit%::%Exp%e%,>%Acc%(Scalar%e)
  65. Embedded Scalars • Consider drop, which yields all but the

    first n elements of a vector • Lift an expression into a singleton array • Extract the element of a singleton array drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop%n%arr%=% %%let%n'%=%the%(unit%n)% %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))% %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr unit%::%Exp%e%,>%Acc%(Scalar%e) the%::%Acc%(Scalar%e)%,>%Exp%e
  66. Embedded Scalars • Check the expression Accelerate sees when it

    evaluates run >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in% let%a1%=%unit%4% in%backpermute% %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))% %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))% %%%%%a0
  67. Embedded Scalars • Check the expression Accelerate sees when it

    evaluates run >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in% let%a1%=%unit%4% in%backpermute% %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))% %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))% %%%%%a0
  68. Embedded Scalars • Check the expression Accelerate sees when it

    evaluates run - Corresponds to the array we created for n >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in% let%a1%=%unit%4% in%backpermute% %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))% %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))% %%%%%a0
  69. Embedded Scalars • Check the expression Accelerate sees when it

    evaluates run - Corresponds to the array we created for n - Critically, this is outside the call to backpermute >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in% let%a1%=%unit%4% in%backpermute% %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))% %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))% %%%%%a0
  70. Embedded Scalars • Check the expression Accelerate sees when it

    evaluates run drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop'%n%arr%=% %%backpermute%(ilift1%(subtract%n)%(shape%arr))% %%%%%%%%%%%%%%(ilift1%(+%n))%arr >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop'%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])% in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))% %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0
  71. Embedded Scalars • Check the expression Accelerate sees when it

    evaluates run - This will defeat Accelerate’s caching of compiled kernels drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop'%n%arr%=% %%backpermute%(ilift1%(subtract%n)%(shape%arr))% %%%%%%%%%%%%%%(ilift1%(+%n))%arr >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop'%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])% in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))% %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0
  72. Embedded Scalars • Check the expression Accelerate sees when it

    evaluates run - This will defeat Accelerate’s caching of compiled kernels drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)% drop'%n%arr%=% %%backpermute%(ilift1%(subtract%n)%(shape%arr))% %%%%%%%%%%%%%%(ilift1%(+%n))%arr >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int% >%drop'%4%(use%vec)% let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])% in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))% %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0 Make sure any arguments that change are passed as Arrays
  73. Embedded Scalars • Inspect the expression directly, as done here

    • Alternatively: Accelerate has some debugging options - Run the program with the ,ddump,cc command line switch - Make sure you have installed Accelerate with ,fdebug
  74. Executing Computations • Recall: to actually execute a computation we

    use run ! • This entails quite a bit of work setting up the computation to run on the GPU - And sometimes the computation never changes… run%::%Arrays%a%=>%Acc%a%,>%a
  75. Executing Computations • Alternative: execute an array program of one

    argument • What is the difference? - This version can be partially applied with the (Acc%a%,>%Acc%b) argument - Returns a new function (a%,>%b) run1%::%(Arrays%a,%Arrays%b)%=>%(Acc%a%,>%Acc%b)%,>%a%,>%b
  76. Executing Computations • Alternative: execute an array program of one

    argument • What is the difference? - This version can be partially applied with the (Acc%a%,>%Acc%b) argument - Returns a new function (a%,>%b) - Key new thing: behind the scenes everything other than final execution is already done. AST is annotated with object code required for execution. run1%::%(Arrays%a,%Arrays%b)%=>%(Acc%a%,>%Acc%b)%,>%a%,>%b
  77. • No magic: ! • Magic: Executing Computations canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)% canny%=%...

    edges%::%Array%DIM2%RGBA%,>%Array%DIM2%Float% edges%img%=%run%(%canny%(use%img)%) edges%::%Array%DIM2%RGBA%,>%Array%DIM2%Float% edges%img%=%run1%canny%img
  78. ! ! • Criterion benchmarks Executing Computations canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)% canny%=%... benchmarking%canny/run%

    collecting%100%samples,%1%iterations%each,%in%estimated%6.711102%s% mean:%37.55531%ms,%lb%36.76889%ms,%ub%38.30969%ms,%ci%0.950% std%dev:%3.953049%ms,%lb%3.655842%ms,%ub%4.300031%ms,%ci%0.950% variance%introduced%by%outliers:%81.052%% variance%is%severely%inflated%by%outliers% ! benchmarking%canny/run1% mean:%5.570774%ms,%lb%5.324862%ms,%ub%5.854066%ms,%ci%0.950% std%dev:%1.352904%ms,%lb%1.210735%ms,%ub%1.654683%ms,%ci%0.950% variance%introduced%by%outliers:%95.756%% variance%is%severely%inflated%by%outliers
  79. ! ! • Criterion benchmarks Executing Computations canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)% canny%=%... benchmarking%canny/run%

    collecting%100%samples,%1%iterations%each,%in%estimated%6.711102%s% mean:%37.55531%ms,%lb%36.76889%ms,%ub%38.30969%ms,%ci%0.950% std%dev:%3.953049%ms,%lb%3.655842%ms,%ub%4.300031%ms,%ci%0.950% variance%introduced%by%outliers:%81.052%% variance%is%severely%inflated%by%outliers% ! benchmarking%canny/run1% mean:%5.570774%ms,%lb%5.324862%ms,%ub%5.854066%ms,%ci%0.950% std%dev:%1.352904%ms,%lb%1.210735%ms,%ub%1.654683%ms,%ci%0.950% variance%introduced%by%outliers:%95.756%% variance%is%severely%inflated%by%outliers
  80. Floyd-Warshall Algorithm • Find the shortest paths in a weighted

    graph shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight% shortestPath%g%i%j%0%=%weight%g%i%j% shortestPath%g%i%j%k%=% %%min%(shortestPath%g%i%j%(k,1))% %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))
  81. Floyd-Warshall Algorithm • Find the shortest paths in a weighted

    graph - k=0: path between vertices i and j is the direct case only shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight% shortestPath%g%i%j%0%=%weight%g%i%j% shortestPath%g%i%j%k%=% %%min%(shortestPath%g%i%j%(k,1))% %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))
  82. Floyd-Warshall Algorithm • Find the shortest paths in a weighted

    graph - k=0: path between vertices i and j is the direct case only - Otherwise the shortest path from i to j passes through k, or it does not shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight% shortestPath%g%i%j%0%=%weight%g%i%j% shortestPath%g%i%j%k%=% %%min%(shortestPath%g%i%j%(k,1))% %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))
  83. Floyd-Warshall Algorithm • Find the shortest paths in a weighted

    graph - k=0: path between vertices i and j is the direct case only - Otherwise the shortest path from i to j passes through k, or it does not - Exponential in the number of calls: instead traverse bottom-up to make the results at (k,1) available shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight% shortestPath%g%i%j%0%=%weight%g%i%j% shortestPath%g%i%j%k%=% %%min%(shortestPath%g%i%j%(k,1))% %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))
  84. Floyd-Warshall Algorithm • Instead of a sparse graph, we’ll use

    a dense adjacency matrix - 2D array where the element is the distance between the two vertices ! ! ! ! ! - Must the iteration number k as a scalar array - sp is the core of the algorithm type%Weight%=%Int32% type%Graph%%=%Array%DIM2%Weight% ! step%::%Acc%(Scalar%Int)%,>%Acc%Graph%,>%Acc%Graph% step%k%g%=%generate%(shape%g)%sp% %where% %%%k'%=%the%k% ! %%%sp%::%Exp%DIM2%,>%Exp%Weight% %%%sp%ix%=%let%(Z%:.%i%:.%j)%=%unlift%ix% %%%%%%%%%%%in%%min%(g%!%(index2%i%j))% %%%%%%%%%%%%%%%%%%%(g%!%(index2%i%k')%+%g%!%(index2%k'%j))
  85. Floyd-Warshall Algorithm • Instead of a sparse graph, we’ll use

    a dense adjacency matrix - 2D array where the element is the distance between the two vertices ! ! ! ! ! - Must the iteration number k as a scalar array - sp is the core of the algorithm type%Weight%=%Int32% type%Graph%%=%Array%DIM2%Weight% ! step%::%Acc%(Scalar%Int)%,>%Acc%Graph%,>%Acc%Graph% step%k%g%=%generate%(shape%g)%sp% %where% %%%k'%=%the%k% ! %%%sp%::%Exp%DIM2%,>%Exp%Weight% %%%sp%ix%=%let%(Z%:.%i%:.%j)%=%unlift%ix% %%%%%%%%%%%in%%min%(g%!%(index2%i%j))% %%%%%%%%%%%%%%%%%%%(g%!%(index2%i%k')%+%g%!%(index2%k'%j))
  86. Floyd-Warshall Algorithm - Construct a list of steps, applying k

    in sequence 0%..%n,1 shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]
  87. Floyd-Warshall Algorithm - Construct a list of steps, applying k

    in sequence 0%..%n,1 - Compose the sequence together shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]
  88. Floyd-Warshall Algorithm - Construct a list of steps, applying k

    in sequence 0%..%n,1 - Compose the sequence together • Let’s try a 1000 x 1000 matrix shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]
  89. Floyd-Warshall Algorithm - Construct a list of steps, applying k

    in sequence 0%..%n,1 - Compose the sequence together • Let’s try a 1000 x 1000 matrix shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%] $%./fwaccel%1000%+RTS%,s% fwaccel:% ***%Internal%error%in%package%accelerate%***% ***%Please%submit%a%bug%report%at%https://github.com/AccelerateHS/accelerate/issues% ./Data/Array/Accelerate/CUDA.hs:246%(unhandled):%CUDA%Exception:%out%of%memory% ! %%%6,627,846,360%bytes%allocated%in%the%heap% %%%2,830,827,472%bytes%copied%during%GC% %%%%%733,091,960%bytes%maximum%residency%(26%sample(s))% %%%%%%31,575,272%bytes%maximum%slop% %%%%%%%%%%%%%812%MB%total%memory%in%use%(54%MB%lost%due%to%fragmentation)% ...% %%Total%%%time%%%%7.16s%%(%11.42s%elapsed)
  90. Floyd-Warshall Algorithm - Construct a list of steps, applying k

    in sequence 0%..%n,1 - Compose the sequence together • Let’s try a 1000 x 1000 matrix shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph% shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0% %where% %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]% %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%] $%./fwaccel%1000%+RTS%,s% fwaccel:% ***%Internal%error%in%package%accelerate%***% ***%Please%submit%a%bug%report%at%https://github.com/AccelerateHS/accelerate/issues% ./Data/Array/Accelerate/CUDA.hs:246%(unhandled):%CUDA%Exception:%out%of%memory% ! %%%6,627,846,360%bytes%allocated%in%the%heap% %%%2,830,827,472%bytes%copied%during%GC% %%%%%733,091,960%bytes%maximum%residency%(26%sample(s))% %%%%%%31,575,272%bytes%maximum%slop% %%%%%%%%%%%%%812%MB%total%memory%in%use%(54%MB%lost%due%to%fragmentation)% ...% %%Total%%%time%%%%7.16s%%(%11.42s%elapsed)
  91. Pipelining • To sequence computations we have a special operator

    ! ! ! - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the first can be GC’d when the second begins (>,>)%::%(Arrays%a,%Arrays%b,%Arrays%c)% %%%%%%=>%(Acc%a%,>%Acc%b)% %%%%%%,>%(Acc%b%,>%Acc%c)% %%%%%%,>%Acc%a%,>%Acc%c
  92. Pipelining • To sequence computations we have a special operator

    ! ! ! - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the first can be GC’d when the second begins (>,>)%::%(Arrays%a,%Arrays%b,%Arrays%c)% %%%%%%=>%(Acc%a%,>%Acc%b)% %%%%%%,>%(Acc%b%,>%Acc%c)% %%%%%%,>%Acc%a%,>%Acc%c $%./fwaccel%1000%+RTS%,s% Array%(Z)%[1783293664]% %%%6,211,776,544%bytes%allocated%in%the%heap% %%%%%914,044,096%bytes%copied%during%GC% %%%%%%24,953,768%bytes%maximum%residency%(211%sample(s))% %%%%%%%2,023,496%bytes%maximum%slop% %%%%%%%%%%%%%%63%MB%total%memory%in%use%(0%MB%lost%due%to%fragmentation)% ...% %%Total%%%time%%%%3.86s%%(%%3.92s%elapsed)
  93. Pipelining • To sequence computations we have a special operator

    ! ! ! - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the first can be GC’d when the second begins (>,>)%::%(Arrays%a,%Arrays%b,%Arrays%c)% %%%%%%=>%(Acc%a%,>%Acc%b)% %%%%%%,>%(Acc%b%,>%Acc%c)% %%%%%%,>%Acc%a%,>%Acc%c $%./fwaccel%1000%+RTS%,s% Array%(Z)%[1783293664]% %%%6,211,776,544%bytes%allocated%in%the%heap% %%%%%914,044,096%bytes%copied%during%GC% %%%%%%24,953,768%bytes%maximum%residency%(211%sample(s))% %%%%%%%2,023,496%bytes%maximum%slop% %%%%%%%%%%%%%%63%MB%total%memory%in%use%(0%MB%lost%due%to%fragmentation)% ...% %%Total%%%time%%%%3.86s%%(%%3.92s%elapsed) (>#>) can help control memory use & startup time
  94. Stencils • A stencil is a map with access to

    the neighbourhood around each element - Useful in many scientific & image processing algorithms laplace%::%Stencil3x3%Int%,>%Exp%Int% laplace%((_,t,_)% %%%%%%%%,(l,c,r)% %%%%%%%%,(_,b,_))%=%t%+%b%+%l%+%r%,%4*c t l c r b
  95. Stencils • A stencil is a map with access to

    the neighbourhood around each element - Useful in many scientific & image processing algorithms - Boundary conditions specify how to handle out-of-bounds neighbours laplace%::%Stencil3x3%Int%,>%Exp%Int% laplace%((_,t,_)% %%%%%%%%,(l,c,r)% %%%%%%%%,(_,b,_))%=%t%+%b%+%l%+%r%,%4*c >%let%mat%=%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int% >%run%$%stencil%laplace%(Constant%0)%(use%mat)% Array%(Z%:.%3%:.%5)%[4,3,2,1,,6,,5,0,0,0,,11,,26,,17,,18,,19,,36] t l c r b
  96. Stencils • A stencil is a map with access to

    the neighbourhood around each element - Useful in many scientific & image processing algorithms
  97. Debugging options • -dverbose • -ddump-sharing • -ddump-simpl-stats, -ddump-simpl-iterations •

    -ddump-cc, -ddebug-cc • -ddump-exec • -ddump-gc • -fflush-cache