$30 off During Our Annual Pro Sale. View Details »

GPGPU Programming in Haskell with Accelerate (Workshop)

GPGPU Programming in Haskell with Accelerate (Workshop)

Companion workshop to the main talk: https://speakerdeck.com/tmcdonell/gpgpu-programming-in-haskell-with-accelerate

Current graphics cards are massively parallel multicore processors optimised for workloads with a large degree of SIMD parallelism. Peak performance of these devices is far greater than that of traditional CPUs, however this is difficult to realise because good performance requires highly idiomatic programs, whose development is work intensive and requires expert knowledge. To raise the level of abstraction we are developing a domain-specific high-level language in Haskell for programming these devices. Computations are expressed in the form of parameterised collective operations —such as maps, reductions, and permutations— over multi-dimensional arrays. These computations are online compiled and executed on the graphics processor.

In this talk, I give some more details about how to use Accelerate effectively.

Trevor L. McDonell

May 17, 2013
Tweet

More Decks by Trevor L. McDonell

Other Decks in Research

Transcript

  1. GPGPU Programming in Haskell
    with Accelerate
    Trevor L. McDonell

    University of New South Wales

    !
    @tlmcdonell

    [email protected]

    !
    https://github.com/AccelerateHS

    View Slide

  2. Preliminaries
    • Get it from Hackage:

    - Debugging support enables some extra options to see what is going on

    !
    • Need to import both the base library as well as a specific backend

    - Import qualified to avoid name clashes with the Prelude
    cabal%install%accelerate%,fdebug%
    cabal%install%accelerate,cuda%,fdebug
    import%Prelude%%%%%%%%%%%%%%%as%P%
    import%Data.Array.Accelerate%as%A%
    import%Data.Array.Accelerate.CUDA%
    %%!!"or%
    import%Data.Array.Accelerate.Interpreter
    http://hackage.haskell.org/package/accelerate
    https://github.com/AccelerateHS/

    View Slide

  3. Accelerate
    • Accelerate is a Domain-Specific Language for GPU programming
    Haskell/Accelerate
    program
    CUDA code
    Compile with NVIDIA’s
    compiler & load onto the GPU
    Copy result back to Haskell
    Transform Accelerate
    program into CUDA program

    View Slide

  4. Accelerate
    • Accelerate computations take place on arrays

    - Parallelism is introduced in the form of collective operations over arrays

    !
    !
    • Arrays have two type parameters

    - The shape of the array, or dimensionality

    - The element type of the array: Int, Float, etc.
    data%Array%sh%e
    Accelerate
    computation
    Arrays in Arrays out

    View Slide

  5. Accelerate
    • Accelerate is split into two worlds: Acc and Exp

    - Acc represents collective operations over instances of Arrays

    - Exp is a scalar computation on things of type Elt

    • Collective operations in Acc comprise many scalar operations in Exp,
    executed in parallel over Arrays

    - Scalar operations can not contain collective operations

    • This stratification excludes nested data parallelism

    View Slide

  6. Accelerate
    • To execute an Accelerate computation (on the GPU):

    !
    - run comes from whichever backend we have chosen (CUDA)
    run%::%Arrays%a%=>%Acc%a%,>%a

    View Slide

  7. Accelerate
    • To execute an Accelerate computation (on the GPU):

    !
    - run comes from whichever backend we have chosen (CUDA)
    • To get arrays into Acc land

    !
    - This may involve copying data to the GPU
    run%::%Arrays%a%=>%Acc%a%,>%a
    use%::%Arrays%a%=>%a%,>%Acc%a

    View Slide

  8. Accelerate
    • To execute an Accelerate computation (on the GPU):

    !
    - run comes from whichever backend we have chosen (CUDA)
    • To get arrays into Acc land

    !
    - This may involve copying data to the GPU
    • Using Accelerate focus on everything in between: using combinators of type
    Acc to build an AST that will be turned into CUDA code and executed by run
    run%::%Arrays%a%=>%Acc%a%,>%a
    use%::%Arrays%a%=>%a%,>%Acc%a

    View Slide

  9. Arrays
    • Create an array from a list:
    - Generates a multidimensional array by consuming elements from the list
    and adding them to the array in row-major order
    • Example:
    data%Array%sh%e
    fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e
    ghci>%fromList%(Z:.10)%[1..10]

    View Slide

  10. Arrays
    • Create an array from a list:
    - Generates a multidimensional array by consuming elements from the list
    and adding them to the array in row-major order
    • Example:
    data%Array%sh%e
    fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e
    ghci>%fromList%(Z:.10)%[1..10]
    :3:1:%
    %%%%No%instance%for%(Shape%(Z%:.%head0))%
    %%%%%%arising%from%a%use%of%`fromList'%
    %%%%The%type%variable%`head0'%is%ambiguous%
    %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type%
    %%%%Note:%there%is%a%potential%instance%available:%
    %%%%%%instance%Shape%sh%=>%Shape%(sh%:.%Int)%
    %%%%%%%%,,%Defined%in%`Data.Array.Accelerate.Array.Sugar'%
    %%%%Possible%fix:%add%an%instance%declaration%for%(Shape%(Z%:
    %%%%In%the%expression:%fromList%(Z%:.%10)%[1%..%10]%
    %%%%In%an%equation%for%`it':%it%=%fromList%(Z%:.%10)%[1%..%10
    !
    :3:14:%
    %%%%No%instance%for%(Num%head0)%arising%from%the%literal%`10'
    %%%%The%type%variable%`head0'%is%ambiguous%
    %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type%
    %%%%Note:%there%are%several%potential%instances:%
    %%%%%%instance%Num%Double%,,%Defined%in%`GHC.Float'%
    %%%%%%instance%Num%Float%,,%Defined%in%`GHC.Float'%
    %%%%%%instance%Integral%a%=>%Num%(GHC.Real.Ratio%a)%
    %%%%%%%%,,%Defined%in%`GHC.Real'%
    %%%%%%...plus%12%others%
    %%%%In%the%second%argument%of%`(:.)',%namely%`10'%

    View Slide

  11. Arrays
    • Create an array from a list:
    - Generates a multidimensional array by consuming elements from the list
    and adding them to the array in row-major order
    • Example:
    - Defaulting does not apply, because

    Shape is not a standard class
    data%Array%sh%e
    fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e
    ghci>%fromList%(Z:.10)%[1..10]
    :3:1:%
    %%%%No%instance%for%(Shape%(Z%:.%head0))%
    %%%%%%arising%from%a%use%of%`fromList'%
    %%%%The%type%variable%`head0'%is%ambiguous%
    %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type%
    %%%%Note:%there%is%a%potential%instance%available:%
    %%%%%%instance%Shape%sh%=>%Shape%(sh%:.%Int)%
    %%%%%%%%,,%Defined%in%`Data.Array.Accelerate.Array.Sugar'%
    %%%%Possible%fix:%add%an%instance%declaration%for%(Shape%(Z%:
    %%%%In%the%expression:%fromList%(Z%:.%10)%[1%..%10]%
    %%%%In%an%equation%for%`it':%it%=%fromList%(Z%:.%10)%[1%..%10
    !
    :3:14:%
    %%%%No%instance%for%(Num%head0)%arising%from%the%literal%`10'
    %%%%The%type%variable%`head0'%is%ambiguous%
    %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type%
    %%%%Note:%there%are%several%potential%instances:%
    %%%%%%instance%Num%Double%,,%Defined%in%`GHC.Float'%
    %%%%%%instance%Num%Float%,,%Defined%in%`GHC.Float'%
    %%%%%%instance%Integral%a%=>%Num%(GHC.Real.Ratio%a)%
    %%%%%%%%,,%Defined%in%`GHC.Real'%
    %%%%%%...plus%12%others%
    %%%%In%the%second%argument%of%`(:.)',%namely%`10'%

    View Slide

  12. Arrays
    • Create an array from a list:
    - Generates a multidimensional array by consuming elements from the list
    and adding them to the array in row-major order
    • Example:
    - Defaulting does not apply, because

    Shape is not a standard class
    data%Array%sh%e
    fromList%::%(Shape%sh,%Elt%e)%=>%sh%,>%[e]%,>%Array%sh%e
    ghci>%fromList%(Z:.10)%[1..10]
    :3:1:%
    %%%%No%instance%for%(Shape%(Z%:.%head0))%
    %%%%%%arising%from%a%use%of%`fromList'%
    %%%%The%type%variable%`head0'%is%ambiguous%
    %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type%
    %%%%Note:%there%is%a%potential%instance%available:%
    %%%%%%instance%Shape%sh%=>%Shape%(sh%:.%Int)%
    %%%%%%%%,,%Defined%in%`Data.Array.Accelerate.Array.Sugar'%
    %%%%Possible%fix:%add%an%instance%declaration%for%(Shape%(Z%:
    %%%%In%the%expression:%fromList%(Z%:.%10)%[1%..%10]%
    %%%%In%an%equation%for%`it':%it%=%fromList%(Z%:.%10)%[1%..%10
    !
    :3:14:%
    %%%%No%instance%for%(Num%head0)%arising%from%the%literal%`10'
    %%%%The%type%variable%`head0'%is%ambiguous%
    %%%%Possible%fix:%add%a%type%signature%that%fixes%these%type%
    %%%%Note:%there%are%several%potential%instances:%
    %%%%%%instance%Num%Double%,,%Defined%in%`GHC.Float'%
    %%%%%%instance%Num%Float%,,%Defined%in%`GHC.Float'%
    %%%%%%instance%Integral%a%=>%Num%(GHC.Real.Ratio%a)%
    %%%%%%%%,,%Defined%in%`GHC.Real'%
    %%%%%%...plus%12%others%
    %%%%In%the%second%argument%of%`(:.)',%namely%`10'%
    Number 1 tip:
    Add type signatures

    View Slide

  13. Arrays
    • Create an array from a list:
    data%Array%sh%e
    >%fromList%(Z:.10)%[1..10]%::%Vector%Float%
    Array%(Z%:.%10)%[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

    View Slide

  14. Arrays
    • Create an array from a list:
    • Multidimensional arrays are similar:

    - Elements are filled along the right-most dimension first
    data%Array%sh%e
    >%fromList%(Z:.10)%[1..10]%::%Vector%Float%
    Array%(Z%:.%10)%[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]
    >%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int%
    Array%(Z%:.%3%:.%5)%[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
    1 2 3 4 5
    6 7 8 9 10
    11 12 13 14 15

    View Slide

  15. Accelerate by example
    Password “recovery”

    View Slide

  16. MD5 Algorithm
    • Aim:

    - Implement one round of MD5: unsalted, single 512-bit block

    - Apply to an array of words

    - Compare hashes to some unknown hash

    - i.e. standard dictionary attack

    View Slide

  17. MD5 Algorithm
    • Algorithm operates on a 4-word state,

    A, B, C, and D

    • There are 4 x 16 rounds: F, G, H, and I

    - Mi
    is a word from the input message

    - Ki
    is a constant

    - <<is left rotate, by some constant ri


    • Each round operates on the 512-bit message 

    block, modifying the state
    http://en.wikipedia.org/wiki/Md5

    View Slide

  18. MD5 Algorithm in Accelerate
    • Accelerate is a meta programming language

    - Use regular Haskell to generate the expression for each step of the round

    - Produces an unrolled loop
    type%ABCD%=%(Exp%Word32,%Exp%Word32,%...%)%
    !
    md5round%::%Acc%(Vector%Word32)%,>%ABCD%
    md5round%msg%
    %%=%P.foldl%round%(a0,b0,c0,d0)%[0..64]%
    %%where%
    %%%%round%::%ABDC%,>%Int%,>%ABCD%
    %%%%round%(a,b,c,d)%i%=%...

    View Slide

  19. MD5 Algorithm in Accelerate
    • The constants ki
    and ri
    can be embedded directly

    - The simple list lookup would be death in standard Haskell

    - Generating the expression need not be performant, only executing it
    k%::%Int%,>%Exp%Word32%
    k%i%=%constant%(ks%P.!!%i)%
    %%where%
    %%%%ks%=%[%0xd76aa478,%0xe8c7b756,%0x242070db,%0xc1bdceee%
    %%%%%%%%%,%...

    View Slide

  20. MD5 Algorithm in Accelerate
    • The message M is stored as an array, so we need array indexing

    - Be wary, arbitrary array indexing can kill performance...
    (!)%::%(Shape%ix,%Elt%e)%=>%Acc%(Array%ix%e)%,>%Exp%ix%,>%Exp%e

    View Slide

  21. MD5 Algorithm in Accelerate
    • The message M is stored as an array, so we need array indexing

    - Be wary, arbitrary array indexing can kill performance...
    • Get the right word of the message for the given round
    (!)%::%(Shape%ix,%Elt%e)%=>%Acc%(Array%ix%e)%,>%Exp%ix%,>%Exp%e
    m%::%Int%,>%Exp%Word32%
    m%i%
    %%|%i%<%16%=%msg%!%index1%(constant%i)%
    %%|%i%<%32%=%msg%!%index1%(constant%((5*i%+%1)%`rem`%16))%
    %%|%...

    View Slide

  22. MD5 Algorithm in Accelerate
    • Finally, the non-linear functions F, G, H, and I
    round%::%ABDC%,>%Int%,>%ABCD%
    round%(a,b,c,d)%i%
    %%|%i%<%16%=%shfl%(f%b%c%d)%
    %%|%...%
    %%where%
    %%%%shfl%x%%=%(d,%b%+%((a%+%x%+%k%i%+%m%i)%`rotateL`%r%i),%b,%c)%
    !
    %%%%f%x%y%z%=%(x%.&.%y)%.|.%((complement%x)%.&.%z)%
    %%%%...

    View Slide

  23. MD5 Algorithm in Accelerate
    • MD5 applied to a single 16-word vector: no parallelism here

    • Lift this operation to an array of n words

    - Process many words in parallel to compare against the unknown

    - Need to use generate, the most general form of array construction.
    Equivalently, the most easily misused
    generate%::%(Shape%sh,%Elt%e)%
    %%%%%%%%%=>%Exp%sh%
    %%%%%%%%%,>%(Exp%sh%,>%Elt%e)%
    %%%%%%%%%,>%Acc%(Array%sh%e)

    View Slide

  24. Problem Solving Accelerate
    Hints for when things don’t work as you expect

    View Slide

  25. MD5 Algorithm in Accelerate
    • As always, data layout is important

    - Accelerate arrays are stored in row-major order

    - For a CPU, this means the input word would be read on a cache line
    c o r r e c t
    h o r s e
    b a t t e r y
    s t a p l e

    View Slide

  26. MD5 Algorithm in Accelerate
    • However in a parallel context many threads must work together
    - generate uses one thread per element
    c o r r e c t
    h o r s e
    b a t t e r y
    s t a p l e

    View Slide

  27. MD5 Algorithm in Accelerate
    • However in a parallel context many threads must work together
    - generate uses one thread per element
    c o r r e c t
    h o r s e
    b a t t e r y
    s t a p l e

    c o r r e c t
    h o r s e
    b a t t e r y
    s t a p l e

    View Slide

  28. MD5 Algorithm in Accelerate
    • However in a parallel context many threads must work together
    - generate uses one thread per element
    - For best performance CUDA threads need to index adjacent memory
    - This only works if the dictionary is stored column major instead

    c o r r e c t
    h o r s e
    b a t t e r y
    s t a p l e

    View Slide

  29. MD5 Algorithm in Accelerate
    • However in a parallel context many threads must work together
    - generate uses one thread per element
    - For best performance CUDA threads need to index adjacent memory
    - This only works if the dictionary is stored column major instead
    c h b s
    o o a t
    r r t a
    r s t p
    e e e l
    c r e
    t y

    View Slide

  30. MD5 Algorithm in Accelerate
    • However in a parallel context many threads must work together
    - generate uses one thread per element
    - For best performance CUDA threads need to index adjacent memory
    - This only works if the dictionary is stored column major instead
    c h b s
    o o a t
    r r t a
    r s t p
    e e e l
    c r e
    t y

    Pay attention to data layout if
    you use indexing operators

    View Slide

  31. Nested Data-Parallelism
    • Accelerate is a language for flat data-parallel computations
    - The allowed array element types do not include Array: simple types only
    - Lots of types to statically exclude nested parallelism: Acc vs. Exp

    View Slide

  32. Nested Data-Parallelism
    • Accelerate is a language for flat data-parallel computations
    - The allowed array element types do not include Array: simple types only
    - Lots of types to statically exclude nested parallelism: Acc vs. Exp
    - But, it doesn’t always succeed, and the error is uninformative…
    ***%Exception:%%
    ***%Internal%error%in%package%accelerate%***%
    ***%Please%submit%a%bug%report%at%https://github.com/Accelerat...%
    ./Data/Array/Accelerate/Smart.hs:561%(convertSharingExp):%
    inconsistent%valuation%@%shared%'Exp'%tree%with%stable%name%54;%
    env%=%[56]

    View Slide

  33. Nested Data-Parallelism
    • Accelerate is a language for flat data-parallel computations
    - The allowed array element types do not include Array: simple types only
    - Lots of types to statically exclude nested parallelism: Acc vs. Exp
    - But, it doesn’t always succeed, and the error is uninformative…
    ***%Exception:%%
    ***%Internal%error%in%package%accelerate%***%
    ***%Please%submit%a%bug%report%at%https://github.com/Accelerat...%
    ./Data/Array/Accelerate/Smart.hs:561%(convertSharingExp):%
    inconsistent%valuation%@%shared%'Exp'%tree%with%stable%name%54;%
    env%=%[56]

    View Slide

  34. Nested Data-Parallelism
    • Accelerate is a language for flat data-parallel computations
    - The allowed array element types do not include Array: simple types only
    - Lots of types to statically exclude nested parallelism: Acc vs. Exp
    - But, it doesn’t always succeed, and the error is uninformative…
    ***%Exception:%%
    ***%Internal%error%in%package%accelerate%***%
    ***%Please%submit%a%bug%report%at%https://github.com/Accelerat...%
    ./Data/Array/Accelerate/Smart.hs:561%(convertSharingExp):%
    inconsistent%valuation%@%shared%'Exp'%tree%with%stable%name%54;%
    env%=%[56]
    ***%Exception:%Cyclic%definition%of%a%value%of%type%'Exp'%(sa%=%45)

    View Slide

  35. Nested Data-Parallelism
    • Matrix-vector multiplication

    - Define vector dot product
    dotp%::%Acc%(Vector%e)%,>%Acc%(Vector%e)%,>%Acc%(Scalar%e)%
    dotp%u%v%=%fold%(+)%0%
    %%%%%%%%%(%zipWith%(*)%u%v%)

    View Slide

  36. Nested Data-Parallelism
    • Matrix-vector multiplication
    - Want to apply dotp to every row of the matrix
    - Extract a row from the matrix
    takeRow%::%Exp%Int%,>%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%
    takeRow%n%mat%=%
    %%let%Z%:.%_%:.%cols%=%unlift%(shape%mat)%::%Z:.%Exp%Int%:.%Exp%Int%
    %%in%backpermute%(index1%cols)%
    %%%%%%%%%%%%%%%%%(\ix%,>%index2%n%(unindex1%ix))%
    %%%%%%%%%%%%%%%%%mat

    View Slide

  37. Nested Data-Parallelism
    • Matrix-vector multiplication
    - Want to apply dotp to every row of the matrix
    - Extract a row from the matrix
    - At each element in the output array, where in the input do I read from?
    takeRow%::%Exp%Int%,>%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%
    takeRow%n%mat%=%
    %%let%Z%:.%_%:.%cols%=%unlift%(shape%mat)%::%Z:.%Exp%Int%:.%Exp%Int%
    %%in%backpermute%(index1%cols)%
    %%%%%%%%%%%%%%%%%(\ix%,>%index2%n%(unindex1%ix))%
    %%%%%%%%%%%%%%%%%mat

    View Slide

  38. Nested Data-Parallelism
    • Matrix-vector multiplication
    - Want to apply dotp to every row of the matrix
    - Extract a row from the matrix
    - At each element in the output array, where in the input do I read from?
    - [un]index1 converts an Int into a (Z:.Int) 1D index (and vice versa)
    takeRow%::%Exp%Int%,>%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%
    takeRow%n%mat%=%
    %%let%Z%:.%_%:.%cols%=%unlift%(shape%mat)%::%Z:.%Exp%Int%:.%Exp%Int%
    %%in%backpermute%(index1%cols)%
    %%%%%%%%%%%%%%%%%(\ix%,>%index2%n%(unindex1%ix))%
    %%%%%%%%%%%%%%%%%mat

    View Slide

  39. Nested Data-Parallelism
    • Matrix-vector multiplication

    - Apply dot product to each row of the matrix

    !
    !
    !
    mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    mvm%mat%vec%=%
    %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int%
    %%in%generate%(index1%rows)%
    %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat))

    View Slide

  40. Nested Data-Parallelism
    • Matrix-vector multiplication

    - Apply dot product to each row of the matrix

    !
    !
    !
    mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    mvm%mat%vec%=%
    %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int%
    %%in%generate%(index1%rows)%
    %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat))
    indexing an array
    that …

    View Slide

  41. Nested Data-Parallelism
    • Matrix-vector multiplication

    - Apply dot product to each row of the matrix

    !
    !
    !
    • Extract the element from a singleton array
    mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    mvm%mat%vec%=%
    %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int%
    %%in%generate%(index1%rows)%
    %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat))
    indexing an array
    that …
    the%::%Acc%(Scalar%e)%,>%Exp%e

    View Slide

  42. Nested Data-Parallelism
    • Matrix-vector multiplication

    - Apply dot product to each row of the matrix

    !
    !
    !
    • Extract the element from a singleton array
    mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    mvm%mat%vec%=%
    %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int%
    %%in%generate%(index1%rows)%
    %%%%%%%%%%%%%%(\ix%,>%the%(vec%`dotp`%takeRow%(unindex1%ix)%mat))
    indexing an array
    that … depends on the index
    given by generate
    the%::%Acc%(Scalar%e)%,>%Exp%e

    View Slide

  43. Nested Data-Parallelism
    • The problem is attempting to execute many separate dot products in parallel

    !
    !
    • We need a way to execute this step as a single collective operation
    dotp%::%Acc%(Vector%e)%,>%Acc%(Vector%e)%,>%Acc%(Scalar%e)%
    dotp%u%v%=%fold%(+)%0%
    %%%%%%%%%(%zipWith%(*)%u%v%)

    View Slide

  44. Reductions
    • Folding (+) over a vector produces a sum

    - The result is a one-element array (scalar). Why?

    !
    >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%run%$%fold%(+)%0%(use%xs)%
    Array%(Z)%[55]

    View Slide

  45. Reductions
    • Folding (+) over a vector produces a sum

    - The result is a one-element array (scalar). Why?

    !
    • Fold has an interesting type:
    >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%run%$%fold%(+)%0%(use%xs)%
    Array%(Z)%[55]
    fold%::%(Shape%sh,%Elt%a)%
    %%%%%=>%(Exp%a%,>%Exp%a%,>%Exp%a)%
    %%%%%,>%Exp%a%
    %%%%%,>%Acc%(Array%(sh:.Int)%a)%
    %%%%%,>%Acc%(Array%sh%%%%%%%%a)

    View Slide

  46. Reductions
    • Folding (+) over a vector produces a sum

    - The result is a one-element array (scalar). Why?

    !
    • Fold has an interesting type:
    >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%run%$%fold%(+)%0%(use%xs)%
    Array%(Z)%[55]
    fold%::%(Shape%sh,%Elt%a)%
    %%%%%=>%(Exp%a%,>%Exp%a%,>%Exp%a)%
    %%%%%,>%Exp%a%
    %%%%%,>%Acc%(Array%(sh:.Int)%a)%
    %%%%%,>%Acc%(Array%sh%%%%%%%%a)
    input array

    View Slide

  47. Reductions
    • Folding (+) over a vector produces a sum

    - The result is a one-element array (scalar). Why?

    !
    • Fold has an interesting type:
    >%let%xs%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%run%$%fold%(+)%0%(use%xs)%
    Array%(Z)%[55]
    fold%::%(Shape%sh,%Elt%a)%
    %%%%%=>%(Exp%a%,>%Exp%a%,>%Exp%a)%
    %%%%%,>%Exp%a%
    %%%%%,>%Acc%(Array%(sh:.Int)%a)%
    %%%%%,>%Acc%(Array%sh%%%%%%%%a)
    outer dimension removed
    input array

    View Slide

  48. • Fold occurs over the outer dimension of the array
    15
    40
    65
    1 2 3 4 5
    6 7 8 9 10
    11 12 13 14 15
    Reductions
    >%let%mat%=%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int%
    >%run%$%fold%(+)%0%(use%mat)%
    Array%(Z%:.%3)%[15,40,65]

    View Slide

  49. • Fold occurs over the outer dimension of the array
    15
    40
    65
    1 2 3 4 5
    6 7 8 9 10
    11 12 13 14 15
    Reductions
    >%let%mat%=%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int%
    >%run%$%fold%(+)%0%(use%mat)%
    Array%(Z%:.%3)%[15,40,65]

    View Slide

  50. Matrix-Vector Multiplication
    • The trick is that we can use this to do all of our dot-products in parallel
    1 2 3 4 5
    6 7 8 9 10
    11 12 13 14 15
    1
    2
    3
    4
    5

    View Slide

  51. Matrix-Vector Multiplication
    • The trick is that we can use this to do all of our dot-products in parallel
    1 2 3 4 5
    6 7 8 9 10
    11 12 13 14 15
    1
    2
    3
    4
    5

    View Slide

  52. Matrix-Vector Multiplication
    • The trick is that we can use this to do all of our dot-products in parallel
    1 2 3 4 5
    6 7 8 9 10
    11 12 13 14 15 1
    2
    3
    1
    2
    3
    4
    1
    2
    3
    4
    5

    View Slide

  53. Matrix-Vector Multiplication
    • The trick is that we can use this to do all of our dot-products in parallel
    1 2 3 4 5
    6 7 8 9 10
    11 12 13 14 15
    1
    2
    3
    4
    1
    2
    3
    4
    5
    1
    2
    3
    4
    5

    View Slide

  54. Replicate
    • Multidimensional replicate introduces new magic
    replicate%::%(Slice%slix,%Elt%e)%
    %%%%%%%%%%=>%Exp%slix%
    %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)%
    %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e)

    View Slide

  55. Replicate
    • Multidimensional replicate introduces new magic
    replicate%::%(Slice%slix,%Elt%e)%
    %%%%%%%%%%=>%Exp%slix%
    %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)%
    %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e)

    View Slide

  56. Replicate
    • Multidimensional replicate introduces new magic
    • Type hackery
    replicate%::%(Slice%slix,%Elt%e)%
    %%%%%%%%%%=>%Exp%slix%
    %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)%
    %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e)
    >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float%
    >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)%
    Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]

    View Slide

  57. Replicate
    • Multidimensional replicate introduces new magic
    • Type hackery
    replicate%::%(Slice%slix,%Elt%e)%
    %%%%%%%%%%=>%Exp%slix%
    %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)%
    %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e)
    >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float%
    >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)%
    Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]

    View Slide

  58. Replicate
    • Multidimensional replicate introduces new magic
    • Type hackery
    - All indicates the entirety of the existing dimension
    replicate%::%(Slice%slix,%Elt%e)%
    %%%%%%%%%%=>%Exp%slix%
    %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)%
    %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e)
    >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float%
    >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)%
    Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]

    View Slide

  59. Replicate
    • Multidimensional replicate introduces new magic
    • Type hackery
    - All indicates the entirety of the existing dimension
    - Number of new rows
    replicate%::%(Slice%slix,%Elt%e)%
    %%%%%%%%%%=>%Exp%slix%
    %%%%%%%%%%,>%Acc%(Array%(SliceShape%slix)%e)%
    %%%%%%%%%%,>%Acc%(Array%(FullShape%%slix)%e)
    >%let%vec%=%fromList%(Z:.5)%[1..]%::%Vector%Float%
    >%run%$%replicate%(constant%(Z%:.%(3::Int)%:.%All))%(use%vec)%
    Array%(Z%:.%3%:.%5)%[1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,%...]

    View Slide

  60. Matrix-Vector Multiplication
    • Looks very much like vector dot product
    mvm%::%Acc%(Array%DIM2%e)%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    mvm%mat%vec%=%
    %%let%Z%:.%rows%:.%_%=%unlift%(shape%mat)%::%Z%:.%Exp%Int%:.%Exp%Int%
    %%%%%%vec'%%%%%%%%%%%=%A.replicate%(lift%(Z%:.%rows%:.%All))%vec%
    %%in%
    %%fold%(+)%0%(A.zipWith%(*)%vec'%mat)

    View Slide

  61. N-Body Simulation
    • Calculate all interactions between a vector of particles
    calcAccels%::%Exp%R%,>%Acc%(Vector%Body)%,>%Acc%(Vector%Accel)%
    calcAccels%epsilon%bodies%
    %%=%let%n%%%%%%%=%A.size%bodies%
    !
    %%%%%%%%rows%%%%=%A.replicate%(lift%$%Z%:.%n%:.%All)%bodies%
    %%%%%%%%cols%%%%=%A.replicate%(lift%$%Z%:.%All%:.%n)%bodies%
    !
    %%%%in%
    %%%%A.fold%(.+.)%(vec%0)%(%A.zipWith%(accel%epsilon)%rows%cols%)

    View Slide

  62. N-Body Simulation
    • Calculate all interactions between a vector of particles
    calcAccels%::%Exp%R%,>%Acc%(Vector%Body)%,>%Acc%(Vector%Accel)%
    calcAccels%epsilon%bodies%
    %%=%let%n%%%%%%%=%A.size%bodies%
    !
    %%%%%%%%rows%%%%=%A.replicate%(lift%$%Z%:.%n%:.%All)%bodies%
    %%%%%%%%cols%%%%=%A.replicate%(lift%$%Z%:.%All%:.%n)%bodies%
    !
    %%%%in%
    %%%%A.fold%(.+.)%(vec%0)%(%A.zipWith%(accel%epsilon)%rows%cols%)
    Turn nested computations into
    segmented or higher-
    dimensional operations

    View Slide

  63. Accelerate
    • Recall that Accelerate computations take place on arrays
    Accelerate
    computation
    Arrays in Arrays out

    View Slide

  64. Accelerate
    • Recall that Accelerate computations take place on arrays
    • Accelerate evaluates the expression passed to run to generate a series of
    CUDA kernels
    Accelerate
    computation
    Arrays in Arrays out

    View Slide

  65. Accelerate
    • Recall that Accelerate computations take place on arrays
    • Accelerate evaluates the expression passed to run to generate a series of
    CUDA kernels
    - Each piece of code CUDA code must be compiled and loaded
    Accelerate
    computation
    Arrays in Arrays out

    View Slide

  66. Accelerate
    • Recall that Accelerate computations take place on arrays
    • Accelerate evaluates the expression passed to run to generate a series of
    CUDA kernels
    - Each piece of code CUDA code must be compiled and loaded
    - The goal is to make kernels that can be reused
    Accelerate
    computation
    Arrays in Arrays out

    View Slide

  67. Accelerate
    • Recall that Accelerate computations take place on arrays
    • Accelerate evaluates the expression passed to run to generate a series of
    CUDA kernels
    - Each piece of code CUDA code must be compiled and loaded
    - The goal is to make kernels that can be reused
    - If we don’t, the overhead of compilation can ruin performance
    Accelerate
    computation
    Arrays in Arrays out

    View Slide

  68. Embedded Scalars
    • Consider drop, which yields all but the first n elements of a vector
    drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    drop%n%arr%=%
    %%let%n'%=%the%(unit%n)%
    %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))%
    %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr

    View Slide

  69. Embedded Scalars
    • Consider drop, which yields all but the first n elements of a vector
    drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    drop%n%arr%=%
    %%let%n'%=%the%(unit%n)%
    %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))%
    %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr

    View Slide

  70. Embedded Scalars
    • Consider drop, which yields all but the first n elements of a vector
    • Lift an expression into a singleton array

    drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    drop%n%arr%=%
    %%let%n'%=%the%(unit%n)%
    %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))%
    %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr
    unit%::%Exp%e%,>%Acc%(Scalar%e)

    View Slide

  71. Embedded Scalars
    • Consider drop, which yields all but the first n elements of a vector
    • Lift an expression into a singleton array

    • Extract the element of a singleton array
    drop%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    drop%n%arr%=%
    %%let%n'%=%the%(unit%n)%
    %%in%%backpermute%(ilift1%(subtract%n')%(shape%arr))%
    %%%%%%%%%%%%%%%%%%(ilift1%(+%n'))%arr
    unit%::%Exp%e%,>%Acc%(Scalar%e)
    the%::%Acc%(Scalar%e)%,>%Exp%e

    View Slide

  72. Embedded Scalars
    • Check the expression Accelerate sees when it evaluates run
    >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%drop%4%(use%vec)%
    let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in%
    let%a1%=%unit%4%
    in%backpermute%
    %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))%
    %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))%
    %%%%%a0

    View Slide

  73. Embedded Scalars
    • Check the expression Accelerate sees when it evaluates run
    >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%drop%4%(use%vec)%
    let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in%
    let%a1%=%unit%4%
    in%backpermute%
    %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))%
    %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))%
    %%%%%a0

    View Slide

  74. Embedded Scalars
    • Check the expression Accelerate sees when it evaluates run
    - Corresponds to the array we created for n
    >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%drop%4%(use%vec)%
    let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in%
    let%a1%=%unit%4%
    in%backpermute%
    %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))%
    %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))%
    %%%%%a0

    View Slide

  75. Embedded Scalars
    • Check the expression Accelerate sees when it evaluates run
    - Corresponds to the array we created for n
    - Critically, this is outside the call to backpermute
    >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%drop%4%(use%vec)%
    let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%in%
    let%a1%=%unit%4%
    in%backpermute%
    %%%%%(let%x0%=%Z%in%x0%:.%(indexHead%(shape%a0))%,%(a1!x0))%
    %%%%%(\x0%,>%let%x1%=%Z%in%x1%:.%(indexHead%x0)%+%(a1!x1))%
    %%%%%a0

    View Slide

  76. Embedded Scalars
    drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    drop'%n%arr%=%
    %%backpermute%(ilift1%(subtract%n)%(shape%arr))%
    %%%%%%%%%%%%%%(ilift1%(+%n))%arr

    View Slide

  77. Embedded Scalars
    drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    drop'%n%arr%=%
    %%backpermute%(ilift1%(subtract%n)%(shape%arr))%
    %%%%%%%%%%%%%%(ilift1%(+%n))%arr
    >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%drop'%4%(use%vec)%
    let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%
    in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))%
    %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0

    View Slide

  78. Embedded Scalars
    drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    drop'%n%arr%=%
    %%backpermute%(ilift1%(subtract%n)%(shape%arr))%
    %%%%%%%%%%%%%%(ilift1%(+%n))%arr
    >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%drop'%4%(use%vec)%
    let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%
    in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))%
    %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0

    View Slide

  79. Embedded Scalars
    • Check the expression Accelerate sees when it evaluates run
    drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    drop'%n%arr%=%
    %%backpermute%(ilift1%(subtract%n)%(shape%arr))%
    %%%%%%%%%%%%%%(ilift1%(+%n))%arr
    >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%drop'%4%(use%vec)%
    let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%
    in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))%
    %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0

    View Slide

  80. Embedded Scalars
    • Check the expression Accelerate sees when it evaluates run
    - This will defeat Accelerate’s caching of compiled kernels
    drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    drop'%n%arr%=%
    %%backpermute%(ilift1%(subtract%n)%(shape%arr))%
    %%%%%%%%%%%%%%(ilift1%(+%n))%arr
    >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%drop'%4%(use%vec)%
    let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%
    in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))%
    %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0

    View Slide

  81. Embedded Scalars
    • Check the expression Accelerate sees when it evaluates run
    - This will defeat Accelerate’s caching of compiled kernels
    drop'%::%Elt%e%=>%Exp%Int%,>%Acc%(Vector%e)%,>%Acc%(Vector%e)%
    drop'%n%arr%=%
    %%backpermute%(ilift1%(subtract%n)%(shape%arr))%
    %%%%%%%%%%%%%%(ilift1%(+%n))%arr
    >%let%vec%=%fromList%(Z:.10)%[1..]%::%Vector%Int%
    >%drop'%4%(use%vec)%
    let%a0%=%use%(Array%(Z%:.%10)%[0,1,2,3,4,5,6,7,8,9])%
    in%backpermute%(Z%:.%,4%+%(indexHead%(shape%a0)))%
    %%%%%%%%%%%%%%%(\x0%,>%Z%:.%4%+%(indexHead%x0))%a0
    Make sure any arguments that
    change are passed as Arrays

    View Slide

  82. Embedded Scalars
    • Inspect the expression directly, as done here

    • Alternatively: Accelerate has some debugging options

    - Run the program with the ,ddump,cc command line switch

    - Make sure you have installed Accelerate with ,fdebug

    View Slide

  83. Executing Computations
    • Recall: to actually execute a computation we use run

    !
    • This entails quite a bit of work setting up the computation to run on the GPU

    - And sometimes the computation never changes…
    run%::%Arrays%a%=>%Acc%a%,>%a

    View Slide

  84. Executing Computations
    • Alternative: execute an array program of one argument
    • What is the difference?
    - This version can be partially applied with the (Acc%a%,>%Acc%b) argument
    - Returns a new function (a%,>%b)
    run1%::%(Arrays%a,%Arrays%b)%=>%(Acc%a%,>%Acc%b)%,>%a%,>%b

    View Slide

  85. Executing Computations
    • Alternative: execute an array program of one argument
    • What is the difference?
    - This version can be partially applied with the (Acc%a%,>%Acc%b) argument
    - Returns a new function (a%,>%b)
    - Key new thing: behind the scenes everything other than final execution is
    already done. AST is annotated with object code required for execution.
    run1%::%(Arrays%a,%Arrays%b)%=>%(Acc%a%,>%Acc%b)%,>%a%,>%b

    View Slide

  86. Executing Computations
    canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)%
    canny%=%...

    View Slide

  87. • No magic:

    !
    Executing Computations
    canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)%
    canny%=%...
    edges%::%Array%DIM2%RGBA%,>%Array%DIM2%Float%
    edges%img%=%run%(%canny%(use%img)%)

    View Slide

  88. • No magic:

    !
    • Magic:
    Executing Computations
    canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)%
    canny%=%...
    edges%::%Array%DIM2%RGBA%,>%Array%DIM2%Float%
    edges%img%=%run%(%canny%(use%img)%)
    edges%::%Array%DIM2%RGBA%,>%Array%DIM2%Float%
    edges%img%=%run1%canny%img

    View Slide

  89. !
    !
    • Criterion benchmarks
    Executing Computations
    canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)%
    canny%=%...
    benchmarking%canny/run%
    collecting%100%samples,%1%iterations%each,%in%estimated%6.711102%s%
    mean:%37.55531%ms,%lb%36.76889%ms,%ub%38.30969%ms,%ci%0.950%
    std%dev:%3.953049%ms,%lb%3.655842%ms,%ub%4.300031%ms,%ci%0.950%
    variance%introduced%by%outliers:%81.052%%
    variance%is%severely%inflated%by%outliers%
    !
    benchmarking%canny/run1%
    mean:%5.570774%ms,%lb%5.324862%ms,%ub%5.854066%ms,%ci%0.950%
    std%dev:%1.352904%ms,%lb%1.210735%ms,%ub%1.654683%ms,%ci%0.950%
    variance%introduced%by%outliers:%95.756%%
    variance%is%severely%inflated%by%outliers

    View Slide

  90. !
    !
    • Criterion benchmarks
    Executing Computations
    canny%::%Acc%(Array%DIM2%RGBA)%,>%Acc%(Array%DIM2%Float)%
    canny%=%...
    benchmarking%canny/run%
    collecting%100%samples,%1%iterations%each,%in%estimated%6.711102%s%
    mean:%37.55531%ms,%lb%36.76889%ms,%ub%38.30969%ms,%ci%0.950%
    std%dev:%3.953049%ms,%lb%3.655842%ms,%ub%4.300031%ms,%ci%0.950%
    variance%introduced%by%outliers:%81.052%%
    variance%is%severely%inflated%by%outliers%
    !
    benchmarking%canny/run1%
    mean:%5.570774%ms,%lb%5.324862%ms,%ub%5.854066%ms,%ci%0.950%
    std%dev:%1.352904%ms,%lb%1.210735%ms,%ub%1.654683%ms,%ci%0.950%
    variance%introduced%by%outliers:%95.756%%
    variance%is%severely%inflated%by%outliers

    View Slide

  91. Floyd-Warshall Algorithm
    • Find the shortest paths in a weighted graph
    shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight%
    shortestPath%g%i%j%0%=%weight%g%i%j%
    shortestPath%g%i%j%k%=%
    %%min%(shortestPath%g%i%j%(k,1))%
    %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))

    View Slide

  92. Floyd-Warshall Algorithm
    • Find the shortest paths in a weighted graph
    - k=0: path between vertices i and j is the direct case only
    shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight%
    shortestPath%g%i%j%0%=%weight%g%i%j%
    shortestPath%g%i%j%k%=%
    %%min%(shortestPath%g%i%j%(k,1))%
    %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))

    View Slide

  93. Floyd-Warshall Algorithm
    • Find the shortest paths in a weighted graph
    - k=0: path between vertices i and j is the direct case only
    - Otherwise the shortest path from i to j passes through k, or it does not
    shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight%
    shortestPath%g%i%j%0%=%weight%g%i%j%
    shortestPath%g%i%j%k%=%
    %%min%(shortestPath%g%i%j%(k,1))%
    %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))

    View Slide

  94. Floyd-Warshall Algorithm
    • Find the shortest paths in a weighted graph
    - k=0: path between vertices i and j is the direct case only
    - Otherwise the shortest path from i to j passes through k, or it does not
    - Exponential in the number of calls: instead traverse bottom-up to make the
    results at (k,1) available
    shortestPath%::%Graph%,>%Vertex%,>%Vertex%,>%Vertex%,>%Weight%
    shortestPath%g%i%j%0%=%weight%g%i%j%
    shortestPath%g%i%j%k%=%
    %%min%(shortestPath%g%i%j%(k,1))%
    %%%%%%(shortestPath%g%i%k%(k,i)%+%shortestPath%g%k%j%(k,1))

    View Slide

  95. Floyd-Warshall Algorithm
    • Instead of a sparse graph, we’ll use a dense adjacency matrix

    - 2D array where the element is the distance between the two vertices

    !
    !
    !
    !
    !
    - Must the iteration number k as a scalar array

    - sp is the core of the algorithm
    type%Weight%=%Int32%
    type%Graph%%=%Array%DIM2%Weight%
    !
    step%::%Acc%(Scalar%Int)%,>%Acc%Graph%,>%Acc%Graph%
    step%k%g%=%generate%(shape%g)%sp%
    %where%
    %%%k'%=%the%k%
    !
    %%%sp%::%Exp%DIM2%,>%Exp%Weight%
    %%%sp%ix%=%let%(Z%:.%i%:.%j)%=%unlift%ix%
    %%%%%%%%%%%in%%min%(g%!%(index2%i%j))%
    %%%%%%%%%%%%%%%%%%%(g%!%(index2%i%k')%+%g%!%(index2%k'%j))

    View Slide

  96. Floyd-Warshall Algorithm
    • Instead of a sparse graph, we’ll use a dense adjacency matrix

    - 2D array where the element is the distance between the two vertices

    !
    !
    !
    !
    !
    - Must the iteration number k as a scalar array

    - sp is the core of the algorithm
    type%Weight%=%Int32%
    type%Graph%%=%Array%DIM2%Weight%
    !
    step%::%Acc%(Scalar%Int)%,>%Acc%Graph%,>%Acc%Graph%
    step%k%g%=%generate%(shape%g)%sp%
    %where%
    %%%k'%=%the%k%
    !
    %%%sp%::%Exp%DIM2%,>%Exp%Weight%
    %%%sp%ix%=%let%(Z%:.%i%:.%j)%=%unlift%ix%
    %%%%%%%%%%%in%%min%(g%!%(index2%i%j))%
    %%%%%%%%%%%%%%%%%%%(g%!%(index2%i%k')%+%g%!%(index2%k'%j))

    View Slide

  97. Floyd-Warshall Algorithm
    shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph%
    shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0%
    %where%
    %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]%
    %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]

    View Slide

  98. Floyd-Warshall Algorithm
    - Construct a list of steps, applying k in sequence 0%..%n,1
    shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph%
    shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0%
    %where%
    %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]%
    %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]

    View Slide

  99. Floyd-Warshall Algorithm
    - Construct a list of steps, applying k in sequence 0%..%n,1
    - Compose the sequence together
    shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph%
    shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0%
    %where%
    %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]%
    %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]

    View Slide

  100. Floyd-Warshall Algorithm
    - Construct a list of steps, applying k in sequence 0%..%n,1
    - Compose the sequence together
    • Let’s try a 1000 x 1000 matrix
    shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph%
    shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0%
    %where%
    %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]%
    %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]

    View Slide

  101. Floyd-Warshall Algorithm
    - Construct a list of steps, applying k in sequence 0%..%n,1
    - Compose the sequence together
    • Let’s try a 1000 x 1000 matrix
    shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph%
    shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0%
    %where%
    %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]%
    %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]
    $%./fwaccel%1000%+RTS%,s%
    fwaccel:%
    ***%Internal%error%in%package%accelerate%***%
    ***%Please%submit%a%bug%report%at%https://github.com/AccelerateHS/accelerate/issues%
    ./Data/Array/Accelerate/CUDA.hs:246%(unhandled):%CUDA%Exception:%out%of%memory%
    !
    %%%6,627,846,360%bytes%allocated%in%the%heap%
    %%%2,830,827,472%bytes%copied%during%GC%
    %%%%%733,091,960%bytes%maximum%residency%(26%sample(s))%
    %%%%%%31,575,272%bytes%maximum%slop%
    %%%%%%%%%%%%%812%MB%total%memory%in%use%(54%MB%lost%due%to%fragmentation)%
    ...%
    %%Total%%%time%%%%7.16s%%(%11.42s%elapsed)

    View Slide

  102. Floyd-Warshall Algorithm
    - Construct a list of steps, applying k in sequence 0%..%n,1
    - Compose the sequence together
    • Let’s try a 1000 x 1000 matrix
    shortestPathsAcc%::%Int%,>%Acc%Graph%,>%Acc%Graph%
    shortestPathsAcc%n%g0%=%foldl1%(.)%steps%g0%
    %where%
    %%%steps%::%[%Acc%Graph%,>%Acc%Graph%]%
    %%%steps%=%%[%step%(unit%(constant%k))%|%k%<,%[0%..%n,1]%]
    $%./fwaccel%1000%+RTS%,s%
    fwaccel:%
    ***%Internal%error%in%package%accelerate%***%
    ***%Please%submit%a%bug%report%at%https://github.com/AccelerateHS/accelerate/issues%
    ./Data/Array/Accelerate/CUDA.hs:246%(unhandled):%CUDA%Exception:%out%of%memory%
    !
    %%%6,627,846,360%bytes%allocated%in%the%heap%
    %%%2,830,827,472%bytes%copied%during%GC%
    %%%%%733,091,960%bytes%maximum%residency%(26%sample(s))%
    %%%%%%31,575,272%bytes%maximum%slop%
    %%%%%%%%%%%%%812%MB%total%memory%in%use%(54%MB%lost%due%to%fragmentation)%
    ...%
    %%Total%%%time%%%%7.16s%%(%11.42s%elapsed)

    View Slide

  103. Pipelining
    • To sequence computations we have a special operator

    !
    !
    !
    - Operationally, the two computations will not share any subcomputations
    with each other or the environment

    - Intermediate arrays from the first can be GC’d when the second begins
    (>,>)%::%(Arrays%a,%Arrays%b,%Arrays%c)%
    %%%%%%=>%(Acc%a%,>%Acc%b)%
    %%%%%%,>%(Acc%b%,>%Acc%c)%
    %%%%%%,>%Acc%a%,>%Acc%c

    View Slide

  104. Pipelining
    • To sequence computations we have a special operator

    !
    !
    !
    - Operationally, the two computations will not share any subcomputations
    with each other or the environment

    - Intermediate arrays from the first can be GC’d when the second begins
    (>,>)%::%(Arrays%a,%Arrays%b,%Arrays%c)%
    %%%%%%=>%(Acc%a%,>%Acc%b)%
    %%%%%%,>%(Acc%b%,>%Acc%c)%
    %%%%%%,>%Acc%a%,>%Acc%c
    $%./fwaccel%1000%+RTS%,s%
    Array%(Z)%[1783293664]%
    %%%6,211,776,544%bytes%allocated%in%the%heap%
    %%%%%914,044,096%bytes%copied%during%GC%
    %%%%%%24,953,768%bytes%maximum%residency%(211%sample(s))%
    %%%%%%%2,023,496%bytes%maximum%slop%
    %%%%%%%%%%%%%%63%MB%total%memory%in%use%(0%MB%lost%due%to%fragmentation)%
    ...%
    %%Total%%%time%%%%3.86s%%(%%3.92s%elapsed)

    View Slide

  105. Pipelining
    • To sequence computations we have a special operator

    !
    !
    !
    - Operationally, the two computations will not share any subcomputations
    with each other or the environment

    - Intermediate arrays from the first can be GC’d when the second begins
    (>,>)%::%(Arrays%a,%Arrays%b,%Arrays%c)%
    %%%%%%=>%(Acc%a%,>%Acc%b)%
    %%%%%%,>%(Acc%b%,>%Acc%c)%
    %%%%%%,>%Acc%a%,>%Acc%c
    $%./fwaccel%1000%+RTS%,s%
    Array%(Z)%[1783293664]%
    %%%6,211,776,544%bytes%allocated%in%the%heap%
    %%%%%914,044,096%bytes%copied%during%GC%
    %%%%%%24,953,768%bytes%maximum%residency%(211%sample(s))%
    %%%%%%%2,023,496%bytes%maximum%slop%
    %%%%%%%%%%%%%%63%MB%total%memory%in%use%(0%MB%lost%due%to%fragmentation)%
    ...%
    %%Total%%%time%%%%3.86s%%(%%3.92s%elapsed)
    (>#>) can help control
    memory use & startup time

    View Slide

  106. Questions?
    https://github.com/AccelerateHS/

    View Slide

  107. Extra slides…

    View Slide

  108. Stencils
    • A stencil is a map with access to the neighbourhood around each element
    - Useful in many scientific & image processing algorithms
    laplace%::%Stencil3x3%Int%,>%Exp%Int%
    laplace%((_,t,_)%
    %%%%%%%%,(l,c,r)%
    %%%%%%%%,(_,b,_))%=%t%+%b%+%l%+%r%,%4*c
    t
    l c r
    b

    View Slide

  109. Stencils
    • A stencil is a map with access to the neighbourhood around each element
    - Useful in many scientific & image processing algorithms
    - Boundary conditions specify how to handle out-of-bounds neighbours
    laplace%::%Stencil3x3%Int%,>%Exp%Int%
    laplace%((_,t,_)%
    %%%%%%%%,(l,c,r)%
    %%%%%%%%,(_,b,_))%=%t%+%b%+%l%+%r%,%4*c
    >%let%mat%=%fromList%(Z:.3:.5)%[1..]%::%Array%DIM2%Int%
    >%run%$%stencil%laplace%(Constant%0)%(use%mat)%
    Array%(Z%:.%3%:.%5)%[4,3,2,1,,6,,5,0,0,0,,11,,26,,17,,18,,19,,36]
    t
    l c r
    b

    View Slide

  110. Stencils
    • A stencil is a map with access to the neighbourhood around each element

    - Useful in many scientific & image processing algorithms

    View Slide

  111. Debugging options
    • -dverbose

    • -ddump-sharing

    • -ddump-simpl-stats, -ddump-simpl-iterations

    • -ddump-cc, -ddebug-cc

    • -ddump-exec

    • -ddump-gc

    • -fflush-cache

    View Slide

  112. View Slide