$30 off During Our Annual Pro Sale. View Details »

A Functional Programming language for GPUs

A Functional Programming language for GPUs

Presented at the 5th NIRICT Workshop on GPU Computing Research in the Netherlands
http://fmttools.ewi.utwente.nl/NIRICT_GPGPU/events.html

Graphics processing units (GPUs), while primarily designed to support the efficient rendering of computer graphics, are increasingly finding their highly-parallel architectures being used to tackle demanding computational problems in many non-graphics domains. However, GPU applications typically need to be programmed at a very low level, and their specialised hardware requires expert knowledge in order to be used effectively. These barriers make it difficult for domain scientists to leverage GPUs in their applications, without first becoming GPU programming experts.

This talk discusses our work on the programming language _Accelerate_, in which computations are expressed in a high-level functional style, but compile down to efficient low-level GPU code. While high-level programming abstractions are typically viewed as creating barriers to high performance code, if used correctly we can instead leverage these abstractions to guide the user towards efficient parallel implementations of their programs.

Trevor L. McDonell

December 04, 2018
Tweet

More Decks by Trevor L. McDonell

Other Decks in Programming

Transcript

  1. GPUs software programmable caches data distribution thread synchronisation weak memory

    model memory access patterns control flow divergence shared-state concurrency
  2. λ Polymorphism & generics Strictly isolating side-effects Higher-order functions &

    closures Expressive type system & inference Strong static typing Garbage collection Boxed values ? Memory access patterns Software programmable
 caches Thread coordination Data distribution
  3. Accelerate An embedded language for data-parallel arrays Haskell/Accelerate program Target

    code Compile and run on the CPU/GPU Copy result back to Haskell Reify and optimise Accelerate program
  4. dotp xs ys = fold (+) 0 (zipWith (*) xs

    ys) 1 2 3 4 ⋮ 5 6 7 8 ⋮ * * * * Example: dot product
  5. dotp xs ys = fold (+) 0 (zipWith (*) xs

    ys) Example: dot product 6 8 10 12 … + + + + … + 0
  6. import Prelude dotp :: Num a => [a] -> [a]

    -> a dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
  7. import Data.Vector.Unboxed dotp :: (Num a, Unbox a) => Vector

    a -> Vector a -> a dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
  8. import Data.Array.Accelerate dotp :: (Num a, Elt a) => Acc

    (Vector a) -> Acc (Vector a) -> Acc (Scalar a) dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
  9. Accelerate dotp xs ys = fold (+) 0 (zipWith (*)

    xs ys) embedded language arrays from Accelerate library Collective operations which compile to parallel code xs, ys :: Acc (Vector Float)
  10. Accelerate dotp xs ys = fold (+) 0 (zipWith (*)

    xs ys) Collective operations which compile to parallel code fold :: (Shape sh, Elt e) => (Exp e -> Exp e -> Exp e) -> Exp e -> Acc (Array (sh:.Int) e) -> Acc (Array sh e) language of sequential, scalar expressions language of collective, parallel operations rank-polymorphic To enforce hardware restrictions, nested parallel computation can't be expressed almost
  11. dotp xs ys = fold (+) 0 (zipWith (*) xs

    ys) Skeleton #1 Skeleton #2 Intermediate array Combined operation Array fusion Combines successive element-wise operations (a.k.a. loop fusion) accelerate 4.8 ms benchmarking (single thread): vector 14.5 ms
  12. LULESH element& node& ariables on a staggered mesh. Thermodynamic variables

    are represented at ele ematic variables are represented at nodes. The figure shows a two-dimensional In a parallel world, imperative is the wrong default concurrent writes!
  13. LULESH Immutable arrays guide us to a more natural parallel

    solution node-centric computation element& node& ariables on a staggered mesh. Thermodynamic variables are represented at ele ematic variables are represented at nodes. The figure shows a two-dimensional
  14. LULESH Accelerate: high-level language, low-level performance Speedup vs. Reference @

    1 Thread 0 2 4 6 8 10 # Threads 1 2 3 4 5 6 7 8 9 10 11 12 Accelerate OpenMP i7-6700K @ 3.4GHz / GTX 1080Ti
  15. LULESH Accelerate: high-level language, low-level performance Lines of Code Runtime

    @ 643 (s) C (OpenMP) 2400 64 CUDA 3000 5.2 Accelerate (CPU) 1200 38 Accelerate (GPU) +0 4.1 i7-6700K @ 3.4GHz / GTX 1080Ti
  16. Summary Abstraction also means that
 the compiler has more information

    so we can leverage these abstractions to help guide program design and generate efficient parallel code