A Functional Programming language for GPUs

A Functional Programming language for GPUs

Presented at the 5th NIRICT Workshop on GPU Computing Research in the Netherlands
http://fmttools.ewi.utwente.nl/NIRICT_GPGPU/events.html

Graphics processing units (GPUs), while primarily designed to support the efficient rendering of computer graphics, are increasingly finding their highly-parallel architectures being used to tackle demanding computational problems in many non-graphics domains. However, GPU applications typically need to be programmed at a very low level, and their specialised hardware requires expert knowledge in order to be used effectively. These barriers make it difficult for domain scientists to leverage GPUs in their applications, without first becoming GPU programming experts.

This talk discusses our work on the programming language _Accelerate_, in which computations are expressed in a high-level functional style, but compile down to efficient low-level GPU code. While high-level programming abstractions are typically viewed as creating barriers to high performance code, if used correctly we can instead leverage these abstractions to guide the user towards efficient parallel implementations of their programs.

2e4f4da0d0954eba69cf06d7df00480e?s=128

Trevor L. McDonell

December 04, 2018
Tweet

Transcript

  1. A Functional Programming Language for GPUs Trevor L. McDonell Utrecht

    University AccelerateHS acceleratehs.org
  2. https://xkcd.com/378/

  3. None
  4. None
  5. GPUs software programmable caches data distribution thread synchronisation weak memory

    model memory access patterns control flow divergence shared-state concurrency
  6. Concrete λ Abstract Compositional Entangled

  7. λ Concrete Abstract Compositional Entangled

  8. λ Polymorphism & generics Strictly isolating side-effects Higher-order functions &

    closures Expressive type system & inference Strong static typing Garbage collection Boxed values ? Memory access patterns Software programmable
 caches Thread coordination Data distribution
  9. Can we have efficient parallel code from a high-level language?

  10. Performance Effort

  11. Performance Effort expected

  12. Performance Effort expected actual

  13. Performance Effort expected actual desired

  14. Performance Effort expected actual desired

  15. How about embedded languages with specialised code generation?

  16. Accelerate An embedded language for data-parallel arrays Haskell/Accelerate program Target

    code Compile and run on the CPU/GPU Copy result back to Haskell Reify and optimise Accelerate program
  17. dotp xs ys = Example: dot product

  18. dotp xs ys = fold (+) 0 (zipWith (*) xs

    ys) 1 2 3 4 ⋮ 5 6 7 8 ⋮ * * * * Example: dot product
  19. dotp xs ys = fold (+) 0 (zipWith (*) xs

    ys) Example: dot product 6 8 10 12 … + + + + … + 0
  20. import Prelude dotp :: Num a => [a] -> [a]

    -> a dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
  21. import Data.Vector.Unboxed dotp :: (Num a, Unbox a) => Vector

    a -> Vector a -> a dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
  22. import Data.Array.Accelerate dotp :: (Num a, Elt a) => Acc

    (Vector a) -> Acc (Vector a) -> Acc (Scalar a) dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
  23. Accelerate dotp xs ys = fold (+) 0 (zipWith (*)

    xs ys) embedded language arrays from Accelerate library Collective operations which compile to parallel code xs, ys :: Acc (Vector Float)
  24. Accelerate dotp xs ys = fold (+) 0 (zipWith (*)

    xs ys) Collective operations which compile to parallel code fold :: (Shape sh, Elt e) => (Exp e -> Exp e -> Exp e) -> Exp e -> Acc (Array (sh:.Int) e) -> Acc (Array sh e) language of sequential, scalar expressions language of collective, parallel operations rank-polymorphic To enforce hardware restrictions, nested parallel computation can't be expressed almost
  25. dotp xs ys = fold (+) 0 (zipWith (*) xs

    ys) Skeleton #1 Skeleton #2 Intermediate array Combined operation Array fusion Combines successive element-wise operations (a.k.a. loop fusion) accelerate 4.8 ms benchmarking (single thread): vector 14.5 ms
  26. LULESH Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics

  27. LULESH element& node& ariables on a staggered mesh. Thermodynamic variables

    are represented at ele ematic variables are represented at nodes. The figure shows a two-dimensional In a parallel world, imperative is the wrong default concurrent writes!
  28. LULESH Immutable arrays guide us to a more natural parallel

    solution node-centric computation element& node& ariables on a staggered mesh. Thermodynamic variables are represented at ele ematic variables are represented at nodes. The figure shows a two-dimensional
  29. LULESH Accelerate: high-level language, low-level performance Speedup vs. Reference @

    1 Thread 0 2 4 6 8 10 # Threads 1 2 3 4 5 6 7 8 9 10 11 12 Accelerate OpenMP i7-6700K @ 3.4GHz / GTX 1080Ti
  30. LULESH Accelerate: high-level language, low-level performance Lines of Code Runtime

    @ 643 (s) C (OpenMP) 2400 64 CUDA 3000 5.2 Accelerate (CPU) 1200 38 Accelerate (GPU) +0 4.1 i7-6700K @ 3.4GHz / GTX 1080Ti
  31. Summary Abstraction also means that
 the compiler has more information

    so we can leverage these abstractions to help guide program design and generate efficient parallel code
  32. acceleratehs.org https://github.com/AccelerateHS/ Trevor L. McDonell Robert Clifton-Everest Manuel M. T.

    Chakravarty Josh Meredith Gabriele Keller Ben Lippmeier
  33. Image attribution https://flic.kr/p/XcAjn3 https://xkcd.com/378 https://commons.wikimedia.org/wiki/File:Motorola_6800_Assembly_Language.png https://commons.wikimedia.org/wiki/File:FortranCardPROJ039.agr.jpg https://commons.wikimedia.org/wiki/File:Set_square_Geodreieck.svg