Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Functional Programming language for GPUs

A Functional Programming language for GPUs

Presented at the 5th NIRICT Workshop on GPU Computing Research in the Netherlands
http://fmttools.ewi.utwente.nl/NIRICT_GPGPU/events.html

Graphics processing units (GPUs), while primarily designed to support the efficient rendering of computer graphics, are increasingly finding their highly-parallel architectures being used to tackle demanding computational problems in many non-graphics domains. However, GPU applications typically need to be programmed at a very low level, and their specialised hardware requires expert knowledge in order to be used effectively. These barriers make it difficult for domain scientists to leverage GPUs in their applications, without first becoming GPU programming experts.

This talk discusses our work on the programming language _Accelerate_, in which computations are expressed in a high-level functional style, but compile down to efficient low-level GPU code. While high-level programming abstractions are typically viewed as creating barriers to high performance code, if used correctly we can instead leverage these abstractions to guide the user towards efficient parallel implementations of their programs.

Trevor L. McDonell

December 04, 2018
Tweet

More Decks by Trevor L. McDonell

Other Decks in Programming

Transcript

  1. A Functional Programming Language for GPUs
    Trevor L. McDonell
    Utrecht University
    AccelerateHS
    acceleratehs.org

    View Slide

  2. https://xkcd.com/378/

    View Slide

  3. View Slide

  4. View Slide

  5. GPUs
    software programmable
    caches
    data distribution
    thread
    synchronisation
    weak memory
    model
    memory access
    patterns
    control flow
    divergence
    shared-state
    concurrency

    View Slide

  6. Concrete
    λ
    Abstract
    Compositional Entangled

    View Slide

  7. λ Concrete
    Abstract
    Compositional Entangled

    View Slide

  8. λ
    Polymorphism & generics
    Strictly isolating side-effects
    Higher-order functions
    & closures
    Expressive type
    system & inference
    Strong static typing
    Garbage collection
    Boxed values
    ?
    Memory access patterns
    Software programmable

    caches
    Thread coordination
    Data distribution

    View Slide

  9. Can we have
    efficient parallel code
    from a
    high-level language?

    View Slide

  10. Performance
    Effort

    View Slide

  11. Performance
    Effort
    expected

    View Slide

  12. Performance
    Effort
    expected
    actual

    View Slide

  13. Performance
    Effort
    expected
    actual
    desired

    View Slide

  14. Performance
    Effort
    expected
    actual
    desired

    View Slide

  15. How about
    embedded languages
    with
    specialised code generation?

    View Slide

  16. Accelerate
    An embedded language for data-parallel arrays
    Haskell/Accelerate
    program
    Target code
    Compile and run on
    the CPU/GPU
    Copy result back to Haskell
    Reify and optimise
    Accelerate program

    View Slide

  17. dotp xs ys =
    Example: dot product

    View Slide

  18. dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
    1
    2
    3
    4

    5
    6
    7
    8

    *
    *
    *
    *
    Example: dot product

    View Slide

  19. dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
    Example: dot product
    6 8 10 12 …
    + + + +
    … + 0

    View Slide

  20. import Prelude
    dotp :: Num a
    => [a] -> [a] -> a
    dotp xs ys = fold (+) 0 (zipWith (*) xs ys)

    View Slide

  21. import Data.Vector.Unboxed
    dotp :: (Num a, Unbox a)
    => Vector a
    -> Vector a
    -> a
    dotp xs ys = fold (+) 0 (zipWith (*) xs ys)

    View Slide

  22. import Data.Array.Accelerate
    dotp :: (Num a, Elt a)
    => Acc (Vector a)
    -> Acc (Vector a)
    -> Acc (Scalar a)
    dotp xs ys = fold (+) 0 (zipWith (*) xs ys)

    View Slide

  23. Accelerate
    dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
    embedded
    language arrays
    from Accelerate library
    Collective operations which compile to parallel code
    xs, ys :: Acc (Vector Float)

    View Slide

  24. Accelerate
    dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
    Collective operations which compile to parallel code
    fold :: (Shape sh, Elt e)
    => (Exp e -> Exp e -> Exp e)
    -> Exp e
    -> Acc (Array (sh:.Int) e)
    -> Acc (Array sh e)
    language of sequential,
    scalar expressions
    language of collective,
    parallel operations
    rank-polymorphic
    To enforce hardware restrictions,
    nested parallel computation can't be expressed
    almost

    View Slide

  25. dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
    Skeleton #1
    Skeleton #2
    Intermediate array
    Combined operation
    Array fusion
    Combines successive element-wise operations (a.k.a. loop fusion)
    accelerate 4.8 ms
    benchmarking (single thread):
    vector 14.5 ms

    View Slide

  26. LULESH
    Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics

    View Slide

  27. LULESH
    element&
    node&
    ariables on a staggered mesh. Thermodynamic variables are represented at ele
    ematic variables are represented at nodes. The figure shows a two-dimensional
    In a parallel world, imperative is the wrong default
    concurrent
    writes!

    View Slide

  28. LULESH
    Immutable arrays guide us to a more natural parallel solution
    node-centric
    computation
    element&
    node&
    ariables on a staggered mesh. Thermodynamic variables are represented at ele
    ematic variables are represented at nodes. The figure shows a two-dimensional

    View Slide

  29. LULESH
    Accelerate: high-level language, low-level performance
    Speedup vs. Reference @ 1 Thread
    0
    2
    4
    6
    8
    10
    # Threads
    1 2 3 4 5 6 7 8 9 10 11 12
    Accelerate OpenMP
    i7-6700K @ 3.4GHz / GTX 1080Ti

    View Slide

  30. LULESH
    Accelerate: high-level language, low-level performance
    Lines of Code Runtime @ 643 (s)
    C (OpenMP) 2400 64
    CUDA 3000 5.2
    Accelerate (CPU) 1200 38
    Accelerate (GPU) +0 4.1
    i7-6700K @ 3.4GHz / GTX 1080Ti

    View Slide

  31. Summary
    Abstraction also means that

    the compiler has more information
    so we can leverage these abstractions
    to help guide program design
    and generate efficient parallel code

    View Slide

  32. acceleratehs.org
    https://github.com/AccelerateHS/
    Trevor L. McDonell
    Robert Clifton-Everest
    Manuel M. T. Chakravarty
    Josh Meredith
    Gabriele Keller
    Ben Lippmeier

    View Slide

  33. Image attribution
    https://flic.kr/p/XcAjn3
    https://xkcd.com/378
    https://commons.wikimedia.org/wiki/File:Motorola_6800_Assembly_Language.png
    https://commons.wikimedia.org/wiki/File:FortranCardPROJ039.agr.jpg
    https://commons.wikimedia.org/wiki/File:Set_square_Geodreieck.svg

    View Slide