A Functional Programming language for GPUs

Slide 1

Slide 1 text

A Functional Programming Language for GPUs Trevor L. McDonell Utrecht University AccelerateHS acceleratehs.org

Slide 2

Slide 2 text

https://xkcd.com/378/

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

GPUs software programmable caches data distribution thread synchronisation weak memory model memory access patterns control ﬂow divergence shared-state concurrency

Slide 6

Slide 6 text

Concrete λ Abstract Compositional Entangled

Slide 7

Slide 7 text

λ Concrete Abstract Compositional Entangled

Slide 8

Slide 8 text

λ Polymorphism & generics Strictly isolating side-effects Higher-order functions & closures Expressive type system & inference Strong static typing Garbage collection Boxed values ? Memory access patterns Software programmable  caches Thread coordination Data distribution

Slide 9

Slide 9 text

Can we have eﬃcient parallel code from a high-level language?

Slide 10

Slide 10 text

Performance Effort

Slide 11

Slide 11 text

Performance Effort expected

Slide 12

Slide 12 text

Performance Effort expected actual

Slide 13

Slide 13 text

Performance Effort expected actual desired

Slide 14

Slide 14 text

Performance Effort expected actual desired

Slide 15

Slide 15 text

How about embedded languages with specialised code generation?

Slide 16

Slide 16 text

Accelerate An embedded language for data-parallel arrays Haskell/Accelerate program Target code Compile and run on the CPU/GPU Copy result back to Haskell Reify and optimise Accelerate program

Slide 17

Slide 17 text

dotp xs ys = Example: dot product

Slide 18

Slide 18 text

dotp xs ys = fold (+) 0 (zipWith (*) xs ys) 1 2 3 4 ⋮ 5 6 7 8 ⋮ * * * * Example: dot product

Slide 19

Slide 19 text

dotp xs ys = fold (+) 0 (zipWith (*) xs ys) Example: dot product 6 8 10 12 … + + + + … + 0

Slide 20

Slide 20 text

import Prelude dotp :: Num a => [a] -> [a] -> a dotp xs ys = fold (+) 0 (zipWith (*) xs ys)

Slide 21

Slide 21 text

import Data.Vector.Unboxed dotp :: (Num a, Unbox a) => Vector a -> Vector a -> a dotp xs ys = fold (+) 0 (zipWith (*) xs ys)

Slide 22

Slide 22 text

import Data.Array.Accelerate dotp :: (Num a, Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a) dotp xs ys = fold (+) 0 (zipWith (*) xs ys)

Slide 23

Slide 23 text

Accelerate dotp xs ys = fold (+) 0 (zipWith (*) xs ys) embedded language arrays from Accelerate library Collective operations which compile to parallel code xs, ys :: Acc (Vector Float)

Slide 24

Slide 24 text

Accelerate dotp xs ys = fold (+) 0 (zipWith (*) xs ys) Collective operations which compile to parallel code fold :: (Shape sh, Elt e) => (Exp e -> Exp e -> Exp e) -> Exp e -> Acc (Array (sh:.Int) e) -> Acc (Array sh e) language of sequential, scalar expressions language of collective, parallel operations rank-polymorphic To enforce hardware restrictions, nested parallel computation can't be expressed almost

Slide 25

Slide 25 text

dotp xs ys = fold (+) 0 (zipWith (*) xs ys) Skeleton #1 Skeleton #2 Intermediate array Combined operation Array fusion Combines successive element-wise operations (a.k.a. loop fusion) accelerate 4.8 ms benchmarking (single thread): vector 14.5 ms

Slide 26

Slide 26 text

LULESH Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics

Slide 27

Slide 27 text

LULESH element& node& ariables on a staggered mesh. Thermodynamic variables are represented at ele ematic variables are represented at nodes. The ﬁgure shows a two-dimensional In a parallel world, imperative is the wrong default concurrent writes!

Slide 28

Slide 28 text

LULESH Immutable arrays guide us to a more natural parallel solution node-centric computation element& node& ariables on a staggered mesh. Thermodynamic variables are represented at ele ematic variables are represented at nodes. The ﬁgure shows a two-dimensional

Slide 29

Slide 29 text

LULESH Accelerate: high-level language, low-level performance Speedup vs. Reference @ 1 Thread 0 2 4 6 8 10 # Threads 1 2 3 4 5 6 7 8 9 10 11 12 Accelerate OpenMP i7-6700K @ 3.4GHz / GTX 1080Ti

Slide 30

Slide 30 text

LULESH Accelerate: high-level language, low-level performance Lines of Code Runtime @ 643 (s) C (OpenMP) 2400 64 CUDA 3000 5.2 Accelerate (CPU) 1200 38 Accelerate (GPU) +0 4.1 i7-6700K @ 3.4GHz / GTX 1080Ti

Slide 31

Slide 31 text

Summary Abstraction also means that  the compiler has more information so we can leverage these abstractions to help guide program design and generate efﬁcient parallel code

Slide 32

Slide 32 text

acceleratehs.org https://github.com/AccelerateHS/ Trevor L. McDonell Robert Clifton-Everest Manuel M. T. Chakravarty Josh Meredith Gabriele Keller Ben Lippmeier

Slide 33

Slide 33 text

Image attribution https://ﬂic.kr/p/XcAjn3 https://xkcd.com/378 https://commons.wikimedia.org/wiki/File:Motorola_6800_Assembly_Language.png https://commons.wikimedia.org/wiki/File:FortranCardPROJ039.agr.jpg https://commons.wikimedia.org/wiki/File:Set_square_Geodreieck.svg