Functional High-Performance Computing

Functional High-Performance Computing Trevor L. McDonell Utrecht University AccelerateHS acceleratehs.org

All modern processors have multiple cores Ryzen 7 1800X (8
core) 4.8B transistors A12 Bionic (2+4 core) 6.9B transistors GTX 1080 (2560 core*) 7.2B transistors

GPU (graphics processing unit) medical imaging data science weather & 
climate bioinformatics computational chemistry machine learning GTX 1080 (2560 cores*) software programmable caches data distribution thread synchronisation memory access patterns control ﬂow divergence

Performance Effort

Performance Effort expected

Performance Effort expected actual

Performance Effort expected actual desired

Performance Effort expected actual desired After expressing available parallelism, I
often ﬁnd that the code has slowed down. — Jeff Larkin, NVIDIA Developer Technology https://devblogs.nvidia.com/getting-started-openacc/

Can we have  parallel programming  with  less eﬀort?

Part 1: Thinking in parallel

for (int i = 0; i < length; ++i) {
// do something (in parallel) }

Theory Practice

Why is this difﬁcult? Concurrency Multiple interleaved threads of control 
All threads have effects on the world  Non-determinism and concurrency control

Data parallelism Instead of unrestricted concurrency, let’s simplify The same
operation is applied to different data  abstracts over concurrency control  abstracts over indeterminism  great for developers and hardware

Energy efﬁciency

Flat data-parallelism

Nested data-parallelism

Amorphous data-parallelism

Expressiveness of  Parallelism    Expressiveness of computation Embedded Native Flat
Nested Amorphous Repa Futhark Lift Data-Parallel Haskell Nessie Accelerate

Accelerate Haskell/Accelerate program Target code Compile and run on the
CPU/GPU Copy result back to Haskell Reify and optimise Accelerate program An embedded language for data-parallel arrays

Example: vector dot product dotp xs ys =

dotp xs ys = fold (+) 0 (zipWith (*) xs
ys) 1 2 3 4 ⋮ 5 6 7 8 ⋮ * * * * Example: vector dot product

Example: vector dot product dotp xs ys = fold (+)
0 (zipWith (*) xs ys) 6 8 10 12 … + + + + … + 0

Example: vector dot product import Prelude dotp :: Num a
=> [a] -> [a] -> a dotp xs ys = fold (+) 0 (zipWith (*) xs ys)

Example: vector dot product import Data.Vector.Unboxed dotp :: (Num a,
Unbox a) => Vector a -> Vector a -> a dotp xs ys = fold (+) 0 (zipWith (*) xs ys)

Example: vector dot product import Data.Array.Accelerate dotp :: (Num a,
Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a) dotp xs ys = fold (+) 0 (zipWith (*) xs ys)

Computers are good at operating on bulk data,  not on
single elements Restrictions can guide the programmer into  writing an efﬁcient parallel program Parallel programming and functional programming  are a natural ﬁt Thinking in parallel

Part 2: Make it work

@jasper.samoyed

Show, don’t tell https://github.com/AccelerateHS/accelerate-examples

LULESH Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics https://github.com/tmcdonell/lulesh-accelerate

LULESH Lines of Code Runtime @ 643 (s) C (OpenMP)
2400 64 CUDA 3000 5.2 Accelerate (CPU) 1200 38 Accelerate (GPU) ±1 4.1 i7-6700K @ 3.4GHz / GTX 1080 Ti Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics

Salt marsh creek formation https://github.com/tmcdonell/spatial-ecology-accelerate

Salt marsh creek formation Elapsed time (s) 1 10 100
1000 Grid size 512 1024 1536 2048 3072 4096 6144 8192 Python+OpenCL Accelerate GTX 1080 Ti 7x faster 2x fewer lines of code

use real examples as a working laboratory

Motivating examples validate and test your ideas  improve performance  what
is good about this, what is bad about it?  as a basis for future work Can we take what’s good as a seed and make it into something that is better?

Part 3: What’s next?

The tower of abstraction

Sequence of abstraction Machine language  Assembly language  Fortran / C
/ C++  C# / Haskell / Javascript “I am working at a higher level;  being smart, saving effort!”

mean :: [Double] -> Double mean xs = sum xs
/ fromIntegral (length xs)

data T = MkT Int Int MkT I# Int# I#
Int# MkT Int# Int# data T = MkT !Int !Int

data Word8x4 = Word8x4 !Word8 !Word8 !Word8 !Word8 Word8x4 Word8#
Word8# Word8# Word8# size in memory = ? = header + 4 * 8 bytes ∴ in Haskell: 4 * 8 = 256 . . .

data T a = MkT a MkT a what if
I need to know whether ‘a’ has been computed? data T a = MkT (IORef (Maybe a)) IORef Nothing Just a MkT

Loss of capability Can no longer program in assembly  Don’t
know how values are stored in memory  Don’t know what the CPU is doing The rhetoric is “I shouldn’t have to”  but the ﬂip side is the loss of ability to

Functional high-performance computing Performance Effort expected actual desired

Functional high-performance computing Performance Effort ? expected actual desired

Functional high-performance computing Real-world examples as a laboratory to develop: 
new features, test performance, … Working at a high level is good  but this also entails a loss of capability Reality exists at the low level

Functional high-performance computing Functional programming languages provides  the right set
of abstractions Instead of:  climbing the tower of abstraction Better idea?:  feet on the ground; reach for the heavens

acceleratehs.org https://github.com/AccelerateHS/ Trevor L. McDonell Robert Clifton-Everest Manuel M. T.
Chakravarty Josh Meredith Gabriele Keller Ben Lippmeier

Image attribution Logo designed by Tina Lam http://instagram.com/tinabarbarina https://en.wikipedia.org/wiki/Waterman_butterﬂy_projection https://www.instagram.com/p/BbTjiebnaw1
http://book.realworldhaskell.org/read/proﬁling-and-optimization.html https://researchinprogress.tumblr.com/post/34088637501/fast-vs-exact-solutions https://researchinprogress.tumblr.com/post/34627563943/when-somebody-mixes-up-causality-and-correlation https://researchinprogress.tumblr.com/post/32886698944/how-is-your-research-useful https://en.wikipedia.org/wiki/Tower_of_Babel https://www.art.com/products/p46922818644-sa-i10543606/paul-souders-polar-bear-swimming-past-melting-iceberg-near- harbor-islands-canada.htm http://unseasonably.blogspot.com/2014/03/a-tangled-ball-of-yarn.html https://www.reddit.com/r/aww/comments/2oagj8/multithreaded_programming_theory_and_practice/

Functional High-Performance Computing

Functional High-Performance Computing

More Decks by Trevor L. McDonell

Other Decks in Research

Featured

Transcript