Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Optimising Purely Functional GPU Programs

Trevor L. McDonell
September 25, 2013

Optimising Purely Functional GPU Programs

Presented at ICFP 2013: http://www.icfpconference.org/icfp2013/
Paper: https://github.com/tmcdonell/tmcdonell.github.io/raw/master/papers/acc-optim-icfp2013.pdf

Purely functional, embedded array programs are a good match for SIMD hardware, such as GPUs. However, the naive compilation of such programs quickly leads to both code explosion and an excessive use of intermediate data structures. The resulting slow- down is not acceptable on target hardware that is usually chosen to achieve high performance.

It this paper, we present two optimisation techniques, sharing recovery and array fusion, that tackle code explosion and eliminate superfluous intermediate structures. Both techniques are well known from other contexts, but they present unique challenges for an embedded language compiled for execution on a GPU. We present novel methods for implementing sharing recovery and array fusion, and demonstrate their effectiveness on a set of benchmarks.

Trevor L. McDonell

September 25, 2013
Tweet

More Decks by Trevor L. McDonell

Other Decks in Research

Transcript

  1. Optimising Purely Functional GPU Programs Trevor L. McDonell University of

    New South Wales Jointly with Manuel M. T. Chakravarty Gabriele Keller Ben Lippmeier tmcdonell tlmcdonell
  2. High Level Languages Boxed values Polymorphism & generics Composite data

    structures Immutable structures Higher-order functions & closures
  3. High Level Languages Boxed values Polymorphism & generics Composite data

    structures Immutable structures Higher-order functions & closures
  4. High Level Languages multicore CPU GPUs Cluster Function pointers Control

    flow Memory access patterns Data distribution Decomposition
  5. High Level Languages multicore CPU GPUs Cluster Function pointers Control

    flow Memory access patterns Data distribution Decomposition Efficient code?
  6. Fast Scan Algorithms on Graphics Processors Yuri Dotsenko Naga K.

    Govindaraju Peter-Pike Sloan Charles Boyd John Manferdelli Microsoft Corporation One Microsoft Way Redmond, WA 98052, USA {yurido, nagag, ppsloan, chasb, jmanfer}@microsoft.com ABSTRACT Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the GPU architecture. Our algorithms exploit shared memory to improve memory performance. We further improve the performance of our algorithms by eliminating shared-memory bank conflicts and reducing the overheads in prior shared-memory GPU algorithms. Furthermore, our algorithms are designed to work well on general data sets, including segmented arrays with arbitrary segment lengths. We also present optimizations to improve the performance of segmented scans based on the segment lengths. We implemented our algorithms on a PC with an NVIDIA GeForce 8800 GPU and compared our results with prior GPU-based algorithms. Our results indicate up to 10x higher performance over prior algorithms on input sequences with millions of elements. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: Parallel programming. General Terms Algorithms, performance. Keywords Scan, all-prefix-sum, segmented scan, GPGPU, GPU, parallel algorithm, HPC, many-core. 1. INTRODUCTION bandwidth and massive parallelism on GPUs. The current state- of-the-art GPU-based algorithms also exploit shared memory to improve the performance of scans. In this paper, we analyze the issues in mapping scan algorithms to the GPU architecture. We highlight that the prior algorithms deliver suboptimal performance due to high overhead of shared-memory bank conflicts, synchronization, and index arithmetic. We present fast scan algorithms that map better to GPUs and achieve higher performance than prior GPU-based algorithms. Our main contribution is a novel data representation in shared and global memory that maps better to the GPU memory hierarchy and the scan algorithms. Accesses to the data representation involve no bank conflicts in the shared memory while exploiting the high parallelism on GPUs. Our algorithm involves low overhead compared to prior approaches and the performance of the kernel scales better with shared memory sizes. We implemented our algorithms on a PC with a modern NVIDIA GPU. We benchmark our algorithms against prior state-of-the-art GPU-based algorithms on several GPUs. Our results on unsegmented scans indicate up to 60% higher performance than prior optimized algorithms. On segmented scans, we observed up to an order of magnitude higher performance over optimized GPU-based segmented scan algorithms. Organization of the paper: The rest of the paper is organized as follows. In Section 2, we present the related work. In Section 3, we give an overview of scan algorithms and the issues in mapping them to GPUs. We present our scan algorithms and provide analysis in Section 4. In Section 5, we describe our experimental International conference on Supercomputing, 2008
  7. Fast Scan Algorithms on Graphics Processors Yuri Dotsenko Naga K.

    Govindaraju Peter-Pike Sloan Charles Boyd John Manferdelli Microsoft Corporation One Microsoft Way Redmond, WA 98052, USA {yurido, nagag, ppsloan, chasb, jmanfer}@microsoft.com ABSTRACT Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the GPU architecture. Our algorithms exploit shared memory to improve memory performance. We further improve the performance of our algorithms by eliminating shared-memory bank conflicts and reducing the overheads in prior shared-memory GPU algorithms. Furthermore, our algorithms are designed to work well on general data sets, including segmented arrays with arbitrary segment lengths. We also present optimizations to improve the performance of segmented scans based on the segment lengths. We implemented our algorithms on a PC with an NVIDIA GeForce 8800 GPU and compared our results with prior GPU-based algorithms. Our results indicate up to 10x higher performance over prior algorithms on input sequences with millions of elements. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: Parallel programming. General Terms Algorithms, performance. Keywords Scan, all-prefix-sum, segmented scan, GPGPU, GPU, parallel algorithm, HPC, many-core. 1. INTRODUCTION bandwidth and massive parallelism on GPUs. The current state- of-the-art GPU-based algorithms also exploit shared memory to improve the performance of scans. In this paper, we analyze the issues in mapping scan algorithms to the GPU architecture. We highlight that the prior algorithms deliver suboptimal performance due to high overhead of shared-memory bank conflicts, synchronization, and index arithmetic. We present fast scan algorithms that map better to GPUs and achieve higher performance than prior GPU-based algorithms. Our main contribution is a novel data representation in shared and global memory that maps better to the GPU memory hierarchy and the scan algorithms. Accesses to the data representation involve no bank conflicts in the shared memory while exploiting the high parallelism on GPUs. Our algorithm involves low overhead compared to prior approaches and the performance of the kernel scales better with shared memory sizes. We implemented our algorithms on a PC with a modern NVIDIA GPU. We benchmark our algorithms against prior state-of-the-art GPU-based algorithms on several GPUs. Our results on unsegmented scans indicate up to 60% higher performance than prior optimized algorithms. On segmented scans, we observed up to an order of magnitude higher performance over optimized GPU-based segmented scan algorithms. Organization of the paper: The rest of the paper is organized as follows. In Section 2, we present the related work. In Section 3, we give an overview of scan algorithms and the issues in mapping them to GPUs. We present our scan algorithms and provide analysis in Section 4. In Section 5, we describe our experimental International conference on Supercomputing, 2008 Optimising Parallel Prefix operations for the Fermi architecture ! Mark Harris Michael Garland NVIDIA Corporation Published October 2012
  8. Fast Scan Algorithms on Graphics Processors Yuri Dotsenko Naga K.

    Govindaraju Peter-Pike Sloan Charles Boyd John Manferdelli Microsoft Corporation One Microsoft Way Redmond, WA 98052, USA {yurido, nagag, ppsloan, chasb, jmanfer}@microsoft.com ABSTRACT Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the GPU architecture. Our algorithms exploit shared memory to improve memory performance. We further improve the performance of our algorithms by eliminating shared-memory bank conflicts and reducing the overheads in prior shared-memory GPU algorithms. Furthermore, our algorithms are designed to work well on general data sets, including segmented arrays with arbitrary segment lengths. We also present optimizations to improve the performance of segmented scans based on the segment lengths. We implemented our algorithms on a PC with an NVIDIA GeForce 8800 GPU and compared our results with prior GPU-based algorithms. Our results indicate up to 10x higher performance over prior algorithms on input sequences with millions of elements. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: Parallel programming. General Terms Algorithms, performance. Keywords Scan, all-prefix-sum, segmented scan, GPGPU, GPU, parallel algorithm, HPC, many-core. 1. INTRODUCTION bandwidth and massive parallelism on GPUs. The current state- of-the-art GPU-based algorithms also exploit shared memory to improve the performance of scans. In this paper, we analyze the issues in mapping scan algorithms to the GPU architecture. We highlight that the prior algorithms deliver suboptimal performance due to high overhead of shared-memory bank conflicts, synchronization, and index arithmetic. We present fast scan algorithms that map better to GPUs and achieve higher performance than prior GPU-based algorithms. Our main contribution is a novel data representation in shared and global memory that maps better to the GPU memory hierarchy and the scan algorithms. Accesses to the data representation involve no bank conflicts in the shared memory while exploiting the high parallelism on GPUs. Our algorithm involves low overhead compared to prior approaches and the performance of the kernel scales better with shared memory sizes. We implemented our algorithms on a PC with a modern NVIDIA GPU. We benchmark our algorithms against prior state-of-the-art GPU-based algorithms on several GPUs. Our results on unsegmented scans indicate up to 60% higher performance than prior optimized algorithms. On segmented scans, we observed up to an order of magnitude higher performance over optimized GPU-based segmented scan algorithms. Organization of the paper: The rest of the paper is organized as follows. In Section 2, we present the related work. In Section 3, we give an overview of scan algorithms and the issues in mapping them to GPUs. We present our scan algorithms and provide analysis in Section 4. In Section 5, we describe our experimental International conference on Supercomputing, 2008 Optimising Parallel Prefix operations for the Fermi architecture ! Mark Harris Michael Garland NVIDIA Corporation Published October 2012
  9. Fast Scan Algorithms on Graphics Processors Yuri Dotsenko Naga K.

    Govindaraju Peter-Pike Sloan Charles Boyd John Manferdelli Microsoft Corporation One Microsoft Way Redmond, WA 98052, USA {yurido, nagag, ppsloan, chasb, jmanfer}@microsoft.com ABSTRACT Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the GPU architecture. Our algorithms exploit shared memory to improve memory performance. We further improve the performance of our algorithms by eliminating shared-memory bank conflicts and reducing the overheads in prior shared-memory GPU algorithms. Furthermore, our algorithms are designed to work well on general data sets, including segmented arrays with arbitrary segment lengths. We also present optimizations to improve the performance of segmented scans based on the segment lengths. We implemented our algorithms on a PC with an NVIDIA GeForce 8800 GPU and compared our results with prior GPU-based algorithms. Our results indicate up to 10x higher performance over prior algorithms on input sequences with millions of elements. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: Parallel programming. General Terms Algorithms, performance. Keywords Scan, all-prefix-sum, segmented scan, GPGPU, GPU, parallel algorithm, HPC, many-core. 1. INTRODUCTION bandwidth and massive parallelism on GPUs. The current state- of-the-art GPU-based algorithms also exploit shared memory to improve the performance of scans. In this paper, we analyze the issues in mapping scan algorithms to the GPU architecture. We highlight that the prior algorithms deliver suboptimal performance due to high overhead of shared-memory bank conflicts, synchronization, and index arithmetic. We present fast scan algorithms that map better to GPUs and achieve higher performance than prior GPU-based algorithms. Our main contribution is a novel data representation in shared and global memory that maps better to the GPU memory hierarchy and the scan algorithms. Accesses to the data representation involve no bank conflicts in the shared memory while exploiting the high parallelism on GPUs. Our algorithm involves low overhead compared to prior approaches and the performance of the kernel scales better with shared memory sizes. We implemented our algorithms on a PC with a modern NVIDIA GPU. We benchmark our algorithms against prior state-of-the-art GPU-based algorithms on several GPUs. Our results on unsegmented scans indicate up to 60% higher performance than prior optimized algorithms. On segmented scans, we observed up to an order of magnitude higher performance over optimized GPU-based segmented scan algorithms. Organization of the paper: The rest of the paper is organized as follows. In Section 2, we present the related work. In Section 3, we give an overview of scan algorithms and the issues in mapping them to GPUs. We present our scan algorithms and provide analysis in Section 4. In Section 5, we describe our experimental International conference on Supercomputing, 2008 Optimising Parallel Prefix operations for the Fermi architecture ! Mark Harris Michael Garland NVIDIA Corporation Published October 2012
  10. #include <accelerate_cuda.h>! typedef DIM1 DimOut;! extern "C" __global__ void zipWith!

    (! const DIM1 shIn0,! const Int64* __restrict__ arrIn0_a0,! const DIM1 shIn1,! const Int64* __restrict__ arrIn1_a0,! const DIM1 shOut,! Int64* __restrict__ arrOut_a0! )! {! const int shapeSize = size(shOut);! const int gridSize = blockDim.x * gridDim.x;! int ix;! ! for (ix = blockDim.x * blockIdx.x + threadIdx.x; ix < shapeSize; ix += gridSize) {! const DimOut sh = fromIndex(shOut, ix);! const int v0 = toIndex(shIn0, shape(sh));! const int v1 = toIndex(shIn1, shape(sh));! ! arrOut_a0[ix] = arrIn0_a0[v0] * arrIn1_a0[v1];! dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) Embedded language arrays dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) From Accelerate library
  11. #include <accelerate_cuda.h>! typedef DIM1 DimOut;! extern "C" __global__ void zipWith!

    (! const DIM1 shIn0,! const Int64* __restrict__ arrIn0_a0,! const DIM1 shIn1,! const Int64* __restrict__ arrIn1_a0,! const DIM1 shOut,! Int64* __restrict__ arrOut_a0! )! {! const int shapeSize = size(shOut);! const int gridSize = blockDim.x * gridDim.x;! int ix;! ! for (ix = blockDim.x * blockIdx.x + threadIdx.x; ix < shapeSize; ix += gridSize) {! const DimOut sh = fromIndex(shOut, ix);! const int v0 = toIndex(shIn0, shape(sh));! const int v1 = toIndex(shIn1, shape(sh));! ! arrOut_a0[ix] = arrIn0_a0[v0] * arrIn1_a0[v1];! }! }! sdata0[threadIdx.x] = y0;! __syncthreads();! ix = min(shapeSize - blockIdx.x * blockDim.x, blockDim.x);! if (threadIdx.x + 512 < ix) {! x0 = sdata0[threadIdx.x + 512];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! __syncthreads();! if (threadIdx.x + 256 < ix) {! x0 = sdata0[threadIdx.x + 256];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! __syncthreads();! if (threadIdx.x + 128 < ix) {! x0 = sdata0[threadIdx.x + 128];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! __syncthreads();! if (threadIdx.x + 64 < ix) {! x0 = sdata0[threadIdx.x + 64];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! __syncthreads();! if (threadIdx.x < 32) {! if (threadIdx.x + 32 < ix) {! x0 = sdata0[threadIdx.x + 32];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! if (threadIdx.x + 16 < ix) {! x0 = sdata0[threadIdx.x + 16];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! if (threadIdx.x + 8 < ix) {! x0 = sdata0[threadIdx.x + 8];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! if (threadIdx.x + 4 < ix) {! x0 = sdata0[threadIdx.x + 4];! y0 = y0 + x0;! dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) Embedded language arrays dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) From Accelerate library
  12. map zipWith backpermute generate fold scanl permute Producers Consumers Internal

    representation: fusion friendly Embed producer into consumer skeleton
  13. map zipWith backpermute generate fold scanl permute Producers Consumers Internal

    representation: fusion friendly Embed producer into consumer skeleton
  14. Fusing networks of skeletons c2 p5 p1 c1 p6 p7

    p3 p2 p4 Phase 1: producer/producer fusion
  15. Fusing networks of skeletons c2 p5 p1 c1 p6 p7

    p3 p2 p4 Phase 2: consumer/producer fusion
  16. Fusing networks of skeletons c2 p5 p1 c1 p6 p7

    p3 p2 p4 Phase 2: consumer/producer fusion Single fused skeleton
  17. Fusing networks of skeletons c2 p5 p1 c1 p6 p7

    p3 p2 p4 Phase 2: consumer/producer fusion Single fused skeleton (see paper for details)
  18. let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # #

    # # # ŸÏ   Ÿ¥ ŸŠ let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine
  19. let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # #

    # # # ŸÏ   Ÿ¥ ŸŠ let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine
  20. let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # #

    # # # ŸÏ   Ÿ¥ ŸŠ let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine
  21. let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # #

    # # # ŸÏ   Ÿ¥ ŸŠ let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine
  22. # # # # # # # ŸÏ  

    Ÿ¥ ŸŠ Phase 1: prune shared subtrees Sharing Recovery
  23. Phase 1: prune shared subtrees Sharing Recovery # # #

    # # # # ŸÏ   Ÿ¥ ŸŠ            
  24. Phase 1: prune shared subtrees Sharing Recovery # # #

    # # # # ŸÏ   Ÿ¥ ŸŠ               
  25. Phase 2: float shared terms Sharing Recovery # # #

    # # # # ŸÏ   Ÿ¥ ŸŠ                # # # # # # # ŸÏ   Ÿ¥ ŸŠ                # # # # # # # ŸÏ   Ÿ¥ ŸŠ                # # # # # # # ŸÏ   Ÿ¥ ŸŠ               
  26. Phase 2: float shared terms Sharing Recovery # # #

    # # ŸÏ   Ÿ¥ ŸŠ             # # # # # # #  Ÿ¥ ŸŠ              # # # # # # #  Ÿ¥ ŸŠ              # # # # # # #  Ÿ¥ ŸŠ              # #  ŸŠ # À Ä À À # à # à Ÿ¥ # Ä  # ŸÏ Ã Ä           
  27. Phase 2: float shared terms Sharing Recovery# # # #

    # # # ŸÏ   Ÿ¥ ŸŠ                # # # # # # # ŸÏ   Ÿ¥ ŸŠ                # # # # # # Ÿ¥ ŸŠ            # #  ŸŠ # À Ä À À # à # à Ÿ¥ # Ä  # ŸÏ Ã Ä            # #  ŸŠ # À Ä À À # à # à Ÿ¥ # Ä  # ŸÏ Ã Ä           
  28. Phase 3: introduce binders Sharing Recovery # #  ŸŠ

    # À Ä À À # à # à Ÿ¥ # Ä  # ŸÏ Ã Ä           
  29. Phase 3: introduce binders Sharing Recovery # #  ŸŠ

    # À Ä À À # à # à Ÿ¥ # Ä  # ŸÏ Ã Ä            # #  ŸŠ G@O G@O G@O # T S T S T # U # U Ÿ¥ # S  # ŸÏ Ó U Ó Ó
  30. Sharing Recovery Before we formalise our sharing recovery algorithm in

    the follow- ing subsections, we shall illustrate the main idea. Consider the fol- lowing source term: let inc = (+) 1 in let nine = let three = inc 2 in (*) three three in (-) (inc nine) nine This term’s abstract syntax DAG is the leftmost diagram in Fig- ure 2. It uses @ nodes to represent applications; as in this grammar: T ! C T ⌧ where | x C ⌧ :: T ⌧ | x. T x ⌧ :: T ⌧ | T1 @ T2 x ⌧1 .T ⌧2 :: T ⌧1!⌧2 C ! hconstantsi T ⌧1!⌧2 1 @ T ⌧1 2 :: T ⌧2 The left definition does not track types, whereas the right one does. We implement typed ASTs in Haskell with GADTs and work with typed representations henceforth. Typed HOAS conversion with sharing recover proceeds in three stages: 1. Prune shared subterms: A depth first traversal over the AST an- notates each node with its unique stable name, where we build an occurrence map of how many times we’ve already visited each node. If we encounter a previously visited node, it repre- sents a shared subterm, and we replace it by a placeholder con- taining its stable name. The second diagram in Figure 2 shows the outcome of this stage. Each node is labeled by a number that represents its stable name, and the dotted edges indicate where we encountered a previously visited, shared node. The placeholders are indicated by underlined stable names. 2. Float shared terms: All shared subterms float upwards in the tree to just above the lowest node that dominates all edges to the original position of that shared subterm — see the third diagram in Figure 2. Floated subterms are referenced by circled stable names located above the node that they floated to. If a node collects more than one shared subterm, the subterm whose origin is deeper in the original term goes on top — here, 9 on top of 5. Nested sharing leads to subterms floating up inside other floated subterms — here, 8 stays inside the subterm rooted in 5. Figure 2. Recovering sharing in an example term 3. Binder introduction: Each floated subterm gets let-bound right above the node it floated to (rightmost diagram in Figure 2). While we use explicit, bound names in the figure, we introduce de Bruijn indices at the same time as introducing the lets. 3.2 Prune shared subterms First, we identify and prune shared subtrees, producing a pruned tree of the following form (second diagram in Figure 2): T ⌧ where ` :: T ⌧ -- binder conversion level ⌫ ⌧ :: T ⌧ -- pruned subtree (name) C ⌧ :: T ⌧ `. T ⌧2 :: T ⌧1!⌧2 T ⌧1!⌧2 1 @ T ⌧1 2 :: T ⌧2 A stable name (here, of type Name ) associates a unique name with each unique term node, so that two terms with the same stable name are identical, and are represented by the same data structure in memory. Here, we denote the stable name of a term as a superscript during pattern matching — e.g., 1⌫ is a constant with stable name ⌫ , just as in the second and third diagram in Figure 2. An occurrence map, ⌦ :: Name 7! Int , is a finite map that determines the number of occurrences of a Name that we encoun- tered during a traversal. The expression ⌦ ⌫ yields the number of occurrences of the name ⌫ , and we have ⌫ 2 ⌦ ⌘ (⌦ ⌫ > 0). To add an occurrence to ⌦, we write ⌫ B⌦. We will see in the next sub- section that we cannot simplify ⌦ to be merely a set of occurring names. We need the actual occurrence count to determine where shared subterms should be let-bound. The identification and pruning of shared subtrees is formalised by the following function operating on closed terms from T ⌧ : prune :: Level ! ( Name 7! Int ) ! T ⌧ ! (( Name 7! Int ) , T ⌧ ) prune ` ⌦ e ⌫ | ⌫ 2 ⌦ = ( ⌫ B ⌦, ⌫ ) prune ` ⌦ e ⌫ | otherwise = enter ( ⌫ B ⌦ ) e where enter ⌦ c = ( ⌦, c ) enter ⌦ ( x.e ) = let ( ⌦ 0 , e 0) = prune ( ` + 1) ⌦ ([ `/x ] e ) in ( ⌦ 0 , `.e 0) enter ⌦ ( e1 @ e2) = let ( ⌦1, e 0 1 ) = prune ` ⌦ e1 ( ⌦2, e 0 2 ) = prune ` ⌦1 e2 in ( ⌦2, e 0 1 @ e 0 2 ) The first equation of prune covers t occurrence. In that case, we prune sha by a tag ⌫ containing its stable name in the second diagram in Figure 2. To interleave sharing recovery wit to typed de Bruijn indices, prune lambdas. Moreover, the lambda case binder x by the level ` at the binding Why don’t we separate computing ing? When computing occurrences, subtrees multiple times, so we can as Moreover, in the first line of prune , w stead of ⌫ — e is of the wrong form a As far as type-preservation is conc due to replacing variables by levels described by Atkey et al. [1], which w check in an environment lookup, as al 3.3 Float shared subterms Second, we float all shared subtrees let-bound, represented by (see third d " T ⌧ ! ⌫ : " T ⌧0 # T ⌧ # T ⌧ where ⌫ ⌧ :: # T ⌧ C ⌧ :: # T ⌧ ⌫. " T ⌧2 :: # T ⌧1!⌧2 " T ⌧1!⌧2 1 @ " T ⌧1 2 :: # T ⌧2 A term in " T comprises a sequence of by their stable name as well as a bod the floated subterms where extracted replaced lambda binders in T get re their term node. This simplifies a unif indices for let and lambda bound vari We write ⌫ : " T for a possibly ⌫1 : " T1, . . . , ⌫n : " Tn , where • d The floating function float maint floating terms and levels, defined as fo ! ⌫ i : " T ⌧ | ⌫ i : · | ⌫ i : ` These are floated subtrees named ⌫ o occurrences. The occurrence count in term gets let bound: namely at the This is why prune needed to collec in ⌦ . When the occurrence count ma ` :: T -- binder conversion level ⌫ ⌧ :: T ⌧ -- pruned subtree (name) C ⌧ :: T ⌧ `. T ⌧2 :: T ⌧1!⌧2 T ⌧1!⌧2 1 @ T ⌧1 2 :: T ⌧2 A stable name (here, of type Name ) associates a unique name with each unique term node, so that two terms with the same stable name are identical, and are represented by the same data structure in memory. Here, we denote the stable name of a term as a superscript during pattern matching — e.g., 1⌫ is a constant with stable name ⌫ , just as in the second and third diagram in Figure 2. An occurrence map, ⌦ :: Name 7! Int , is a finite map that determines the number of occurrences of a Name that we encoun- tered during a traversal. The expression ⌦ ⌫ yields the number of occurrences of the name ⌫ , and we have ⌫ 2 ⌦ ⌘ (⌦ ⌫ > 0). To add an occurrence to ⌦, we write ⌫ B⌦. We will see in the next sub- section that we cannot simplify ⌦ to be merely a set of occurring names. We need the actual occurrence count to determine where shared subterms should be let-bound. The identification and pruning of shared subtrees is formalised by the following function operating on closed terms from T ⌧ : prune :: Level ! ( Name 7! Int ) ! T ⌧ ! (( Name 7! Int ) , T ⌧ ) prune ` ⌦ e ⌫ | ⌫ 2 ⌦ = ( ⌫ B ⌦, ⌫ ) prune ` ⌦ e ⌫ | otherwise = enter ( ⌫ B ⌦ ) e where enter ⌦ c = ( ⌦, c ) enter ⌦ ( x.e ) = let ( ⌦ 0 , e 0) = prune ( ` + 1) ⌦ ([ `/x ] e ) in ( ⌦ 0 , `.e 0) enter ⌦ ( e1 @ e2) = let ( ⌦1, e 0 1 ) = prune ` ⌦ e1 ( ⌦2, e 0 2 ) = prune ` ⌦1 e2 in ( ⌦2, e 0 1 @ e 0 2 ) Moreover, in the first line of prune , we cannot simply return e in- stead of ⌫ — e is of the wrong form as it has type T and not T ! As far as type-preservation is concerned, we do lose information due to replacing variables by levels ` . This is the inevitable loss described by Atkey et al. [1], which we make up for by a dynamic check in an environment lookup, as already discussed. 3.3 Float shared subterms Second, we float all shared subtrees out to where they should be let-bound, represented by (see third diagram in Figure 2) " T ⌧ ! ⌫ : " T ⌧0 # T ⌧ # T ⌧ where ⌫ ⌧ :: # T ⌧ C ⌧ :: # T ⌧ ⌫. " T ⌧2 :: # T ⌧1!⌧2 " T ⌧1!⌧2 1 @ " T ⌧1 2 :: # T ⌧2 A term in " T comprises a sequence of floated-out subterms labelled by their stable name as well as a body term from # T from which the floated subterms where extracted. Moreover, the levels ` that replaced lambda binders in T get replaced by the stable name of their term node. This simplifies a uniform introduction of de Bruijn indices for let and lambda bound variables. We write ⌫ : " T for a possibly empty sequence of items: ⌫1 : " T1, . . . , ⌫n : " Tn , where • denotes an empty sequence. The floating function float maintains an auxiliary structure of floating terms and levels, defined as follows: ! ⌫ i : " T ⌧ | ⌫ i : · | ⌫ i : ` These are floated subtrees named ⌫ of which we have collected i occurrences. The occurrence count indicates where a shared sub- term gets let bound: namely at the node where it matches ⌦⌫ . This is why prune needed to collect the number of occurrences in ⌦ . When the occurrence count matches ⌦⌫ , we call the floated ering sharing in an example term right e 2). duce uned vel e) name table re in cript name that oun- er of ). To sub- rring here lised t ) , T ⌧ ) The first equation of prune covers the case of a term’s repeated occurrence. In that case, we prune sharing by replacing the term e ⌫ by a tag ⌫ containing its stable name — these are the dotted lines in the second diagram in Figure 2. To interleave sharing recovery with the conversion from HOAS to typed de Bruijn indices, prune tracks the nesting Level of lambdas. Moreover, the lambda case of enter replaces the HOAS binder x by the level ` at the binding and usage sites. Why don’t we separate computing occurrences from tree prun- ing? When computing occurrences, we must not traverse shared subtrees multiple times, so we can as well prune at the same time. Moreover, in the first line of prune , we cannot simply return e in- stead of ⌫ — e is of the wrong form as it has type T and not T ! As far as type-preservation is concerned, we do lose information due to replacing variables by levels ` . This is the inevitable loss described by Atkey et al. [1], which we make up for by a dynamic check in an environment lookup, as already discussed. 3.3 Float shared subterms Second, we float all shared subtrees out to where they should be let-bound, represented by (see third diagram in Figure 2) " T ⌧ ! ⌫ : " T ⌧0 # T ⌧ # T ⌧ where ⌫ ⌧ :: # T ⌧ C ⌧ :: # T ⌧ ⌫. " T ⌧2 :: # T ⌧1!⌧2 " T ⌧1!⌧2 1 @ " T ⌧1 2 :: # T ⌧2 A term in " T comprises a sequence of floated-out subterms labelled by their stable name as well as a body term from # T from which the floated subterms where extracted. Moreover, the levels ` that replaced lambda binders in T get replaced by the stable name of their term node. This simplifies a uniform introduction of de Bruijn indices for let and lambda bound variables. We write ⌫ : " T for a possibly empty sequence of items: ⌫1 : " T1, . . . , ⌫n : " Tn , where • denotes an empty sequence. The floating function float maintains an auxiliary structure of floating terms and levels, defined as follows: und names in the figure, we introduce me time as introducing the lets. s shared subtrees, producing a pruned cond diagram in Figure 2): ⌧ -- binder conversion level ⌧ -- pruned subtree (name) ⌧ ⌧1!⌧2 ⌧2 ype Name ) associates a unique name so that two terms with the same stable presented by the same data structure in stable name of a term as a superscript .g., 1⌫ is a constant with stable name ird diagram in Figure 2. Name 7! Int , is a finite map that currences of a Name that we encoun- expression ⌦ ⌫ yields the number of and we have ⌫ 2 ⌦ ⌘ (⌦ ⌫ > 0). To rite ⌫ B⌦. We will see in the next sub- ify ⌦ to be merely a set of occurring occurrence count to determine where -bound. ning of shared subtrees is formalised rating on closed terms from T ⌧ : ! Int ) ! T ⌧ ! (( Name 7! Int ) , T ⌧ ) = ( ⌫ B ⌦, ⌫ ) = enter ( ⌫ B ⌦ ) e ⌦, c ) t ( ⌦ 0 , e 0) = prune ( ` + 1) ⌦ ([ `/x ] e ) n ⌦ 0 , `.e 0) t ( ⌦1, e 0 1 ) = prune ` ⌦ e1 ( ⌦2, e 0 2 ) = prune ` ⌦1 e2 n ⌦2, e 0 1 @ e 0 2 ) by a tag ⌫ containing its stable name — these are the dotted lines in the second diagram in Figure 2. To interleave sharing recovery with the conversion from HOAS to typed de Bruijn indices, prune tracks the nesting Level of lambdas. Moreover, the lambda case of enter replaces the HOAS binder x by the level ` at the binding and usage sites. Why don’t we separate computing occurrences from tree prun- ing? When computing occurrences, we must not traverse shared subtrees multiple times, so we can as well prune at the same time. Moreover, in the first line of prune , we cannot simply return e in- stead of ⌫ — e is of the wrong form as it has type T and not T ! As far as type-preservation is concerned, we do lose information due to replacing variables by levels ` . This is the inevitable loss described by Atkey et al. [1], which we make up for by a dynamic check in an environment lookup, as already discussed. 3.3 Float shared subterms Second, we float all shared subtrees out to where they should be let-bound, represented by (see third diagram in Figure 2) " T ⌧ ! ⌫ : " T ⌧0 # T ⌧ # T ⌧ where ⌫ ⌧ :: # T ⌧ C ⌧ :: # T ⌧ ⌫. " T ⌧2 :: # T ⌧1!⌧2 " T ⌧1!⌧2 1 @ " T ⌧1 2 :: # T ⌧2 A term in " T comprises a sequence of floated-out subterms labelled by their stable name as well as a body term from # T from which the floated subterms where extracted. Moreover, the levels ` that replaced lambda binders in T get replaced by the stable name of their term node. This simplifies a uniform introduction of de Bruijn indices for let and lambda bound variables. We write ⌫ : " T for a possibly empty sequence of items: ⌫1 : " T1, . . . , ⌫n : " Tn , where • denotes an empty sequence. The floating function float maintains an auxiliary structure of floating terms and levels, defined as follows: ! ⌫ i : " T ⌧ | ⌫ i : · | ⌫ i : ` These are floated subtrees named ⌫ of which we have collected i occurrences. The occurrence count indicates where a shared sub- term gets let bound: namely at the node where it matches ⌦⌫ . This is why prune needed to collect the number of occurrences in ⌦ . When the occurrence count matches ⌦⌫ , we call the floated term saturated. The following function determines saturated floated terms, which ought to be let bound: bind :: ( Name 7! Int ) ! ! 9 ⌧.⌫ : " T ⌧ bind ⌦ • = • bind ⌦ ( ⌫ i : e, ) | ⌦⌫ == i = ⌫ : e, bind ⌦ bind ⌦ ( ⌫ i : , ) = bind ⌦ Note that does not keep track of the type ⌧ of a floated term " T ⌧ ; hence, floated terms from bind come in an existential package. This does not introduce additional loss of type safety as we already lost the type of lambda bound variables in ⌫ i : ` . It merely means that let bound, just like lambda bound, variables require the dynamically checked environment look up we already discussed. When floating the first occurrence of a shared tree (not pruned by prune ), we use ⌫ i : " T ⌧ . When floating subsequent occurrences (which were pruned), we use ⌫ i : ·. Finally, when floating a level, to replace it by a stable name, we use ⌫ i : ` . We define a partial ordering on floated terms: ⌫1 i : x < ⌫2 j : y iff the direct path from ⌫1 to the root of the AST is shorter than that of ⌫2 . We keep sequences of floated terms in descending order — so that the deepest subterm comes first. We write 1 ] 2 to merge two sequences of floated terms. Merging respects the partial order, and it combines floated trees with the same stable name by adding their occurrence counts. To combine the first occurrence and a subsequent occurrence of a shared tree, we preserve the term of the first occurrence. We write \ ⌫ to delete elements of that term saturated. The following function determines saturated floated terms, which ought to be let bound: bind :: ( Name 7! Int ) ! ! 9 ⌧.⌫ : " T ⌧ bind ⌦ • = • bind ⌦ ( ⌫ i : e, ) | ⌦⌫ == i = ⌫ : e, bind ⌦ bind ⌦ ( ⌫ i : , ) = bind ⌦ Note that does not keep track of the type ⌧ of a floated term " T ⌧ ; hence, floated terms from bind come in an existential package. This does not introduce additional loss of type safety as we already lost the type of lambda bound variables in ⌫ i : ` . It merely means that let bound, just like lambda bound, variables require the dynamically checked environment look up we already discussed. When floating the first occurrence of a shared tree (not pruned by prune ), we use ⌫ i : " T ⌧ . When floating subsequent occurrences (which were pruned), we use ⌫ i : ·. Finally, when floating a level, to replace it by a stable name, we use ⌫ i : ` . We define a partial ordering on floated terms: ⌫1 i : x < ⌫2 j : y iff the direct path from ⌫1 to the root of the AST is shorter than that of ⌫2 . We keep sequences of floated terms in descending order — so that the deepest subterm comes first. We write 1 ] 2 to merge two sequences of floated terms. Merging respects the partial order, and it combines floated trees with the same stable name by adding their occurrence counts. To combine the first occurrence and a subsequent occurrence of a shared tree, we preserve the term of the first occurrence. We write \ ⌫ to delete elements of that are tagged with a name that appears in the sequence ⌫ . We can now formalise the floating process as follows: float :: ( Name 7! Int ) ! T ⌧ ! ( , " T ⌧ ) float ⌦ ` ⌫ = ( ⌫ 1 : `, ⌫ ) float ⌦ ⌫ = ( ⌫ 1 : · , ⌫ ) float ⌦ e ⌫ = let ( , e 0) = descend e ⌫b : eb = bind ⌦ d = ⌫b : eb e 0 in if ⌦⌫ == 1 then ( \ ⌫b, d ) else ( \ ⌫b ] { ⌫ : d } , ⌫ ) where descend :: T ⌧ ! ( , # T ⌧ ) descend c = (• , c ) descend ( `.e ) = let ( , e 0) = float ⌦ e in if 9 ⌫ 0 i. ( ⌫ 0 i : ` ) 2 then ( \ { ⌫ 0} , ⌫ 0 .e 0) else ( , .e 0) descend ( e1 @ e2) = let ( 1, e 0 1 ) = float ⌦ e1 ( 2, e 0 2 ) = float ⌦ e2 in ( 1 ] 2, e 0 1 @ e 0 2 ) Regardless of whether a term gets floated, all saturated float terms, ⌫b : eb , must prefix the result, e 0, and be removed from When descend ing into a term, the only interesting case is lambdas. For a lambda at level ` , we look for a floated level of t form ⌫ 0 : ` . If that is available, ⌫ 0 replaces ` as a binder and remove ⌫ 0 : ` from . However, if ⌫ 0 : ` is not in , the bind introduced by the lambda doesn’t get used in e . In this case, pick an arbitrary new name; here symbolised by an underscore ” 3.4 Binder introduction Thirdly, we introduce typed de Bruijn indices to represent lamb and let binding structure (rightmost diagram in Figure 2): env T ⌧ where C ⌧ :: env T ⌧ env ◆ ⌧ :: env T ⌧ (⌧1, env) T ⌧2 :: env T ⌧1!⌧2 env T ⌧1!⌧2 1 @ env T ⌧1 2 :: env T ⌧2 let env T ⌧1 1 in (⌧1, env) T ⌧2 2 :: env T ⌧2 With this type of terms, e :: env T ⌧ means that e is a term rep senting a computation producing a value of type ⌧ under the ty environment env . Type environments are nested pair types, pos bly terminated by a unit type (). For example, ((() , ⌧1) , ⌧0) i type environment, where de Bruijn index 0 represents a variable type ⌧0 and de Bruijn index 1 represents a variable of type ⌧1 . We abbreviate let e1 in · · · let en in eb as let e in Both and let use de Bruijn indices ◆ instead of introduci explicit binders. To replace the names of pruned subtrees and of lambda bou variables by de Bruijn indices, we need to construct a suitab type environment as well as an association of environment entri their de Bruijn indices, and the stable names that they replace. W maintain the type environment with associated de Bruijn indices the following environment layout structure: env env0 where :: env () env env0 ; env ◆ ⌧ :: env (env0, t) Together with a layout, we use a sequence of names ⌫ of the sam size as the layout, where corresponding entries represent the sam variable. As this association between typed layout and untyp sequence of names is not validated by types, the lookup functi lyt # i getting the i th index of layout lyt makes use of a dynam type check. It’s signature is (#) :: N ! env env0 ! env ◆ ⌧ . Now we can introduces de Bruijn indices to body expression body :: env env ! ⌫ ! # T ⌧ ! env T ⌧ body lyt ( ⌫⇢,0, . . . , ⌫⇢,n ) ⌫ | ⌫ == ⌫⇢,i = lyt # i body lyt ⌫⇢ c = c body lyt ⌫⇢ ( ⌫.e ) = ( binders lyt + ( ⌫, ⌫⇢) e ) body lyt ⌫⇢ ( e1 @ e2) = ( binders lyt ⌫⇢ e1) @ ( binders lyt The first equation performs a lookup in the environment layo at the same index where the stable name ⌫ occurs in the nam environment ⌫ . The lookup is the same for lambda and let bou variables. It is the only place where we need a dynamic type che and that is already needed for lambda bound variables alone. In the case of a lambda, we add a new binder by extendi the layout, denoted lyt +, with a new zeroth de Bruijn index a term saturated. The following function determines saturated floated terms, which ought to be let bound: bind :: ( Name 7! Int ) ! ! 9 ⌧.⌫ : " T ⌧ bind ⌦ • = • bind ⌦ ( ⌫ i : e, ) | ⌦⌫ == i = ⌫ : e, bind ⌦ bind ⌦ ( ⌫ i : , ) = bind ⌦ Note that does not keep track of the type ⌧ of a floated term " T ⌧ ; hence, floated terms from bind come in an existential package. This does not introduce additional loss of type safety as we already lost the type of lambda bound variables in ⌫ i : ` . It merely means that let bound, just like lambda bound, variables require the dynamically checked environment look up we already discussed. When floating the first occurrence of a shared tree (not pruned by prune ), we use ⌫ i : " T ⌧ . When floating subsequent occurrences (which were pruned), we use ⌫ i : ·. Finally, when floating a level, to replace it by a stable name, we use ⌫ i : ` . We define a partial ordering on floated terms: ⌫1 i : x < ⌫2 j : y iff the direct path from ⌫1 to the root of the AST is shorter than that of ⌫2 . We keep sequences of floated terms in descending order — so that the deepest subterm comes first. We write 1 ] 2 to merge two sequences of floated terms. Merging respects the partial order, and it combines floated trees with the same stable name by adding their occurrence counts. To combine the first occurrence and a subsequent occurrence of a shared tree, we preserve the term of the first occurrence. We write \ ⌫ to delete elements of that are tagged with a name that appears in the sequence ⌫ . We can now formalise the floating process as follows: float :: ( Name 7! Int ) ! T ⌧ ! ( , " T ⌧ ) float ⌦ ` ⌫ = ( ⌫ 1 : `, ⌫ ) float ⌦ ⌫ = ( ⌫ 1 : · , ⌫ ) float ⌦ e ⌫ = let ( , e 0) = descend e ⌫b : eb = bind ⌦ d = ⌫b : eb e 0 in if ⌦⌫ == 1 then ( \ ⌫b, d ) else ( \ ⌫b ] { ⌫ : d } , ⌫ ) where descend :: T ⌧ ! ( , # T ⌧ ) descend c = (• , c ) descend ( `.e ) = let ( , e 0) = float ⌦ e in if 9 ⌫ 0 i. ( ⌫ 0 i : ` ) 2 then ( \ { ⌫ 0} , ⌫ 0 .e 0) else ( , .e 0) descend ( e1 @ e2) = let ( 1, e 0 1 ) = float ⌦ e1 ( 2, e 0 2 ) = float ⌦ e2 in ( 1 ] 2, e 0 1 @ e 0 2 ) Regardless of whether a term gets floated, all saturated fl terms, ⌫b : eb , must prefix the result, e 0, and be removed fr When descend ing into a term, the only interesting case lambdas. For a lambda at level ` , we look for a floated level form ⌫ 0 : ` . If that is available, ⌫ 0 replaces ` as a binder a remove ⌫ 0 : ` from . However, if ⌫ 0 : ` is not in , the introduced by the lambda doesn’t get used in e . In this ca pick an arbitrary new name; here symbolised by an undersco 3.4 Binder introduction Thirdly, we introduce typed de Bruijn indices to represent l and let binding structure (rightmost diagram in Figure 2): env T ⌧ where C ⌧ :: env T ⌧ env ◆ ⌧ :: env T ⌧ (⌧1, env) T ⌧2 :: env T ⌧1!⌧2 env T ⌧1!⌧2 1 @ env T ⌧1 2 :: env T ⌧2 let env T ⌧1 1 in (⌧1, env) T ⌧2 2 :: env T ⌧2 With this type of terms, e :: env T ⌧ means that e is a term senting a computation producing a value of type ⌧ under th environment env . Type environments are nested pair types, bly terminated by a unit type (). For example, ((() , ⌧1) , ⌧ type environment, where de Bruijn index 0 represents a varia type ⌧0 and de Bruijn index 1 represents a variable of type ⌧ We abbreviate let e1 in · · · let en in eb as let e Both and let use de Bruijn indices ◆ instead of introd explicit binders. To replace the names of pruned subtrees and of lambda variables by de Bruijn indices, we need to construct a su type environment as well as an association of environment e their de Bruijn indices, and the stable names that they replac maintain the type environment with associated de Bruijn ind the following environment layout structure: env env0 where :: env () env env0 ; env ◆ ⌧ :: env (env0, t) Together with a layout, we use a sequence of names ⌫ of the size as the layout, where corresponding entries represent the variable. As this association between typed layout and un sequence of names is not validated by types, the lookup fu lyt # i getting the i th index of layout lyt makes use of a dy type check. It’s signature is (#) :: N ! env env0 ! env ◆ ⌧ Now we can introduces de Bruijn indices to body express body :: env env ! ⌫ ! # T ⌧ ! env T ⌧ body lyt ( ⌫⇢,0, . . . , ⌫⇢,n ) ⌫ | ⌫ == ⌫⇢,i = lyt # i body lyt ⌫⇢ c = c body lyt ⌫⇢ ( ⌫.e ) = ( binders lyt + ( ⌫, ⌫⇢) e ) body lyt ⌫⇢ ( e1 @ e2) = ( binders lyt ⌫⇢ e1) @ ( binders The first equation performs a lookup in the environment at the same index where the stable name ⌫ occurs in the environment ⌫ . The lookup is the same for lambda and let variables. It is the only place where we need a dynamic type and that is already needed for lambda bound variables alone In the case of a lambda, we add a new binder by exte We define a partial ordering on floated terms: ⌫1 : x < ⌫2 : y the direct path from ⌫1 to the root of the AST is shorter than at of ⌫2 . We keep sequences of floated terms in descending order — so that the deepest subterm comes first. We write 1 ] 2 to erge two sequences of floated terms. Merging respects the partial der, and it combines floated trees with the same stable name by dding their occurrence counts. To combine the first occurrence and subsequent occurrence of a shared tree, we preserve the term of e first occurrence. We write \ ⌫ to delete elements of that e tagged with a name that appears in the sequence ⌫ . We can now formalise the floating process as follows: float :: ( Name 7! Int ) ! T ⌧ ! ( , " T ⌧ ) float ⌦ ` ⌫ = ( ⌫ 1 : `, ⌫ ) float ⌦ ⌫ = ( ⌫ 1 : · , ⌫ ) float ⌦ e ⌫ = let ( , e 0) = descend e ⌫b : eb = bind ⌦ d = ⌫b : eb e 0 in if ⌦⌫ == 1 then ( \ ⌫b, d ) else ( \ ⌫b ] { ⌫ : d } , ⌫ ) where descend :: T ⌧ ! ( , # T ⌧ ) descend c = (• , c ) descend ( `.e ) = let ( , e 0) = float ⌦ e in if 9 ⌫ 0 i. ( ⌫ 0 i : ` ) 2 then ( \ { ⌫ 0} , ⌫ 0 .e 0) else ( , .e 0) descend ( e1 @ e2) = let ( 1, e 0 1 ) = float ⌦ e1 ( 2, e 0 2 ) = float ⌦ e2 in ( 1 ] 2, e 0 1 @ e 0 2 ) he first two cases of float ensure that the levels of lambda bound riables and the names of pruned shared subterms are floated gardless of how often they occur. In contrast, the third equation oats a term with name ⌫ only if it is shared; i.e., ⌦⌫ is not 1. If it shared, it is also pruned; i.e., replaced by its name ⌫ — just as in e third diagram of Figure 2. With this type of terms, e :: env T ⌧ means that e is a term repre- senting a computation producing a value of type ⌧ under the type environment env . Type environments are nested pair types, possi- bly terminated by a unit type (). For example, ((() , ⌧1) , ⌧0) is a type environment, where de Bruijn index 0 represents a variable of type ⌧0 and de Bruijn index 1 represents a variable of type ⌧1 . We abbreviate let e1 in · · · let en in eb as let e in eb . Both and let use de Bruijn indices ◆ instead of introducing explicit binders. To replace the names of pruned subtrees and of lambda bound variables by de Bruijn indices, we need to construct a suitable type environment as well as an association of environment entries, their de Bruijn indices, and the stable names that they replace. We maintain the type environment with associated de Bruijn indices in the following environment layout structure: env env0 where :: env () env env0 ; env ◆ ⌧ :: env (env0, t) Together with a layout, we use a sequence of names ⌫ of the same size as the layout, where corresponding entries represent the same variable. As this association between typed layout and untyped sequence of names is not validated by types, the lookup function lyt # i getting the i th index of layout lyt makes use of a dynamic type check. It’s signature is (#) :: N ! env env0 ! env ◆ ⌧ . Now we can introduces de Bruijn indices to body expressions: body :: env env ! ⌫ ! # T ⌧ ! env T ⌧ body lyt ( ⌫⇢,0, . . . , ⌫⇢,n ) ⌫ | ⌫ == ⌫⇢,i = lyt # i body lyt ⌫⇢ c = c body lyt ⌫⇢ ( ⌫.e ) = ( binders lyt + ( ⌫, ⌫⇢) e ) body lyt ⌫⇢ ( e1 @ e2) = ( binders lyt ⌫⇢ e1) @ ( binders lyt ⌫⇢ e2) The first equation performs a lookup in the environment layout at the same index where the stable name ⌫ occurs in the name environment ⌫ . The lookup is the same for lambda and let bound variables. It is the only place where we need a dynamic type check and that is already needed for lambda bound variables alone. In the case of a lambda, we add a new binder by extending the layout, denoted lyt +, with a new zeroth de Bruijn index and shifting all others one up. Keeping the name environment in sync, we add the stable name ⌫ , which # T used as a binder. In the same vein, we bind n floated terms ⌫ : e with let bind- ings in body expression eb , by extending the type environment n times ( map applies a function to each element of a sequence): terms, which ought to be let bound: bind :: ( Name 7! Int ) ! ! 9 ⌧.⌫ : " T ⌧ bind ⌦ • = • bind ⌦ ( ⌫ i : e, ) | ⌦⌫ == i = ⌫ : e, bind ⌦ bind ⌦ ( ⌫ i : , ) = bind ⌦ Note that does not keep track of the type ⌧ of a floated term " T ⌧ ; hence, floated terms from bind come in an existential package. This does not introduce additional loss of type safety as we already lost the type of lambda bound variables in ⌫ i : ` . It merely means that let bound, just like lambda bound, variables require the dynamically checked environment look up we already discussed. When floating the first occurrence of a shared tree (not pruned by prune ), we use ⌫ i : " T ⌧ . When floating subsequent occurrences (which were pruned), we use ⌫ i : ·. Finally, when floating a level, to replace it by a stable name, we use ⌫ i : ` . We define a partial ordering on floated terms: ⌫1 i : x < ⌫2 j : y iff the direct path from ⌫1 to the root of the AST is shorter than that of ⌫2 . We keep sequences of floated terms in descending order — so that the deepest subterm comes first. We write 1 ] 2 to merge two sequences of floated terms. Merging respects the partial order, and it combines floated trees with the same stable name by adding their occurrence counts. To combine the first occurrence and a subsequent occurrence of a shared tree, we preserve the term of the first occurrence. We write \ ⌫ to delete elements of that are tagged with a name that appears in the sequence ⌫ . We can now formalise the floating process as follows: float :: ( Name 7! Int ) ! T ⌧ ! ( , " T ⌧ ) float ⌦ ` ⌫ = ( ⌫ 1 : `, ⌫ ) float ⌦ ⌫ = ( ⌫ 1 : · , ⌫ ) float ⌦ e ⌫ = let ( , e 0) = descend e ⌫b : eb = bind ⌦ d = ⌫b : eb e 0 in if ⌦⌫ == 1 then ( \ ⌫b, d ) else ( \ ⌫b ] { ⌫ : d } , ⌫ ) where descend :: T ⌧ ! ( , # T ⌧ ) descend c = (• , c ) descend ( `.e ) = let ( , e 0) = float ⌦ e in if 9 ⌫ 0 i. ( ⌫ 0 i : ` ) 2 then ( \ { ⌫ 0} , ⌫ 0 .e 0) else ( , .e 0) descend ( e1 @ e2) = let ( 1, e 0 1 ) = float ⌦ e1 ( 2, e 0 2 ) = float ⌦ e2 in ( 1 ] 2, e 0 1 @ e 0 2 ) The first two cases of float ensure that the levels of lambda bound variables and the names of pruned shared subterms are floated terms, ⌫b : eb , must prefix the result, e 0, and be removed from . When descend ing into a term, the only interesting case is for lambdas. For a lambda at level ` , we look for a floated level of the form ⌫ 0 : ` . If that is available, ⌫ 0 replaces ` as a binder and we remove ⌫ 0 : ` from . However, if ⌫ 0 : ` is not in , the binder introduced by the lambda doesn’t get used in e . In this case, we pick an arbitrary new name; here symbolised by an underscore ” ”. 3.4 Binder introduction Thirdly, we introduce typed de Bruijn indices to represent lambda and let binding structure (rightmost diagram in Figure 2): env T ⌧ where C ⌧ :: env T ⌧ env ◆ ⌧ :: env T ⌧ (⌧1, env) T ⌧2 :: env T ⌧1!⌧2 env T ⌧1!⌧2 1 @ env T ⌧1 2 :: env T ⌧2 let env T ⌧1 1 in (⌧1, env) T ⌧2 2 :: env T ⌧2 With this type of terms, e :: env T ⌧ means that e is a term repre- senting a computation producing a value of type ⌧ under the type environment env . Type environments are nested pair types, possi- bly terminated by a unit type (). For example, ((() , ⌧1) , ⌧0) is a type environment, where de Bruijn index 0 represents a variable of type ⌧0 and de Bruijn index 1 represents a variable of type ⌧1 . We abbreviate let e1 in · · · let en in eb as let e in eb . Both and let use de Bruijn indices ◆ instead of introducing explicit binders. To replace the names of pruned subtrees and of lambda bound variables by de Bruijn indices, we need to construct a suitable type environment as well as an association of environment entries, their de Bruijn indices, and the stable names that they replace. We maintain the type environment with associated de Bruijn indices in the following environment layout structure: env env0 where :: env () env env0 ; env ◆ ⌧ :: env (env0, t) Together with a layout, we use a sequence of names ⌫ of the same size as the layout, where corresponding entries represent the same variable. As this association between typed layout and untyped sequence of names is not validated by types, the lookup function lyt # i getting the i th index of layout lyt makes use of a dynamic type check. It’s signature is (#) :: N ! env env0 ! env ◆ ⌧ . Now we can introduces de Bruijn indices to body expressions: body :: env env ! ⌫ ! # T ⌧ ! env T ⌧ body lyt ( ⌫⇢,0, . . . , ⌫⇢,n ) ⌫ | ⌫ == ⌫⇢,i = lyt # i body lyt ⌫⇢ c = c body lyt ⌫⇢ ( ⌫.e ) = ( binders lyt + ( ⌫, ⌫⇢) e ) body lyt ⌫⇢ ( e1 @ e2) = ( binders lyt ⌫⇢ e1) @ ( binders lyt ⌫⇢ e2) The first equation performs a lookup in the environment layout at the same index where the stable name ⌫ occurs in the name environment ⌫ . The lookup is the same for lambda and let bound variables. It is the only place where we need a dynamic type check and that is already needed for lambda bound variables alone. In the case of a lambda, we add a new binder by extending the layout, denoted lyt +, with a new zeroth de Bruijn index and shifting all others one up. Keeping the name environment in sync, # (Before fusion) p1 p1 p2 p3 p4 p5 p6 p7 c1 c2 (After producer/producer fusion) c2 p5 p1 c1 p6 p7 p3 p2 p4 (After consumer/producer fusion) c2 p5 p1 c1 p6 p7 p3 p2 p4 Figure 3. Produce/producer and consumer/producer fusion binders :: env env ! ⌫ ! " T ⌧ ! env T ⌧ binders lyt ⌫⇢ ( ⌫ : e eb ) = let map ( binders lyt ⌫⇢) e in body lyt +n ( ⌫, ⌫⇢) eb where n = length ( ⌫ : e ) We tie the three stages together to convert from HOAS with sharing recovery producing let bindings and typed de Bruijn indices: variables are used multiple times in the body of an expression, un- restrained inlining can lead to duplication of work. Compilers such as GHC, handle this situation by only inlining the definitions of let- bound variables that have a single use site, or by relying on some heuristic about the size of the resulting code to decide what to inline [26]. However, in typical Accelerate programs, each array is used at least twice: once to access the shape information and once to access the array data; so, we must handle at least this case separately. Filtering. General array fusion transforms must deal with filter- like operations, for which the size of the result structure depends on the value of the input structure, as well as its size. Accelerate does not encode filtering as a primitive operation, so we do not need to consider it further.1 Fusion at run-time. As the Accelerate language is embedded in Haskell, compilation of the Accelerate program happens at Haskell runtime rather than when compiling the Haskell program. For this reason, optimisations applied to an Accelerate program contribute to its overall runtime, so we must be mindful of the cost of analysis and code transformation. On the flip-side, runtime optimisations can make use of information that is only available at runtime. Fusion on typed de Brujin terms. We fuse Accelerate programs by rewriting typed de Bruijn terms in a type preserving manner. However, maintaining type information adds complexity to the def- initions and rules, which amounts to a partial proof of correctness checked by the type checker, but is not particularly exciting for the present exposition. Hence, in this section, we elide the steps neces- sary to maintain type information during fusion. 4.1 The Main Idea All collective operations in Accelerate are array-to-array transfor- mations. Reductions, such as fold, which reduce an array to a sin- gle element, yield a singleton array rather than a scalar expression. Hence, we can partition array operations into two categories: 1. Operations where each element of the result array depends on at most one element of each input array. Multiple elements of the c2 p5 p1 c1 p6 p7 p3 p2 p4 (After consumer/producer fusion) c2 p5 p1 c1 p6 p7 p3 p2 p4 Figure 3. Produce/producer and consumer/producer fusion binders :: env env ! ⌫ ! " T ⌧ ! env T ⌧ binders lyt ⌫⇢ ( ⌫ : e eb ) = let map ( binders lyt ⌫⇢) e in body lyt +n ( ⌫, ⌫⇢) eb where n = length ( ⌫ : e ) We tie the three stages together to convert from HOAS with sharing recovery producing let bindings and typed de Bruijn indices: hoasSharing :: T ⌧ ! () T ⌧ hoasSharing e = let ( ⌦, e 0) = prune 0 • e (• , e 00) = float ⌦ e 0 in binders • e 00 4. Array fusion Fusion in a massively data-parallel, embedded language for GPUs, such as Accelerate, requires a few uncommon considerations. Parallelism. While fusing parallel collective operations, we must be careful not to lose information essential to parallel execution. For example, foldr/build fusion [15] is not applicable, because it produces sequential tail-recursive loops rather than massively parallel GPU kernels. Similarly, the split/join approach used in Data Parallel Haskell (DPH) [16] is not helpful, although fused operations are split into sequential and parallel subcomputations, as they assume an explicit parallel scheduler, which in DPH is written directly in Haskell. Accelerate compiles massively parallel array combinators to CUDA code via template skeleton instantiation, so any fusion system must preserve the combinator representation of the intermediate code. Sharing. Existing fusion transforms rely on inlining to move pro- ducer and consumer expressions next to each other, which allows producer/consumer pairs to be detected. However, when let-bound Fusion at run-time. As th Haskell, compilation of the A runtime rather than when co reason, optimisations applie to its overall runtime, so we and code transformation. O can make use of information Fusion on typed de Brujin by rewriting typed de Bruij However, maintaining type i initions and rules, which am checked by the type checker present exposition. Hence, in sary to maintain type inform 4.1 The Main Idea All collective operations in mations. Reductions, such a gle element, yield a singleto Hence, we can partition arra 1. Operations where each e most one element of eac output array may depen all output elements can b these operations as produ 2. Operations where each e multiple elements of the consumers, in spite of th Table 1 summarises the colle In a parallel context, produc cause independent element- ping to the GPU. Consume know exactly how the comp plement them efficiently. For ciative operator) can be impl but a parallel scan requires nately, this sort of informati niques. To support the diffe sumers, our fusion transform • Producer/producer: fuse producer. This is implem mation on the AST. • Consumer/producer: fus into the consumer. This h we specialise the consum 1 filter is easily implemented is provided as part of the library
  31. N-Body Run Time (ms) Bodies Accelerate -fusion -sharing ... -fusion

    +sharing ... +fusion +sharing Hand optimised GPU
  32. N-Body Run Time (ms) Bodies Accelerate -fusion -sharing ... -fusion

    +sharing ... +fusion +sharing Hand optimised GPU 11x
  33. N-Body Run Time (ms) Bodies Accelerate -fusion -sharing ... -fusion

    +sharing ... +fusion +sharing Hand optimised GPU Missing “optimisation” 11x
  34. N-Body Run Time (ms) Bodies Accelerate -fusion -sharing ... -fusion

    +sharing ... +fusion +sharing Hand optimised GPU Run Time (ms) Bodies Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing Hand optimised GPU Current version Missing “optimisation” 2x 11x
  35. Embedded languages are restricted languages Skeletons encapsulate efficient code idioms

    Fusion and Sharing reduce the abstraction penalty Summary https://github.com/AccelerateHS/