Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Optimising Purely Functional GPU Programs

Optimising Purely Functional GPU Programs

Presented at ICFP 2013: http://www.icfpconference.org/icfp2013/
Paper: https://github.com/tmcdonell/tmcdonell.github.io/raw/master/papers/acc-optim-icfp2013.pdf

Purely functional, embedded array programs are a good match for SIMD hardware, such as GPUs. However, the naive compilation of such programs quickly leads to both code explosion and an excessive use of intermediate data structures. The resulting slow- down is not acceptable on target hardware that is usually chosen to achieve high performance.

It this paper, we present two optimisation techniques, sharing recovery and array fusion, that tackle code explosion and eliminate superfluous intermediate structures. Both techniques are well known from other contexts, but they present unique challenges for an embedded language compiled for execution on a GPU. We present novel methods for implementing sharing recovery and array fusion, and demonstrate their effectiveness on a set of benchmarks.

Trevor L. McDonell

September 25, 2013
Tweet

More Decks by Trevor L. McDonell

Other Decks in Research

Transcript

  1. Optimising Purely Functional GPU Programs
    Trevor L. McDonell

    University of New South Wales
    Jointly with
    Manuel M. T. Chakravarty

    Gabriele Keller

    Ben Lippmeier
    tmcdonell
    tlmcdonell

    View Slide

  2. High Level Languages

    View Slide

  3. High Level Languages
    Boxed values
    Polymorphism
    &
    generics
    Composite
    data structures
    Immutable
    structures
    Higher-order
    functions
    &
    closures

    View Slide

  4. High Level Languages
    Boxed values
    Polymorphism
    &
    generics
    Composite
    data structures
    Immutable
    structures
    Higher-order
    functions
    &
    closures

    View Slide

  5. High Level Languages
    multicore

    CPU
    GPUs
    Cluster

    View Slide

  6. High Level Languages
    multicore

    CPU
    GPUs
    Cluster
    Function pointers
    Control flow
    Memory access patterns
    Data distribution
    Decomposition

    View Slide

  7. High Level Languages
    multicore

    CPU
    GPUs
    Cluster
    Function pointers
    Control flow
    Memory access patterns
    Data distribution
    Decomposition
    Efficient code?

    View Slide

  8. scanl&::&(a&)>&b&)>&a)&)>&a&)>&[b]&)>&[a]&
    scanl&f&q&ls&=&&q&:&(case&ls&of&
    &&&&&&&&&&&&&&&&&&&&&[]&&&)>&[]&
    &&&&&&&&&&&&&&&&&&&&&x:xs&)>&scanl&f&(f&q&x)&xs)

    View Slide

  9. Fast Scan Algorithms on Graphics Processors
    Yuri Dotsenko Naga K. Govindaraju Peter-Pike Sloan Charles Boyd John Manferdelli
    Microsoft Corporation
    One Microsoft Way
    Redmond, WA 98052, USA
    {yurido, nagag, ppsloan, chasb, jmanfer}@microsoft.com
    ABSTRACT
    Scan and segmented scan are important data-parallel primitives
    for a wide range of applications. We present fast, work-efficient
    algorithms for these primitives on graphics processing units
    (GPUs). We use novel data representations that map well to the
    GPU architecture. Our algorithms exploit shared memory to
    improve memory performance. We further improve the
    performance of our algorithms by eliminating shared-memory
    bank conflicts and reducing the overheads in prior shared-memory
    GPU algorithms. Furthermore, our algorithms are designed to
    work well on general data sets, including segmented arrays with
    arbitrary segment lengths. We also present optimizations to
    improve the performance of segmented scans based on the
    segment lengths. We implemented our algorithms on a PC with an
    NVIDIA GeForce 8800 GPU and compared our results with prior
    GPU-based algorithms. Our results indicate up to 10x higher
    performance over prior algorithms on input sequences with
    millions of elements.
    Categories and Subject Descriptors
    D.1.3 [Concurrent Programming]: Parallel programming.
    General Terms
    Algorithms, performance.
    Keywords
    Scan, all-prefix-sum, segmented scan, GPGPU, GPU, parallel
    algorithm, HPC, many-core.
    1. INTRODUCTION
    bandwidth and massive parallelism on GPUs. The current state-
    of-the-art GPU-based algorithms also exploit shared memory to
    improve the performance of scans. In this paper, we analyze the
    issues in mapping scan algorithms to the GPU architecture. We
    highlight that the prior algorithms deliver suboptimal performance
    due to high overhead of shared-memory bank conflicts,
    synchronization, and index arithmetic.
    We present fast scan algorithms that map better to GPUs and
    achieve higher performance than prior GPU-based algorithms.
    Our main contribution is a novel data representation in shared and
    global memory that maps better to the GPU memory hierarchy
    and the scan algorithms. Accesses to the data representation
    involve no bank conflicts in the shared memory while exploiting
    the high parallelism on GPUs. Our algorithm involves low
    overhead compared to prior approaches and the performance of
    the kernel scales better with shared memory sizes.
    We implemented our algorithms on a PC with a modern NVIDIA
    GPU. We benchmark our algorithms against prior state-of-the-art
    GPU-based algorithms on several GPUs. Our results on
    unsegmented scans indicate up to 60% higher performance than
    prior optimized algorithms. On segmented scans, we observed up
    to an order of magnitude higher performance over optimized
    GPU-based segmented scan algorithms.
    Organization of the paper: The rest of the paper is organized as
    follows. In Section 2, we present the related work. In Section 3,
    we give an overview of scan algorithms and the issues in mapping
    them to GPUs. We present our scan algorithms and provide
    analysis in Section 4. In Section 5, we describe our experimental
    International conference on
    Supercomputing, 2008

    View Slide

  10. Fast Scan Algorithms on Graphics Processors
    Yuri Dotsenko Naga K. Govindaraju Peter-Pike Sloan Charles Boyd John Manferdelli
    Microsoft Corporation
    One Microsoft Way
    Redmond, WA 98052, USA
    {yurido, nagag, ppsloan, chasb, jmanfer}@microsoft.com
    ABSTRACT
    Scan and segmented scan are important data-parallel primitives
    for a wide range of applications. We present fast, work-efficient
    algorithms for these primitives on graphics processing units
    (GPUs). We use novel data representations that map well to the
    GPU architecture. Our algorithms exploit shared memory to
    improve memory performance. We further improve the
    performance of our algorithms by eliminating shared-memory
    bank conflicts and reducing the overheads in prior shared-memory
    GPU algorithms. Furthermore, our algorithms are designed to
    work well on general data sets, including segmented arrays with
    arbitrary segment lengths. We also present optimizations to
    improve the performance of segmented scans based on the
    segment lengths. We implemented our algorithms on a PC with an
    NVIDIA GeForce 8800 GPU and compared our results with prior
    GPU-based algorithms. Our results indicate up to 10x higher
    performance over prior algorithms on input sequences with
    millions of elements.
    Categories and Subject Descriptors
    D.1.3 [Concurrent Programming]: Parallel programming.
    General Terms
    Algorithms, performance.
    Keywords
    Scan, all-prefix-sum, segmented scan, GPGPU, GPU, parallel
    algorithm, HPC, many-core.
    1. INTRODUCTION
    bandwidth and massive parallelism on GPUs. The current state-
    of-the-art GPU-based algorithms also exploit shared memory to
    improve the performance of scans. In this paper, we analyze the
    issues in mapping scan algorithms to the GPU architecture. We
    highlight that the prior algorithms deliver suboptimal performance
    due to high overhead of shared-memory bank conflicts,
    synchronization, and index arithmetic.
    We present fast scan algorithms that map better to GPUs and
    achieve higher performance than prior GPU-based algorithms.
    Our main contribution is a novel data representation in shared and
    global memory that maps better to the GPU memory hierarchy
    and the scan algorithms. Accesses to the data representation
    involve no bank conflicts in the shared memory while exploiting
    the high parallelism on GPUs. Our algorithm involves low
    overhead compared to prior approaches and the performance of
    the kernel scales better with shared memory sizes.
    We implemented our algorithms on a PC with a modern NVIDIA
    GPU. We benchmark our algorithms against prior state-of-the-art
    GPU-based algorithms on several GPUs. Our results on
    unsegmented scans indicate up to 60% higher performance than
    prior optimized algorithms. On segmented scans, we observed up
    to an order of magnitude higher performance over optimized
    GPU-based segmented scan algorithms.
    Organization of the paper: The rest of the paper is organized as
    follows. In Section 2, we present the related work. In Section 3,
    we give an overview of scan algorithms and the issues in mapping
    them to GPUs. We present our scan algorithms and provide
    analysis in Section 4. In Section 5, we describe our experimental
    International conference on
    Supercomputing, 2008

    Optimising Parallel Prefix operations
    for the Fermi architecture
    !
    Mark Harris Michael Garland
    NVIDIA Corporation
    Published October 2012

    View Slide

  11. Fast Scan Algorithms on Graphics Processors
    Yuri Dotsenko Naga K. Govindaraju Peter-Pike Sloan Charles Boyd John Manferdelli
    Microsoft Corporation
    One Microsoft Way
    Redmond, WA 98052, USA
    {yurido, nagag, ppsloan, chasb, jmanfer}@microsoft.com
    ABSTRACT
    Scan and segmented scan are important data-parallel primitives
    for a wide range of applications. We present fast, work-efficient
    algorithms for these primitives on graphics processing units
    (GPUs). We use novel data representations that map well to the
    GPU architecture. Our algorithms exploit shared memory to
    improve memory performance. We further improve the
    performance of our algorithms by eliminating shared-memory
    bank conflicts and reducing the overheads in prior shared-memory
    GPU algorithms. Furthermore, our algorithms are designed to
    work well on general data sets, including segmented arrays with
    arbitrary segment lengths. We also present optimizations to
    improve the performance of segmented scans based on the
    segment lengths. We implemented our algorithms on a PC with an
    NVIDIA GeForce 8800 GPU and compared our results with prior
    GPU-based algorithms. Our results indicate up to 10x higher
    performance over prior algorithms on input sequences with
    millions of elements.
    Categories and Subject Descriptors
    D.1.3 [Concurrent Programming]: Parallel programming.
    General Terms
    Algorithms, performance.
    Keywords
    Scan, all-prefix-sum, segmented scan, GPGPU, GPU, parallel
    algorithm, HPC, many-core.
    1. INTRODUCTION
    bandwidth and massive parallelism on GPUs. The current state-
    of-the-art GPU-based algorithms also exploit shared memory to
    improve the performance of scans. In this paper, we analyze the
    issues in mapping scan algorithms to the GPU architecture. We
    highlight that the prior algorithms deliver suboptimal performance
    due to high overhead of shared-memory bank conflicts,
    synchronization, and index arithmetic.
    We present fast scan algorithms that map better to GPUs and
    achieve higher performance than prior GPU-based algorithms.
    Our main contribution is a novel data representation in shared and
    global memory that maps better to the GPU memory hierarchy
    and the scan algorithms. Accesses to the data representation
    involve no bank conflicts in the shared memory while exploiting
    the high parallelism on GPUs. Our algorithm involves low
    overhead compared to prior approaches and the performance of
    the kernel scales better with shared memory sizes.
    We implemented our algorithms on a PC with a modern NVIDIA
    GPU. We benchmark our algorithms against prior state-of-the-art
    GPU-based algorithms on several GPUs. Our results on
    unsegmented scans indicate up to 60% higher performance than
    prior optimized algorithms. On segmented scans, we observed up
    to an order of magnitude higher performance over optimized
    GPU-based segmented scan algorithms.
    Organization of the paper: The rest of the paper is organized as
    follows. In Section 2, we present the related work. In Section 3,
    we give an overview of scan algorithms and the issues in mapping
    them to GPUs. We present our scan algorithms and provide
    analysis in Section 4. In Section 5, we describe our experimental
    International conference on
    Supercomputing, 2008

    Optimising Parallel Prefix operations
    for the Fermi architecture
    !
    Mark Harris Michael Garland
    NVIDIA Corporation
    Published October 2012

    View Slide

  12. Fast Scan Algorithms on Graphics Processors
    Yuri Dotsenko Naga K. Govindaraju Peter-Pike Sloan Charles Boyd John Manferdelli
    Microsoft Corporation
    One Microsoft Way
    Redmond, WA 98052, USA
    {yurido, nagag, ppsloan, chasb, jmanfer}@microsoft.com
    ABSTRACT
    Scan and segmented scan are important data-parallel primitives
    for a wide range of applications. We present fast, work-efficient
    algorithms for these primitives on graphics processing units
    (GPUs). We use novel data representations that map well to the
    GPU architecture. Our algorithms exploit shared memory to
    improve memory performance. We further improve the
    performance of our algorithms by eliminating shared-memory
    bank conflicts and reducing the overheads in prior shared-memory
    GPU algorithms. Furthermore, our algorithms are designed to
    work well on general data sets, including segmented arrays with
    arbitrary segment lengths. We also present optimizations to
    improve the performance of segmented scans based on the
    segment lengths. We implemented our algorithms on a PC with an
    NVIDIA GeForce 8800 GPU and compared our results with prior
    GPU-based algorithms. Our results indicate up to 10x higher
    performance over prior algorithms on input sequences with
    millions of elements.
    Categories and Subject Descriptors
    D.1.3 [Concurrent Programming]: Parallel programming.
    General Terms
    Algorithms, performance.
    Keywords
    Scan, all-prefix-sum, segmented scan, GPGPU, GPU, parallel
    algorithm, HPC, many-core.
    1. INTRODUCTION
    bandwidth and massive parallelism on GPUs. The current state-
    of-the-art GPU-based algorithms also exploit shared memory to
    improve the performance of scans. In this paper, we analyze the
    issues in mapping scan algorithms to the GPU architecture. We
    highlight that the prior algorithms deliver suboptimal performance
    due to high overhead of shared-memory bank conflicts,
    synchronization, and index arithmetic.
    We present fast scan algorithms that map better to GPUs and
    achieve higher performance than prior GPU-based algorithms.
    Our main contribution is a novel data representation in shared and
    global memory that maps better to the GPU memory hierarchy
    and the scan algorithms. Accesses to the data representation
    involve no bank conflicts in the shared memory while exploiting
    the high parallelism on GPUs. Our algorithm involves low
    overhead compared to prior approaches and the performance of
    the kernel scales better with shared memory sizes.
    We implemented our algorithms on a PC with a modern NVIDIA
    GPU. We benchmark our algorithms against prior state-of-the-art
    GPU-based algorithms on several GPUs. Our results on
    unsegmented scans indicate up to 60% higher performance than
    prior optimized algorithms. On segmented scans, we observed up
    to an order of magnitude higher performance over optimized
    GPU-based segmented scan algorithms.
    Organization of the paper: The rest of the paper is organized as
    follows. In Section 2, we present the related work. In Section 3,
    we give an overview of scan algorithms and the issues in mapping
    them to GPUs. We present our scan algorithms and provide
    analysis in Section 4. In Section 5, we describe our experimental
    International conference on
    Supercomputing, 2008

    Optimising Parallel Prefix operations
    for the Fermi architecture
    !
    Mark Harris Michael Garland
    NVIDIA Corporation
    Published October 2012

    View Slide

  13. http://jacohaasbroek.com

    View Slide

  14. How about
    embedded languages
    with
    specialised code generation?

    View Slide

  15. Accelerate

    An embedded language for GPU programming

    View Slide

  16. Accelerate

    An embedded language for GPU programming

    View Slide

  17. Accelerate

    An embedded language for GPU programming

    View Slide

  18. dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)

    View Slide

  19. dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    Embedded
    language arrays

    View Slide

  20. dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    Embedded
    language arrays
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)

    View Slide

  21. dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    Embedded
    language arrays
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)

    View Slide

  22. dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    Embedded
    language arrays
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    From Accelerate library

    View Slide

  23. #include !
    typedef DIM1 DimOut;!
    extern "C" __global__ void zipWith!
    (!
    const DIM1 shIn0,!
    const Int64* __restrict__ arrIn0_a0,!
    const DIM1 shIn1,!
    const Int64* __restrict__ arrIn1_a0,!
    const DIM1 shOut,!
    Int64* __restrict__ arrOut_a0!
    )!
    {!
    const int shapeSize = size(shOut);!
    const int gridSize = blockDim.x * gridDim.x;!
    int ix;!
    !
    for (ix = blockDim.x * blockIdx.x + threadIdx.x; ix < shapeSize; ix += gridSize) {!
    const DimOut sh = fromIndex(shOut, ix);!
    const int v0 = toIndex(shIn0, shape(sh));!
    const int v1 = toIndex(shIn1, shape(sh));!
    !
    arrOut_a0[ix] = arrIn0_a0[v0] * arrIn1_a0[v1];!
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    Embedded
    language arrays
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    From Accelerate library

    View Slide

  24. #include !
    typedef DIM1 DimOut;!
    extern "C" __global__ void zipWith!
    (!
    const DIM1 shIn0,!
    const Int64* __restrict__ arrIn0_a0,!
    const DIM1 shIn1,!
    const Int64* __restrict__ arrIn1_a0,!
    const DIM1 shOut,!
    Int64* __restrict__ arrOut_a0!
    )!
    {!
    const int shapeSize = size(shOut);!
    const int gridSize = blockDim.x * gridDim.x;!
    int ix;!
    !
    for (ix = blockDim.x * blockIdx.x + threadIdx.x; ix < shapeSize; ix += gridSize) {!
    const DimOut sh = fromIndex(shOut, ix);!
    const int v0 = toIndex(shIn0, shape(sh));!
    const int v1 = toIndex(shIn1, shape(sh));!
    !
    arrOut_a0[ix] = arrIn0_a0[v0] * arrIn1_a0[v1];!
    }!
    }!
    sdata0[threadIdx.x] = y0;!
    __syncthreads();!
    ix = min(shapeSize - blockIdx.x * blockDim.x, blockDim.x);!
    if (threadIdx.x + 512 < ix) {!
    x0 = sdata0[threadIdx.x + 512];!
    y0 = y0 + x0;!
    sdata0[threadIdx.x] = y0;!
    }!
    __syncthreads();!
    if (threadIdx.x + 256 < ix) {!
    x0 = sdata0[threadIdx.x + 256];!
    y0 = y0 + x0;!
    sdata0[threadIdx.x] = y0;!
    }!
    __syncthreads();!
    if (threadIdx.x + 128 < ix) {!
    x0 = sdata0[threadIdx.x + 128];!
    y0 = y0 + x0;!
    sdata0[threadIdx.x] = y0;!
    }!
    __syncthreads();!
    if (threadIdx.x + 64 < ix) {!
    x0 = sdata0[threadIdx.x + 64];!
    y0 = y0 + x0;!
    sdata0[threadIdx.x] = y0;!
    }!
    __syncthreads();!
    if (threadIdx.x < 32) {!
    if (threadIdx.x + 32 < ix) {!
    x0 = sdata0[threadIdx.x + 32];!
    y0 = y0 + x0;!
    sdata0[threadIdx.x] = y0;!
    }!
    if (threadIdx.x + 16 < ix) {!
    x0 = sdata0[threadIdx.x + 16];!
    y0 = y0 + x0;!
    sdata0[threadIdx.x] = y0;!
    }!
    if (threadIdx.x + 8 < ix) {!
    x0 = sdata0[threadIdx.x + 8];!
    y0 = y0 + x0;!
    sdata0[threadIdx.x] = y0;!
    }!
    if (threadIdx.x + 4 < ix) {!
    x0 = sdata0[threadIdx.x + 4];!
    y0 = y0 + x0;!
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    Embedded
    language arrays
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    From Accelerate library

    View Slide

  25. Problem #1: Fusion

    View Slide

  26. dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)

    View Slide

  27. dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    Skeleton #1
    Skeleton #2

    View Slide

  28. dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    Skeleton #1
    Skeleton #2
    Intermediate array
    Extra traversal

    View Slide

  29. dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)
    Combined skeleton

    View Slide

  30. Stream fusion?
    Data.Vector
    ( D. Coutts et. al, ICFP ‘07 )

    View Slide

  31. Stream fusion?
    Data.Vector
    ( D. Coutts et. al, ICFP ‘07 )

    View Slide

  32. Retain the skeleton-based structure of the program

    View Slide

  33. map
    zipWith
    backpermute
    generate

    View Slide

  34. map
    zipWith
    backpermute
    generate
    Producers

    View Slide

  35. map
    zipWith
    backpermute
    generate
    fold
    scanl
    permute
    Producers
    Consumers

    View Slide

  36. map
    zipWith
    backpermute
    generate
    fold
    scanl
    permute
    Producers
    Consumers
    Internal representation:
    fusion friendly

    View Slide

  37. map
    zipWith
    backpermute
    generate
    fold
    scanl
    permute
    Producers
    Consumers
    Internal representation:
    fusion friendly
    Embed producer into
    consumer skeleton

    View Slide

  38. map
    zipWith
    backpermute
    generate
    fold
    scanl
    permute
    Producers
    Consumers
    Internal representation:
    fusion friendly
    Embed producer into
    consumer skeleton

    View Slide

  39. Fusing networks of skeletons
    p1
    p1
    p2 p3
    p4
    p5 p6 p7
    c1
    c2

    View Slide

  40. Fusing networks of skeletons
    c2
    p5
    p1
    c1
    p6 p7
    p3
    p2
    p4
    Phase 1: producer/producer fusion

    View Slide

  41. Fusing networks of skeletons
    c2
    p5
    p1
    c1
    p6 p7
    p3
    p2
    p4
    Phase 2: consumer/producer fusion

    View Slide

  42. Fusing networks of skeletons
    c2
    p5
    p1
    c1
    p6 p7
    p3
    p2
    p4
    Phase 2: consumer/producer fusion
    Single fused skeleton

    View Slide

  43. Fusing networks of skeletons
    c2
    p5
    p1
    c1
    p6 p7
    p3
    p2
    p4
    Phase 2: consumer/producer fusion
    Single fused skeleton
    (see paper for details)

    View Slide

  44. let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ

    View Slide

  45. let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ

    View Slide

  46. let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ
    let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine

    View Slide

  47. let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ
    let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine

    View Slide

  48. let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ
    let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine

    View Slide

  49. let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ
    let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine

    View Slide

  50. Problem #2: Sharing

    View Slide

  51. let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ

    View Slide

  52. let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ
    Without sharing
    this is evaluated 4 times

    View Slide

  53. Syntactic?
    ( E. Axelsson, ICFP ‘12 )

    View Slide

  54. Syntactic?
    ( E. Axelsson, ICFP ‘12 )

    View Slide

  55. Type preserving sharing recovery
    Preserve the tree structure of the program

    View Slide

  56. Sharing Recovery
    let&inc&=&(+)&1&
    in&&let&nine&=&let&three&=&inc&2&
    &&&&&&&&&&&&&&&in&
    &&&&&&&&&&&&&&&(*)&three&three&
    &&&&in&
    &&&&())&(inc&nine)&nine
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ

    View Slide

  57. Sharing Recovery
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ

    View Slide

  58. #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ
    Phase 1: prune shared subtrees
    Sharing Recovery

    View Slide

  59. Phase 1: prune shared subtrees
    Sharing Recovery
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ









    View Slide

  60. Phase 1: prune shared subtrees
    Sharing Recovery
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ












    View Slide

  61. Phase 2: float shared terms
    Sharing Recovery
    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ












    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ












    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ












    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ












    View Slide

  62. Phase 2: float shared terms
    Sharing Recovery
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ








    #
    #
    #
    #
    #
    #
    #
    ٴ
    ŸŠ











    #
    #
    #
    #
    #
    #
    #
    ٴ
    ŸŠ











    #
    #
    #
    #
    #
    #
    #
    ٴ
    ŸŠ











    #
    #

    ŸŠ
    #
    À
    Ä À
    À
    #
    Ã
    #
    Ã
    ٴ
    #
    Ä

    #
    ŸÏ
    Ã
    Ä









    View Slide

  63. Phase 2: float shared terms
    Sharing Recovery#
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ












    #
    #
    #
    #
    #
    #
    #
    ŸÏ

    ٴ
    ŸŠ












    #
    #
    #
    #
    #
    #
    ٴ
    ŸŠ









    #
    #

    ŸŠ
    #
    À
    Ä À
    À
    #
    Ã
    #
    Ã
    ٴ
    #
    Ä

    #
    ŸÏ
    Ã
    Ä









    #
    #

    ŸŠ
    #
    À
    Ä À
    À
    #
    Ã
    #
    Ã
    ٴ
    #
    Ä

    #
    ŸÏ
    Ã
    Ä









    View Slide

  64. Phase 3: introduce binders
    Sharing Recovery
    #
    #

    ŸŠ
    #
    À
    Ä À
    À
    #
    Ã
    #
    Ã
    ٴ
    #
    Ä

    #
    ŸÏ
    Ã
    Ä









    View Slide

  65. Phase 3: introduce binders
    Sharing Recovery
    #
    #

    ŸŠ
    #
    À
    Ä À
    À
    #
    Ã
    #
    Ã
    ٴ
    #
    Ä

    #
    ŸÏ
    Ã
    Ä









    #
    #

    ŸŠ
    G@O
    G@O G@O
    #
    T
    S T
    S
    T
    #
    U
    #
    U
    ٴ
    #
    S

    #
    ŸÏ
    Ó U Ó
    Ó

    View Slide

  66. Sharing Recovery
    Before we formalise our sharing recovery algorithm in the follow-
    ing subsections, we shall illustrate the main idea. Consider the fol-
    lowing source term:
    let inc = (+) 1
    in let nine = let three = inc 2
    in
    (*) three three
    in
    (-) (inc nine) nine
    This term’s abstract syntax DAG is the leftmost diagram in Fig-
    ure 2. It uses @ nodes to represent applications; as in this grammar:
    T
    !
    C T
    ⌧ where
    |
    x C
    ⌧ ::
    T

    |
    x. T x
    ⌧ ::
    T

    |
    T1 @
    T2 x
    ⌧1
    .T
    ⌧2 ::
    T
    ⌧1!⌧2
    C
    ! hconstantsi
    T
    ⌧1!⌧2
    1
    @
    T
    ⌧1
    2
    ::
    T
    ⌧2
    The left definition does not track types, whereas the right one does.
    We implement typed ASTs in Haskell with GADTs and work with
    typed representations henceforth. Typed HOAS conversion with
    sharing recover proceeds in three stages:
    1. Prune shared subterms: A depth first traversal over the AST an-
    notates each node with its unique stable name, where we build
    an occurrence map of how many times we’ve already visited
    each node. If we encounter a previously visited node, it repre-
    sents a shared subterm, and we replace it by a placeholder con-
    taining its stable name. The second diagram in Figure 2 shows
    the outcome of this stage. Each node is labeled by a number
    that represents its stable name, and the dotted edges indicate
    where we encountered a previously visited, shared node. The
    placeholders are indicated by underlined stable names.
    2. Float shared terms: All shared subterms float upwards in the
    tree to just above the lowest node that dominates all edges to
    the original position of that shared subterm — see the third
    diagram in Figure 2. Floated subterms are referenced by circled
    stable names located above the node that they floated to. If a
    node collects more than one shared subterm, the subterm whose
    origin is deeper in the original term goes on top — here, 9 on top
    of 5. Nested sharing leads to subterms floating up inside other
    floated subterms — here, 8 stays inside the subterm rooted in 5.
    Figure 2. Recovering sharing in an example term
    3. Binder introduction: Each floated subterm gets let-bound right
    above the node it floated to (rightmost diagram in Figure 2).
    While we use explicit, bound names in the figure, we introduce
    de Bruijn indices at the same time as introducing the lets.
    3.2 Prune shared subterms
    First, we identify and prune shared subtrees, producing a pruned
    tree of the following form (second diagram in Figure 2):
    T
    ⌧ where
    `
    ::
    T
    ⌧ -- binder conversion level

    ⌧ ::
    T
    ⌧ -- pruned subtree (name)
    C
    ⌧ ::
    T

    `. T
    ⌧2 ::
    T
    ⌧1!⌧2
    T
    ⌧1!⌧2
    1
    @
    T
    ⌧1
    2
    ::
    T
    ⌧2
    A stable name (here, of type
    Name
    ) associates a unique name
    with each unique term node, so that two terms with the same stable
    name are identical, and are represented by the same data structure in
    memory. Here, we denote the stable name of a term as a superscript
    during pattern matching — e.g., 1⌫ is a constant with stable name

    , just as in the second and third diagram in Figure 2.
    An occurrence map,

    ::
    Name
    7!
    Int
    , is a finite map that
    determines the number of occurrences of a Name that we encoun-
    tered during a traversal. The expression ⌦

    yields the number of
    occurrences of the name

    , and we have

    2 ⌦ ⌘ (⌦
    ⌫ >
    0). To
    add an occurrence to ⌦, we write

    B⌦. We will see in the next sub-
    section that we cannot simplify ⌦ to be merely a set of occurring
    names. We need the actual occurrence count to determine where
    shared subterms should be let-bound.
    The identification and pruning of shared subtrees is formalised
    by the following function operating on closed terms from
    T
    ⌧ :
    prune
    ::
    Level
    ! (
    Name
    7!
    Int
    ) !
    T
    ⌧ ! ((
    Name
    7!
    Int
    )
    , T
    ⌧ )
    prune ` ⌦ e
    ⌫ |

    2

    = (

    B
    ⌦, ⌫
    )
    prune ` ⌦ e
    ⌫ |
    otherwise
    =
    enter
    (

    B

    )
    e
    where
    enter ⌦ c
    = (
    ⌦, c
    )
    enter ⌦
    (
    x.e
    ) = let
    (

    0
    , e
    0) =
    prune
    (
    `
    + 1)

    ([
    `/x
    ]
    e
    )
    in
    (

    0
    , `.e
    0)
    enter ⌦
    (
    e1 @
    e2) = let
    (
    ⌦1, e
    0
    1
    ) =
    prune ` ⌦ e1
    (
    ⌦2, e
    0
    2
    ) =
    prune ` ⌦1 e2
    in
    (
    ⌦2, e
    0
    1
    @
    e
    0
    2
    )
    The first equation of
    prune
    covers t
    occurrence. In that case, we prune sha
    by a tag

    containing its stable name
    in the second diagram in Figure 2.
    To interleave sharing recovery wit
    to typed de Bruijn indices,
    prune
    lambdas. Moreover, the lambda case
    binder
    x
    by the level
    `
    at the binding
    Why don’t we separate computing
    ing? When computing occurrences,
    subtrees multiple times, so we can as
    Moreover, in the first line of
    prune
    , w
    stead of


    e
    is of the wrong form a
    As far as type-preservation is conc
    due to replacing variables by levels
    described by Atkey et al. [1], which w
    check in an environment lookup, as al
    3.3 Float shared subterms
    Second, we float all shared subtrees
    let-bound, represented by (see third d
    "
    T

    !

    : "
    T
    ⌧0 #
    T

    #
    T

    where

    ⌧ :: #
    T

    C
    ⌧ :: #
    T

    ⌫.
    "
    T
    ⌧2 :: #
    T
    ⌧1!⌧2
    "
    T
    ⌧1!⌧2
    1
    @ "
    T
    ⌧1
    2
    :: #
    T
    ⌧2
    A term in "
    T
    comprises a sequence of
    by their stable name as well as a bod
    the floated subterms where extracted
    replaced lambda binders in
    T
    get re
    their term node. This simplifies a unif
    indices for let and lambda bound vari
    We write

    : "
    T
    for a possibly
    ⌫1 : "
    T1, . . . , ⌫n : "
    Tn
    , where • d
    The floating function
    float
    maint
    floating terms and levels, defined as fo
    !

    i
    : "
    T

    |

    i
    : · |

    i
    :
    `
    These are floated subtrees named

    o
    occurrences. The occurrence count in
    term gets let bound: namely at the
    This is why
    prune
    needed to collec
    in

    . When the occurrence count ma
    `
    ::
    T
    -- binder conversion level

    ⌧ ::
    T
    ⌧ -- pruned subtree (name)
    C
    ⌧ ::
    T

    `. T
    ⌧2 ::
    T
    ⌧1!⌧2
    T
    ⌧1!⌧2
    1
    @
    T
    ⌧1
    2
    ::
    T
    ⌧2
    A stable name (here, of type
    Name
    ) associates a unique name
    with each unique term node, so that two terms with the same stable
    name are identical, and are represented by the same data structure in
    memory. Here, we denote the stable name of a term as a superscript
    during pattern matching — e.g., 1⌫ is a constant with stable name

    , just as in the second and third diagram in Figure 2.
    An occurrence map,

    ::
    Name
    7!
    Int
    , is a finite map that
    determines the number of occurrences of a Name that we encoun-
    tered during a traversal. The expression ⌦

    yields the number of
    occurrences of the name

    , and we have

    2 ⌦ ⌘ (⌦
    ⌫ >
    0). To
    add an occurrence to ⌦, we write

    B⌦. We will see in the next sub-
    section that we cannot simplify ⌦ to be merely a set of occurring
    names. We need the actual occurrence count to determine where
    shared subterms should be let-bound.
    The identification and pruning of shared subtrees is formalised
    by the following function operating on closed terms from
    T
    ⌧ :
    prune
    ::
    Level
    ! (
    Name
    7!
    Int
    ) !
    T
    ⌧ ! ((
    Name
    7!
    Int
    )
    , T
    ⌧ )
    prune ` ⌦ e
    ⌫ |

    2

    = (

    B
    ⌦, ⌫
    )
    prune ` ⌦ e
    ⌫ |
    otherwise
    =
    enter
    (

    B

    )
    e
    where
    enter ⌦ c
    = (
    ⌦, c
    )
    enter ⌦
    (
    x.e
    ) = let
    (

    0
    , e
    0) =
    prune
    (
    `
    + 1)

    ([
    `/x
    ]
    e
    )
    in
    (

    0
    , `.e
    0)
    enter ⌦
    (
    e1 @
    e2) = let
    (
    ⌦1, e
    0
    1
    ) =
    prune ` ⌦ e1
    (
    ⌦2, e
    0
    2
    ) =
    prune ` ⌦1 e2
    in
    (
    ⌦2, e
    0
    1
    @
    e
    0
    2
    )
    Moreover, in the first line of
    prune
    , we cannot simply return
    e
    in-
    stead of


    e
    is of the wrong form as it has type
    T
    and not
    T
    !
    As far as type-preservation is concerned, we do lose information
    due to replacing variables by levels
    `
    . This is the inevitable loss
    described by Atkey et al. [1], which we make up for by a dynamic
    check in an environment lookup, as already discussed.
    3.3 Float shared subterms
    Second, we float all shared subtrees out to where they should be
    let-bound, represented by (see third diagram in Figure 2)
    "
    T

    !

    : "
    T
    ⌧0 #
    T

    #
    T

    where

    ⌧ :: #
    T

    C
    ⌧ :: #
    T

    ⌫.
    "
    T
    ⌧2 :: #
    T
    ⌧1!⌧2
    "
    T
    ⌧1!⌧2
    1
    @ "
    T
    ⌧1
    2
    :: #
    T
    ⌧2
    A term in "
    T
    comprises a sequence of floated-out subterms labelled
    by their stable name as well as a body term from #
    T
    from which
    the floated subterms where extracted. Moreover, the levels
    `
    that
    replaced lambda binders in
    T
    get replaced by the stable name of
    their term node. This simplifies a uniform introduction of de Bruijn
    indices for let and lambda bound variables.
    We write

    : "
    T
    for a possibly empty sequence of items:
    ⌫1 : "
    T1, . . . , ⌫n : "
    Tn
    , where • denotes an empty sequence.
    The floating function
    float
    maintains an auxiliary structure of
    floating terms and levels, defined as follows:
    !

    i
    : "
    T

    |

    i
    : · |

    i
    :
    `
    These are floated subtrees named

    of which we have collected
    i
    occurrences. The occurrence count indicates where a shared sub-
    term gets let bound: namely at the node where it matches
    ⌦⌫
    .
    This is why
    prune
    needed to collect the number of occurrences
    in

    . When the occurrence count matches
    ⌦⌫
    , we call the floated
    ering sharing in an example term
    right
    e 2).
    duce
    uned
    vel
    e)
    name
    table
    re in
    cript
    name
    that
    oun-
    er of
    ). To
    sub-
    rring
    here
    lised
    t
    )
    , T
    ⌧ )
    The first equation of
    prune
    covers the case of a term’s repeated
    occurrence. In that case, we prune sharing by replacing the term
    e

    by a tag

    containing its stable name — these are the dotted lines
    in the second diagram in Figure 2.
    To interleave sharing recovery with the conversion from HOAS
    to typed de Bruijn indices,
    prune
    tracks the nesting
    Level
    of
    lambdas. Moreover, the lambda case of
    enter
    replaces the HOAS
    binder
    x
    by the level
    `
    at the binding and usage sites.
    Why don’t we separate computing occurrences from tree prun-
    ing? When computing occurrences, we must not traverse shared
    subtrees multiple times, so we can as well prune at the same time.
    Moreover, in the first line of
    prune
    , we cannot simply return
    e
    in-
    stead of


    e
    is of the wrong form as it has type
    T
    and not
    T
    !
    As far as type-preservation is concerned, we do lose information
    due to replacing variables by levels
    `
    . This is the inevitable loss
    described by Atkey et al. [1], which we make up for by a dynamic
    check in an environment lookup, as already discussed.
    3.3 Float shared subterms
    Second, we float all shared subtrees out to where they should be
    let-bound, represented by (see third diagram in Figure 2)
    "
    T

    !

    : "
    T
    ⌧0 #
    T

    #
    T

    where

    ⌧ :: #
    T

    C
    ⌧ :: #
    T

    ⌫.
    "
    T
    ⌧2 :: #
    T
    ⌧1!⌧2
    "
    T
    ⌧1!⌧2
    1
    @ "
    T
    ⌧1
    2
    :: #
    T
    ⌧2
    A term in "
    T
    comprises a sequence of floated-out subterms labelled
    by their stable name as well as a body term from #
    T
    from which
    the floated subterms where extracted. Moreover, the levels
    `
    that
    replaced lambda binders in
    T
    get replaced by the stable name of
    their term node. This simplifies a uniform introduction of de Bruijn
    indices for let and lambda bound variables.
    We write

    : "
    T
    for a possibly empty sequence of items:
    ⌫1 : "
    T1, . . . , ⌫n : "
    Tn
    , where • denotes an empty sequence.
    The floating function
    float
    maintains an auxiliary structure of
    floating terms and levels, defined as follows:
    und names in the figure, we introduce
    me time as introducing the lets.
    s
    shared subtrees, producing a pruned
    cond diagram in Figure 2):
    ⌧ -- binder conversion level
    ⌧ -- pruned subtree (name)

    ⌧1!⌧2
    ⌧2
    ype
    Name
    ) associates a unique name
    so that two terms with the same stable
    presented by the same data structure in
    stable name of a term as a superscript
    .g., 1⌫ is a constant with stable name
    ird diagram in Figure 2.
    Name
    7!
    Int
    , is a finite map that
    currences of a Name that we encoun-
    expression ⌦

    yields the number of
    and we have

    2 ⌦ ⌘ (⌦
    ⌫ >
    0). To
    rite

    B⌦. We will see in the next sub-
    ify ⌦ to be merely a set of occurring
    occurrence count to determine where
    -bound.
    ning of shared subtrees is formalised
    rating on closed terms from
    T
    ⌧ :
    !
    Int
    ) !
    T
    ⌧ ! ((
    Name
    7!
    Int
    )
    , T
    ⌧ )
    = (

    B
    ⌦, ⌫
    )
    =
    enter
    (

    B

    )
    e
    ⌦, c
    )
    t
    (

    0
    , e
    0) =
    prune
    (
    `
    + 1)

    ([
    `/x
    ]
    e
    )
    n

    0
    , `.e
    0)
    t
    (
    ⌦1, e
    0
    1
    ) =
    prune ` ⌦ e1
    (
    ⌦2, e
    0
    2
    ) =
    prune ` ⌦1 e2
    n
    ⌦2, e
    0
    1
    @
    e
    0
    2
    )
    by a tag

    containing its stable name — these are the dotted lines
    in the second diagram in Figure 2.
    To interleave sharing recovery with the conversion from HOAS
    to typed de Bruijn indices,
    prune
    tracks the nesting
    Level
    of
    lambdas. Moreover, the lambda case of
    enter
    replaces the HOAS
    binder
    x
    by the level
    `
    at the binding and usage sites.
    Why don’t we separate computing occurrences from tree prun-
    ing? When computing occurrences, we must not traverse shared
    subtrees multiple times, so we can as well prune at the same time.
    Moreover, in the first line of
    prune
    , we cannot simply return
    e
    in-
    stead of


    e
    is of the wrong form as it has type
    T
    and not
    T
    !
    As far as type-preservation is concerned, we do lose information
    due to replacing variables by levels
    `
    . This is the inevitable loss
    described by Atkey et al. [1], which we make up for by a dynamic
    check in an environment lookup, as already discussed.
    3.3 Float shared subterms
    Second, we float all shared subtrees out to where they should be
    let-bound, represented by (see third diagram in Figure 2)
    "
    T

    !

    : "
    T
    ⌧0 #
    T

    #
    T

    where

    ⌧ :: #
    T

    C
    ⌧ :: #
    T

    ⌫.
    "
    T
    ⌧2 :: #
    T
    ⌧1!⌧2
    "
    T
    ⌧1!⌧2
    1
    @ "
    T
    ⌧1
    2
    :: #
    T
    ⌧2
    A term in "
    T
    comprises a sequence of floated-out subterms labelled
    by their stable name as well as a body term from #
    T
    from which
    the floated subterms where extracted. Moreover, the levels
    `
    that
    replaced lambda binders in
    T
    get replaced by the stable name of
    their term node. This simplifies a uniform introduction of de Bruijn
    indices for let and lambda bound variables.
    We write

    : "
    T
    for a possibly empty sequence of items:
    ⌫1 : "
    T1, . . . , ⌫n : "
    Tn
    , where • denotes an empty sequence.
    The floating function
    float
    maintains an auxiliary structure of
    floating terms and levels, defined as follows:
    !

    i
    : "
    T

    |

    i
    : · |

    i
    :
    `
    These are floated subtrees named

    of which we have collected
    i
    occurrences. The occurrence count indicates where a shared sub-
    term gets let bound: namely at the node where it matches
    ⌦⌫
    .
    This is why
    prune
    needed to collect the number of occurrences
    in

    . When the occurrence count matches
    ⌦⌫
    , we call the floated
    term saturated. The following function determines saturated floated
    terms, which ought to be let bound:
    bind
    :: (
    Name
    7!
    Int
    ) ! ! 9
    ⌧.⌫
    : "
    T

    bind ⌦
    • = •
    bind ⌦
    (

    i
    :
    e,
    ) |
    ⌦⌫
    ==
    i
    =

    :
    e, bind ⌦
    bind ⌦
    (

    i
    :
    ,
    ) =
    bind ⌦
    Note that does not keep track of the type

    of a floated term "
    T

    ;
    hence, floated terms from
    bind
    come in an existential package. This
    does not introduce additional loss of type safety as we already lost
    the type of lambda bound variables in

    i
    :
    `
    . It merely means that let
    bound, just like lambda bound, variables require the dynamically
    checked environment look up we already discussed.
    When floating the first occurrence of a shared tree (not pruned
    by
    prune
    ), we use

    i
    : "
    T

    . When floating subsequent occurrences
    (which were pruned), we use

    i
    : ·. Finally, when floating a level, to
    replace it by a stable name, we use

    i
    :
    `
    .
    We define a partial ordering on floated terms:
    ⌫1
    i
    :
    x < ⌫2
    j
    :
    y
    iff the direct path from
    ⌫1
    to the root of the AST is shorter than
    that of
    ⌫2
    . We keep sequences of floated terms in descending order
    — so that the deepest subterm comes first. We write 1
    ] 2
    to
    merge two sequences of floated terms. Merging respects the partial
    order, and it combines floated trees with the same stable name by
    adding their occurrence counts. To combine the first occurrence and
    a subsequent occurrence of a shared tree, we preserve the term of
    the first occurrence. We write \

    to delete elements of that
    term saturated. The following function determines saturated floated
    terms, which ought to be let bound:
    bind
    :: (
    Name
    7!
    Int
    ) ! ! 9
    ⌧.⌫
    : "
    T

    bind ⌦
    • = •
    bind ⌦
    (

    i
    :
    e,
    ) |
    ⌦⌫
    ==
    i
    =

    :
    e, bind ⌦
    bind ⌦
    (

    i
    :
    ,
    ) =
    bind ⌦
    Note that does not keep track of the type

    of a floated term "
    T

    ;
    hence, floated terms from
    bind
    come in an existential package. This
    does not introduce additional loss of type safety as we already lost
    the type of lambda bound variables in

    i
    :
    `
    . It merely means that let
    bound, just like lambda bound, variables require the dynamically
    checked environment look up we already discussed.
    When floating the first occurrence of a shared tree (not pruned
    by
    prune
    ), we use

    i
    : "
    T

    . When floating subsequent occurrences
    (which were pruned), we use

    i
    : ·. Finally, when floating a level, to
    replace it by a stable name, we use

    i
    :
    `
    .
    We define a partial ordering on floated terms:
    ⌫1
    i
    :
    x < ⌫2
    j
    :
    y
    iff the direct path from
    ⌫1
    to the root of the AST is shorter than
    that of
    ⌫2
    . We keep sequences of floated terms in descending order
    — so that the deepest subterm comes first. We write 1
    ] 2
    to
    merge two sequences of floated terms. Merging respects the partial
    order, and it combines floated trees with the same stable name by
    adding their occurrence counts. To combine the first occurrence and
    a subsequent occurrence of a shared tree, we preserve the term of
    the first occurrence. We write \

    to delete elements of that
    are tagged with a name that appears in the sequence

    .
    We can now formalise the floating process as follows:
    float
    :: (
    Name
    7!
    Int
    ) !
    T
    ⌧ ! (
    ,
    "
    T

    )
    float ⌦ `
    ⌫ = (

    1
    :
    `, ⌫
    )
    float ⌦ ⌫
    = (

    1
    : ·
    , ⌫
    )
    float ⌦ e
    ⌫ = let
    (
    , e
    0) =
    descend e
    ⌫b
    :
    eb
    =
    bind ⌦
    d
    =
    ⌫b
    :
    eb e
    0
    in
    if
    ⌦⌫
    == 1 then
    ( \
    ⌫b, d
    )
    else
    ( \
    ⌫b
    ] {

    :
    d
    }
    , ⌫
    )
    where
    descend
    ::
    T
    ⌧ ! (
    ,
    #
    T

    )
    descend c
    = (•
    , c
    )
    descend
    (
    `.e
    ) = let
    (
    , e
    0) =
    float ⌦ e
    in
    if 9

    0
    i.
    (

    0 i
    :
    `
    ) 2 then
    ( \ {

    0}
    , ⌫
    0
    .e
    0)
    else
    (
    , .e
    0)
    descend
    (
    e1 @
    e2) = let
    ( 1, e
    0
    1
    ) =
    float ⌦ e1
    ( 2, e
    0
    2
    ) =
    float ⌦ e2
    in
    ( 1
    ] 2, e
    0
    1
    @
    e
    0
    2
    )
    Regardless of whether a term gets floated, all saturated float
    terms,
    ⌫b
    :
    eb
    , must prefix the result,
    e
    0, and be removed from
    When
    descend
    ing into a term, the only interesting case is
    lambdas. For a lambda at level
    `
    , we look for a floated level of t
    form

    0 :
    `
    . If that is available,

    0 replaces
    `
    as a binder and
    remove

    0 :
    `
    from . However, if

    0 :
    `
    is not in , the bind
    introduced by the lambda doesn’t get used in
    e
    . In this case,
    pick an arbitrary new name; here symbolised by an underscore ”
    3.4 Binder introduction
    Thirdly, we introduce typed de Bruijn indices to represent lamb
    and let binding structure (rightmost diagram in Figure 2):
    env
    T
    ⌧ where
    C
    ⌧ :: env
    T

    env

    ⌧ :: env
    T

    (⌧1, env)
    T
    ⌧2 :: env
    T
    ⌧1!⌧2
    env
    T
    ⌧1!⌧2
    1
    @ env
    T
    ⌧1
    2
    :: env
    T
    ⌧2
    let env
    T
    ⌧1
    1
    in (⌧1, env)
    T
    ⌧2
    2
    :: env
    T
    ⌧2
    With this type of terms,
    e
    :: env
    T
    ⌧ means that
    e
    is a term rep
    senting a computation producing a value of type

    under the ty
    environment
    env
    . Type environments are nested pair types, pos
    bly terminated by a unit type (). For example, ((()
    , ⌧1)
    , ⌧0) i
    type environment, where de Bruijn index 0 represents a variable
    type
    ⌧0
    and de Bruijn index 1 represents a variable of type
    ⌧1
    .
    We abbreviate let
    e1 in · · · let
    en in
    eb
    as let
    e
    in
    Both and let use de Bruijn indices

    instead of introduci
    explicit binders.
    To replace the names of pruned subtrees and of lambda bou
    variables by de Bruijn indices, we need to construct a suitab
    type environment as well as an association of environment entri
    their de Bruijn indices, and the stable names that they replace. W
    maintain the type environment with associated de Bruijn indices
    the following environment layout structure:
    env env0
    where
    :: env ()
    env env0
    ; env

    ⌧ :: env (env0, t)
    Together with a layout, we use a sequence of names

    of the sam
    size as the layout, where corresponding entries represent the sam
    variable. As this association between typed layout and untyp
    sequence of names is not validated by types, the lookup functi
    lyt
    #
    i
    getting the
    i
    th index of layout
    lyt
    makes use of a dynam
    type check. It’s signature is (#) :: N ! env env0
    ! env

    ⌧ .
    Now we can introduces de Bruijn indices to body expression
    body
    :: env env !

    ! #
    T

    ! env
    T

    body lyt
    (
    ⌫⇢,0, . . . , ⌫⇢,n )

    |

    ==
    ⌫⇢,i =
    lyt
    #
    i
    body lyt ⌫⇢ c
    =
    c
    body lyt ⌫⇢ (
    ⌫.e
    ) = (
    binders lyt
    + (
    ⌫, ⌫⇢)
    e
    )
    body lyt ⌫⇢ (
    e1 @
    e2) = (
    binders lyt ⌫⇢ e1) @ (
    binders lyt
    The first equation performs a lookup in the environment layo
    at the same index where the stable name

    occurs in the nam
    environment

    . The lookup is the same for lambda and let bou
    variables. It is the only place where we need a dynamic type che
    and that is already needed for lambda bound variables alone.
    In the case of a lambda, we add a new binder by extendi
    the layout, denoted
    lyt
    +, with a new zeroth de Bruijn index a
    term saturated. The following function determines saturated floated
    terms, which ought to be let bound:
    bind
    :: (
    Name
    7!
    Int
    ) ! ! 9
    ⌧.⌫
    : "
    T

    bind ⌦
    • = •
    bind ⌦
    (

    i
    :
    e,
    ) |
    ⌦⌫
    ==
    i
    =

    :
    e, bind ⌦
    bind ⌦
    (

    i
    :
    ,
    ) =
    bind ⌦
    Note that does not keep track of the type

    of a floated term "
    T

    ;
    hence, floated terms from
    bind
    come in an existential package. This
    does not introduce additional loss of type safety as we already lost
    the type of lambda bound variables in

    i
    :
    `
    . It merely means that let
    bound, just like lambda bound, variables require the dynamically
    checked environment look up we already discussed.
    When floating the first occurrence of a shared tree (not pruned
    by
    prune
    ), we use

    i
    : "
    T

    . When floating subsequent occurrences
    (which were pruned), we use

    i
    : ·. Finally, when floating a level, to
    replace it by a stable name, we use

    i
    :
    `
    .
    We define a partial ordering on floated terms:
    ⌫1
    i
    :
    x < ⌫2
    j
    :
    y
    iff the direct path from
    ⌫1
    to the root of the AST is shorter than
    that of
    ⌫2
    . We keep sequences of floated terms in descending order
    — so that the deepest subterm comes first. We write 1
    ] 2
    to
    merge two sequences of floated terms. Merging respects the partial
    order, and it combines floated trees with the same stable name by
    adding their occurrence counts. To combine the first occurrence and
    a subsequent occurrence of a shared tree, we preserve the term of
    the first occurrence. We write \

    to delete elements of that
    are tagged with a name that appears in the sequence

    .
    We can now formalise the floating process as follows:
    float
    :: (
    Name
    7!
    Int
    ) !
    T
    ⌧ ! (
    ,
    "
    T

    )
    float ⌦ `
    ⌫ = (

    1
    :
    `, ⌫
    )
    float ⌦ ⌫
    = (

    1
    : ·
    , ⌫
    )
    float ⌦ e
    ⌫ = let
    (
    , e
    0) =
    descend e
    ⌫b
    :
    eb
    =
    bind ⌦
    d
    =
    ⌫b
    :
    eb e
    0
    in
    if
    ⌦⌫
    == 1 then
    ( \
    ⌫b, d
    )
    else
    ( \
    ⌫b
    ] {

    :
    d
    }
    , ⌫
    )
    where
    descend
    ::
    T
    ⌧ ! (
    ,
    #
    T

    )
    descend c
    = (•
    , c
    )
    descend
    (
    `.e
    ) = let
    (
    , e
    0) =
    float ⌦ e
    in
    if 9

    0
    i.
    (

    0 i
    :
    `
    ) 2 then
    ( \ {

    0}
    , ⌫
    0
    .e
    0)
    else
    (
    , .e
    0)
    descend
    (
    e1 @
    e2) = let
    ( 1, e
    0
    1
    ) =
    float ⌦ e1
    ( 2, e
    0
    2
    ) =
    float ⌦ e2
    in
    ( 1
    ] 2, e
    0
    1
    @
    e
    0
    2
    )
    Regardless of whether a term gets floated, all saturated fl
    terms,
    ⌫b
    :
    eb
    , must prefix the result,
    e
    0, and be removed fr
    When
    descend
    ing into a term, the only interesting case
    lambdas. For a lambda at level
    `
    , we look for a floated level
    form

    0 :
    `
    . If that is available,

    0 replaces
    `
    as a binder a
    remove

    0 :
    `
    from . However, if

    0 :
    `
    is not in , the
    introduced by the lambda doesn’t get used in
    e
    . In this ca
    pick an arbitrary new name; here symbolised by an undersco
    3.4 Binder introduction
    Thirdly, we introduce typed de Bruijn indices to represent l
    and let binding structure (rightmost diagram in Figure 2):
    env
    T
    ⌧ where
    C
    ⌧ :: env
    T

    env

    ⌧ :: env
    T

    (⌧1, env)
    T
    ⌧2 :: env
    T
    ⌧1!⌧2
    env
    T
    ⌧1!⌧2
    1
    @ env
    T
    ⌧1
    2
    :: env
    T
    ⌧2
    let env
    T
    ⌧1
    1
    in (⌧1, env)
    T
    ⌧2
    2
    :: env
    T
    ⌧2
    With this type of terms,
    e
    :: env
    T
    ⌧ means that
    e
    is a term
    senting a computation producing a value of type

    under th
    environment
    env
    . Type environments are nested pair types,
    bly terminated by a unit type (). For example, ((()
    , ⌧1)
    , ⌧
    type environment, where de Bruijn index 0 represents a varia
    type
    ⌧0
    and de Bruijn index 1 represents a variable of type

    We abbreviate let
    e1 in · · · let
    en in
    eb
    as let
    e
    Both and let use de Bruijn indices

    instead of introd
    explicit binders.
    To replace the names of pruned subtrees and of lambda
    variables by de Bruijn indices, we need to construct a su
    type environment as well as an association of environment e
    their de Bruijn indices, and the stable names that they replac
    maintain the type environment with associated de Bruijn ind
    the following environment layout structure:
    env env0
    where
    :: env ()
    env env0
    ; env

    ⌧ :: env (env0, t)
    Together with a layout, we use a sequence of names

    of the
    size as the layout, where corresponding entries represent the
    variable. As this association between typed layout and un
    sequence of names is not validated by types, the lookup fu
    lyt
    #
    i
    getting the
    i
    th index of layout
    lyt
    makes use of a dy
    type check. It’s signature is (#) :: N ! env env0
    ! env


    Now we can introduces de Bruijn indices to body express
    body
    :: env env !

    ! #
    T

    ! env
    T

    body lyt
    (
    ⌫⇢,0, . . . , ⌫⇢,n )

    |

    ==
    ⌫⇢,i =
    lyt
    #
    i
    body lyt ⌫⇢ c
    =
    c
    body lyt ⌫⇢ (
    ⌫.e
    ) = (
    binders lyt
    + (
    ⌫, ⌫⇢)
    e
    )
    body lyt ⌫⇢ (
    e1 @
    e2) = (
    binders lyt ⌫⇢ e1) @ (
    binders
    The first equation performs a lookup in the environment
    at the same index where the stable name

    occurs in the
    environment

    . The lookup is the same for lambda and let
    variables. It is the only place where we need a dynamic type
    and that is already needed for lambda bound variables alone
    In the case of a lambda, we add a new binder by exte
    We define a partial ordering on floated terms:
    ⌫1 :
    x < ⌫2 :
    y
    the direct path from
    ⌫1
    to the root of the AST is shorter than
    at of
    ⌫2
    . We keep sequences of floated terms in descending order
    — so that the deepest subterm comes first. We write 1
    ] 2
    to
    erge two sequences of floated terms. Merging respects the partial
    der, and it combines floated trees with the same stable name by
    dding their occurrence counts. To combine the first occurrence and
    subsequent occurrence of a shared tree, we preserve the term of
    e first occurrence. We write \

    to delete elements of that
    e tagged with a name that appears in the sequence

    .
    We can now formalise the floating process as follows:
    float
    :: (
    Name
    7!
    Int
    ) !
    T
    ⌧ ! (
    ,
    "
    T

    )
    float ⌦ `
    ⌫ = (

    1
    :
    `, ⌫
    )
    float ⌦ ⌫
    = (

    1
    : ·
    , ⌫
    )
    float ⌦ e
    ⌫ = let
    (
    , e
    0) =
    descend e
    ⌫b
    :
    eb
    =
    bind ⌦
    d
    =
    ⌫b
    :
    eb e
    0
    in
    if
    ⌦⌫
    == 1 then
    ( \
    ⌫b, d
    )
    else
    ( \
    ⌫b
    ] {

    :
    d
    }
    , ⌫
    )
    where
    descend
    ::
    T
    ⌧ ! (
    ,
    #
    T

    )
    descend c
    = (•
    , c
    )
    descend
    (
    `.e
    ) = let
    (
    , e
    0) =
    float ⌦ e
    in
    if 9

    0
    i.
    (

    0 i
    :
    `
    ) 2 then
    ( \ {

    0}
    , ⌫
    0
    .e
    0)
    else
    (
    , .e
    0)
    descend
    (
    e1 @
    e2) = let
    ( 1, e
    0
    1
    ) =
    float ⌦ e1
    ( 2, e
    0
    2
    ) =
    float ⌦ e2
    in
    ( 1
    ] 2, e
    0
    1
    @
    e
    0
    2
    )
    he first two cases of
    float
    ensure that the levels of lambda bound
    riables and the names of pruned shared subterms are floated
    gardless of how often they occur. In contrast, the third equation
    oats a term with name

    only if it is shared; i.e.,
    ⌦⌫
    is not 1. If it
    shared, it is also pruned; i.e., replaced by its name

    — just as in
    e third diagram of Figure 2.
    With this type of terms,
    e
    :: env
    T
    ⌧ means that
    e
    is a term repre-
    senting a computation producing a value of type

    under the type
    environment
    env
    . Type environments are nested pair types, possi-
    bly terminated by a unit type (). For example, ((()
    , ⌧1)
    , ⌧0) is a
    type environment, where de Bruijn index 0 represents a variable of
    type
    ⌧0
    and de Bruijn index 1 represents a variable of type
    ⌧1
    .
    We abbreviate let
    e1 in · · · let
    en in
    eb
    as let
    e
    in
    eb
    .
    Both and let use de Bruijn indices

    instead of introducing
    explicit binders.
    To replace the names of pruned subtrees and of lambda bound
    variables by de Bruijn indices, we need to construct a suitable
    type environment as well as an association of environment entries,
    their de Bruijn indices, and the stable names that they replace. We
    maintain the type environment with associated de Bruijn indices in
    the following environment layout structure:
    env env0
    where
    :: env ()
    env env0
    ; env

    ⌧ :: env (env0, t)
    Together with a layout, we use a sequence of names

    of the same
    size as the layout, where corresponding entries represent the same
    variable. As this association between typed layout and untyped
    sequence of names is not validated by types, the lookup function
    lyt
    #
    i
    getting the
    i
    th index of layout
    lyt
    makes use of a dynamic
    type check. It’s signature is (#) :: N ! env env0
    ! env

    ⌧ .
    Now we can introduces de Bruijn indices to body expressions:
    body
    :: env env !

    ! #
    T

    ! env
    T

    body lyt
    (
    ⌫⇢,0, . . . , ⌫⇢,n )

    |

    ==
    ⌫⇢,i =
    lyt
    #
    i
    body lyt ⌫⇢ c
    =
    c
    body lyt ⌫⇢ (
    ⌫.e
    ) = (
    binders lyt
    + (
    ⌫, ⌫⇢)
    e
    )
    body lyt ⌫⇢ (
    e1 @
    e2) = (
    binders lyt ⌫⇢ e1) @ (
    binders lyt ⌫⇢ e2)
    The first equation performs a lookup in the environment layout
    at the same index where the stable name

    occurs in the name
    environment

    . The lookup is the same for lambda and let bound
    variables. It is the only place where we need a dynamic type check
    and that is already needed for lambda bound variables alone.
    In the case of a lambda, we add a new binder by extending
    the layout, denoted
    lyt
    +, with a new zeroth de Bruijn index and
    shifting all others one up. Keeping the name environment in sync,
    we add the stable name

    , which #
    T
    used as a binder.
    In the same vein, we bind
    n
    floated terms

    :
    e
    with let bind-
    ings in body expression
    eb
    , by extending the type environment
    n
    times (
    map
    applies a function to each element of a sequence):
    terms, which ought to be let bound:
    bind
    :: (
    Name
    7!
    Int
    ) ! ! 9
    ⌧.⌫
    : "
    T

    bind ⌦
    • = •
    bind ⌦
    (

    i
    :
    e,
    ) |
    ⌦⌫
    ==
    i
    =

    :
    e, bind ⌦
    bind ⌦
    (

    i
    :
    ,
    ) =
    bind ⌦
    Note that does not keep track of the type

    of a floated term "
    T

    ;
    hence, floated terms from
    bind
    come in an existential package. This
    does not introduce additional loss of type safety as we already lost
    the type of lambda bound variables in

    i
    :
    `
    . It merely means that let
    bound, just like lambda bound, variables require the dynamically
    checked environment look up we already discussed.
    When floating the first occurrence of a shared tree (not pruned
    by
    prune
    ), we use

    i
    : "
    T

    . When floating subsequent occurrences
    (which were pruned), we use

    i
    : ·. Finally, when floating a level, to
    replace it by a stable name, we use

    i
    :
    `
    .
    We define a partial ordering on floated terms:
    ⌫1
    i
    :
    x < ⌫2
    j
    :
    y
    iff the direct path from
    ⌫1
    to the root of the AST is shorter than
    that of
    ⌫2
    . We keep sequences of floated terms in descending order
    — so that the deepest subterm comes first. We write 1
    ] 2
    to
    merge two sequences of floated terms. Merging respects the partial
    order, and it combines floated trees with the same stable name by
    adding their occurrence counts. To combine the first occurrence and
    a subsequent occurrence of a shared tree, we preserve the term of
    the first occurrence. We write \

    to delete elements of that
    are tagged with a name that appears in the sequence

    .
    We can now formalise the floating process as follows:
    float
    :: (
    Name
    7!
    Int
    ) !
    T
    ⌧ ! (
    ,
    "
    T

    )
    float ⌦ `
    ⌫ = (

    1
    :
    `, ⌫
    )
    float ⌦ ⌫
    = (

    1
    : ·
    , ⌫
    )
    float ⌦ e
    ⌫ = let
    (
    , e
    0) =
    descend e
    ⌫b
    :
    eb
    =
    bind ⌦
    d
    =
    ⌫b
    :
    eb e
    0
    in
    if
    ⌦⌫
    == 1 then
    ( \
    ⌫b, d
    )
    else
    ( \
    ⌫b
    ] {

    :
    d
    }
    , ⌫
    )
    where
    descend
    ::
    T
    ⌧ ! (
    ,
    #
    T

    )
    descend c
    = (•
    , c
    )
    descend
    (
    `.e
    ) = let
    (
    , e
    0) =
    float ⌦ e
    in
    if 9

    0
    i.
    (

    0 i
    :
    `
    ) 2 then
    ( \ {

    0}
    , ⌫
    0
    .e
    0)
    else
    (
    , .e
    0)
    descend
    (
    e1 @
    e2) = let
    ( 1, e
    0
    1
    ) =
    float ⌦ e1
    ( 2, e
    0
    2
    ) =
    float ⌦ e2
    in
    ( 1
    ] 2, e
    0
    1
    @
    e
    0
    2
    )
    The first two cases of
    float
    ensure that the levels of lambda bound
    variables and the names of pruned shared subterms are floated
    terms,
    ⌫b
    :
    eb
    , must prefix the result,
    e
    0, and be removed from .
    When
    descend
    ing into a term, the only interesting case is for
    lambdas. For a lambda at level
    `
    , we look for a floated level of the
    form

    0 :
    `
    . If that is available,

    0 replaces
    `
    as a binder and we
    remove

    0 :
    `
    from . However, if

    0 :
    `
    is not in , the binder
    introduced by the lambda doesn’t get used in
    e
    . In this case, we
    pick an arbitrary new name; here symbolised by an underscore ” ”.
    3.4 Binder introduction
    Thirdly, we introduce typed de Bruijn indices to represent lambda
    and let binding structure (rightmost diagram in Figure 2):
    env
    T
    ⌧ where
    C
    ⌧ :: env
    T

    env

    ⌧ :: env
    T

    (⌧1, env)
    T
    ⌧2 :: env
    T
    ⌧1!⌧2
    env
    T
    ⌧1!⌧2
    1
    @ env
    T
    ⌧1
    2
    :: env
    T
    ⌧2
    let env
    T
    ⌧1
    1
    in (⌧1, env)
    T
    ⌧2
    2
    :: env
    T
    ⌧2
    With this type of terms,
    e
    :: env
    T
    ⌧ means that
    e
    is a term repre-
    senting a computation producing a value of type

    under the type
    environment
    env
    . Type environments are nested pair types, possi-
    bly terminated by a unit type (). For example, ((()
    , ⌧1)
    , ⌧0) is a
    type environment, where de Bruijn index 0 represents a variable of
    type
    ⌧0
    and de Bruijn index 1 represents a variable of type
    ⌧1
    .
    We abbreviate let
    e1 in · · · let
    en in
    eb
    as let
    e
    in
    eb
    .
    Both and let use de Bruijn indices

    instead of introducing
    explicit binders.
    To replace the names of pruned subtrees and of lambda bound
    variables by de Bruijn indices, we need to construct a suitable
    type environment as well as an association of environment entries,
    their de Bruijn indices, and the stable names that they replace. We
    maintain the type environment with associated de Bruijn indices in
    the following environment layout structure:
    env env0
    where
    :: env ()
    env env0
    ; env

    ⌧ :: env (env0, t)
    Together with a layout, we use a sequence of names

    of the same
    size as the layout, where corresponding entries represent the same
    variable. As this association between typed layout and untyped
    sequence of names is not validated by types, the lookup function
    lyt
    #
    i
    getting the
    i
    th index of layout
    lyt
    makes use of a dynamic
    type check. It’s signature is (#) :: N ! env env0
    ! env

    ⌧ .
    Now we can introduces de Bruijn indices to body expressions:
    body
    :: env env !

    ! #
    T

    ! env
    T

    body lyt
    (
    ⌫⇢,0, . . . , ⌫⇢,n )

    |

    ==
    ⌫⇢,i =
    lyt
    #
    i
    body lyt ⌫⇢ c
    =
    c
    body lyt ⌫⇢ (
    ⌫.e
    ) = (
    binders lyt
    + (
    ⌫, ⌫⇢)
    e
    )
    body lyt ⌫⇢ (
    e1 @
    e2) = (
    binders lyt ⌫⇢ e1) @ (
    binders lyt ⌫⇢ e2)
    The first equation performs a lookup in the environment layout
    at the same index where the stable name

    occurs in the name
    environment

    . The lookup is the same for lambda and let bound
    variables. It is the only place where we need a dynamic type check
    and that is already needed for lambda bound variables alone.
    In the case of a lambda, we add a new binder by extending
    the layout, denoted
    lyt
    +, with a new zeroth de Bruijn index and
    shifting all others one up. Keeping the name environment in sync,
    #
    (Before fusion)
    p1
    p1
    p2 p3
    p4
    p5 p6 p7
    c1
    c2
    (After producer/producer fusion)
    c2
    p5
    p1
    c1
    p6 p7
    p3
    p2
    p4
    (After consumer/producer fusion)
    c2
    p5
    p1
    c1
    p6 p7
    p3
    p2
    p4
    Figure 3. Produce/producer and consumer/producer fusion
    binders
    :: env env !

    ! "
    T

    ! env
    T

    binders lyt ⌫⇢ (

    :
    e eb
    ) =
    let
    map
    (
    binders lyt ⌫⇢)
    e
    in
    body lyt
    +n (
    ⌫, ⌫⇢)
    eb
    where
    n
    =
    length
    (

    :
    e
    )
    We tie the three stages together to convert from HOAS with sharing
    recovery producing let bindings and typed de Bruijn indices:
    variables are used multiple times in the body of an expression, un-
    restrained inlining can lead to duplication of work. Compilers such
    as GHC, handle this situation by only inlining the definitions of let-
    bound variables that have a single use site, or by relying on some
    heuristic about the size of the resulting code to decide what to inline
    [26]. However, in typical Accelerate programs, each array is used at
    least twice: once to access the shape information and once to access
    the array data; so, we must handle at least this case separately.
    Filtering.
    General array fusion transforms must deal with filter-
    like operations, for which the size of the result structure depends on
    the value of the input structure, as well as its size. Accelerate does
    not encode filtering as a primitive operation, so we do not need to
    consider it further.1
    Fusion at run-time.
    As the Accelerate language is embedded in
    Haskell, compilation of the Accelerate program happens at Haskell
    runtime rather than when compiling the Haskell program. For this
    reason, optimisations applied to an Accelerate program contribute
    to its overall runtime, so we must be mindful of the cost of analysis
    and code transformation. On the flip-side, runtime optimisations
    can make use of information that is only available at runtime.
    Fusion on typed de Brujin terms.
    We fuse Accelerate programs
    by rewriting typed de Bruijn terms in a type preserving manner.
    However, maintaining type information adds complexity to the def-
    initions and rules, which amounts to a partial proof of correctness
    checked by the type checker, but is not particularly exciting for the
    present exposition. Hence, in this section, we elide the steps neces-
    sary to maintain type information during fusion.
    4.1 The Main Idea
    All collective operations in Accelerate are array-to-array transfor-
    mations. Reductions, such as fold, which reduce an array to a sin-
    gle element, yield a singleton array rather than a scalar expression.
    Hence, we can partition array operations into two categories:
    1. Operations where each element of the result array depends on at
    most one element of each input array. Multiple elements of the
    c2
    p5
    p1
    c1
    p6 p7
    p3
    p2
    p4
    (After consumer/producer fusion)
    c2
    p5
    p1
    c1
    p6 p7
    p3
    p2
    p4
    Figure 3. Produce/producer and consumer/producer fusion
    binders
    :: env env !

    ! "
    T

    ! env
    T

    binders lyt ⌫⇢ (

    :
    e eb
    ) =
    let
    map
    (
    binders lyt ⌫⇢)
    e
    in
    body lyt
    +n (
    ⌫, ⌫⇢)
    eb
    where
    n
    =
    length
    (

    :
    e
    )
    We tie the three stages together to convert from HOAS with sharing
    recovery producing let bindings and typed de Bruijn indices:
    hoasSharing
    ::
    T
    ⌧ ! ()
    T

    hoasSharing e
    = let
    (
    ⌦, e
    0) =
    prune
    0 •
    e
    (•
    , e
    00) =
    float ⌦ e
    0
    in
    binders

    e
    00
    4. Array fusion
    Fusion in a massively data-parallel, embedded language for GPUs,
    such as Accelerate, requires a few uncommon considerations.
    Parallelism.
    While fusing parallel collective operations, we must
    be careful not to lose information essential to parallel execution.
    For example, foldr/build fusion [15] is not applicable, because
    it produces sequential tail-recursive loops rather than massively
    parallel GPU kernels. Similarly, the split/join approach used
    in Data Parallel Haskell (DPH) [16] is not helpful, although fused
    operations are split into sequential and parallel subcomputations, as
    they assume an explicit parallel scheduler, which in DPH is written
    directly in Haskell. Accelerate compiles massively parallel array
    combinators to CUDA code via template skeleton instantiation, so
    any fusion system must preserve the combinator representation of
    the intermediate code.
    Sharing.
    Existing fusion transforms rely on inlining to move pro-
    ducer and consumer expressions next to each other, which allows
    producer/consumer pairs to be detected. However, when let-bound
    Fusion at run-time.
    As th
    Haskell, compilation of the A
    runtime rather than when co
    reason, optimisations applie
    to its overall runtime, so we
    and code transformation. O
    can make use of information
    Fusion on typed de Brujin
    by rewriting typed de Bruij
    However, maintaining type i
    initions and rules, which am
    checked by the type checker
    present exposition. Hence, in
    sary to maintain type inform
    4.1 The Main Idea
    All collective operations in
    mations. Reductions, such a
    gle element, yield a singleto
    Hence, we can partition arra
    1. Operations where each e
    most one element of eac
    output array may depen
    all output elements can b
    these operations as produ
    2. Operations where each e
    multiple elements of the
    consumers, in spite of th
    Table 1 summarises the colle
    In a parallel context, produc
    cause independent element-
    ping to the GPU. Consume
    know exactly how the comp
    plement them efficiently. For
    ciative operator) can be impl
    but a parallel scan requires
    nately, this sort of informati
    niques. To support the diffe
    sumers, our fusion transform
    • Producer/producer: fuse
    producer. This is implem
    mation on the AST.
    • Consumer/producer: fus
    into the consumer. This h
    we specialise the consum
    1 filter is easily implemented
    is provided as part of the library

    View Slide

  67. How fast are we going?

    View Slide

  68. Dot Product
    Run Time (ms)
    Elements (millions)
    Data.Vector
    Accelerate -fusion
    ... +fusion
    Hand optimised GPU

    View Slide

  69. Dot Product
    Run Time (ms)
    Elements (millions)
    Data.Vector
    Accelerate -fusion
    ... +fusion
    Hand optimised GPU
    2x

    View Slide

  70. Dot Product
    Run Time (ms)
    Elements (millions)
    Data.Vector
    Accelerate -fusion
    ... +fusion
    Hand optimised GPU
    2x
    1.2x

    View Slide

  71. Black-Scholes options pricing
    Run Time (ms)
    Options (millions)
    Accelerate -sharing
    ... +sharing
    Hand optimised GPU

    View Slide

  72. Black-Scholes options pricing
    Run Time (ms)
    Options (millions)
    Accelerate -sharing
    ... +sharing
    Hand optimised GPU
    17x

    View Slide

  73. Black-Scholes options pricing
    Run Time (ms)
    Options (millions)
    Accelerate -sharing
    ... +sharing
    Hand optimised GPU
    17x 0.9x

    View Slide

  74. N-Body
    Run Time (ms)
    Bodies
    Accelerate -fusion -sharing
    ... -fusion +sharing
    ... +fusion +sharing
    Hand optimised GPU

    View Slide

  75. N-Body
    Run Time (ms)
    Bodies
    Accelerate -fusion -sharing
    ... -fusion +sharing
    ... +fusion +sharing
    Hand optimised GPU
    11x

    View Slide

  76. N-Body
    Run Time (ms)
    Bodies
    Accelerate -fusion -sharing
    ... -fusion +sharing
    ... +fusion +sharing
    Hand optimised GPU
    Missing
    “optimisation”
    11x

    View Slide

  77. N-Body
    Run Time (ms)
    Bodies
    Accelerate -fusion -sharing
    ... -fusion +sharing
    ... +fusion +sharing
    Hand optimised GPU
    Run Time (ms)
    Bodies
    Accelerate -fusion -sharing
    ... -fusion +sharing
    ... +fusion +sharing
    Hand optimised GPU
    Current version
    Missing
    “optimisation”
    2x
    11x

    View Slide

  78. Embedded languages are restricted languages
    Skeletons encapsulate efficient code idioms
    Fusion and Sharing reduce the abstraction penalty
    Summary
    https://github.com/AccelerateHS/

    View Slide