Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Retrofitting Parallelism onto OCaml

Retrofitting Parallelism onto OCaml

Slides from the ICFP 2020 Talk

KC Sivaramakrishnan

August 12, 2020
Tweet

More Decks by KC Sivaramakrishnan

Other Decks in Research

Transcript

  1. Retrofitting Parallelism onto
    OCaml
    KC Sivaramakrishnan, Stephen Dolan, Leo white,
    Sadiq Jaffer, Tom Kelly, Anmol Sahoo, Sudha Parimala, Atul
    Dhiman, Anil Madhavapeddy
    OCaml Labs

    View full-size slide

  2. The Astrée Static Analyzer
    Industry Projects

    View full-size slide

  3. The Astrée Static Analyzer
    Industry Projects
    No multicore support!

    View full-size slide

  4. Multicore OCaml
    • Adds native support for concurrency and shared-memory
    parallelism to OCaml

    View full-size slide

  5. Multicore OCaml
    • Adds native support for concurrency and shared-memory
    parallelism to OCaml
    • Focus of this work is parallelism
    ✦ Building a multicore GC for OCaml

    View full-size slide

  6. Multicore OCaml
    • Adds native support for concurrency and shared-memory
    parallelism to OCaml
    • Focus of this work is parallelism
    ✦ Building a multicore GC for OCaml
    • Key parallel GC design principle
    ✦ Backwards compatibility before parallel scalability

    View full-size slide

  7. Challenges
    • Millions of lines of legacy code
    ✦ Weak references, ephemerons, lazy values, finalisers
    ✦ Low-level C API that bakes in GC invariants
    ✦ Cost of refactoring sequential code itself is prohibitive

    View full-size slide

  8. Challenges
    • Millions of lines of legacy code
    ✦ Weak references, ephemerons, lazy values, finalisers
    ✦ Low-level C API that bakes in GC invariants
    ✦ Cost of refactoring sequential code itself is prohibitive
    • Type safety
    ✦ Dolan et al, “Bounding Data Races in Space and Time”, PLDI’18
    ✦ Strong guarantees (including type safety) under data races

    View full-size slide

  9. Challenges
    • Millions of lines of legacy code
    ✦ Weak references, ephemerons, lazy values, finalisers
    ✦ Low-level C API that bakes in GC invariants
    ✦ Cost of refactoring sequential code itself is prohibitive
    • Type safety
    ✦ Dolan et al, “Bounding Data Races in Space and Time”, PLDI’18
    ✦ Strong guarantees (including type safety) under data races
    • Low-latency and predictable performance
    ✦ Thanks to the GC design

    View full-size slide

  10. Incremental
    and non-moving
    Stock OCaml GC
    • A generational, non-moving, incremental, mark-and-sweep GC
    Minor
    Heap
    Major Heap
    • Small (2 MB default)
    • Bump pointer allocation
    • Survivors copied to major heap

    View full-size slide

  11. Incremental
    and non-moving
    Stock OCaml GC
    • A generational, non-moving, incremental, mark-and-sweep GC
    Minor
    Heap
    Major Heap
    • Small (2 MB default)
    • Bump pointer allocation
    • Survivors copied to major heap
    Mutator
    Start of major cycle
    Idle

    View full-size slide

  12. Incremental
    and non-moving
    Stock OCaml GC
    • A generational, non-moving, incremental, mark-and-sweep GC
    Minor
    Heap
    Major Heap
    • Small (2 MB default)
    • Bump pointer allocation
    • Survivors copied to major heap
    Mutator
    Start of major cycle
    Idle
    Mark
    Roots
    mark roots

    View full-size slide

  13. Mark
    mark main
    Incremental
    and non-moving
    Stock OCaml GC
    • A generational, non-moving, incremental, mark-and-sweep GC
    Minor
    Heap
    Major Heap
    • Small (2 MB default)
    • Bump pointer allocation
    • Survivors copied to major heap
    Mutator
    Start of major cycle
    Idle
    Mark
    Roots
    mark roots

    View full-size slide

  14. Mark
    mark main
    Sweep
    sweep
    Incremental
    and non-moving
    Stock OCaml GC
    • A generational, non-moving, incremental, mark-and-sweep GC
    Minor
    Heap
    Major Heap
    • Small (2 MB default)
    • Bump pointer allocation
    • Survivors copied to major heap
    Mutator
    Start of major cycle
    Idle
    Mark
    Roots
    mark roots

    View full-size slide

  15. Mark
    mark main
    Sweep
    sweep
    Incremental
    and non-moving
    Stock OCaml GC
    • A generational, non-moving, incremental, mark-and-sweep GC
    Minor
    Heap
    Major Heap
    • Small (2 MB default)
    • Bump pointer allocation
    • Survivors copied to major heap
    End of major cycle
    Mutator
    Start of major cycle
    Idle
    Mark
    Roots
    mark roots

    View full-size slide

  16. Mark
    mark main
    Sweep
    sweep
    Incremental
    and non-moving
    Stock OCaml GC
    • A generational, non-moving, incremental, mark-and-sweep GC
    Minor
    Heap
    Major Heap
    • Small (2 MB default)
    • Bump pointer allocation
    • Survivors copied to major heap
    End of major cycle
    Mutator
    Start of major cycle
    Idle
    Mark
    Roots
    mark roots
    • Fast allocations, no read barriers

    View full-size slide

  17. Mark
    mark main
    Sweep
    sweep
    Incremental
    and non-moving
    Stock OCaml GC
    • A generational, non-moving, incremental, mark-and-sweep GC
    Minor
    Heap
    Major Heap
    • Small (2 MB default)
    • Bump pointer allocation
    • Survivors copied to major heap
    End of major cycle
    Mutator
    Start of major cycle
    Idle
    Mark
    Roots
    mark roots
    • Fast allocations, no read barriers
    • Max GC latency < 10 ms, 99th percentile latency < 1 ms

    View full-size slide

  18. Requirements
    1. Feature backwards compatibility
    • Serial programs do not break on parallel runtime
    • No separate serial and parallel modes

    View full-size slide

  19. Requirements
    1. Feature backwards compatibility
    • Serial programs do not break on parallel runtime
    • No separate serial and parallel modes
    2. Performance backwards compatibility
    • Serial programs behave similarly on parallel runtime in terms of
    running time, GC pausetime and memory usage.

    View full-size slide

  20. Requirements
    1. Feature backwards compatibility
    • Serial programs do not break on parallel runtime
    • No separate serial and parallel modes
    2. Performance backwards compatibility
    • Serial programs behave similarly on parallel runtime in terms of
    running time, GC pausetime and memory usage.
    3. Parallel responsiveness and scalability
    • Parallel programs remain responsive
    • Parallel programs scale with additional cores

    View full-size slide

  21. Multicore OCaml: Major GC
    • Multicore-aware allocator
    ✦ Based on Streamflow [Schneider et al. 2006]
    ✦ Thread-local, size-segmented free lists for small objects + malloc for large
    allocations
    ✦ Sequential performance on par with OCaml’s allocators

    View full-size slide

  22. Multicore OCaml: Major GC
    • Multicore-aware allocator
    ✦ Based on Streamflow [Schneider et al. 2006]
    ✦ Thread-local, size-segmented free lists for small objects + malloc for large
    allocations
    ✦ Sequential performance on par with OCaml’s allocators
    • A mostly-concurrent, non-moving, mark-and-sweep collector
    ✦ Based on VCGC [Huelsbergen and Winterbottom 1998]

    View full-size slide

  23. Multicore OCaml: Major GC
    • Multicore-aware allocator
    ✦ Based on Streamflow [Schneider et al. 2006]
    ✦ Thread-local, size-segmented free lists for small objects + malloc for large
    allocations
    ✦ Sequential performance on par with OCaml’s allocators
    • A mostly-concurrent, non-moving, mark-and-sweep collector
    ✦ Based on VCGC [Huelsbergen and Winterbottom 1998]
    Sweep Mark
    Mark
    Roots
    Sweep Mark
    Mark
    Roots
    Start of major cycle End of major cycle
    Domain 0
    Domain 1

    View full-size slide

  24. Multicore OCaml: Major GC
    • Multicore-aware allocator
    ✦ Based on Streamflow [Schneider et al. 2006]
    ✦ Thread-local, size-segmented free lists for small objects + malloc for large
    allocations
    ✦ Sequential performance on par with OCaml’s allocators
    • A mostly-concurrent, non-moving, mark-and-sweep collector
    ✦ Based on VCGC [Huelsbergen and Winterbottom 1998]
    Sweep Mark
    Mark
    Roots
    Sweep Mark
    Mark
    Roots
    Start of major cycle End of major cycle
    mark and sweep phases may overlap
    Domain 0
    Domain 1

    View full-size slide

  25. Multicore OCaml: Major GC

    View full-size slide

  26. Multicore OCaml: Major GC
    • Extend support weak references, ephemerons, (2 different kinds
    of) finalizers, fibers, lazy values

    View full-size slide

  27. Multicore OCaml: Major GC
    • Extend support weak references, ephemerons, (2 different kinds
    of) finalizers, fibers, lazy values
    • Ephemerons are tricky in a concurrent multicore GC
    ✦ A generalisation of weak references
    ✦ Introduce conjunction in the reachability property
    ✦ Requires multiple rounds of ephemeron marking
    ✦ Cycle-delimited handshaking without global barrier

    View full-size slide

  28. Multicore OCaml: Major GC
    • Extend support weak references, ephemerons, (2 different kinds
    of) finalizers, fibers, lazy values
    • Ephemerons are tricky in a concurrent multicore GC
    ✦ A generalisation of weak references
    ✦ Introduce conjunction in the reachability property
    ✦ Requires multiple rounds of ephemeron marking
    ✦ Cycle-delimited handshaking without global barrier
    • A barrier each for the two kinds of finalisers
    ✦ 3 barriers / cycle worst case

    View full-size slide

  29. Multicore OCaml: Major GC
    • Extend support weak references, ephemerons, (2 different kinds
    of) finalizers, fibers, lazy values
    • Ephemerons are tricky in a concurrent multicore GC
    ✦ A generalisation of weak references
    ✦ Introduce conjunction in the reachability property
    ✦ Requires multiple rounds of ephemeron marking
    ✦ Cycle-delimited handshaking without global barrier
    • A barrier each for the two kinds of finalisers
    ✦ 3 barriers / cycle worst case
    • Verified in the SPIN model checker

    View full-size slide

  30. Concurrent Minor GC
    • Based on [Doligez and Leroy 1993] but lazier as in [Marlow and
    Peyton Jones 2011] collector for GHC
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Major Heap
    Domain 0 Domain 1 Domain 2 Domain 3

    View full-size slide

  31. Concurrent Minor GC
    • Based on [Doligez and Leroy 1993] but lazier as in [Marlow and
    Peyton Jones 2011] collector for GHC
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Major Heap
    Domain 0 Domain 1 Domain 2 Domain 3
    • Each domain can independently collect its minor heap

    View full-size slide

  32. Concurrent Minor GC
    • Based on [Doligez and Leroy 1993] but lazier as in [Marlow and
    Peyton Jones 2011] collector for GHC
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Major Heap
    Domain 0 Domain 1 Domain 2 Domain 3
    • Each domain can independently collect its minor heap
    • Major to minor pointers allowed
    ✦ Prevents early promotion & mirrors sequential behaviour
    ✦ Read barrier required for mutable field + promotion

    View full-size slide

  33. Read Barriers
    • Stock OCaml does not have read barriers
    ✦ Read barriers need to be efficient for performance backwards
    compatibility

    View full-size slide

  34. Read Barriers
    • Stock OCaml does not have read barriers
    ✦ Read barriers need to be efficient for performance backwards
    compatibility
    • 3 instructions in x86 - VMM + bit-twiddling tricks
    ✦ Proof of correctness available in the paper
    ✦ Minimal performance impact on sequential code

    View full-size slide

  35. Read Barriers
    • Stock OCaml does not have read barriers
    ✦ Read barriers need to be efficient for performance backwards
    compatibility
    • 3 instructions in x86 - VMM + bit-twiddling tricks
    ✦ Proof of correctness available in the paper
    ✦ Minimal performance impact on sequential code
    • Unfortunately, read barriers break the C API (feature backwards
    compatibility)

    View full-size slide

  36. Read Barriers
    minor
    major
    heap
    x y
    a
    minor
    b
    Domain 0 Domain 1
    !y !x

    View full-size slide

  37. Read Barriers
    minor
    major
    heap
    x y
    a
    minor
    b
    Domain 0 Domain 1
    !y !x
    promote (!y)
    promote (!x)

    View full-size slide

  38. Read Barriers
    • Service promotion requests on read faults to avoid deadlock
    ✦ Mutable reads are GC safe points!
    minor
    major
    heap
    x y
    a
    minor
    b
    Domain 0 Domain 1
    !y !x
    promote (!y)
    promote (!x)

    View full-size slide

  39. Read Barriers
    • Service promotion requests on read faults to avoid deadlock
    ✦ Mutable reads are GC safe points!
    • C API written with explicit knowledge of when GC may happen
    ✦ Need to manually refactor tricky code
    minor
    major
    heap
    x y
    a
    minor
    b
    Domain 0 Domain 1
    !y !x
    promote (!y)
    promote (!x)

    View full-size slide

  40. Parallel Minor GC
    • Stop-the-world parallel minor collection
    ✦ Similar to GHCs minor collection

    View full-size slide

  41. Parallel Minor GC
    • Stop-the-world parallel minor collection
    ✦ Similar to GHCs minor collection
    Dom 0
    Dom 1
    Mutator
    Minor
    GC
    Major
    slice
    Mutator
    Minor
    GC
    Start
    major
    End
    major
    ConcMinor

    View full-size slide

  42. Parallel Minor GC
    • Stop-the-world parallel minor collection
    ✦ Similar to GHCs minor collection
    Dom 0
    Dom 1
    Mutator
    Minor
    GC
    Major
    slice
    Mutator
    Minor
    GC
    Start
    major
    End
    major
    ConcMinor
    Mutator
    Major
    slice
    Mutator
    Start
    major
    End
    major
    Start
    minor
    End
    minor
    ParMinor

    View full-size slide

  43. Parallel Minor GC
    • Stop-the-world parallel minor collection
    ✦ Similar to GHCs minor collection
    Dom 0
    Dom 1
    Mutator
    Minor
    GC
    Major
    slice
    Mutator
    Minor
    GC
    Start
    major
    End
    major
    ConcMinor
    Mutator
    Major
    slice
    Mutator
    Start
    major
    End
    major
    Start
    minor
    End
    minor
    ParMinor
    Slop space filled with
    major slices

    View full-size slide

  44. Parallel Minor GC
    • Stop-the-world parallel minor collection
    ✦ Similar to GHCs minor collection
    • No need for read barriers!
    Dom 0
    Dom 1
    Mutator
    Minor
    GC
    Major
    slice
    Mutator
    Minor
    GC
    Start
    major
    End
    major
    ConcMinor
    Mutator
    Major
    slice
    Mutator
    Start
    major
    End
    major
    Start
    minor
    End
    minor
    ParMinor
    Slop space filled with
    major slices

    View full-size slide

  45. Parallel Minor GC
    • Stop-the-world parallel minor collection
    ✦ Similar to GHCs minor collection
    • No need for read barriers!
    • Quickly bring all the domains to a barrier
    ✦ Insert poll points in code for timely inter-domain interrupt handling
    [Feeley 1993]
    Dom 0
    Dom 1
    Mutator
    Minor
    GC
    Major
    slice
    Mutator
    Minor
    GC
    Start
    major
    End
    major
    ConcMinor
    Mutator
    Major
    slice
    Mutator
    Start
    major
    End
    major
    Start
    minor
    End
    minor
    ParMinor
    Slop space filled with
    major slices

    View full-size slide

  46. Evaluation
    • 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
    ✦ 24 cores isolated for performance evaluation

    View full-size slide

  47. Evaluation
    • 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
    ✦ 24 cores isolated for performance evaluation
    • Sequential Throughput — compared to stock OCaml
    ✦ ConcMinor 4.9% slower and ParMinor 3.5% slower
    ✦ ConcMinor 54% lower peak memory and ParMinor 61% lower peak
    memory

    View full-size slide

  48. Evaluation
    • 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
    ✦ 24 cores isolated for performance evaluation
    • Sequential Throughput — compared to stock OCaml
    ✦ ConcMinor 4.9% slower and ParMinor 3.5% slower
    ✦ ConcMinor 54% lower peak memory and ParMinor 61% lower peak
    memory
    • Sequential GC pause times on par with stock OCaml

    View full-size slide

  49. Parallel Scalability

    View full-size slide

  50. Parallel Scalability
    ConcMinor suffers due
    to read faults

    View full-size slide

  51. Parallel Scalability
    ConcMinor suffers due
    to read faults
    Unbalanced allocation leads to
    inopportune minor GCs in ParMinor

    View full-size slide

  52. ParMinor vs ConcMinor
    • Parallel GC latency roughly similar between ParMinor and
    ConcMinor

    View full-size slide

  53. ParMinor vs ConcMinor
    • Parallel GC latency roughly similar between ParMinor and
    ConcMinor
    • ParMinor wins over ConcMinor
    ✦ Does not break the C API
    ✦ Performs similarly to the ConcMinor on 24 cores

    View full-size slide

  54. ParMinor vs ConcMinor
    • Parallel GC latency roughly similar between ParMinor and
    ConcMinor
    • ParMinor wins over ConcMinor
    ✦ Does not break the C API
    ✦ Performs similarly to the ConcMinor on 24 cores
    • OCaml 5.00 will have multicore support and use ParMinor
    ✦ May revisit ConcMinor later for manycore future

    View full-size slide

  55. Thanks!
    • Multicore OCaml
    ✦ https://github.com/ocaml-multicore/ocaml-multicore
    • Sandmark — benchmark suite for (Multicore) OCaml
    ✦ https://github.com/ocaml-bench/sandmark/
    • SPIN models
    ✦ https://github.com/ocaml-multicore/multicore-ocaml-verify
    • Parallel Programming with Multicore OCaml
    ✦ https://github.com/ocaml-multicore/parallel-programming-in-
    multicore-ocaml

    View full-size slide