Retrofitting Parallelism onto OCaml

Retrofitting Parallelism onto OCaml

Slides from the ICFP 2020 Talk

C29f097d23f8904532ca088ac23ce801?s=128

KC Sivaramakrishnan

August 12, 2020
Tweet

Transcript

  1. Retrofitting Parallelism onto OCaml KC Sivaramakrishnan, Stephen Dolan, Leo white,

    Sadiq Jaffer, Tom Kelly, Anmol Sahoo, Sudha Parimala, Atul Dhiman, Anil Madhavapeddy OCaml Labs
  2. The Astrée Static Analyzer Industry Projects

  3. The Astrée Static Analyzer Industry Projects No multicore support!

  4. Multicore OCaml • Adds native support for concurrency and shared-memory

    parallelism to OCaml
  5. Multicore OCaml • Adds native support for concurrency and shared-memory

    parallelism to OCaml • Focus of this work is parallelism ✦ Building a multicore GC for OCaml
  6. Multicore OCaml • Adds native support for concurrency and shared-memory

    parallelism to OCaml • Focus of this work is parallelism ✦ Building a multicore GC for OCaml • Key parallel GC design principle ✦ Backwards compatibility before parallel scalability
  7. Challenges • Millions of lines of legacy code ✦ Weak

    references, ephemerons, lazy values, finalisers ✦ Low-level C API that bakes in GC invariants ✦ Cost of refactoring sequential code itself is prohibitive
  8. Challenges • Millions of lines of legacy code ✦ Weak

    references, ephemerons, lazy values, finalisers ✦ Low-level C API that bakes in GC invariants ✦ Cost of refactoring sequential code itself is prohibitive • Type safety ✦ Dolan et al, “Bounding Data Races in Space and Time”, PLDI’18 ✦ Strong guarantees (including type safety) under data races
  9. Challenges • Millions of lines of legacy code ✦ Weak

    references, ephemerons, lazy values, finalisers ✦ Low-level C API that bakes in GC invariants ✦ Cost of refactoring sequential code itself is prohibitive • Type safety ✦ Dolan et al, “Bounding Data Races in Space and Time”, PLDI’18 ✦ Strong guarantees (including type safety) under data races • Low-latency and predictable performance ✦ Thanks to the GC design
  10. Incremental and non-moving Stock OCaml GC • A generational, non-moving,

    incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap
  11. Incremental and non-moving Stock OCaml GC • A generational, non-moving,

    incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap Mutator Start of major cycle Idle
  12. Incremental and non-moving Stock OCaml GC • A generational, non-moving,

    incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap Mutator Start of major cycle Idle Mark Roots mark roots
  13. Mark mark main Incremental and non-moving Stock OCaml GC •

    A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap Mutator Start of major cycle Idle Mark Roots mark roots
  14. Mark mark main Sweep sweep Incremental and non-moving Stock OCaml

    GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap Mutator Start of major cycle Idle Mark Roots mark roots
  15. Mark mark main Sweep sweep Incremental and non-moving Stock OCaml

    GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap End of major cycle Mutator Start of major cycle Idle Mark Roots mark roots
  16. Mark mark main Sweep sweep Incremental and non-moving Stock OCaml

    GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap End of major cycle Mutator Start of major cycle Idle Mark Roots mark roots • Fast allocations, no read barriers
  17. Mark mark main Sweep sweep Incremental and non-moving Stock OCaml

    GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap End of major cycle Mutator Start of major cycle Idle Mark Roots mark roots • Fast allocations, no read barriers • Max GC latency < 10 ms, 99th percentile latency < 1 ms
  18. Requirements 1. Feature backwards compatibility • Serial programs do not

    break on parallel runtime • No separate serial and parallel modes
  19. Requirements 1. Feature backwards compatibility • Serial programs do not

    break on parallel runtime • No separate serial and parallel modes 2. Performance backwards compatibility • Serial programs behave similarly on parallel runtime in terms of running time, GC pausetime and memory usage.
  20. Requirements 1. Feature backwards compatibility • Serial programs do not

    break on parallel runtime • No separate serial and parallel modes 2. Performance backwards compatibility • Serial programs behave similarly on parallel runtime in terms of running time, GC pausetime and memory usage. 3. Parallel responsiveness and scalability • Parallel programs remain responsive • Parallel programs scale with additional cores
  21. Multicore OCaml: Major GC • Multicore-aware allocator ✦ Based on

    Streamflow [Schneider et al. 2006] ✦ Thread-local, size-segmented free lists for small objects + malloc for large allocations ✦ Sequential performance on par with OCaml’s allocators
  22. Multicore OCaml: Major GC • Multicore-aware allocator ✦ Based on

    Streamflow [Schneider et al. 2006] ✦ Thread-local, size-segmented free lists for small objects + malloc for large allocations ✦ Sequential performance on par with OCaml’s allocators • A mostly-concurrent, non-moving, mark-and-sweep collector ✦ Based on VCGC [Huelsbergen and Winterbottom 1998]
  23. Multicore OCaml: Major GC • Multicore-aware allocator ✦ Based on

    Streamflow [Schneider et al. 2006] ✦ Thread-local, size-segmented free lists for small objects + malloc for large allocations ✦ Sequential performance on par with OCaml’s allocators • A mostly-concurrent, non-moving, mark-and-sweep collector ✦ Based on VCGC [Huelsbergen and Winterbottom 1998] Sweep Mark Mark Roots Sweep Mark Mark Roots Start of major cycle End of major cycle Domain 0 Domain 1
  24. Multicore OCaml: Major GC • Multicore-aware allocator ✦ Based on

    Streamflow [Schneider et al. 2006] ✦ Thread-local, size-segmented free lists for small objects + malloc for large allocations ✦ Sequential performance on par with OCaml’s allocators • A mostly-concurrent, non-moving, mark-and-sweep collector ✦ Based on VCGC [Huelsbergen and Winterbottom 1998] Sweep Mark Mark Roots Sweep Mark Mark Roots Start of major cycle End of major cycle mark and sweep phases may overlap Domain 0 Domain 1
  25. Multicore OCaml: Major GC

  26. Multicore OCaml: Major GC • Extend support weak references, ephemerons,

    (2 different kinds of) finalizers, fibers, lazy values
  27. Multicore OCaml: Major GC • Extend support weak references, ephemerons,

    (2 different kinds of) finalizers, fibers, lazy values • Ephemerons are tricky in a concurrent multicore GC ✦ A generalisation of weak references ✦ Introduce conjunction in the reachability property ✦ Requires multiple rounds of ephemeron marking ✦ Cycle-delimited handshaking without global barrier
  28. Multicore OCaml: Major GC • Extend support weak references, ephemerons,

    (2 different kinds of) finalizers, fibers, lazy values • Ephemerons are tricky in a concurrent multicore GC ✦ A generalisation of weak references ✦ Introduce conjunction in the reachability property ✦ Requires multiple rounds of ephemeron marking ✦ Cycle-delimited handshaking without global barrier • A barrier each for the two kinds of finalisers ✦ 3 barriers / cycle worst case
  29. Multicore OCaml: Major GC • Extend support weak references, ephemerons,

    (2 different kinds of) finalizers, fibers, lazy values • Ephemerons are tricky in a concurrent multicore GC ✦ A generalisation of weak references ✦ Introduce conjunction in the reachability property ✦ Requires multiple rounds of ephemeron marking ✦ Cycle-delimited handshaking without global barrier • A barrier each for the two kinds of finalisers ✦ 3 barriers / cycle worst case • Verified in the SPIN model checker
  30. Concurrent Minor GC • Based on [Doligez and Leroy 1993]

    but lazier as in [Marlow and Peyton Jones 2011] collector for GHC Minor Heap Minor Heap Minor Heap Minor Heap Major Heap Domain 0 Domain 1 Domain 2 Domain 3
  31. Concurrent Minor GC • Based on [Doligez and Leroy 1993]

    but lazier as in [Marlow and Peyton Jones 2011] collector for GHC Minor Heap Minor Heap Minor Heap Minor Heap Major Heap Domain 0 Domain 1 Domain 2 Domain 3 • Each domain can independently collect its minor heap
  32. Concurrent Minor GC • Based on [Doligez and Leroy 1993]

    but lazier as in [Marlow and Peyton Jones 2011] collector for GHC Minor Heap Minor Heap Minor Heap Minor Heap Major Heap Domain 0 Domain 1 Domain 2 Domain 3 • Each domain can independently collect its minor heap • Major to minor pointers allowed ✦ Prevents early promotion & mirrors sequential behaviour ✦ Read barrier required for mutable field + promotion
  33. Read Barriers • Stock OCaml does not have read barriers

    ✦ Read barriers need to be efficient for performance backwards compatibility
  34. Read Barriers • Stock OCaml does not have read barriers

    ✦ Read barriers need to be efficient for performance backwards compatibility • 3 instructions in x86 - VMM + bit-twiddling tricks ✦ Proof of correctness available in the paper ✦ Minimal performance impact on sequential code
  35. Read Barriers • Stock OCaml does not have read barriers

    ✦ Read barriers need to be efficient for performance backwards compatibility • 3 instructions in x86 - VMM + bit-twiddling tricks ✦ Proof of correctness available in the paper ✦ Minimal performance impact on sequential code • Unfortunately, read barriers break the C API (feature backwards compatibility)
  36. Read Barriers minor major heap x y a minor b

    Domain 0 Domain 1 !y !x
  37. Read Barriers minor major heap x y a minor b

    Domain 0 Domain 1 !y !x promote (!y) promote (!x)
  38. Read Barriers • Service promotion requests on read faults to

    avoid deadlock ✦ Mutable reads are GC safe points! minor major heap x y a minor b Domain 0 Domain 1 !y !x promote (!y) promote (!x)
  39. Read Barriers • Service promotion requests on read faults to

    avoid deadlock ✦ Mutable reads are GC safe points! • C API written with explicit knowledge of when GC may happen ✦ Need to manually refactor tricky code minor major heap x y a minor b Domain 0 Domain 1 !y !x promote (!y) promote (!x)
  40. Parallel Minor GC • Stop-the-world parallel minor collection ✦ Similar

    to GHCs minor collection
  41. Parallel Minor GC • Stop-the-world parallel minor collection ✦ Similar

    to GHCs minor collection Dom 0 Dom 1 Mutator Minor GC Major slice Mutator Minor GC Start major End major ConcMinor
  42. Parallel Minor GC • Stop-the-world parallel minor collection ✦ Similar

    to GHCs minor collection Dom 0 Dom 1 Mutator Minor GC Major slice Mutator Minor GC Start major End major ConcMinor Mutator Major slice Mutator Start major End major Start minor End minor ParMinor
  43. Parallel Minor GC • Stop-the-world parallel minor collection ✦ Similar

    to GHCs minor collection Dom 0 Dom 1 Mutator Minor GC Major slice Mutator Minor GC Start major End major ConcMinor Mutator Major slice Mutator Start major End major Start minor End minor ParMinor Slop space filled with major slices
  44. Parallel Minor GC • Stop-the-world parallel minor collection ✦ Similar

    to GHCs minor collection • No need for read barriers! Dom 0 Dom 1 Mutator Minor GC Major slice Mutator Minor GC Start major End major ConcMinor Mutator Major slice Mutator Start major End major Start minor End minor ParMinor Slop space filled with major slices
  45. Parallel Minor GC • Stop-the-world parallel minor collection ✦ Similar

    to GHCs minor collection • No need for read barriers! • Quickly bring all the domains to a barrier ✦ Insert poll points in code for timely inter-domain interrupt handling [Feeley 1993] Dom 0 Dom 1 Mutator Minor GC Major slice Mutator Minor GC Start major End major ConcMinor Mutator Major slice Mutator Start major End major Start minor End minor ParMinor Slop space filled with major slices
  46. Evaluation • 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU

    @ 2.20GHz ✦ 24 cores isolated for performance evaluation
  47. Evaluation • 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU

    @ 2.20GHz ✦ 24 cores isolated for performance evaluation • Sequential Throughput — compared to stock OCaml ✦ ConcMinor 4.9% slower and ParMinor 3.5% slower ✦ ConcMinor 54% lower peak memory and ParMinor 61% lower peak memory
  48. Evaluation • 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU

    @ 2.20GHz ✦ 24 cores isolated for performance evaluation • Sequential Throughput — compared to stock OCaml ✦ ConcMinor 4.9% slower and ParMinor 3.5% slower ✦ ConcMinor 54% lower peak memory and ParMinor 61% lower peak memory • Sequential GC pause times on par with stock OCaml
  49. Parallel Scalability

  50. Parallel Scalability ConcMinor suffers due to read faults

  51. Parallel Scalability ConcMinor suffers due to read faults Unbalanced allocation

    leads to inopportune minor GCs in ParMinor
  52. ParMinor vs ConcMinor • Parallel GC latency roughly similar between

    ParMinor and ConcMinor
  53. ParMinor vs ConcMinor • Parallel GC latency roughly similar between

    ParMinor and ConcMinor • ParMinor wins over ConcMinor ✦ Does not break the C API ✦ Performs similarly to the ConcMinor on 24 cores
  54. ParMinor vs ConcMinor • Parallel GC latency roughly similar between

    ParMinor and ConcMinor • ParMinor wins over ConcMinor ✦ Does not break the C API ✦ Performs similarly to the ConcMinor on 24 cores • OCaml 5.00 will have multicore support and use ParMinor ✦ May revisit ConcMinor later for manycore future
  55. Thanks! • Multicore OCaml ✦ https://github.com/ocaml-multicore/ocaml-multicore • Sandmark — benchmark

    suite for (Multicore) OCaml ✦ https://github.com/ocaml-bench/sandmark/ • SPIN models ✦ https://github.com/ocaml-multicore/multicore-ocaml-verify • Parallel Programming with Multicore OCaml ✦ https://github.com/ocaml-multicore/parallel-programming-in- multicore-ocaml