Retrofitting Parallelism onto
OCaml
KC Sivaramakrishnan, Stephen Dolan, Leo white,
Sadiq Jaffer, Tom Kelly, Anmol Sahoo, Sudha Parimala, Atul
Dhiman, Anil Madhavapeddy
OCaml Labs
Slide 2
Slide 2 text
The Astrée Static Analyzer
Industry Projects
Slide 3
Slide 3 text
The Astrée Static Analyzer
Industry Projects
No multicore support!
Slide 4
Slide 4 text
Multicore OCaml
• Adds native support for concurrency and shared-memory
parallelism to OCaml
Slide 5
Slide 5 text
Multicore OCaml
• Adds native support for concurrency and shared-memory
parallelism to OCaml
• Focus of this work is parallelism
✦ Building a multicore GC for OCaml
Slide 6
Slide 6 text
Multicore OCaml
• Adds native support for concurrency and shared-memory
parallelism to OCaml
• Focus of this work is parallelism
✦ Building a multicore GC for OCaml
• Key parallel GC design principle
✦ Backwards compatibility before parallel scalability
Slide 7
Slide 7 text
Challenges
• Millions of lines of legacy code
✦ Weak references, ephemerons, lazy values, finalisers
✦ Low-level C API that bakes in GC invariants
✦ Cost of refactoring sequential code itself is prohibitive
Slide 8
Slide 8 text
Challenges
• Millions of lines of legacy code
✦ Weak references, ephemerons, lazy values, finalisers
✦ Low-level C API that bakes in GC invariants
✦ Cost of refactoring sequential code itself is prohibitive
• Type safety
✦ Dolan et al, “Bounding Data Races in Space and Time”, PLDI’18
✦ Strong guarantees (including type safety) under data races
Slide 9
Slide 9 text
Challenges
• Millions of lines of legacy code
✦ Weak references, ephemerons, lazy values, finalisers
✦ Low-level C API that bakes in GC invariants
✦ Cost of refactoring sequential code itself is prohibitive
• Type safety
✦ Dolan et al, “Bounding Data Races in Space and Time”, PLDI’18
✦ Strong guarantees (including type safety) under data races
• Low-latency and predictable performance
✦ Thanks to the GC design
Slide 10
Slide 10 text
Incremental
and non-moving
Stock OCaml GC
• A generational, non-moving, incremental, mark-and-sweep GC
Minor
Heap
Major Heap
• Small (2 MB default)
• Bump pointer allocation
• Survivors copied to major heap
Slide 11
Slide 11 text
Incremental
and non-moving
Stock OCaml GC
• A generational, non-moving, incremental, mark-and-sweep GC
Minor
Heap
Major Heap
• Small (2 MB default)
• Bump pointer allocation
• Survivors copied to major heap
Mutator
Start of major cycle
Idle
Slide 12
Slide 12 text
Incremental
and non-moving
Stock OCaml GC
• A generational, non-moving, incremental, mark-and-sweep GC
Minor
Heap
Major Heap
• Small (2 MB default)
• Bump pointer allocation
• Survivors copied to major heap
Mutator
Start of major cycle
Idle
Mark
Roots
mark roots
Slide 13
Slide 13 text
Mark
mark main
Incremental
and non-moving
Stock OCaml GC
• A generational, non-moving, incremental, mark-and-sweep GC
Minor
Heap
Major Heap
• Small (2 MB default)
• Bump pointer allocation
• Survivors copied to major heap
Mutator
Start of major cycle
Idle
Mark
Roots
mark roots
Slide 14
Slide 14 text
Mark
mark main
Sweep
sweep
Incremental
and non-moving
Stock OCaml GC
• A generational, non-moving, incremental, mark-and-sweep GC
Minor
Heap
Major Heap
• Small (2 MB default)
• Bump pointer allocation
• Survivors copied to major heap
Mutator
Start of major cycle
Idle
Mark
Roots
mark roots
Slide 15
Slide 15 text
Mark
mark main
Sweep
sweep
Incremental
and non-moving
Stock OCaml GC
• A generational, non-moving, incremental, mark-and-sweep GC
Minor
Heap
Major Heap
• Small (2 MB default)
• Bump pointer allocation
• Survivors copied to major heap
End of major cycle
Mutator
Start of major cycle
Idle
Mark
Roots
mark roots
Slide 16
Slide 16 text
Mark
mark main
Sweep
sweep
Incremental
and non-moving
Stock OCaml GC
• A generational, non-moving, incremental, mark-and-sweep GC
Minor
Heap
Major Heap
• Small (2 MB default)
• Bump pointer allocation
• Survivors copied to major heap
End of major cycle
Mutator
Start of major cycle
Idle
Mark
Roots
mark roots
• Fast allocations, no read barriers
Slide 17
Slide 17 text
Mark
mark main
Sweep
sweep
Incremental
and non-moving
Stock OCaml GC
• A generational, non-moving, incremental, mark-and-sweep GC
Minor
Heap
Major Heap
• Small (2 MB default)
• Bump pointer allocation
• Survivors copied to major heap
End of major cycle
Mutator
Start of major cycle
Idle
Mark
Roots
mark roots
• Fast allocations, no read barriers
• Max GC latency < 10 ms, 99th percentile latency < 1 ms
Slide 18
Slide 18 text
Requirements
1. Feature backwards compatibility
• Serial programs do not break on parallel runtime
• No separate serial and parallel modes
Slide 19
Slide 19 text
Requirements
1. Feature backwards compatibility
• Serial programs do not break on parallel runtime
• No separate serial and parallel modes
2. Performance backwards compatibility
• Serial programs behave similarly on parallel runtime in terms of
running time, GC pausetime and memory usage.
Slide 20
Slide 20 text
Requirements
1. Feature backwards compatibility
• Serial programs do not break on parallel runtime
• No separate serial and parallel modes
2. Performance backwards compatibility
• Serial programs behave similarly on parallel runtime in terms of
running time, GC pausetime and memory usage.
3. Parallel responsiveness and scalability
• Parallel programs remain responsive
• Parallel programs scale with additional cores
Slide 21
Slide 21 text
Multicore OCaml: Major GC
• Multicore-aware allocator
✦ Based on Streamflow [Schneider et al. 2006]
✦ Thread-local, size-segmented free lists for small objects + malloc for large
allocations
✦ Sequential performance on par with OCaml’s allocators
Slide 22
Slide 22 text
Multicore OCaml: Major GC
• Multicore-aware allocator
✦ Based on Streamflow [Schneider et al. 2006]
✦ Thread-local, size-segmented free lists for small objects + malloc for large
allocations
✦ Sequential performance on par with OCaml’s allocators
• A mostly-concurrent, non-moving, mark-and-sweep collector
✦ Based on VCGC [Huelsbergen and Winterbottom 1998]
Slide 23
Slide 23 text
Multicore OCaml: Major GC
• Multicore-aware allocator
✦ Based on Streamflow [Schneider et al. 2006]
✦ Thread-local, size-segmented free lists for small objects + malloc for large
allocations
✦ Sequential performance on par with OCaml’s allocators
• A mostly-concurrent, non-moving, mark-and-sweep collector
✦ Based on VCGC [Huelsbergen and Winterbottom 1998]
Sweep Mark
Mark
Roots
Sweep Mark
Mark
Roots
Start of major cycle End of major cycle
Domain 0
Domain 1
Slide 24
Slide 24 text
Multicore OCaml: Major GC
• Multicore-aware allocator
✦ Based on Streamflow [Schneider et al. 2006]
✦ Thread-local, size-segmented free lists for small objects + malloc for large
allocations
✦ Sequential performance on par with OCaml’s allocators
• A mostly-concurrent, non-moving, mark-and-sweep collector
✦ Based on VCGC [Huelsbergen and Winterbottom 1998]
Sweep Mark
Mark
Roots
Sweep Mark
Mark
Roots
Start of major cycle End of major cycle
mark and sweep phases may overlap
Domain 0
Domain 1
Slide 25
Slide 25 text
Multicore OCaml: Major GC
Slide 26
Slide 26 text
Multicore OCaml: Major GC
• Extend support weak references, ephemerons, (2 different kinds
of) finalizers, fibers, lazy values
Slide 27
Slide 27 text
Multicore OCaml: Major GC
• Extend support weak references, ephemerons, (2 different kinds
of) finalizers, fibers, lazy values
• Ephemerons are tricky in a concurrent multicore GC
✦ A generalisation of weak references
✦ Introduce conjunction in the reachability property
✦ Requires multiple rounds of ephemeron marking
✦ Cycle-delimited handshaking without global barrier
Slide 28
Slide 28 text
Multicore OCaml: Major GC
• Extend support weak references, ephemerons, (2 different kinds
of) finalizers, fibers, lazy values
• Ephemerons are tricky in a concurrent multicore GC
✦ A generalisation of weak references
✦ Introduce conjunction in the reachability property
✦ Requires multiple rounds of ephemeron marking
✦ Cycle-delimited handshaking without global barrier
• A barrier each for the two kinds of finalisers
✦ 3 barriers / cycle worst case
Slide 29
Slide 29 text
Multicore OCaml: Major GC
• Extend support weak references, ephemerons, (2 different kinds
of) finalizers, fibers, lazy values
• Ephemerons are tricky in a concurrent multicore GC
✦ A generalisation of weak references
✦ Introduce conjunction in the reachability property
✦ Requires multiple rounds of ephemeron marking
✦ Cycle-delimited handshaking without global barrier
• A barrier each for the two kinds of finalisers
✦ 3 barriers / cycle worst case
• Verified in the SPIN model checker
Slide 30
Slide 30 text
Concurrent Minor GC
• Based on [Doligez and Leroy 1993] but lazier as in [Marlow and
Peyton Jones 2011] collector for GHC
Minor
Heap
Minor
Heap
Minor
Heap
Minor
Heap
Major Heap
Domain 0 Domain 1 Domain 2 Domain 3
Slide 31
Slide 31 text
Concurrent Minor GC
• Based on [Doligez and Leroy 1993] but lazier as in [Marlow and
Peyton Jones 2011] collector for GHC
Minor
Heap
Minor
Heap
Minor
Heap
Minor
Heap
Major Heap
Domain 0 Domain 1 Domain 2 Domain 3
• Each domain can independently collect its minor heap
Slide 32
Slide 32 text
Concurrent Minor GC
• Based on [Doligez and Leroy 1993] but lazier as in [Marlow and
Peyton Jones 2011] collector for GHC
Minor
Heap
Minor
Heap
Minor
Heap
Minor
Heap
Major Heap
Domain 0 Domain 1 Domain 2 Domain 3
• Each domain can independently collect its minor heap
• Major to minor pointers allowed
✦ Prevents early promotion & mirrors sequential behaviour
✦ Read barrier required for mutable field + promotion
Slide 33
Slide 33 text
Read Barriers
• Stock OCaml does not have read barriers
✦ Read barriers need to be efficient for performance backwards
compatibility
Slide 34
Slide 34 text
Read Barriers
• Stock OCaml does not have read barriers
✦ Read barriers need to be efficient for performance backwards
compatibility
• 3 instructions in x86 - VMM + bit-twiddling tricks
✦ Proof of correctness available in the paper
✦ Minimal performance impact on sequential code
Slide 35
Slide 35 text
Read Barriers
• Stock OCaml does not have read barriers
✦ Read barriers need to be efficient for performance backwards
compatibility
• 3 instructions in x86 - VMM + bit-twiddling tricks
✦ Proof of correctness available in the paper
✦ Minimal performance impact on sequential code
• Unfortunately, read barriers break the C API (feature backwards
compatibility)
Slide 36
Slide 36 text
Read Barriers
minor
major
heap
x y
a
minor
b
Domain 0 Domain 1
!y !x
Slide 37
Slide 37 text
Read Barriers
minor
major
heap
x y
a
minor
b
Domain 0 Domain 1
!y !x
promote (!y)
promote (!x)
Slide 38
Slide 38 text
Read Barriers
• Service promotion requests on read faults to avoid deadlock
✦ Mutable reads are GC safe points!
minor
major
heap
x y
a
minor
b
Domain 0 Domain 1
!y !x
promote (!y)
promote (!x)
Slide 39
Slide 39 text
Read Barriers
• Service promotion requests on read faults to avoid deadlock
✦ Mutable reads are GC safe points!
• C API written with explicit knowledge of when GC may happen
✦ Need to manually refactor tricky code
minor
major
heap
x y
a
minor
b
Domain 0 Domain 1
!y !x
promote (!y)
promote (!x)
Slide 40
Slide 40 text
Parallel Minor GC
• Stop-the-world parallel minor collection
✦ Similar to GHCs minor collection
Slide 41
Slide 41 text
Parallel Minor GC
• Stop-the-world parallel minor collection
✦ Similar to GHCs minor collection
Dom 0
Dom 1
Mutator
Minor
GC
Major
slice
Mutator
Minor
GC
Start
major
End
major
ConcMinor
Slide 42
Slide 42 text
Parallel Minor GC
• Stop-the-world parallel minor collection
✦ Similar to GHCs minor collection
Dom 0
Dom 1
Mutator
Minor
GC
Major
slice
Mutator
Minor
GC
Start
major
End
major
ConcMinor
Mutator
Major
slice
Mutator
Start
major
End
major
Start
minor
End
minor
ParMinor
Slide 43
Slide 43 text
Parallel Minor GC
• Stop-the-world parallel minor collection
✦ Similar to GHCs minor collection
Dom 0
Dom 1
Mutator
Minor
GC
Major
slice
Mutator
Minor
GC
Start
major
End
major
ConcMinor
Mutator
Major
slice
Mutator
Start
major
End
major
Start
minor
End
minor
ParMinor
Slop space filled with
major slices
Slide 44
Slide 44 text
Parallel Minor GC
• Stop-the-world parallel minor collection
✦ Similar to GHCs minor collection
• No need for read barriers!
Dom 0
Dom 1
Mutator
Minor
GC
Major
slice
Mutator
Minor
GC
Start
major
End
major
ConcMinor
Mutator
Major
slice
Mutator
Start
major
End
major
Start
minor
End
minor
ParMinor
Slop space filled with
major slices
Slide 45
Slide 45 text
Parallel Minor GC
• Stop-the-world parallel minor collection
✦ Similar to GHCs minor collection
• No need for read barriers!
• Quickly bring all the domains to a barrier
✦ Insert poll points in code for timely inter-domain interrupt handling
[Feeley 1993]
Dom 0
Dom 1
Mutator
Minor
GC
Major
slice
Mutator
Minor
GC
Start
major
End
major
ConcMinor
Mutator
Major
slice
Mutator
Start
major
End
major
Start
minor
End
minor
ParMinor
Slop space filled with
major slices
Slide 46
Slide 46 text
Evaluation
• 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
✦ 24 cores isolated for performance evaluation
Slide 47
Slide 47 text
Evaluation
• 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
✦ 24 cores isolated for performance evaluation
• Sequential Throughput — compared to stock OCaml
✦ ConcMinor 4.9% slower and ParMinor 3.5% slower
✦ ConcMinor 54% lower peak memory and ParMinor 61% lower peak
memory
Slide 48
Slide 48 text
Evaluation
• 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
✦ 24 cores isolated for performance evaluation
• Sequential Throughput — compared to stock OCaml
✦ ConcMinor 4.9% slower and ParMinor 3.5% slower
✦ ConcMinor 54% lower peak memory and ParMinor 61% lower peak
memory
• Sequential GC pause times on par with stock OCaml
Slide 49
Slide 49 text
Parallel Scalability
Slide 50
Slide 50 text
Parallel Scalability
ConcMinor suffers due
to read faults
Slide 51
Slide 51 text
Parallel Scalability
ConcMinor suffers due
to read faults
Unbalanced allocation leads to
inopportune minor GCs in ParMinor
Slide 52
Slide 52 text
ParMinor vs ConcMinor
• Parallel GC latency roughly similar between ParMinor and
ConcMinor
Slide 53
Slide 53 text
ParMinor vs ConcMinor
• Parallel GC latency roughly similar between ParMinor and
ConcMinor
• ParMinor wins over ConcMinor
✦ Does not break the C API
✦ Performs similarly to the ConcMinor on 24 cores
Slide 54
Slide 54 text
ParMinor vs ConcMinor
• Parallel GC latency roughly similar between ParMinor and
ConcMinor
• ParMinor wins over ConcMinor
✦ Does not break the C API
✦ Performs similarly to the ConcMinor on 24 cores
• OCaml 5.00 will have multicore support and use ParMinor
✦ May revisit ConcMinor later for manycore future
Slide 55
Slide 55 text
Thanks!
• Multicore OCaml
✦ https://github.com/ocaml-multicore/ocaml-multicore
• Sandmark — benchmark suite for (Multicore) OCaml
✦ https://github.com/ocaml-bench/sandmark/
• SPIN models
✦ https://github.com/ocaml-multicore/multicore-ocaml-verify
• Parallel Programming with Multicore OCaml
✦ https://github.com/ocaml-multicore/parallel-programming-in-
multicore-ocaml