Slide 1

Slide 1 text

Retrofitting Parallelism onto OCaml KC Sivaramakrishnan, Stephen Dolan, Leo white, Sadiq Jaffer, Tom Kelly, Anmol Sahoo, Sudha Parimala, Atul Dhiman, Anil Madhavapeddy OCaml Labs

Slide 2

Slide 2 text

The Astrée Static Analyzer Industry Projects

Slide 3

Slide 3 text

The Astrée Static Analyzer Industry Projects No multicore support!

Slide 4

Slide 4 text

Multicore OCaml • Adds native support for concurrency and shared-memory parallelism to OCaml

Slide 5

Slide 5 text

Multicore OCaml • Adds native support for concurrency and shared-memory parallelism to OCaml • Focus of this work is parallelism ✦ Building a multicore GC for OCaml

Slide 6

Slide 6 text

Multicore OCaml • Adds native support for concurrency and shared-memory parallelism to OCaml • Focus of this work is parallelism ✦ Building a multicore GC for OCaml • Key parallel GC design principle ✦ Backwards compatibility before parallel scalability

Slide 7

Slide 7 text

Challenges • Millions of lines of legacy code ✦ Weak references, ephemerons, lazy values, finalisers ✦ Low-level C API that bakes in GC invariants ✦ Cost of refactoring sequential code itself is prohibitive

Slide 8

Slide 8 text

Challenges • Millions of lines of legacy code ✦ Weak references, ephemerons, lazy values, finalisers ✦ Low-level C API that bakes in GC invariants ✦ Cost of refactoring sequential code itself is prohibitive • Type safety ✦ Dolan et al, “Bounding Data Races in Space and Time”, PLDI’18 ✦ Strong guarantees (including type safety) under data races

Slide 9

Slide 9 text

Challenges • Millions of lines of legacy code ✦ Weak references, ephemerons, lazy values, finalisers ✦ Low-level C API that bakes in GC invariants ✦ Cost of refactoring sequential code itself is prohibitive • Type safety ✦ Dolan et al, “Bounding Data Races in Space and Time”, PLDI’18 ✦ Strong guarantees (including type safety) under data races • Low-latency and predictable performance ✦ Thanks to the GC design

Slide 10

Slide 10 text

Incremental and non-moving Stock OCaml GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap

Slide 11

Slide 11 text

Incremental and non-moving Stock OCaml GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap Mutator Start of major cycle Idle

Slide 12

Slide 12 text

Incremental and non-moving Stock OCaml GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap Mutator Start of major cycle Idle Mark Roots mark roots

Slide 13

Slide 13 text

Mark mark main Incremental and non-moving Stock OCaml GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap Mutator Start of major cycle Idle Mark Roots mark roots

Slide 14

Slide 14 text

Mark mark main Sweep sweep Incremental and non-moving Stock OCaml GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap Mutator Start of major cycle Idle Mark Roots mark roots

Slide 15

Slide 15 text

Mark mark main Sweep sweep Incremental and non-moving Stock OCaml GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap End of major cycle Mutator Start of major cycle Idle Mark Roots mark roots

Slide 16

Slide 16 text

Mark mark main Sweep sweep Incremental and non-moving Stock OCaml GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap End of major cycle Mutator Start of major cycle Idle Mark Roots mark roots • Fast allocations, no read barriers

Slide 17

Slide 17 text

Mark mark main Sweep sweep Incremental and non-moving Stock OCaml GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default) • Bump pointer allocation • Survivors copied to major heap End of major cycle Mutator Start of major cycle Idle Mark Roots mark roots • Fast allocations, no read barriers • Max GC latency < 10 ms, 99th percentile latency < 1 ms

Slide 18

Slide 18 text

Requirements 1. Feature backwards compatibility • Serial programs do not break on parallel runtime • No separate serial and parallel modes

Slide 19

Slide 19 text

Requirements 1. Feature backwards compatibility • Serial programs do not break on parallel runtime • No separate serial and parallel modes 2. Performance backwards compatibility • Serial programs behave similarly on parallel runtime in terms of running time, GC pausetime and memory usage.

Slide 20

Slide 20 text

Requirements 1. Feature backwards compatibility • Serial programs do not break on parallel runtime • No separate serial and parallel modes 2. Performance backwards compatibility • Serial programs behave similarly on parallel runtime in terms of running time, GC pausetime and memory usage. 3. Parallel responsiveness and scalability • Parallel programs remain responsive • Parallel programs scale with additional cores

Slide 21

Slide 21 text

Multicore OCaml: Major GC • Multicore-aware allocator ✦ Based on Streamflow [Schneider et al. 2006] ✦ Thread-local, size-segmented free lists for small objects + malloc for large allocations ✦ Sequential performance on par with OCaml’s allocators

Slide 22

Slide 22 text

Multicore OCaml: Major GC • Multicore-aware allocator ✦ Based on Streamflow [Schneider et al. 2006] ✦ Thread-local, size-segmented free lists for small objects + malloc for large allocations ✦ Sequential performance on par with OCaml’s allocators • A mostly-concurrent, non-moving, mark-and-sweep collector ✦ Based on VCGC [Huelsbergen and Winterbottom 1998]

Slide 23

Slide 23 text

Multicore OCaml: Major GC • Multicore-aware allocator ✦ Based on Streamflow [Schneider et al. 2006] ✦ Thread-local, size-segmented free lists for small objects + malloc for large allocations ✦ Sequential performance on par with OCaml’s allocators • A mostly-concurrent, non-moving, mark-and-sweep collector ✦ Based on VCGC [Huelsbergen and Winterbottom 1998] Sweep Mark Mark Roots Sweep Mark Mark Roots Start of major cycle End of major cycle Domain 0 Domain 1

Slide 24

Slide 24 text

Multicore OCaml: Major GC • Multicore-aware allocator ✦ Based on Streamflow [Schneider et al. 2006] ✦ Thread-local, size-segmented free lists for small objects + malloc for large allocations ✦ Sequential performance on par with OCaml’s allocators • A mostly-concurrent, non-moving, mark-and-sweep collector ✦ Based on VCGC [Huelsbergen and Winterbottom 1998] Sweep Mark Mark Roots Sweep Mark Mark Roots Start of major cycle End of major cycle mark and sweep phases may overlap Domain 0 Domain 1

Slide 25

Slide 25 text

Multicore OCaml: Major GC

Slide 26

Slide 26 text

Multicore OCaml: Major GC • Extend support weak references, ephemerons, (2 different kinds of) finalizers, fibers, lazy values

Slide 27

Slide 27 text

Multicore OCaml: Major GC • Extend support weak references, ephemerons, (2 different kinds of) finalizers, fibers, lazy values • Ephemerons are tricky in a concurrent multicore GC ✦ A generalisation of weak references ✦ Introduce conjunction in the reachability property ✦ Requires multiple rounds of ephemeron marking ✦ Cycle-delimited handshaking without global barrier

Slide 28

Slide 28 text

Multicore OCaml: Major GC • Extend support weak references, ephemerons, (2 different kinds of) finalizers, fibers, lazy values • Ephemerons are tricky in a concurrent multicore GC ✦ A generalisation of weak references ✦ Introduce conjunction in the reachability property ✦ Requires multiple rounds of ephemeron marking ✦ Cycle-delimited handshaking without global barrier • A barrier each for the two kinds of finalisers ✦ 3 barriers / cycle worst case

Slide 29

Slide 29 text

Multicore OCaml: Major GC • Extend support weak references, ephemerons, (2 different kinds of) finalizers, fibers, lazy values • Ephemerons are tricky in a concurrent multicore GC ✦ A generalisation of weak references ✦ Introduce conjunction in the reachability property ✦ Requires multiple rounds of ephemeron marking ✦ Cycle-delimited handshaking without global barrier • A barrier each for the two kinds of finalisers ✦ 3 barriers / cycle worst case • Verified in the SPIN model checker

Slide 30

Slide 30 text

Concurrent Minor GC • Based on [Doligez and Leroy 1993] but lazier as in [Marlow and Peyton Jones 2011] collector for GHC Minor Heap Minor Heap Minor Heap Minor Heap Major Heap Domain 0 Domain 1 Domain 2 Domain 3

Slide 31

Slide 31 text

Concurrent Minor GC • Based on [Doligez and Leroy 1993] but lazier as in [Marlow and Peyton Jones 2011] collector for GHC Minor Heap Minor Heap Minor Heap Minor Heap Major Heap Domain 0 Domain 1 Domain 2 Domain 3 • Each domain can independently collect its minor heap

Slide 32

Slide 32 text

Concurrent Minor GC • Based on [Doligez and Leroy 1993] but lazier as in [Marlow and Peyton Jones 2011] collector for GHC Minor Heap Minor Heap Minor Heap Minor Heap Major Heap Domain 0 Domain 1 Domain 2 Domain 3 • Each domain can independently collect its minor heap • Major to minor pointers allowed ✦ Prevents early promotion & mirrors sequential behaviour ✦ Read barrier required for mutable field + promotion

Slide 33

Slide 33 text

Read Barriers • Stock OCaml does not have read barriers ✦ Read barriers need to be efficient for performance backwards compatibility

Slide 34

Slide 34 text

Read Barriers • Stock OCaml does not have read barriers ✦ Read barriers need to be efficient for performance backwards compatibility • 3 instructions in x86 - VMM + bit-twiddling tricks ✦ Proof of correctness available in the paper ✦ Minimal performance impact on sequential code

Slide 35

Slide 35 text

Read Barriers • Stock OCaml does not have read barriers ✦ Read barriers need to be efficient for performance backwards compatibility • 3 instructions in x86 - VMM + bit-twiddling tricks ✦ Proof of correctness available in the paper ✦ Minimal performance impact on sequential code • Unfortunately, read barriers break the C API (feature backwards compatibility)

Slide 36

Slide 36 text

Read Barriers minor major heap x y a minor b Domain 0 Domain 1 !y !x

Slide 37

Slide 37 text

Read Barriers minor major heap x y a minor b Domain 0 Domain 1 !y !x promote (!y) promote (!x)

Slide 38

Slide 38 text

Read Barriers • Service promotion requests on read faults to avoid deadlock ✦ Mutable reads are GC safe points! minor major heap x y a minor b Domain 0 Domain 1 !y !x promote (!y) promote (!x)

Slide 39

Slide 39 text

Read Barriers • Service promotion requests on read faults to avoid deadlock ✦ Mutable reads are GC safe points! • C API written with explicit knowledge of when GC may happen ✦ Need to manually refactor tricky code minor major heap x y a minor b Domain 0 Domain 1 !y !x promote (!y) promote (!x)

Slide 40

Slide 40 text

Parallel Minor GC • Stop-the-world parallel minor collection ✦ Similar to GHCs minor collection

Slide 41

Slide 41 text

Parallel Minor GC • Stop-the-world parallel minor collection ✦ Similar to GHCs minor collection Dom 0 Dom 1 Mutator Minor GC Major slice Mutator Minor GC Start major End major ConcMinor

Slide 42

Slide 42 text

Parallel Minor GC • Stop-the-world parallel minor collection ✦ Similar to GHCs minor collection Dom 0 Dom 1 Mutator Minor GC Major slice Mutator Minor GC Start major End major ConcMinor Mutator Major slice Mutator Start major End major Start minor End minor ParMinor

Slide 43

Slide 43 text

Parallel Minor GC • Stop-the-world parallel minor collection ✦ Similar to GHCs minor collection Dom 0 Dom 1 Mutator Minor GC Major slice Mutator Minor GC Start major End major ConcMinor Mutator Major slice Mutator Start major End major Start minor End minor ParMinor Slop space filled with major slices

Slide 44

Slide 44 text

Parallel Minor GC • Stop-the-world parallel minor collection ✦ Similar to GHCs minor collection • No need for read barriers! Dom 0 Dom 1 Mutator Minor GC Major slice Mutator Minor GC Start major End major ConcMinor Mutator Major slice Mutator Start major End major Start minor End minor ParMinor Slop space filled with major slices

Slide 45

Slide 45 text

Parallel Minor GC • Stop-the-world parallel minor collection ✦ Similar to GHCs minor collection • No need for read barriers! • Quickly bring all the domains to a barrier ✦ Insert poll points in code for timely inter-domain interrupt handling [Feeley 1993] Dom 0 Dom 1 Mutator Minor GC Major slice Mutator Minor GC Start major End major ConcMinor Mutator Major slice Mutator Start major End major Start minor End minor ParMinor Slop space filled with major slices

Slide 46

Slide 46 text

Evaluation • 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz ✦ 24 cores isolated for performance evaluation

Slide 47

Slide 47 text

Evaluation • 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz ✦ 24 cores isolated for performance evaluation • Sequential Throughput — compared to stock OCaml ✦ ConcMinor 4.9% slower and ParMinor 3.5% slower ✦ ConcMinor 54% lower peak memory and ParMinor 61% lower peak memory

Slide 48

Slide 48 text

Evaluation • 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz ✦ 24 cores isolated for performance evaluation • Sequential Throughput — compared to stock OCaml ✦ ConcMinor 4.9% slower and ParMinor 3.5% slower ✦ ConcMinor 54% lower peak memory and ParMinor 61% lower peak memory • Sequential GC pause times on par with stock OCaml

Slide 49

Slide 49 text

Parallel Scalability

Slide 50

Slide 50 text

Parallel Scalability ConcMinor suffers due to read faults

Slide 51

Slide 51 text

Parallel Scalability ConcMinor suffers due to read faults Unbalanced allocation leads to inopportune minor GCs in ParMinor

Slide 52

Slide 52 text

ParMinor vs ConcMinor • Parallel GC latency roughly similar between ParMinor and ConcMinor

Slide 53

Slide 53 text

ParMinor vs ConcMinor • Parallel GC latency roughly similar between ParMinor and ConcMinor • ParMinor wins over ConcMinor ✦ Does not break the C API ✦ Performs similarly to the ConcMinor on 24 cores

Slide 54

Slide 54 text

ParMinor vs ConcMinor • Parallel GC latency roughly similar between ParMinor and ConcMinor • ParMinor wins over ConcMinor ✦ Does not break the C API ✦ Performs similarly to the ConcMinor on 24 cores • OCaml 5.00 will have multicore support and use ParMinor ✦ May revisit ConcMinor later for manycore future

Slide 55

Slide 55 text

Thanks! • Multicore OCaml ✦ https://github.com/ocaml-multicore/ocaml-multicore • Sandmark — benchmark suite for (Multicore) OCaml ✦ https://github.com/ocaml-bench/sandmark/ • SPIN models ✦ https://github.com/ocaml-multicore/multicore-ocaml-verify • Parallel Programming with Multicore OCaml ✦ https://github.com/ocaml-multicore/parallel-programming-in- multicore-ocaml