$30 off During Our Annual Pro Sale. View details »

State of Multicore OCaml

State of Multicore OCaml

Status of Multicore OCaml project and future directions.

KC Sivaramakrishnan

June 26, 2018
Tweet

More Decks by KC Sivaramakrishnan

Other Decks in Programming

Transcript

  1. State of Multicore OCaml KC Sivaramakrishnan University of Cambridge OCaml

    Labs
  2. Outline • Overview of the multicore OCaml project • Multicore

    OCaml runtime design • Future directions
  3. Multicore OCaml

  4. Multicore OCaml • Add native support for concurrency and (shared-memory)

    parallelism to OCaml
  5. Multicore OCaml • Add native support for concurrency and (shared-memory)

    parallelism to OCaml • History ★ Jan 2014: Initiated by Stephen Dolan and Leo White ★ Sep 2014: Multicore OCaml design @ OCaml workshop ★ Jan 2015: KC joins the project at OCaml Labs ★ Sep 2015: Effect handlers @ OCaml workshop ★ Jan 2016: Native code backend for Amd64 on Linux and OSX ★ Jun 2016: Multicore rebased to 4.02.2 from 4.00.0 ★ Sep 2016: Reagents library, Multicore backend for Links @ OCaml workshop ★ Apr 2017: ARM64 backend
  6. Multicore OCaml

  7. Multicore OCaml • History continued… ★ Jun 2017: Handlers for

    Concurrent System Programming @ TFP ★ Sep 2017: Memory model proposal @ OCaml workshop ★ Sep 2017: CPS translation for handlers @ FSCD ★ Apr 2018: Multicore rebased to 4.06.1 (will track releases going forward) ★ Jun 2018: Memory model @ PLDI
  8. Multicore OCaml • History continued… ★ Jun 2017: Handlers for

    Concurrent System Programming @ TFP ★ Sep 2017: Memory model proposal @ OCaml workshop ★ Sep 2017: CPS translation for handlers @ FSCD ★ Apr 2018: Multicore rebased to 4.06.1 (will track releases going forward) ★ Jun 2018: Memory model @ PLDI • Looking forward… ★ Q3’18 — Q4’18: Implement missing features, upstream prerequisites to trunk ★ Q1’19 — Q2’19: Submit feature-based PRs to upstream
  9. Components Multicore Runtime + Domains Effect Handlers Effect System

  10. Components Multicore Runtime + Domains Effect Handlers Effect System •

    Multicore Runtime ★ Multicore GC + Domains (creating and managing parallel threads)
  11. Components Multicore Runtime + Domains Effect Handlers Effect System •

    Multicore Runtime ★ Multicore GC + Domains (creating and managing parallel threads) • Effect handlers ★ Fibers: Runtime system support for linear delimited continuations
  12. Components Multicore Runtime + Domains Effect Handlers Effect System •

    Multicore Runtime ★ Multicore GC + Domains (creating and managing parallel threads) • Effect handlers ★ Fibers: Runtime system support for linear delimited continuations • Effect system ★ Track user-defined effects in the type system ★ Statically rule out the possibility of unhandled effects
  13. Components Multicore Runtime + Domains Effect Handlers Effect System •

    Multicore Runtime ★ Multicore GC + Domains (creating and managing parallel threads) • Effect handlers ★ Fibers: Runtime system support for linear delimited continuations • Effect system ★ Track user-defined effects in the type system ★ Statically rule out the possibility of unhandled effects Current implementation
  14. Components Multicore Runtime + Domains Effect Handlers Effect System •

    Multicore Runtime ★ Multicore GC + Domains (creating and managing parallel threads) • Effect handlers ★ Fibers: Runtime system support for linear delimited continuations • Effect system ★ Track user-defined effects in the type system ★ Statically rule out the possibility of unhandled effects Current implementation Work-in-progress
  15. Multicore GC Minor Heap Minor Heap Minor Heap Minor Heap

    Major Heap Domain 0 Domain 1 Domain 2 Domain 3 [1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006. [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.
  16. Multicore GC Minor Heap Minor Heap Minor Heap Minor Heap

    Major Heap Domain 0 Domain 1 Domain 2 Domain 3 [1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006. [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.
  17. Multicore GC Minor Heap Minor Heap Minor Heap Minor Heap

    Major Heap Domain 0 Domain 1 Domain 2 Domain 3 • Independent per-domain minor collection [1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006. [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.
  18. Multicore GC Minor Heap Minor Heap Minor Heap Minor Heap

    Major Heap Domain 0 Domain 1 Domain 2 Domain 3 • Independent per-domain minor collection ★ Read barrier for mutable fields + promotion to major [1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006. [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.
  19. Multicore GC Minor Heap Minor Heap Minor Heap Minor Heap

    Major Heap Domain 0 Domain 1 Domain 2 Domain 3 • Independent per-domain minor collection ★ Read barrier for mutable fields + promotion to major • A new major allocator based on StreamFlow [1], lock-free multithreaded allocation [1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006. [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.
  20. Multicore GC Minor Heap Minor Heap Minor Heap Minor Heap

    Major Heap Domain 0 Domain 1 Domain 2 Domain 3 • Independent per-domain minor collection ★ Read barrier for mutable fields + promotion to major • A new major allocator based on StreamFlow [1], lock-free multithreaded allocation • A new major GC based on VCGC [2] adapted to fibers, ephemerons, finalisers [1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006. [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.
  21. Major GC • Concurrent, incremental, mark and sweep ★ Uses

    deletion/yuasa barrier ★ Upper bound on marking work per cycle (not fixed due to weak refs) • 3 phases: ★ Sweep-and-mark-main ★ Mark-final ★ Sweep-ephe
  22. Major GC: Sweep-and-mark-main

  23. Major GC: Sweep-and-mark-main Domain 0 Mark Roots Domain 1 Mark

    Roots • Domains begin by marking roots
  24. Major GC: Sweep-and-mark-main Mutator Domain 0 Sweep Mark Roots Mutator

    Sweep Mutator Domain 1 Mutator Mark Roots Sweep Mutator • Domains begin by marking roots • Domains alternate between sweeping own garbage and running mutator
  25. Major GC: Sweep-and-mark-main Mutator Domain 0 Mark Sweep Mark Roots

    Mutator Sweep Mutator Mark Mutator Mutator Domain 1 Mutator Mark Roots Sweep Mutator Mark Mutator Mark Mutator • Domains begin by marking roots • Domains alternate between sweeping own garbage and running mutator • Domains alternate between marking objects and running mutator
  26. Major GC: Sweep-and-mark-main Mutator Domain 0 Mark Sweep Mark Roots

    Mutator Sweep Mutator Mark Mutator Mutator Domain 1 Mutator Mark Roots Sweep Mutator Mark Mutator Mark Mutator • Domains begin by marking roots • Domains alternate between sweeping own garbage and running mutator • Domains alternate between marking objects and running mutator
  27. Major GC: Sweep-and-mark-main Mutator Domain 0 Mark Sweep Ephe Mark

    Mark Roots Mutator Sweep Mutator Mark Mutator Mutator Mutator Mark Mutator Domain 1 Mutator Mark Roots Sweep Mutator Mark Mutator Mark Ephe Mark Mutator Mutator Mark Mutator • Domains begin by marking roots • Domains alternate between sweeping own garbage and running mutator • Domains alternate between marking objects and running mutator • Domains alternate between marking ephemerons, marking other objects and running mutator
  28. Major GC: Sweep-and-mark-main Mutator Domain 0 Mark Sweep Ephe Mark

    Mark Roots Mutator Sweep Mutator Mark Mutator Mutator Mutator Mark Mutator Domain 1 Mutator Mark Roots Sweep Mutator Mark Mutator Mark Ephe Mark Mutator Mutator Mark Mutator • Domains begin by marking roots • Domains alternate between sweeping own garbage and running mutator • Domains alternate between marking objects and running mutator • Domains alternate between marking ephemerons, marking other objects and running mutator
  29. Major GC: Sweep-and-mark-main Mutator Domain 0 Mark Sweep Ephe Mark

    Mark Roots Mutator Sweep Mutator Mark Mutator Mutator Mutator Mark Mutator Ephe Mark Domain 1 Mutator Mark Roots Sweep Mutator Mark Mutator Mark Ephe Mark Mutator Mutator Mark Mutator • Domains begin by marking roots • Domains alternate between sweeping own garbage and running mutator • Domains alternate between marking objects and running mutator • Domains alternate between marking ephemerons, marking other objects and running mutator
  30. Major GC: Sweep-and-mark-main Mutator Domain 0 Mark Sweep Ephe Mark

    Mark Roots Mutator Sweep Mutator Mark Mutator Mutator Mutator Mark Mutator Ephe Mark Domain 1 Mutator Mark Roots Sweep Mutator Mark Mutator Mark Ephe Mark Mutator Mutator Mark Mutator Barrier • Domains begin by marking roots • Domains alternate between sweeping own garbage and running mutator • Domains alternate between marking objects and running mutator • Domains alternate between marking ephemerons, marking other objects and running mutator • Global barrier to switch to the next phase ★ Reading weak keys may make unreachable objects reachable ★ Verify that the phase termination conditions hold
  31. Major GC: mark-final

  32. Major GC: mark-final Domain 0 Update final first Domain 1

    Update final first • Domains update Gc.finalise finalisers which take values and mark the values ★ Preserves the order of evaluation of finalisers per domain c.f trunk
  33. Major GC: mark-final Domain 0 Mark Ephe Mark Update final

    first Mutator Mark Mutator Mutator Mutator Mark Mutator Ephe Mark Domain 1 Update final first Mutator Mark Mutator Mark Ephe Mark Mutator Mutator Mark Mutator • Domains update Gc.finalise finalisers which take values and mark the values ★ Preserves the order of evaluation of finalisers per domain c.f trunk
  34. Major GC: mark-final Domain 0 Mark Ephe Mark Update final

    first Mutator Mark Mutator Mutator Mutator Mark Mutator Ephe Mark Domain 1 Update final first Mutator Mark Mutator Mark Ephe Mark Mutator Mutator Mark Mutator Barrier • Domains update Gc.finalise finalisers which take values and mark the values ★ Preserves the order of evaluation of finalisers per domain c.f trunk
  35. Major GC: sweep-ephe

  36. Major GC: sweep-ephe Domain 0 Update final last Domain 1

    Update final last • Domains prepares the Gc.finalise_last finaliser list which do not take values ★ Preserves the order of evaluation of finalisers per domain c.f trunk
  37. Major GC: sweep-ephe Domain 0 Update final last Domain 1

    Update final last Ephe Sweep Mutator Mutator Mutator Barrier • Domains prepares the Gc.finalise_last finaliser list which do not take values ★ Preserves the order of evaluation of finalisers per domain c.f trunk Ephe Sweep Ephe Sweep Mutator Mutator Ephe Sweep
  38. Major GC: sweep-ephe Domain 0 Update final last Domain 1

    Update final last Ephe Sweep Mutator Mutator Mutator Barrier • Domains prepares the Gc.finalise_last finaliser list which do not take values ★ Preserves the order of evaluation of finalisers per domain c.f trunk Ephe Sweep Ephe Sweep Mutator Mutator Ephe Sweep • Swap the meaning of GC bits ★ MARKED → UNMARKED ★ UNMARKED → GARBAGE ★ GARBAGE → MARKED
  39. Major GC: sweep-ephe Domain 0 Update final last Domain 1

    Update final last Ephe Sweep Mutator Mutator Mutator Barrier • Domains prepares the Gc.finalise_last finaliser list which do not take values ★ Preserves the order of evaluation of finalisers per domain c.f trunk Ephe Sweep Ephe Sweep Mutator Mutator Ephe Sweep • Swap the meaning of GC bits ★ MARKED → UNMARKED ★ UNMARKED → GARBAGE ★ GARBAGE → MARKED • Major GC algorithm verified in SPIN model checker
  40. Memory Model

  41. Memory Model • Goal: Balance comprehensibility and performance

  42. Memory Model • Goal: Balance comprehensibility and performance • Generalise

    ★ SC-DRF property ✦ Data-race-free programs have sequential semantics ★ to local DRF ✦ Data-race-free parts of programs have sequential semantics
  43. Memory Model • Goal: Balance comprehensibility and performance • Generalise

    ★ SC-DRF property ✦ Data-race-free programs have sequential semantics ★ to local DRF ✦ Data-race-free parts of programs have sequential semantics • Bounds data races in space and time ★ Data races on one location do not affect sequential semantics of another ★ Dara races in the past or the future do no affect sequential semantics of non- racy accesses
  44. Memory Model

  45. Memory Model • We have developed a memory model that

    has LDRF ★ Atomic and non-atomic locations (no relaxed operations yet) ★ Proven correct (on paper) compilation to x86 and ARMv8
  46. Memory Model • We have developed a memory model that

    has LDRF ★ Atomic and non-atomic locations (no relaxed operations yet) ★ Proven correct (on paper) compilation to x86 and ARMv8 • Is it practical? ★ SC has LDRF and SRA is conjectured to have LDRF, but not practical due to performance impact
  47. Memory Model • We have developed a memory model that

    has LDRF ★ Atomic and non-atomic locations (no relaxed operations yet) ★ Proven correct (on paper) compilation to x86 and ARMv8 • Is it practical? ★ SC has LDRF and SRA is conjectured to have LDRF, but not practical due to performance impact • Must preserve load-store ordering ★ Most compiler optimisations are valid (CSE, LICM). ✦ No redundant store elimination across load. ★ Free on x86, low-overhead on ARM (0.6% overhead) and POWER (2.9% overhead)
  48. Runtime support for Effect handlers

  49. Runtime support for Effect handlers • Linear delimited continuations ★

    Linearity enforced by the runtime ★ Raise exception when continuation resumed more than once ★ Finaliser discontinues unresumed continuation
  50. Runtime support for Effect handlers • Linear delimited continuations ★

    Linearity enforced by the runtime ★ Raise exception when continuation resumed more than once ★ Finaliser discontinues unresumed continuation • Fibers: Heap managed stack segments ★ Requires stack-overflow checks at function entry ★ Static analysis removes checks in small leaf functions
  51. Runtime support for Effect handlers • Linear delimited continuations ★

    Linearity enforced by the runtime ★ Raise exception when continuation resumed more than once ★ Finaliser discontinues unresumed continuation • Fibers: Heap managed stack segments ★ Requires stack-overflow checks at function entry ★ Static analysis removes checks in small leaf functions • C calls needs to be performed on C stack ★ < 1% performance slowdown on average for this feature ★ DWARF magic allows full backtrace across nested calls of handlers, C calls and callbacks.
  52. Runtime support for Effect handlers • Linear delimited continuations ★

    Linearity enforced by the runtime ★ Raise exception when continuation resumed more than once ★ Finaliser discontinues unresumed continuation • Fibers: Heap managed stack segments ★ Requires stack-overflow checks at function entry ★ Static analysis removes checks in small leaf functions • C calls needs to be performed on C stack ★ < 1% performance slowdown on average for this feature ★ DWARF magic allows full backtrace across nested calls of handlers, C calls and callbacks. • WIP to support capturing continuations that include C frames c.f “Threads Yield Continuations”
  53. Status • Major GC and fiber implementations are stable modulo

    bugs ★ TODO: Effect System • Laundry list of minor features ★ https://github.com/ocamllabs/ocaml-multicore/projects/3 • We need ★ Benchmarks ★ Benchmarking tools and infrastructure ★ Performance tuning
  54. Future Directions: Memory Model

  55. Future Directions: Memory Model • Memory model only supports atomic

    and non-atomic locations ★ Extend memory model with weaker atomics and “new ref” while preserving LDRF theorem
  56. Future Directions: Memory Model • Memory model only supports atomic

    and non-atomic locations ★ Extend memory model with weaker atomics and “new ref” while preserving LDRF theorem • Avoid become C++ — multiple weak atomics w/ subtle interactions ★ Could we expose restricted APIs to the programmer?
  57. Future Directions: Memory Model • Memory model only supports atomic

    and non-atomic locations ★ Extend memory model with weaker atomics and “new ref” while preserving LDRF theorem • Avoid become C++ — multiple weak atomics w/ subtle interactions ★ Could we expose restricted APIs to the programmer? • Verify multicore OCaml programs ★ Explore (semi-)automated SMT-aided verification ★ Challenge problem: verify k-CAS at the heart of Reagents library
  58. Future Directions: Multicore MirageOS

  59. Future Directions: Multicore MirageOS • MirageOS rewrite to take advantage

    of typed effect handlers and multicore parallelism ★ Typed effects for better error handling and concurrency
  60. Future Directions: Multicore MirageOS • MirageOS rewrite to take advantage

    of typed effect handlers and multicore parallelism ★ Typed effects for better error handling and concurrency • Better concurrency model over Xen block devices ★ Extricate oneself from dependence on POSIX API ★ Discriminate various concurrency levels (CPU, application, I/O) in the scheduler ★ Failure and Back pressure as a first-class operation
  61. Future Directions: Multicore MirageOS • MirageOS rewrite to take advantage

    of typed effect handlers and multicore parallelism ★ Typed effects for better error handling and concurrency • Better concurrency model over Xen block devices ★ Extricate oneself from dependence on POSIX API ★ Discriminate various concurrency levels (CPU, application, I/O) in the scheduler ★ Failure and Back pressure as a first-class operation • Multicore-capable Irmin, a branch-consistent database library
  62. Future Directions: Heterogeneous System • Programming heterogenous, non Von Neumann

    architectures ★ How do we capture computational model in richer type system? ★ How do we compile efficiently to such a system?