Upgrade to Pro — share decks privately, control downloads, hide ads and more …

State of Multicore OCaml

State of Multicore OCaml

Status of Multicore OCaml project and future directions.

KC Sivaramakrishnan

June 26, 2018
Tweet

More Decks by KC Sivaramakrishnan

Other Decks in Programming

Transcript

  1. State of Multicore
    OCaml
    KC Sivaramakrishnan
    University of
    Cambridge
    OCaml Labs

    View Slide

  2. Outline
    • Overview of the multicore OCaml project
    • Multicore OCaml runtime design
    • Future directions

    View Slide

  3. Multicore OCaml

    View Slide

  4. Multicore OCaml
    • Add native support for concurrency and (shared-memory)
    parallelism to OCaml

    View Slide

  5. Multicore OCaml
    • Add native support for concurrency and (shared-memory)
    parallelism to OCaml
    • History
    ★ Jan 2014: Initiated by Stephen Dolan and Leo White
    ★ Sep 2014: Multicore OCaml design @ OCaml workshop
    ★ Jan 2015: KC joins the project at OCaml Labs
    ★ Sep 2015: Effect handlers @ OCaml workshop
    ★ Jan 2016: Native code backend for Amd64 on Linux and OSX
    ★ Jun 2016: Multicore rebased to 4.02.2 from 4.00.0
    ★ Sep 2016: Reagents library, Multicore backend for Links @ OCaml workshop
    ★ Apr 2017: ARM64 backend

    View Slide

  6. Multicore OCaml

    View Slide

  7. Multicore OCaml
    • History continued…
    ★ Jun 2017: Handlers for Concurrent System Programming @ TFP
    ★ Sep 2017: Memory model proposal @ OCaml workshop
    ★ Sep 2017: CPS translation for handlers @ FSCD
    ★ Apr 2018: Multicore rebased to 4.06.1 (will track releases going forward)
    ★ Jun 2018: Memory model @ PLDI

    View Slide

  8. Multicore OCaml
    • History continued…
    ★ Jun 2017: Handlers for Concurrent System Programming @ TFP
    ★ Sep 2017: Memory model proposal @ OCaml workshop
    ★ Sep 2017: CPS translation for handlers @ FSCD
    ★ Apr 2018: Multicore rebased to 4.06.1 (will track releases going forward)
    ★ Jun 2018: Memory model @ PLDI
    • Looking forward…
    ★ Q3’18 — Q4’18: Implement missing features, upstream prerequisites to
    trunk
    ★ Q1’19 — Q2’19: Submit feature-based PRs to upstream

    View Slide

  9. Components
    Multicore Runtime
    +
    Domains
    Effect Handlers Effect System

    View Slide

  10. Components
    Multicore Runtime
    +
    Domains
    Effect Handlers Effect System
    • Multicore Runtime
    ★ Multicore GC + Domains (creating and managing parallel threads)

    View Slide

  11. Components
    Multicore Runtime
    +
    Domains
    Effect Handlers Effect System
    • Multicore Runtime
    ★ Multicore GC + Domains (creating and managing parallel threads)
    • Effect handlers
    ★ Fibers: Runtime system support for linear delimited continuations

    View Slide

  12. Components
    Multicore Runtime
    +
    Domains
    Effect Handlers Effect System
    • Multicore Runtime
    ★ Multicore GC + Domains (creating and managing parallel threads)
    • Effect handlers
    ★ Fibers: Runtime system support for linear delimited continuations
    • Effect system
    ★ Track user-defined effects in the type system
    ★ Statically rule out the possibility of unhandled effects

    View Slide

  13. Components
    Multicore Runtime
    +
    Domains
    Effect Handlers Effect System
    • Multicore Runtime
    ★ Multicore GC + Domains (creating and managing parallel threads)
    • Effect handlers
    ★ Fibers: Runtime system support for linear delimited continuations
    • Effect system
    ★ Track user-defined effects in the type system
    ★ Statically rule out the possibility of unhandled effects
    Current
    implementation

    View Slide

  14. Components
    Multicore Runtime
    +
    Domains
    Effect Handlers Effect System
    • Multicore Runtime
    ★ Multicore GC + Domains (creating and managing parallel threads)
    • Effect handlers
    ★ Fibers: Runtime system support for linear delimited continuations
    • Effect system
    ★ Track user-defined effects in the type system
    ★ Statically rule out the possibility of unhandled effects
    Current
    implementation
    Work-in-progress

    View Slide

  15. Multicore GC
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Major Heap
    Domain 0 Domain 1 Domain 2 Domain 3
    [1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory
    allocation." ISMM 2006.
    [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain
    synchronization." ISMM 1998.

    View Slide

  16. Multicore GC
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Major Heap
    Domain 0 Domain 1 Domain 2 Domain 3
    [1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory
    allocation." ISMM 2006.
    [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain
    synchronization." ISMM 1998.

    View Slide

  17. Multicore GC
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Major Heap
    Domain 0 Domain 1 Domain 2 Domain 3
    • Independent per-domain minor collection
    [1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory
    allocation." ISMM 2006.
    [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain
    synchronization." ISMM 1998.

    View Slide

  18. Multicore GC
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Major Heap
    Domain 0 Domain 1 Domain 2 Domain 3
    • Independent per-domain minor collection
    ★ Read barrier for mutable fields + promotion to major
    [1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory
    allocation." ISMM 2006.
    [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain
    synchronization." ISMM 1998.

    View Slide

  19. Multicore GC
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Major Heap
    Domain 0 Domain 1 Domain 2 Domain 3
    • Independent per-domain minor collection
    ★ Read barrier for mutable fields + promotion to major
    • A new major allocator based on StreamFlow [1], lock-free multithreaded
    allocation
    [1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory
    allocation." ISMM 2006.
    [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain
    synchronization." ISMM 1998.

    View Slide

  20. Multicore GC
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Minor
    Heap
    Major Heap
    Domain 0 Domain 1 Domain 2 Domain 3
    • Independent per-domain minor collection
    ★ Read barrier for mutable fields + promotion to major
    • A new major allocator based on StreamFlow [1], lock-free multithreaded
    allocation
    • A new major GC based on VCGC [2] adapted to fibers, ephemerons, finalisers
    [1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory
    allocation." ISMM 2006.
    [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain
    synchronization." ISMM 1998.

    View Slide

  21. Major GC
    • Concurrent, incremental, mark and sweep
    ★ Uses deletion/yuasa barrier
    ★ Upper bound on marking work per cycle (not fixed due to weak refs)
    • 3 phases:
    ★ Sweep-and-mark-main
    ★ Mark-final
    ★ Sweep-ephe

    View Slide

  22. Major GC: Sweep-and-mark-main

    View Slide

  23. Major GC: Sweep-and-mark-main
    Domain 0 Mark
    Roots
    Domain 1 Mark
    Roots
    • Domains begin by marking roots

    View Slide

  24. Major GC: Sweep-and-mark-main
    Mutator
    Domain 0 Sweep
    Mark
    Roots
    Mutator Sweep Mutator
    Domain 1 Mutator
    Mark
    Roots
    Sweep Mutator
    • Domains begin by marking roots
    • Domains alternate between sweeping own garbage and running mutator

    View Slide

  25. Major GC: Sweep-and-mark-main
    Mutator
    Domain 0 Mark
    Sweep
    Mark
    Roots
    Mutator Sweep Mutator Mark
    Mutator Mutator
    Domain 1 Mutator
    Mark
    Roots
    Sweep Mutator Mark Mutator Mark Mutator
    • Domains begin by marking roots
    • Domains alternate between sweeping own garbage and running mutator
    • Domains alternate between marking objects and running mutator

    View Slide

  26. Major GC: Sweep-and-mark-main
    Mutator
    Domain 0 Mark
    Sweep
    Mark
    Roots
    Mutator Sweep Mutator Mark
    Mutator Mutator
    Domain 1 Mutator
    Mark
    Roots
    Sweep Mutator Mark Mutator Mark Mutator
    • Domains begin by marking roots
    • Domains alternate between sweeping own garbage and running mutator
    • Domains alternate between marking objects and running mutator

    View Slide

  27. Major GC: Sweep-and-mark-main
    Mutator
    Domain 0 Mark
    Sweep
    Ephe
    Mark
    Mark
    Roots
    Mutator Sweep Mutator Mark
    Mutator Mutator Mutator Mark Mutator
    Domain 1 Mutator
    Mark
    Roots
    Sweep Mutator Mark Mutator Mark
    Ephe
    Mark
    Mutator Mutator Mark Mutator
    • Domains begin by marking roots
    • Domains alternate between sweeping own garbage and running mutator
    • Domains alternate between marking objects and running mutator
    • Domains alternate between marking ephemerons, marking other objects and running
    mutator

    View Slide

  28. Major GC: Sweep-and-mark-main
    Mutator
    Domain 0 Mark
    Sweep
    Ephe
    Mark
    Mark
    Roots
    Mutator Sweep Mutator Mark
    Mutator Mutator Mutator Mark Mutator
    Domain 1 Mutator
    Mark
    Roots
    Sweep Mutator Mark Mutator Mark
    Ephe
    Mark
    Mutator Mutator Mark Mutator
    • Domains begin by marking roots
    • Domains alternate between sweeping own garbage and running mutator
    • Domains alternate between marking objects and running mutator
    • Domains alternate between marking ephemerons, marking other objects and running
    mutator

    View Slide

  29. Major GC: Sweep-and-mark-main
    Mutator
    Domain 0 Mark
    Sweep
    Ephe
    Mark
    Mark
    Roots
    Mutator Sweep Mutator Mark
    Mutator Mutator Mutator Mark Mutator
    Ephe
    Mark
    Domain 1 Mutator
    Mark
    Roots
    Sweep Mutator Mark Mutator Mark
    Ephe
    Mark
    Mutator Mutator Mark Mutator
    • Domains begin by marking roots
    • Domains alternate between sweeping own garbage and running mutator
    • Domains alternate between marking objects and running mutator
    • Domains alternate between marking ephemerons, marking other objects and running
    mutator

    View Slide

  30. Major GC: Sweep-and-mark-main
    Mutator
    Domain 0 Mark
    Sweep
    Ephe
    Mark
    Mark
    Roots
    Mutator Sweep Mutator Mark
    Mutator Mutator Mutator Mark Mutator
    Ephe
    Mark
    Domain 1 Mutator
    Mark
    Roots
    Sweep Mutator Mark Mutator Mark
    Ephe
    Mark
    Mutator Mutator Mark Mutator
    Barrier
    • Domains begin by marking roots
    • Domains alternate between sweeping own garbage and running mutator
    • Domains alternate between marking objects and running mutator
    • Domains alternate between marking ephemerons, marking other objects and running
    mutator
    • Global barrier to switch to the next phase
    ★ Reading weak keys may make unreachable objects reachable
    ★ Verify that the phase termination conditions hold

    View Slide

  31. Major GC: mark-final

    View Slide

  32. Major GC: mark-final
    Domain 0 Update
    final first
    Domain 1 Update
    final first
    • Domains update Gc.finalise finalisers which take values and mark the values
    ★ Preserves the order of evaluation of finalisers per domain c.f trunk

    View Slide

  33. Major GC: mark-final
    Domain 0 Mark
    Ephe
    Mark
    Update
    final first
    Mutator Mark
    Mutator Mutator Mutator Mark Mutator
    Ephe
    Mark
    Domain 1 Update
    final first
    Mutator Mark Mutator Mark
    Ephe
    Mark
    Mutator Mutator Mark Mutator
    • Domains update Gc.finalise finalisers which take values and mark the values
    ★ Preserves the order of evaluation of finalisers per domain c.f trunk

    View Slide

  34. Major GC: mark-final
    Domain 0 Mark
    Ephe
    Mark
    Update
    final first
    Mutator Mark
    Mutator Mutator Mutator Mark Mutator
    Ephe
    Mark
    Domain 1 Update
    final first
    Mutator Mark Mutator Mark
    Ephe
    Mark
    Mutator Mutator Mark Mutator
    Barrier
    • Domains update Gc.finalise finalisers which take values and mark the values
    ★ Preserves the order of evaluation of finalisers per domain c.f trunk

    View Slide

  35. Major GC: sweep-ephe

    View Slide

  36. Major GC: sweep-ephe
    Domain 0 Update
    final last
    Domain 1 Update
    final last
    • Domains prepares the Gc.finalise_last finaliser list which do not take values
    ★ Preserves the order of evaluation of finalisers per domain c.f trunk

    View Slide

  37. Major GC: sweep-ephe
    Domain 0 Update
    final last
    Domain 1 Update
    final last
    Ephe
    Sweep
    Mutator Mutator
    Mutator
    Barrier
    • Domains prepares the Gc.finalise_last finaliser list which do not take values
    ★ Preserves the order of evaluation of finalisers per domain c.f trunk
    Ephe Sweep
    Ephe
    Sweep
    Mutator Mutator
    Ephe
    Sweep

    View Slide

  38. Major GC: sweep-ephe
    Domain 0 Update
    final last
    Domain 1 Update
    final last
    Ephe
    Sweep
    Mutator Mutator
    Mutator
    Barrier
    • Domains prepares the Gc.finalise_last finaliser list which do not take values
    ★ Preserves the order of evaluation of finalisers per domain c.f trunk
    Ephe Sweep
    Ephe
    Sweep
    Mutator Mutator
    Ephe
    Sweep
    • Swap the meaning of GC bits
    ★ MARKED → UNMARKED
    ★ UNMARKED → GARBAGE
    ★ GARBAGE → MARKED

    View Slide

  39. Major GC: sweep-ephe
    Domain 0 Update
    final last
    Domain 1 Update
    final last
    Ephe
    Sweep
    Mutator Mutator
    Mutator
    Barrier
    • Domains prepares the Gc.finalise_last finaliser list which do not take values
    ★ Preserves the order of evaluation of finalisers per domain c.f trunk
    Ephe Sweep
    Ephe
    Sweep
    Mutator Mutator
    Ephe
    Sweep
    • Swap the meaning of GC bits
    ★ MARKED → UNMARKED
    ★ UNMARKED → GARBAGE
    ★ GARBAGE → MARKED
    • Major GC algorithm verified in SPIN model checker

    View Slide

  40. Memory Model

    View Slide

  41. Memory Model
    • Goal: Balance comprehensibility and performance

    View Slide

  42. Memory Model
    • Goal: Balance comprehensibility and performance
    • Generalise
    ★ SC-DRF property
    ✦ Data-race-free programs have sequential semantics
    ★ to local DRF
    ✦ Data-race-free parts of programs have sequential semantics

    View Slide

  43. Memory Model
    • Goal: Balance comprehensibility and performance
    • Generalise
    ★ SC-DRF property
    ✦ Data-race-free programs have sequential semantics
    ★ to local DRF
    ✦ Data-race-free parts of programs have sequential semantics
    • Bounds data races in space and time
    ★ Data races on one location do not affect sequential semantics of another
    ★ Dara races in the past or the future do no affect sequential semantics of non-
    racy accesses

    View Slide

  44. Memory Model

    View Slide

  45. Memory Model
    • We have developed a memory model that has LDRF
    ★ Atomic and non-atomic locations (no relaxed operations yet)
    ★ Proven correct (on paper) compilation to x86 and ARMv8

    View Slide

  46. Memory Model
    • We have developed a memory model that has LDRF
    ★ Atomic and non-atomic locations (no relaxed operations yet)
    ★ Proven correct (on paper) compilation to x86 and ARMv8
    • Is it practical?
    ★ SC has LDRF and SRA is conjectured to have LDRF, but not practical due to
    performance impact

    View Slide

  47. Memory Model
    • We have developed a memory model that has LDRF
    ★ Atomic and non-atomic locations (no relaxed operations yet)
    ★ Proven correct (on paper) compilation to x86 and ARMv8
    • Is it practical?
    ★ SC has LDRF and SRA is conjectured to have LDRF, but not practical due to
    performance impact
    • Must preserve load-store ordering
    ★ Most compiler optimisations are valid (CSE, LICM).
    ✦ No redundant store elimination across load.
    ★ Free on x86, low-overhead on ARM (0.6% overhead) and POWER (2.9%
    overhead)

    View Slide

  48. Runtime support for
    Effect handlers

    View Slide

  49. Runtime support for
    Effect handlers
    • Linear delimited continuations
    ★ Linearity enforced by the runtime
    ★ Raise exception when continuation resumed more than once
    ★ Finaliser discontinues unresumed continuation

    View Slide

  50. Runtime support for
    Effect handlers
    • Linear delimited continuations
    ★ Linearity enforced by the runtime
    ★ Raise exception when continuation resumed more than once
    ★ Finaliser discontinues unresumed continuation
    • Fibers: Heap managed stack segments
    ★ Requires stack-overflow checks at function entry
    ★ Static analysis removes checks in small leaf functions

    View Slide

  51. Runtime support for
    Effect handlers
    • Linear delimited continuations
    ★ Linearity enforced by the runtime
    ★ Raise exception when continuation resumed more than once
    ★ Finaliser discontinues unresumed continuation
    • Fibers: Heap managed stack segments
    ★ Requires stack-overflow checks at function entry
    ★ Static analysis removes checks in small leaf functions
    • C calls needs to be performed on C stack
    ★ < 1% performance slowdown on average for this feature
    ★ DWARF magic allows full backtrace across nested calls of handlers, C calls and callbacks.

    View Slide

  52. Runtime support for
    Effect handlers
    • Linear delimited continuations
    ★ Linearity enforced by the runtime
    ★ Raise exception when continuation resumed more than once
    ★ Finaliser discontinues unresumed continuation
    • Fibers: Heap managed stack segments
    ★ Requires stack-overflow checks at function entry
    ★ Static analysis removes checks in small leaf functions
    • C calls needs to be performed on C stack
    ★ < 1% performance slowdown on average for this feature
    ★ DWARF magic allows full backtrace across nested calls of handlers, C calls and callbacks.
    • WIP to support capturing continuations that include C frames c.f “Threads Yield Continuations”

    View Slide

  53. Status
    • Major GC and fiber implementations are stable modulo bugs
    ★ TODO: Effect System
    • Laundry list of minor features
    ★ https://github.com/ocamllabs/ocaml-multicore/projects/3
    • We need
    ★ Benchmarks
    ★ Benchmarking tools and infrastructure
    ★ Performance tuning

    View Slide

  54. Future Directions: Memory Model

    View Slide

  55. Future Directions: Memory Model
    • Memory model only supports atomic and non-atomic locations
    ★ Extend memory model with weaker atomics and “new ref” while
    preserving LDRF theorem

    View Slide

  56. Future Directions: Memory Model
    • Memory model only supports atomic and non-atomic locations
    ★ Extend memory model with weaker atomics and “new ref” while
    preserving LDRF theorem
    • Avoid become C++ — multiple weak atomics w/ subtle
    interactions
    ★ Could we expose restricted APIs to the programmer?

    View Slide

  57. Future Directions: Memory Model
    • Memory model only supports atomic and non-atomic locations
    ★ Extend memory model with weaker atomics and “new ref” while
    preserving LDRF theorem
    • Avoid become C++ — multiple weak atomics w/ subtle
    interactions
    ★ Could we expose restricted APIs to the programmer?
    • Verify multicore OCaml programs
    ★ Explore (semi-)automated SMT-aided verification
    ★ Challenge problem: verify k-CAS at the heart of Reagents library

    View Slide

  58. Future Directions: Multicore MirageOS

    View Slide

  59. Future Directions: Multicore MirageOS
    • MirageOS rewrite to take advantage of typed effect handlers
    and multicore parallelism
    ★ Typed effects for better error handling and concurrency

    View Slide

  60. Future Directions: Multicore MirageOS
    • MirageOS rewrite to take advantage of typed effect handlers
    and multicore parallelism
    ★ Typed effects for better error handling and concurrency
    • Better concurrency model over Xen block devices
    ★ Extricate oneself from dependence on POSIX API
    ★ Discriminate various concurrency levels (CPU, application, I/O) in the
    scheduler
    ★ Failure and Back pressure as a first-class operation

    View Slide

  61. Future Directions: Multicore MirageOS
    • MirageOS rewrite to take advantage of typed effect handlers
    and multicore parallelism
    ★ Typed effects for better error handling and concurrency
    • Better concurrency model over Xen block devices
    ★ Extricate oneself from dependence on POSIX API
    ★ Discriminate various concurrency levels (CPU, application, I/O) in the
    scheduler
    ★ Failure and Back pressure as a first-class operation
    • Multicore-capable Irmin, a branch-consistent database library

    View Slide

  62. Future Directions:
    Heterogeneous System
    • Programming heterogenous, non Von Neumann architectures
    ★ How do we capture computational model in richer type system?
    ★ How do we compile efficiently to such a system?

    View Slide