Written without concurrency and parallelism in min d ✦ Cost of refactoring sequential code itself is prohibitive • Low-latency and predictable performanc e ✦ Great for applications that require ~10ms latency
Written without concurrency and parallelism in min d ✦ Cost of refactoring sequential code itself is prohibitive • Low-latency and predictable performanc e ✦ Great for applications that require ~10ms latency • Excellent compatibility with debugging and pro f i ling tool s ✦ gdb, lldb, perf, libunwind, etc.
Written without concurrency and parallelism in min d ✦ Cost of refactoring sequential code itself is prohibitive • Low-latency and predictable performanc e ✦ Great for applications that require ~10ms latency • Excellent compatibility with debugging and pro f i ling tool s ✦ gdb, lldb, perf, libunwind, etc. Backwards compatibility before scalability
existing code • Performance backwards compatibilit y ✦ Existing programs run just as fast using just the same memory • GC Latency before multicore scalability
existing code • Performance backwards compatibilit y ✦ Existing programs run just as fast using just the same memory • GC Latency before multicore scalability • Compatibility with program inspection tools
existing code • Performance backwards compatibilit y ✦ Existing programs run just as fast using just the same memory • GC Latency before multicore scalability • Compatibility with program inspection tools • Performant concurrent and parallel programming abstractions
— maps onto a OS threa d ✦ Recommended to have 1 domain per core • Low-level domain AP I ✦ Spawn & join, wait & notif y ✦ Domain-local storag e ✦ Atomic memory operation s ✤ Dolan et al, “Bounding Data Races in Space and Time”, PLDI’18
— maps onto a OS threa d ✦ Recommended to have 1 domain per core • Low-level domain AP I ✦ Spawn & join, wait & notif y ✦ Domain-local storag e ✦ Atomic memory operation s ✤ Dolan et al, “Bounding Data Races in Space and Time”, PLDI’18 • No restrictions on sharing objects between domain s ✦ But how does it work?
incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default ) • Bump pointer allocatio n • Survivors copied to major heap Mutator Start of major cycle Idle
incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default ) • Bump pointer allocatio n • Survivors copied to major heap Mutator Start of major cycle Idle Mark Roots mark roots
A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default ) • Bump pointer allocatio n • Survivors copied to major heap Mutator Start of major cycle Idle Mark Roots mark roots
GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default ) • Bump pointer allocatio n • Survivors copied to major heap Mutator Start of major cycle Idle Mark Roots mark roots
GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default ) • Bump pointer allocatio n • Survivors copied to major heap End of major cycle Mutator Start of major cycle Idle Mark Roots mark roots
GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default ) • Bump pointer allocatio n • Survivors copied to major heap End of major cycle Mutator Start of major cycle Idle Mark Roots mark roots • Fast allocations
GC • A generational, non-moving, incremental, mark-and-sweep GC Minor Heap Major Heap • Small (2 MB default ) • Bump pointer allocatio n • Survivors copied to major heap End of major cycle Mutator Start of major cycle Idle Mark Roots mark roots • Fast allocations • Max GC latency < 10 ms, 99th percentile latency < 1 ms
minor hea p ✦ 2 global barriers / minor g c ✦ On 24 cores, ~10 ms pauses Major Heap Dom 0 Dom 0 Dom 1 Dom 0 Dom 1 Domain 0 allocation pointer Domain 1 allocation pointer Minor Heap
✦ All the marking and sweeping work done without synchronizatio n ✦ 3 barriers per cycle (worst case) to agree end of GC phase s ✤ 2 barriers for the two kinds of f i nalisers in OCam l ✦ ~5 ms pauses on 24 cores Sweep Mark Mark Roots Mutator Sweep Mark Mark Roots Start of major cycle End of major cycle mark and sweep phases may overlap Domain 0 Domain 1
compiler is too low-level • Domainslib - https://github.com/ocaml-multicore/domainslib Domain 0 Domain N … Task Pool Async/Await Parallel for Domainslib
compiler is too low-level • Domainslib - https://github.com/ocaml-multicore/domainslib Domain 0 Domain N … Task Pool Async/Await Parallel for Domainslib Let’s look at examples!
= T.setup_pool ~num_domains:(num_domains - 1) in let res = fib_par pool n in T.teardown_pool pool; res let rec fib_par pool n = if n <= 40 then fib_seq n else let a = T.async pool (fun _ -> fib_par pool (n-1)) in let b = T.async pool (fun _ -> fib_par pool (n-2)) in T.await pool a + T.await pool b module T = Domainslib.Task
n < 2 then 1 else fib_seq (n-1) + fib_seq (n-2) let fib n = let pool = T.setup_pool ~num_domains:(num_domains - 1) in let res = fib_par pool n in T.teardown_pool pool; res let rec fib_par pool n = if n <= 40 then fib_seq n else let a = T.async pool (fun _ -> fib_par pool (n-1)) in let b = T.async pool (fun _ -> fib_par pool (n-2)) in T.await pool a + T.await pool b module T = Domainslib.Task
x = 0 to board_size - 1 do for y = 0 to board_size - 1 do next_board.(x).(y) <- next_cell cur_board x y done done; ... let next () = ... T.parallel_for pool ~start:0 ~finish:(board_size - 1) ~body:(fun x -> for y = 0 to board_size - 1 do next_board.(x).(y) <- next_cell cur_board x y done); ...
programming libraries in OCam l ✦ Callback-oriented programming with nicer syntax Parallelism is a performance hack whereas concurrency is a program structuring mechanism
programming libraries in OCam l ✦ Callback-oriented programming with nicer syntax • Suffers many pitfalls of callback-oriented programmin g ✦ No backtraces, exceptions can’t be used, monadic syntax Parallelism is a performance hack whereas concurrency is a program structuring mechanism
programming libraries in OCam l ✦ Callback-oriented programming with nicer syntax • Suffers many pitfalls of callback-oriented programmin g ✦ No backtraces, exceptions can’t be used, monadic syntax • Go (goroutines) and GHC Haskell (threads) have better abstractions — lightweight threads Parallelism is a performance hack whereas concurrency is a program structuring mechanism
programming libraries in OCam l ✦ Callback-oriented programming with nicer syntax • Suffers many pitfalls of callback-oriented programmin g ✦ No backtraces, exceptions can’t be used, monadic syntax • Go (goroutines) and GHC Haskell (threads) have better abstractions — lightweight threads Parallelism is a performance hack whereas concurrency is a program structuring mechanism Should we add lightweight threads to OCaml?
i ned effects • Modular basis of non-local control- f l ow mechanism s ✦ Exceptions, generators, lightweight threads, promises, asynchronous IO, coroutines
i ned effects • Modular basis of non-local control- f l ow mechanism s ✦ Exceptions, generators, lightweight threads, promises, asynchronous IO, coroutines • Effect declaration separate from interpretation (c.f. exceptions)
i ned effects • Modular basis of non-local control- f l ow mechanism s ✦ Exceptions, generators, lightweight threads, promises, asynchronous IO, coroutines • Effect declaration separate from interpretation (c.f. exceptions) effect E : string let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 "
i ned effects • Modular basis of non-local control- f l ow mechanism s ✦ Exceptions, generators, lightweight threads, promises, asynchronous IO, coroutines • Effect declaration separate from interpretation (c.f. exceptions) effect E : string let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " effect declaration
i ned effects • Modular basis of non-local control- f l ow mechanism s ✦ Exceptions, generators, lightweight threads, promises, asynchronous IO, coroutines • Effect declaration separate from interpretation (c.f. exceptions) effect E : string let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " computation effect declaration
i ned effects • Modular basis of non-local control- f l ow mechanism s ✦ Exceptions, generators, lightweight threads, promises, asynchronous IO, coroutines • Effect declaration separate from interpretation (c.f. exceptions) effect E : string let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " computation handler effect declaration
i ned effects • Modular basis of non-local control- f l ow mechanism s ✦ Exceptions, generators, lightweight threads, promises, asynchronous IO, coroutines • Effect declaration separate from interpretation (c.f. exceptions) effect E : string let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " computation handler suspends current computation effect declaration
i ned effects • Modular basis of non-local control- f l ow mechanism s ✦ Exceptions, generators, lightweight threads, promises, asynchronous IO, coroutines • Effect declaration separate from interpretation (c.f. exceptions) effect E : string let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " computation handler delimited continuation suspends current computation effect declaration
i ned effects • Modular basis of non-local control- f l ow mechanism s ✦ Exceptions, generators, lightweight threads, promises, asynchronous IO, coroutines • Effect declaration separate from interpretation (c.f. exceptions) effect E : string let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " computation handler delimited continuation suspends current computation resume suspended computation effect declaration
() = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp
() = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp
comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp parent Fiber: A piece of stack + effect handler
let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp parent 0
let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp k 0
let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp k 0
let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp k 0
let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp k 0 1
let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp k 0 1
let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp k parent 0 1
let comp () = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp k parent 0 1 2
() = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp k 0 1 2 3
() = print_string "0 "; print_string (perform E); print_string "3 " let main () = try comp () with effect E k -> print_string "1 "; continue k "2 "; print_string “4 " pc main sp k 0 1 2 3 4
effect Yield : unit let run main = ... (* assume queue of continuations *) let run_next () = match dequeue () with | Some k -> continue k () | None -> () in let rec spawn f = match f () with | () -> run_next () | effect Yield k -> enqueue k; run_next () | effect (Fork f) k -> enqueue k; spawn f in spawn main
effect Yield : unit let run main = ... (* assume queue of continuations *) let run_next () = match dequeue () with | Some k -> continue k () | None -> () in let rec spawn f = match f () with | () -> run_next () | effect Yield k -> enqueue k; run_next () | effect (Fork f) k -> enqueue k; spawn f in spawn main let fork f = perform (Fork f) let yield () = perform Yield
print_endline "1.a"; yield (); print_endline "1.b"); fork (fun _ -> print_endline "2.a"; yield (); print_endline “2.b") ;; run main 1.a 2.a 1.b 2.b • Direct-style (no monads) • User-code need not be aware of effects
yielding value s ✦ Primitives in JavaScript and Pytho n ✦ Can be derived automatically from iterator using effect handlers • Task — traverse a complete binary-tree of depth 2 5 ✦ 226 stack switches
yielding value s ✦ Primitives in JavaScript and Pytho n ✦ Can be derived automatically from iterator using effect handlers • Task — traverse a complete binary-tree of depth 2 5 ✦ 226 stack switches • Iterator — idiomatic recursive traversal
yielding value s ✦ Primitives in JavaScript and Pytho n ✦ Can be derived automatically from iterator using effect handlers • Task — traverse a complete binary-tree of depth 2 5 ✦ 226 stack switches • Iterator — idiomatic recursive traversal • Generato r ✦ Hand-written generator (hw-generator ) ✤ CPS translation + defunctionalization to remove intermediate closure allocatio n ✦ Generator using effect handlers (eh-generator)
rst 2. Runtime support for effect handler s • No effect syntax but all the compiler and runtime bits in 3. Effect syste m a. Track user-de f i ned effects in the typ e b. Track ambinet effects (ref, IO) in the typ e c. OCaml becomes a pure language (in the Haskell sense).
rst 2. Runtime support for effect handler s • No effect syntax but all the compiler and runtime bits in 3. Effect syste m a. Track user-de f i ned effects in the typ e b. Track ambinet effects (ref, IO) in the typ e c. OCaml becomes a pure language (in the Haskell sense). let foo () = print_string "hello, world" val foo : unit -[ io ]-> unit Syntax is still in the works
funding Multicore OCaml development! • Multicore + Tezo s ✦ Parallel Lwt preemptive tasks ✦ Direct-style asynchronous IO librar y ✤ Bridge the gap between Async and Lw t ✦ Parallelising Irmin (storage layer of Tezos)
funding Multicore OCaml development! • Multicore + Tezo s ✦ Parallel Lwt preemptive tasks ✦ Direct-style asynchronous IO librar y ✤ Bridge the gap between Async and Lw t ✦ Parallelising Irmin (storage layer of Tezos) • An end-to-end Multicore Tezos demonstrator (mid-2021)