Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Retrofitting Concurrency -- Lessons from the engine room

Retrofitting Concurrency -- Lessons from the engine room

ICFP 2022 Keynote

A new major release of the OCaml programming language is on the horizon. OCaml 5.0 brings native support for concurrency and parallelism to OCaml. While recent languages like Go and Rust have been designed with concurrency in mind, OCaml is not so fortunate. There are millions of lines of OCaml code in production, and none of which was written with concurrency in mind. Extending OCaml with concurrency brings the challenge of not just maintaining backwards compatibility but also preserving the performance profile of single-threaded applications.

In this talk, I will describe the approach taken by the Multicore OCaml project that has helped deliver OCaml 5.0, focusing on what worked well and what didn’t. I hope that these lessons are useful to other researchers building programming language abstractions with the aim to retrofit them onto industrial-strength programming languages.

KC Sivaramakrishnan

September 14, 2022
Tweet

More Decks by KC Sivaramakrishnan

Other Decks in Science

Transcript

  1. Retro fi tting Concurrency Lessons from the engine room “KC”

    Sivaramakrishnan Images made with Stable Diffusion
  2. In Sep 2022… OCaml 5.0 Overlapped execution A B A

    C B Time Concurrency Parallelism Effect Handlers
  3. In Sep 2022… OCaml 5.0 Overlapped execution A B A

    C B Time Simultaneous execution A B C Time Concurrency Parallelism Effect Handlers Domains
  4. Eliminate the runtime lock OCaml OCaml OCaml Simultaneous execution A

    B C Time Parallelism Domains GIL Sam Gross, Meta, “Multithreaded Python without the GIL”
  5. Retro fi tting Challenges ~> Approach • Millions of lines

    of legacy software ✦ Most code likely to remain sequential even with multicore • Cost of refactoring is prohibitive
  6. Retro fi tting Challenges ~> Approach • Millions of lines

    of legacy software ✦ Most code likely to remain sequential even with multicore • Cost of refactoring is prohibitive Do not break existing code!
  7. Retro fi tting Challenges ~> Approach • Low latency and

    predictable performance ✦ Great for ~10ms tolerance
  8. Retro fi tting Challenges ~> Approach • Low latency and

    predictable performance ✦ Great for ~10ms tolerance Optimise for GC latency before scalability
  9. Retro fi tting Challenges ~> Approach • OCaml core team

    is composed of volunteers ✦ Aim to reduce complexity and maintenance burden
  10. Retro fi tting Challenges ~> Approach • OCaml core team

    is composed of volunteers ✦ Aim to reduce complexity and maintenance burden No separate sequential and parallel runtimes Unlike -threaded runtime
  11. Retro fi tting Challenges ~> Approach • OCaml core team

    is composed of volunteers ✦ Aim to reduce complexity and maintenance burden No separate sequential and parallel runtimes 㱺 Existing sequential programs run just as fast using just as much memory Unlike -threaded runtime
  12. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 Medieval garbage truck according to Stable Diffusion
  13. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 Medieval garbage truck according to Stable Diffusion
  14. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 Medieval garbage truck according to Stable Diffusion
  15. Access remote objects Domain 0 Domain 1 X Y let

    r = !x Major heap Minor heaps
  16. Access remote objects Domain 0 Domain 1 X Y let

    r = !x promote(y) Major heap Minor heaps
  17. Access remote objects Domain 0 Domain 1 X Y promote(y)

    y Major heap Minor heaps let r = !x
  18. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 Mostly concurrent concurrent Medieval garbage truck according to Stable Diffusion
  19. Parallel Allocator & GC POPL ‘93 JFP ‘14 Intel Single-chip

    Cloud Computer (SCC) Major Heap Minor Heap Minor Heap
  20. Parallel Allocator & GC • Excellent scalability on 128-cores ✦

    Also maintains low latency on large core counts • Mostly retains sequential latency, throughput and memory usage characteristics Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2
  21. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 But …
  22. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 But … Read barrier!
  23. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 But … Read barrier! • Read barrier ✦ Only a branch on the OCaml side for reads ✦ Read are now GC safe points ✦ Breaks the C FFI invariants about when GC may be performed
  24. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 But … Read barrier! • Read barrier ✦ Only a branch on the OCaml side for reads ✦ Read are now GC safe points ✦ Breaks the C FFI invariants about when GC may be performed • No push-button fi x! ✦ Lots of packages in the ecosystem broke.
  25. Back to the drawing board (~2019) Major Heap Minor Heap

    Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 Mostly concurrent Stop-the-world parallel
  26. Back to the drawing board (~2019) Major Heap Minor Heap

    Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 Mostly concurrent Stop-the-world parallel Bring 128-domains to a stop is surprisingly fast
  27. Data Races • Data Race: When two threads perform unsynchronised

    access and at least one is a write. ✦ Non-SC behaviour due to compiler optimisations and relaxed hardware.
  28. Data Races • Data Race: When two threads perform unsynchronised

    access and at least one is a write. ✦ Non-SC behaviour due to compiler optimisations and relaxed hardware. • Enforcing SC behaviour slows down sequential programs! ✦ 85% on ARM64, 41% on PowerPC
  29. Data Races • Data Race: When two threads perform unsynchronised

    access and at least one is a write. ✦ Non-SC behaviour due to compiler optimisations and relaxed hardware. • Enforcing SC behaviour slows down sequential programs! ✦ 85% on ARM64, 41% on PowerPC OCaml needed a relaxed memory model
  30. Second-mover Advantage • Learn from the other language memory models

    • DRF-SC, but catch- fi re semantics on data races Well-typed OCaml programs don’t go wrong
  31. Second-mover Advantage • Learn from the other language memory models

    • DRF-SC + no crash under data races ✦ But scope of race is not limited in time • DRF-SC, but catch- fi re semantics on data races Well-typed OCaml programs don’t go wrong
  32. Second-mover Advantage • Learn from the other language memory models

    • DRF-SC + no crash under data races ✦ But scope of race is not limited in time • DRF-SC, but catch- fi re semantics on data races Well-typed OCaml programs don’t go wrong • No data races by construction ✦ Unsafe code memory model is ~C++11
  33. Second-mover Advantage • Learn from the other language memory models

    • DRF-SC + no crash under data races ✦ But scope of race is not limited in time Advantage: No Multicore OCaml programs in the wild! • DRF-SC, but catch- fi re semantics on data races Well-typed OCaml programs don’t go wrong • No data races by construction ✦ Unsafe code memory model is ~C++11
  34. OCaml memory model (~2017) • Simple (comprehensible!) operational memory model

    ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust.
  35. OCaml memory model (~2017) • Simple (comprehensible!) operational memory model

    ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust. 1.19
  36. OCaml memory model (~2017) • Simple (comprehensible!) operational memory model

    ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust. 1.19
  37. OCaml memory model (~2017) • Simple (comprehensible!) operational memory model

    ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust. • Key innovation: Local data race freedom ✦ Permits compositional reasoning 1.19
  38. OCaml memory model (~2017) • Simple (comprehensible!) operational memory model

    ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust. • Key innovation: Local data race freedom ✦ Permits compositional reasoning • Performance impact ✦ Free on x86 and < 1% on ARM 1.19
  39. • Simple (comprehensible!) operational memory model ✦ Only atomic and

    non-atomic locations ✦ No “out of thin air” values • Interested in extracting f i
  40. Concurrency (~2015) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time
  41. Concurrency (~2015) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time Async Lwt >>=
  42. Concurrency (~2015) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time Async Lwt >>= Synchronous Asynchronous Normal calls Special calling convention
  43. Concurrency (~2015) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Async Lwt >>= Synchronous Asynchronous Normal calls Special calling convention Eliminate function colours with native concurrency support — Bob Nystrom Overlapped execution A B A C B Time
  44. Concurrency • Parallelism is a resource; concurrency is a programming

    abstraction ✦ Language-level threads Overlapped execution A B A C B Time Language & Runtime System Library
  45. Concurrency • Parallelism is a resource; concurrency is a programming

    abstraction ✦ Language-level threads Overlapped execution A B A C B Time Language & Runtime System Library
  46. Concurrency • Parallelism is a resource; concurrency is a programming

    abstraction ✦ Language-level threads Overlapped execution A B A C B Time Language & Runtime System Library Maintenance Burden C and not Haskell Lack of fl exibility
  47. Concurrency • Parallelism is a resource; concurrency is a programming

    abstraction ✦ M:N scheduling Overlapped execution A B A C B Time Runtime System Language Maintenance Burden C and not Haskell Lack of fl
  48. Language & Runtime System Library Scheduler Blackholing (lazy evaluation) Hard

    to undo adding a feature into the RTS Concurrency
  49. Concurrency • Parallelism is a resource; concurrency is a programming

    abstraction ✦ Language-level threads Overlapped execution A B A C B Time Language & Runtime System Library First-class continuations!
  50. Ease of comprehension • Effect handler ~= Resumable exceptions +

    computation may be resumed later effect E : string let comp () = print_string (perform E) let main () = try comp () with effect E k -> continue k “Handled" exception E let comp () = print_string (raise E) let main () = try comp () with E -> print_string “Raised” Exception Effect handler delimited continuation
  51. Ease of comprehension • Effect handler ~= Resumable exceptions +

    computation may be resumed later • Easier than shift/reset, control/prompt ✦ No prompts or answer-type polymorphism effect E : string let comp () = print_string (perform E) let main () = try comp () with effect E k -> continue k “Handled" exception E let comp () = print_string (raise E) let main () = try comp () with E -> print_string “Raised” Exception Effect handler delimited continuation
  52. Ease of comprehension • Effect handler ~= Resumable exceptions +

    computation may be resumed later • Easier than shift/reset, control/prompt ✦ No prompts or answer-type polymorphism Effect handlers : shift/reset :: while : goto effect E : string let comp () = print_string (perform E) let main () = try comp () with effect E k -> continue k “Handled" exception E let comp () = print_string (raise E) let main () = try comp () with E -> print_string “Raised” Exception Effect handler delimited continuation
  53. Retro fi tting Effect Handlers • Don’t break existing code

    㱺 No effect system ✦ No syntax and just functions
  54. Retro fi tting Effect Handlers • Don’t break existing code

    㱺 No effect system ✦ No syntax and just functions • Focus on preserving ✦ Performance of legacy code (< 1% impact) ✦ Compatibility of tools — gdb, perf
  55. Retro fi tting Effect Handlers • Don’t break existing code

    㱺 No effect system • Focus on preserving ✦ Performance of legacy code (< 1% impact) ✦ Compatibility of tools — gdb, perf PLDI ‘21
  56. Eio — Direct-style effect-based concurrency HTTP server performance using 24

    cores HTTP server scaling maintaining a constant load of 1.5 million requests per second
  57. Concurrency (~2022) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time
  58. Concurrency (~2022) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time
  59. Concurrency (~2022) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time ~2 days ago 🎉
  60. Care for Users • Transition to the new version should

    be a no-op or push- button solution ✦ Most code likely to remain sequential
  61. Care for Users • Transition to the new version should

    be a no-op or push- button solution ✦ Most code likely to remain sequential • Build tools to ease the transition OPAM Health Check: http://check.ocamllabs.io/
  62. Benchmarking Rigorously, Continuously on Real programs • OCaml users don’t

    just run synthetic benchmarks • Sandmark — Real-world programs picked from wild ✦ Coq ✦ Menhir (parser-generator) ✦ Alt-ergo (solver) ✦ Irmin (database) … and their large set of OPAM dependencies
  63. Benchmarking Rigorously, Continuously on Real programs Program P: OCaml 4.14

    = 19s OCaml 5.0 = 18s Are the speedups / slowdowns statistically signi fi cant?
  64. Benchmarking Rigorously, Continuously on Real programs Program P: OCaml 4.14

    = 19s OCaml 5.0 = 18s Are the speedups / slowdowns statistically signi fi cant? • Modern OS, arch, micro-arch effects become signi fi cant at small scales ✦ 20% speedup by inserting fences
  65. Benchmarking Rigorously, Continuously on Real programs Program P: OCaml 4.14

    = 19s OCaml 5.0 = 18s Are the speedups / slowdowns statistically signi fi cant? • Modern OS, arch, micro-arch effects become signi fi cant at small scales ✦ 20% speedup by inserting fences • Tune the machine to remove noise
  66. Benchmarking Rigorously, Continuously on Real programs Program P: OCaml 4.14

    = 19s OCaml 5.0 = 18s Are the speedups / slowdowns statistically signi fi cant? • Modern OS, arch, micro-arch effects become signi fi cant at small scales ✦ 20% speedup by inserting fences • Tune the machine to remove noise • Useful to measure instructions retired along with real time
  67. Invest in tooling Reuse existing tools; if not build them!

    • rr = gdb + record-and-replay debugging
  68. Invest in tooling Reuse existing tools; if not build them!

    • rr = gdb + record-and-replay debugging • OCaml 5 + ThreadSanitizer ✦ Detect data races dynamically
  69. Convincing caml-devel • Quite a challenge maintaining a separate fork

    for 7+ years! ✦ Multiple rebases to keep it up-to-date with mainline ✦ In hindsight, smaller PRs are better
  70. Convincing caml-devel • Quite a challenge maintaining a separate fork

    for 7+ years! ✦ Multiple rebases to keep it up-to-date with mainline ✦ In hindsight, smaller PRs are better • Peer-reviewed papers adds credibility to the effort
  71. Convincing caml-devel • Quite a challenge maintaining a separate fork

    for 7+ years! ✦ Multiple rebases to keep it up-to-date with mainline ✦ In hindsight, smaller PRs are better • Peer-reviewed papers adds credibility to the effort • Open-source and actively-maintained always ✦ Lots of (academic) users from early days
  72. Convincing caml-devel • Quite a challenge maintaining a separate fork

    for 7+ years! ✦ Multiple rebases to keep it up-to-date with mainline ✦ In hindsight, smaller PRs are better • Peer-reviewed papers adds credibility to the effort • Open-source and actively-maintained always ✦ Lots of (academic) users from early days • Continuous benchmarking, OPAM health check
  73. Growing the language OCaml 4 OCaml 5 A few researchers

    Lots of Engineers time Independent Contributors
  74. Where do we go from here? OCaml 5.0 Effect System

    Backwards compatibility, polymorphism, modularity & generatively
  75. Where do we go from here? OCaml 5.0 Effect System

    JavaScript target Modal Types + Lexically scoped typed effect handlers Untyped effects
  76. Where do we go from here? OCaml 5.0 Effect System

    JavaScript target Modal Types Unboxed Types Flambda2 Parallelism Control memory layout Avoid heap allocations Aggressive compiler optimisations Rust/C-like performance (on demand), with GC as default, and the ergonomics and safety of classic ML
  77. Where do we go from here? OCaml 5.0 Effect System

    JavaScript target Stack allocation Unboxed Types Flambda2 Parallelism Control memory layout Avoid heap allocations Agressive compiler optimisations Rust/C-like performance (on demand), with GC as default, and the ergonomics and safety of classic ML