Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Retrofitting Concurrency -- Lessons from the engine room

Retrofitting Concurrency -- Lessons from the engine room

ICFP 2022 Keynote

A new major release of the OCaml programming language is on the horizon. OCaml 5.0 brings native support for concurrency and parallelism to OCaml. While recent languages like Go and Rust have been designed with concurrency in mind, OCaml is not so fortunate. There are millions of lines of OCaml code in production, and none of which was written with concurrency in mind. Extending OCaml with concurrency brings the challenge of not just maintaining backwards compatibility but also preserving the performance profile of single-threaded applications.

In this talk, I will describe the approach taken by the Multicore OCaml project that has helped deliver OCaml 5.0, focusing on what worked well and what didn’t. I hope that these lessons are useful to other researchers building programming language abstractions with the aim to retrofit them onto industrial-strength programming languages.

KC Sivaramakrishnan

September 14, 2022
Tweet

More Decks by KC Sivaramakrishnan

Other Decks in Science

Transcript

  1. Retro fi tting Concurrency Lessons from the engine room “KC”

    Sivaramakrishnan Images made with Stable Diffusion
  2. In Sep 2022… OCaml 5.0

  3. In Sep 2022… OCaml 5.0 Concurrency Parallelism

  4. In Sep 2022… OCaml 5.0 Overlapped execution A B A

    C B Time Concurrency Parallelism Effect Handlers
  5. In Sep 2022… OCaml 5.0 Overlapped execution A B A

    C B Time Simultaneous execution A B C Time Concurrency Parallelism Effect Handlers Domains
  6. In this talk… OCaml 5.0 OCaml 4.x Multicore OCaml

  7. Backwards Compatibility Data Races Implementation Complexity Performance Stability In this

    talk… OCaml 5.0 OCaml 4.x
  8. Journey Takeaways

  9. In the year 2014… 18 year-old, industrial-strength, pragmatic, functional programming

    language
  10. In the year 2014… 18 year-old, industrial-strength, pragmatic, functional programming

    language Industry Projects
  11. In the year 2014… 18 year-old, industrial-strength, pragmatic, functional programming

    language Industry Projects No multicore support!
  12. Runtime lock OCaml C C C

  13. Runtime lock OCaml C C C GIL

  14. Eliminate the runtime lock OCaml OCaml OCaml Simultaneous execution A

    B C Time Parallelism Domains
  15. Eliminate the runtime lock OCaml OCaml OCaml Simultaneous execution A

    B C Time Parallelism Domains GIL Sam Gross, Meta, “Multithreaded Python without the GIL”
  16. Retro fi tting Challenges ~> Approach • Millions of lines

    of legacy software ✦ Most code likely to remain sequential even with multicore • Cost of refactoring is prohibitive
  17. Retro fi tting Challenges ~> Approach • Millions of lines

    of legacy software ✦ Most code likely to remain sequential even with multicore • Cost of refactoring is prohibitive Do not break existing code!
  18. Retro fi tting Challenges ~> Approach • Low latency and

    predictable performance ✦ Great for ~10ms tolerance
  19. Retro fi tting Challenges ~> Approach • Low latency and

    predictable performance ✦ Great for ~10ms tolerance Optimise for GC latency before scalability
  20. Retro fi tting Challenges ~> Approach • OCaml core team

    is composed of volunteers ✦ Aim to reduce complexity and maintenance burden
  21. Retro fi tting Challenges ~> Approach • OCaml core team

    is composed of volunteers ✦ Aim to reduce complexity and maintenance burden No separate sequential and parallel runtimes Unlike -threaded runtime
  22. Retro fi tting Challenges ~> Approach • OCaml core team

    is composed of volunteers ✦ Aim to reduce complexity and maintenance burden No separate sequential and parallel runtimes 㱺 Existing sequential programs run just as fast using just as much memory Unlike -threaded runtime
  23. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 Medieval garbage truck according to Stable Diffusion
  24. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 Medieval garbage truck according to Stable Diffusion
  25. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 Medieval garbage truck according to Stable Diffusion
  26. Access remote objects Domain 0 Domain 1 X Y let

    r = !x Major heap Minor heaps
  27. Access remote objects Domain 0 Domain 1 X Y let

    r = !x promote(y) Major heap Minor heaps
  28. Access remote objects Domain 0 Domain 1 X Y promote(y)

    y Major heap Minor heaps let r = !x
  29. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 Mostly concurrent concurrent Medieval garbage truck according to Stable Diffusion
  30. Parallel Allocator & GC Major Heap Minor

  31. Parallel Allocator & GC POPL ‘93 Major Heap Minor Heap

    Minor Heap
  32. Parallel Allocator & GC ISMM ‘11 Major Heap Minor Heap

    Minor Heap
  33. Parallel Allocator & GC POPL ‘93 JFP ‘14 Intel Single-chip

    Cloud Computer (SCC) Major Heap Minor Heap Minor Heap
  34. Parallel Allocator & GC PPoPP ‘18 H1 H2 H3 H4

    H5 disentanglement MaPLe
  35. Parallel Allocator & GC JFP ‘14 PPoPP ‘18 ICFP ‘22

  36. JFP ‘14 Parallel Allocator & GC POPL ‘93 ISMM ‘11

    JFP ‘14 PPoPP ‘18
  37. Parallel Allocator & GC • Excellent scalability on 128-cores ✦

    Also maintains low latency on large core counts • Mostly retains sequential latency, throughput and memory usage characteristics Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2
  38. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 But …
  39. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 But … Read barrier!
  40. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 But … Read barrier! • Read barrier ✦ Only a branch on the OCaml side for reads ✦ Read are now GC safe points ✦ Breaks the C FFI invariants about when GC may be performed
  41. Parallel Allocator & GC Major Heap Minor Heap Minor Heap

    Minor Heap Domain 0 Domain 1 Domain 2 But … Read barrier! • Read barrier ✦ Only a branch on the OCaml side for reads ✦ Read are now GC safe points ✦ Breaks the C FFI invariants about when GC may be performed • No push-button fi x! ✦ Lots of packages in the ecosystem broke.
  42. Back to the drawing board (~2019) Major Heap Minor Heap

    Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 Mostly concurrent Stop-the-world parallel
  43. Back to the drawing board (~2019) Major Heap Minor Heap

    Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 Mostly concurrent Stop-the-world parallel Bring 128-domains to a stop is surprisingly fast
  44. Back to the drawing board (~2019) Major Heap Minor

  45. Data Races • Data Race: When two threads perform unsynchronised

    access and at least one is a write. ✦ Non-SC behaviour due to compiler optimisations and relaxed hardware.
  46. Data Races • Data Race: When two threads perform unsynchronised

    access and at least one is a write. ✦ Non-SC behaviour due to compiler optimisations and relaxed hardware. • Enforcing SC behaviour slows down sequential programs! ✦ 85% on ARM64, 41% on PowerPC
  47. Data Races • Data Race: When two threads perform unsynchronised

    access and at least one is a write. ✦ Non-SC behaviour due to compiler optimisations and relaxed hardware. • Enforcing SC behaviour slows down sequential programs! ✦ 85% on ARM64, 41% on PowerPC OCaml needed a relaxed memory model
  48. Second-mover Advantage • Learn from the other language memory models

  49. Second-mover Advantage • Learn from the other language memory models

    • DRF-SC, but catch- fi re semantics on data races Well-typed OCaml programs don’t go wrong
  50. Second-mover Advantage • Learn from the other language memory models

    • DRF-SC + no crash under data races ✦ But scope of race is not limited in time • DRF-SC, but catch- fi re semantics on data races Well-typed OCaml programs don’t go wrong
  51. Second-mover Advantage • Learn from the other language memory models

    • DRF-SC + no crash under data races ✦ But scope of race is not limited in time • DRF-SC, but catch- fi re semantics on data races Well-typed OCaml programs don’t go wrong • No data races by construction ✦ Unsafe code memory model is ~C++11
  52. Second-mover Advantage • Learn from the other language memory models

    • DRF-SC + no crash under data races ✦ But scope of race is not limited in time Advantage: No Multicore OCaml programs in the wild! • DRF-SC, but catch- fi re semantics on data races Well-typed OCaml programs don’t go wrong • No data races by construction ✦ Unsafe code memory model is ~C++11
  53. OCaml memory model (~2017) • Simple (comprehensible!) operational memory model

    ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust.
  54. OCaml memory model (~2017) • Simple (comprehensible!) operational memory model

    ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust. 1.19
  55. OCaml memory model (~2017) • Simple (comprehensible!) operational memory model

    ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust. 1.19
  56. OCaml memory model (~2017) • Simple (comprehensible!) operational memory model

    ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust. • Key innovation: Local data race freedom ✦ Permits compositional reasoning 1.19
  57. OCaml memory model (~2017) • Simple (comprehensible!) operational memory model

    ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust. • Key innovation: Local data race freedom ✦ Permits compositional reasoning • Performance impact ✦ Free on x86 and < 1% on ARM 1.19
  58. • Simple (comprehensible!) operational memory model ✦ Only atomic and

    non-atomic locations ✦ No “out of thin air” values • Interested in extracting f i
  59. Concurrency (~2015) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time
  60. Concurrency (~2015) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time Async Lwt >>=
  61. Concurrency (~2015) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time Async Lwt >>= Synchronous Asynchronous Normal calls Special calling convention
  62. Concurrency (~2015) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Async Lwt >>= Synchronous Asynchronous Normal calls Special calling convention Eliminate function colours with native concurrency support — Bob Nystrom Overlapped execution A B A C B Time
  63. Concurrency • Parallelism is a resource; concurrency is a programming

    abstraction ✦ Language-level threads Overlapped execution A B A C B Time Language & Runtime System Library
  64. Concurrency • Parallelism is a resource; concurrency is a programming

    abstraction ✦ Language-level threads Overlapped execution A B A C B Time Language & Runtime System Library
  65. Concurrency • Parallelism is a resource; concurrency is a programming

    abstraction ✦ Language-level threads Overlapped execution A B A C B Time Language & Runtime System Library Maintenance Burden C and not Haskell Lack of fl exibility
  66. Concurrency • Parallelism is a resource; concurrency is a programming

    abstraction ✦ M:N scheduling Overlapped execution A B A C B Time Runtime System Language Maintenance Burden C and not Haskell Lack of fl
  67. Language & Runtime System Library Scheduler Blackholing (lazy evaluation) Concurrency

  68. Language & Runtime System Library Scheduler Blackholing (lazy evaluation) Hard

    to undo adding a feature into the RTS Concurrency
  69. Concurrency • Parallelism is a resource; concurrency is a programming

    abstraction ✦ Language-level threads Overlapped execution A B A C B Time Language & Runtime System Library First-class continuations!
  70. How to continue? PLDI ‘96 call/1cc Chez Scheme

  71. call/1cc

  72. call/1cc

  73. call/1cc

  74. Ease of comprehension • Effect handler ~= Resumable exceptions +

    computation may be resumed later effect E : string let comp () = print_string (perform E) let main () = try comp () with effect E k -> continue k “Handled" exception E let comp () = print_string (raise E) let main () = try comp () with E -> print_string “Raised” Exception Effect handler delimited continuation
  75. Ease of comprehension • Effect handler ~= Resumable exceptions +

    computation may be resumed later • Easier than shift/reset, control/prompt ✦ No prompts or answer-type polymorphism effect E : string let comp () = print_string (perform E) let main () = try comp () with effect E k -> continue k “Handled" exception E let comp () = print_string (raise E) let main () = try comp () with E -> print_string “Raised” Exception Effect handler delimited continuation
  76. Ease of comprehension • Effect handler ~= Resumable exceptions +

    computation may be resumed later • Easier than shift/reset, control/prompt ✦ No prompts or answer-type polymorphism Effect handlers : shift/reset :: while : goto effect E : string let comp () = print_string (perform E) let main () = try comp () with effect E k -> continue k “Handled" exception E let comp () = print_string (raise E) let main () = try comp () with E -> print_string “Raised” Exception Effect handler delimited continuation
  77. OCaml ‘15 How to continue? One-shot delimited continuations exposed through

    effect handlers
  78. Ease of comprehension ~> Impact

  79. Ease of comprehension ~> Impact

  80. Retro fi tting Effect Handlers • Don’t break existing code

    㱺 No effect system ✦ No syntax and just functions
  81. Retro fi tting Effect Handlers • Don’t break existing code

    㱺 No effect system ✦ No syntax and just functions • Focus on preserving ✦ Performance of legacy code (< 1% impact) ✦ Compatibility of tools — gdb, perf
  82. Retro fi tting Effect Handlers • Don’t break existing code

    㱺 No effect system • Focus on preserving ✦ Performance of legacy code (< 1% impact) ✦ Compatibility of tools — gdb, perf PLDI ‘21
  83. Eio — Direct-style effect-based concurrency HTTP server performance using 24

    cores HTTP server scaling maintaining a constant load of 1.5 million requests per second
  84. Concurrency (~2022) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time
  85. Concurrency (~2022) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time
  86. Concurrency (~2022) • Parallelism is a resource; concurrency is a

    programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time ~2 days ago 🎉
  87. Takeaways

  88. Care for Users • Transition to the new version should

    be a no-op or push- button solution ✦ Most code likely to remain sequential
  89. Care for Users • Transition to the new version should

    be a no-op or push- button solution ✦ Most code likely to remain sequential • Build tools to ease the transition OPAM Health Check: http://check.ocamllabs.io/
  90. Benchmarking Rigorously, Continuously on Real programs • OCaml users don’t

    just run synthetic benchmarks
  91. Benchmarking Rigorously, Continuously on Real programs • OCaml users don’t

    just run synthetic benchmarks • Sandmark — Real-world programs picked from wild ✦ Coq ✦ Menhir (parser-generator) ✦ Alt-ergo (solver) ✦ Irmin (database) … and their large set of OPAM dependencies
  92. Benchmarking Rigorously, Continuously on Real programs Program P: OCaml 4.14

    = 19s OCaml 5.0 = 18s Are the speedups / slowdowns statistically signi fi cant?
  93. Benchmarking Rigorously, Continuously on Real programs Program P: OCaml 4.14

    = 19s OCaml 5.0 = 18s Are the speedups / slowdowns statistically signi fi cant? • Modern OS, arch, micro-arch effects become signi fi cant at small scales ✦ 20% speedup by inserting fences
  94. Benchmarking Rigorously, Continuously on Real programs Program P: OCaml 4.14

    = 19s OCaml 5.0 = 18s Are the speedups / slowdowns statistically signi fi cant? • Modern OS, arch, micro-arch effects become signi fi cant at small scales ✦ 20% speedup by inserting fences • Tune the machine to remove noise
  95. Benchmarking Rigorously, Continuously on Real programs Program P: OCaml 4.14

    = 19s OCaml 5.0 = 18s Are the speedups / slowdowns statistically signi fi cant? • Modern OS, arch, micro-arch effects become signi fi cant at small scales ✦ 20% speedup by inserting fences • Tune the machine to remove noise • Useful to measure instructions retired along with real time
  96. Benchmarking Rigorously, Continuously on Real programs • Are the speedups

    / slowdowns statistically signi f i
  97. Benchmarking Rigorously, Continuously on Real programs • Are the speedups

    / slowdowns statistically signi f i
  98. Benchmarking Rigorously, Continuously on Real programs • Continuous benchmarking as

    a service ✦ sandmark.tarides.com
  99. Invest in tooling Reuse existing tools; if not build them!

  100. Invest in tooling Reuse existing tools; if not build them!

    • rr = gdb + record-and-replay debugging
  101. Invest in tooling Reuse existing tools; if not build them!

    • rr = gdb + record-and-replay debugging • OCaml 5 + ThreadSanitizer ✦ Detect data races dynamically
  102. Invest in tooling Tracing GHC’s ThreadScope

  103. Invest in tooling Tracing GHC’s ThreadScope

  104. Invest in tooling Tracing Runtime Events: CTF-based tracing

  105. Invest in tooling Tracing Runtime Events: CTF-based tracing

  106. Convincing caml-devel • Quite a challenge maintaining a separate fork

    for 7+ years! ✦ Multiple rebases to keep it up-to-date with mainline ✦ In hindsight, smaller PRs are better
  107. Convincing caml-devel • Quite a challenge maintaining a separate fork

    for 7+ years! ✦ Multiple rebases to keep it up-to-date with mainline ✦ In hindsight, smaller PRs are better • Peer-reviewed papers adds credibility to the effort
  108. Convincing caml-devel • Quite a challenge maintaining a separate fork

    for 7+ years! ✦ Multiple rebases to keep it up-to-date with mainline ✦ In hindsight, smaller PRs are better • Peer-reviewed papers adds credibility to the effort • Open-source and actively-maintained always ✦ Lots of (academic) users from early days
  109. Convincing caml-devel • Quite a challenge maintaining a separate fork

    for 7+ years! ✦ Multiple rebases to keep it up-to-date with mainline ✦ In hindsight, smaller PRs are better • Peer-reviewed papers adds credibility to the effort • Open-source and actively-maintained always ✦ Lots of (academic) users from early days • Continuous benchmarking, OPAM health check
  110. Growing the language OCaml 4 OCaml 5 time

  111. Growing the language OCaml 4 OCaml 5 time

  112. Growing the language OCaml 4 OCaml 5 time

  113. Growing the language OCaml 4 OCaml 5 A few researchers

    Lots of Engineers time
  114. Growing the language OCaml 4 OCaml 5 A few researchers

    Lots of Engineers time
  115. Growing the language OCaml 4 OCaml 5 A few researchers

    Lots of Engineers time Independent Contributors
  116. Where do we go from here? OCaml 5.0

  117. Where do we go from here? OCaml 5.0

  118. Where do we go from here? OCaml 5.0 Effect System

  119. Where do we go from here? OCaml 5.0 Effect System

    Backwards compatibility, polymorphism, modularity & generatively
  120. Where do we go from here? OCaml 5.0 Effect System

    JavaScript target
  121. Where do we go from here? OCaml 5.0 Effect System

    JavaScript target JFP ‘20
  122. Where do we go from here? OCaml 5.0 Effect System

    JavaScript target Modal Types
  123. Where do we go from here? OCaml 5.0 Effect System

    JavaScript target Modal Types + Lexically scoped typed effect handlers Untyped effects
  124. Where do we go from here? OCaml 5.0 Effect System

    JavaScript target Modal Types Unboxed Types Flambda2 Parallelism Control memory layout Avoid heap allocations Aggressive compiler optimisations Rust/C-like performance (on demand), with GC as default, and the ergonomics and safety of classic ML
  125. Where do we go from here? OCaml 5.0 Effect System

    JavaScript target Stack allocation Unboxed Types Flambda2 Parallelism Control memory layout Avoid heap allocations Agressive compiler optimisations Rust/C-like performance (on demand), with GC as default, and the ergonomics and safety of classic ML
  126. Enjoy OCaml 5!

  127. Top Secret