Slide 1

Slide 1 text

Retro fi tting Concurrency Lessons from the engine room “KC” Sivaramakrishnan Images made with Stable Diffusion

Slide 2

Slide 2 text

In Sep 2022… OCaml 5.0

Slide 3

Slide 3 text

In Sep 2022… OCaml 5.0 Concurrency Parallelism

Slide 4

Slide 4 text

In Sep 2022… OCaml 5.0 Overlapped execution A B A C B Time Concurrency Parallelism Effect Handlers

Slide 5

Slide 5 text

In Sep 2022… OCaml 5.0 Overlapped execution A B A C B Time Simultaneous execution A B C Time Concurrency Parallelism Effect Handlers Domains

Slide 6

Slide 6 text

In this talk… OCaml 5.0 OCaml 4.x Multicore OCaml

Slide 7

Slide 7 text

Backwards Compatibility Data Races Implementation Complexity Performance Stability In this talk… OCaml 5.0 OCaml 4.x

Slide 8

Slide 8 text

Journey Takeaways

Slide 9

Slide 9 text

In the year 2014… 18 year-old, industrial-strength, pragmatic, functional programming language

Slide 10

Slide 10 text

In the year 2014… 18 year-old, industrial-strength, pragmatic, functional programming language Industry Projects

Slide 11

Slide 11 text

In the year 2014… 18 year-old, industrial-strength, pragmatic, functional programming language Industry Projects No multicore support!

Slide 12

Slide 12 text

Runtime lock OCaml C C C

Slide 13

Slide 13 text

Runtime lock OCaml C C C GIL

Slide 14

Slide 14 text

Eliminate the runtime lock OCaml OCaml OCaml Simultaneous execution A B C Time Parallelism Domains

Slide 15

Slide 15 text

Eliminate the runtime lock OCaml OCaml OCaml Simultaneous execution A B C Time Parallelism Domains GIL Sam Gross, Meta, “Multithreaded Python without the GIL”

Slide 16

Slide 16 text

Retro fi tting Challenges ~> Approach • Millions of lines of legacy software ✦ Most code likely to remain sequential even with multicore • Cost of refactoring is prohibitive

Slide 17

Slide 17 text

Retro fi tting Challenges ~> Approach • Millions of lines of legacy software ✦ Most code likely to remain sequential even with multicore • Cost of refactoring is prohibitive Do not break existing code!

Slide 18

Slide 18 text

Retro fi tting Challenges ~> Approach • Low latency and predictable performance ✦ Great for ~10ms tolerance

Slide 19

Slide 19 text

Retro fi tting Challenges ~> Approach • Low latency and predictable performance ✦ Great for ~10ms tolerance Optimise for GC latency before scalability

Slide 20

Slide 20 text

Retro fi tting Challenges ~> Approach • OCaml core team is composed of volunteers ✦ Aim to reduce complexity and maintenance burden

Slide 21

Slide 21 text

Retro fi tting Challenges ~> Approach • OCaml core team is composed of volunteers ✦ Aim to reduce complexity and maintenance burden No separate sequential and parallel runtimes Unlike -threaded runtime

Slide 22

Slide 22 text

Retro fi tting Challenges ~> Approach • OCaml core team is composed of volunteers ✦ Aim to reduce complexity and maintenance burden No separate sequential and parallel runtimes 㱺 Existing sequential programs run just as fast using just as much memory Unlike -threaded runtime

Slide 23

Slide 23 text

Parallel Allocator & GC Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 Medieval garbage truck according to Stable Diffusion

Slide 24

Slide 24 text

Parallel Allocator & GC Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 Medieval garbage truck according to Stable Diffusion

Slide 25

Slide 25 text

Parallel Allocator & GC Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 Medieval garbage truck according to Stable Diffusion

Slide 26

Slide 26 text

Access remote objects Domain 0 Domain 1 X Y let r = !x Major heap Minor heaps

Slide 27

Slide 27 text

Access remote objects Domain 0 Domain 1 X Y let r = !x promote(y) Major heap Minor heaps

Slide 28

Slide 28 text

Access remote objects Domain 0 Domain 1 X Y promote(y) y Major heap Minor heaps let r = !x

Slide 29

Slide 29 text

Parallel Allocator & GC Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 Mostly concurrent concurrent Medieval garbage truck according to Stable Diffusion

Slide 30

Slide 30 text

Parallel Allocator & GC Major Heap Minor

Slide 31

Slide 31 text

Parallel Allocator & GC POPL ‘93 Major Heap Minor Heap Minor Heap

Slide 32

Slide 32 text

Parallel Allocator & GC ISMM ‘11 Major Heap Minor Heap Minor Heap

Slide 33

Slide 33 text

Parallel Allocator & GC POPL ‘93 JFP ‘14 Intel Single-chip Cloud Computer (SCC) Major Heap Minor Heap Minor Heap

Slide 34

Slide 34 text

Parallel Allocator & GC PPoPP ‘18 H1 H2 H3 H4 H5 disentanglement MaPLe

Slide 35

Slide 35 text

Parallel Allocator & GC JFP ‘14 PPoPP ‘18 ICFP ‘22

Slide 36

Slide 36 text

JFP ‘14 Parallel Allocator & GC POPL ‘93 ISMM ‘11 JFP ‘14 PPoPP ‘18

Slide 37

Slide 37 text

Parallel Allocator & GC • Excellent scalability on 128-cores ✦ Also maintains low latency on large core counts • Mostly retains sequential latency, throughput and memory usage characteristics Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2

Slide 38

Slide 38 text

Parallel Allocator & GC Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 But …

Slide 39

Slide 39 text

Parallel Allocator & GC Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 But … Read barrier!

Slide 40

Slide 40 text

Parallel Allocator & GC Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 But … Read barrier! • Read barrier ✦ Only a branch on the OCaml side for reads ✦ Read are now GC safe points ✦ Breaks the C FFI invariants about when GC may be performed

Slide 41

Slide 41 text

Parallel Allocator & GC Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 But … Read barrier! • Read barrier ✦ Only a branch on the OCaml side for reads ✦ Read are now GC safe points ✦ Breaks the C FFI invariants about when GC may be performed • No push-button fi x! ✦ Lots of packages in the ecosystem broke.

Slide 42

Slide 42 text

Back to the drawing board (~2019) Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 Mostly concurrent Stop-the-world parallel

Slide 43

Slide 43 text

Back to the drawing board (~2019) Major Heap Minor Heap Minor Heap Minor Heap Domain 0 Domain 1 Domain 2 Mostly concurrent Stop-the-world parallel Bring 128-domains to a stop is surprisingly fast

Slide 44

Slide 44 text

Back to the drawing board (~2019) Major Heap Minor

Slide 45

Slide 45 text

Data Races • Data Race: When two threads perform unsynchronised access and at least one is a write. ✦ Non-SC behaviour due to compiler optimisations and relaxed hardware.

Slide 46

Slide 46 text

Data Races • Data Race: When two threads perform unsynchronised access and at least one is a write. ✦ Non-SC behaviour due to compiler optimisations and relaxed hardware. • Enforcing SC behaviour slows down sequential programs! ✦ 85% on ARM64, 41% on PowerPC

Slide 47

Slide 47 text

Data Races • Data Race: When two threads perform unsynchronised access and at least one is a write. ✦ Non-SC behaviour due to compiler optimisations and relaxed hardware. • Enforcing SC behaviour slows down sequential programs! ✦ 85% on ARM64, 41% on PowerPC OCaml needed a relaxed memory model

Slide 48

Slide 48 text

Second-mover Advantage • Learn from the other language memory models

Slide 49

Slide 49 text

Second-mover Advantage • Learn from the other language memory models • DRF-SC, but catch- fi re semantics on data races Well-typed OCaml programs don’t go wrong

Slide 50

Slide 50 text

Second-mover Advantage • Learn from the other language memory models • DRF-SC + no crash under data races ✦ But scope of race is not limited in time • DRF-SC, but catch- fi re semantics on data races Well-typed OCaml programs don’t go wrong

Slide 51

Slide 51 text

Second-mover Advantage • Learn from the other language memory models • DRF-SC + no crash under data races ✦ But scope of race is not limited in time • DRF-SC, but catch- fi re semantics on data races Well-typed OCaml programs don’t go wrong • No data races by construction ✦ Unsafe code memory model is ~C++11

Slide 52

Slide 52 text

Second-mover Advantage • Learn from the other language memory models • DRF-SC + no crash under data races ✦ But scope of race is not limited in time Advantage: No Multicore OCaml programs in the wild! • DRF-SC, but catch- fi re semantics on data races Well-typed OCaml programs don’t go wrong • No data races by construction ✦ Unsafe code memory model is ~C++11

Slide 53

Slide 53 text

OCaml memory model (~2017) • Simple (comprehensible!) operational memory model ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust.

Slide 54

Slide 54 text

OCaml memory model (~2017) • Simple (comprehensible!) operational memory model ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust. 1.19

Slide 55

Slide 55 text

OCaml memory model (~2017) • Simple (comprehensible!) operational memory model ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust. 1.19

Slide 56

Slide 56 text

OCaml memory model (~2017) • Simple (comprehensible!) operational memory model ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust. • Key innovation: Local data race freedom ✦ Permits compositional reasoning 1.19

Slide 57

Slide 57 text

OCaml memory model (~2017) • Simple (comprehensible!) operational memory model ✦ Only atomic and non-atomic locations ✦ DRF-SC ✦ No “out of thin air” values ✦ Squeeze at most perf 㱺 write that module in C, C++ or Rust. • Key innovation: Local data race freedom ✦ Permits compositional reasoning • Performance impact ✦ Free on x86 and < 1% on ARM 1.19

Slide 58

Slide 58 text

• Simple (comprehensible!) operational memory model ✦ Only atomic and non-atomic locations ✦ No “out of thin air” values • Interested in extracting f i

Slide 59

Slide 59 text

Concurrency (~2015) • Parallelism is a resource; concurrency is a programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time

Slide 60

Slide 60 text

Concurrency (~2015) • Parallelism is a resource; concurrency is a programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time Async Lwt >>=

Slide 61

Slide 61 text

Concurrency (~2015) • Parallelism is a resource; concurrency is a programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time Async Lwt >>= Synchronous Asynchronous Normal calls Special calling convention

Slide 62

Slide 62 text

Concurrency (~2015) • Parallelism is a resource; concurrency is a programming abstraction ✦ Language-level threads Async Lwt >>= Synchronous Asynchronous Normal calls Special calling convention Eliminate function colours with native concurrency support — Bob Nystrom Overlapped execution A B A C B Time

Slide 63

Slide 63 text

Concurrency • Parallelism is a resource; concurrency is a programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time Language & Runtime System Library

Slide 64

Slide 64 text

Concurrency • Parallelism is a resource; concurrency is a programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time Language & Runtime System Library

Slide 65

Slide 65 text

Concurrency • Parallelism is a resource; concurrency is a programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time Language & Runtime System Library Maintenance Burden C and not Haskell Lack of fl exibility

Slide 66

Slide 66 text

Concurrency • Parallelism is a resource; concurrency is a programming abstraction ✦ M:N scheduling Overlapped execution A B A C B Time Runtime System Language Maintenance Burden C and not Haskell Lack of fl

Slide 67

Slide 67 text

Language & Runtime System Library Scheduler Blackholing (lazy evaluation) Concurrency

Slide 68

Slide 68 text

Language & Runtime System Library Scheduler Blackholing (lazy evaluation) Hard to undo adding a feature into the RTS Concurrency

Slide 69

Slide 69 text

Concurrency • Parallelism is a resource; concurrency is a programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time Language & Runtime System Library First-class continuations!

Slide 70

Slide 70 text

How to continue? PLDI ‘96 call/1cc Chez Scheme

Slide 71

Slide 71 text

call/1cc

Slide 72

Slide 72 text

call/1cc

Slide 73

Slide 73 text

call/1cc

Slide 74

Slide 74 text

Ease of comprehension • Effect handler ~= Resumable exceptions + computation may be resumed later effect E : string let comp () = print_string (perform E) let main () = try comp () with effect E k -> continue k “Handled" exception E let comp () = print_string (raise E) let main () = try comp () with E -> print_string “Raised” Exception Effect handler delimited continuation

Slide 75

Slide 75 text

Ease of comprehension • Effect handler ~= Resumable exceptions + computation may be resumed later • Easier than shift/reset, control/prompt ✦ No prompts or answer-type polymorphism effect E : string let comp () = print_string (perform E) let main () = try comp () with effect E k -> continue k “Handled" exception E let comp () = print_string (raise E) let main () = try comp () with E -> print_string “Raised” Exception Effect handler delimited continuation

Slide 76

Slide 76 text

Ease of comprehension • Effect handler ~= Resumable exceptions + computation may be resumed later • Easier than shift/reset, control/prompt ✦ No prompts or answer-type polymorphism Effect handlers : shift/reset :: while : goto effect E : string let comp () = print_string (perform E) let main () = try comp () with effect E k -> continue k “Handled" exception E let comp () = print_string (raise E) let main () = try comp () with E -> print_string “Raised” Exception Effect handler delimited continuation

Slide 77

Slide 77 text

OCaml ‘15 How to continue? One-shot delimited continuations exposed through effect handlers

Slide 78

Slide 78 text

Ease of comprehension ~> Impact

Slide 79

Slide 79 text

Ease of comprehension ~> Impact

Slide 80

Slide 80 text

Retro fi tting Effect Handlers • Don’t break existing code 㱺 No effect system ✦ No syntax and just functions

Slide 81

Slide 81 text

Retro fi tting Effect Handlers • Don’t break existing code 㱺 No effect system ✦ No syntax and just functions • Focus on preserving ✦ Performance of legacy code (< 1% impact) ✦ Compatibility of tools — gdb, perf

Slide 82

Slide 82 text

Retro fi tting Effect Handlers • Don’t break existing code 㱺 No effect system • Focus on preserving ✦ Performance of legacy code (< 1% impact) ✦ Compatibility of tools — gdb, perf PLDI ‘21

Slide 83

Slide 83 text

Eio — Direct-style effect-based concurrency HTTP server performance using 24 cores HTTP server scaling maintaining a constant load of 1.5 million requests per second

Slide 84

Slide 84 text

Concurrency (~2022) • Parallelism is a resource; concurrency is a programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time

Slide 85

Slide 85 text

Concurrency (~2022) • Parallelism is a resource; concurrency is a programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time

Slide 86

Slide 86 text

Concurrency (~2022) • Parallelism is a resource; concurrency is a programming abstraction ✦ Language-level threads Overlapped execution A B A C B Time ~2 days ago 🎉

Slide 87

Slide 87 text

Takeaways

Slide 88

Slide 88 text

Care for Users • Transition to the new version should be a no-op or push- button solution ✦ Most code likely to remain sequential

Slide 89

Slide 89 text

Care for Users • Transition to the new version should be a no-op or push- button solution ✦ Most code likely to remain sequential • Build tools to ease the transition OPAM Health Check: http://check.ocamllabs.io/

Slide 90

Slide 90 text

Benchmarking Rigorously, Continuously on Real programs • OCaml users don’t just run synthetic benchmarks

Slide 91

Slide 91 text

Benchmarking Rigorously, Continuously on Real programs • OCaml users don’t just run synthetic benchmarks • Sandmark — Real-world programs picked from wild ✦ Coq ✦ Menhir (parser-generator) ✦ Alt-ergo (solver) ✦ Irmin (database) … and their large set of OPAM dependencies

Slide 92

Slide 92 text

Benchmarking Rigorously, Continuously on Real programs Program P: OCaml 4.14 = 19s OCaml 5.0 = 18s Are the speedups / slowdowns statistically signi fi cant?

Slide 93

Slide 93 text

Benchmarking Rigorously, Continuously on Real programs Program P: OCaml 4.14 = 19s OCaml 5.0 = 18s Are the speedups / slowdowns statistically signi fi cant? • Modern OS, arch, micro-arch effects become signi fi cant at small scales ✦ 20% speedup by inserting fences

Slide 94

Slide 94 text

Benchmarking Rigorously, Continuously on Real programs Program P: OCaml 4.14 = 19s OCaml 5.0 = 18s Are the speedups / slowdowns statistically signi fi cant? • Modern OS, arch, micro-arch effects become signi fi cant at small scales ✦ 20% speedup by inserting fences • Tune the machine to remove noise

Slide 95

Slide 95 text

Benchmarking Rigorously, Continuously on Real programs Program P: OCaml 4.14 = 19s OCaml 5.0 = 18s Are the speedups / slowdowns statistically signi fi cant? • Modern OS, arch, micro-arch effects become signi fi cant at small scales ✦ 20% speedup by inserting fences • Tune the machine to remove noise • Useful to measure instructions retired along with real time

Slide 96

Slide 96 text

Benchmarking Rigorously, Continuously on Real programs • Are the speedups / slowdowns statistically signi f i

Slide 97

Slide 97 text

Benchmarking Rigorously, Continuously on Real programs • Are the speedups / slowdowns statistically signi f i

Slide 98

Slide 98 text

Benchmarking Rigorously, Continuously on Real programs • Continuous benchmarking as a service ✦ sandmark.tarides.com

Slide 99

Slide 99 text

Invest in tooling Reuse existing tools; if not build them!

Slide 100

Slide 100 text

Invest in tooling Reuse existing tools; if not build them! • rr = gdb + record-and-replay debugging

Slide 101

Slide 101 text

Invest in tooling Reuse existing tools; if not build them! • rr = gdb + record-and-replay debugging • OCaml 5 + ThreadSanitizer ✦ Detect data races dynamically

Slide 102

Slide 102 text

Invest in tooling Tracing GHC’s ThreadScope

Slide 103

Slide 103 text

Invest in tooling Tracing GHC’s ThreadScope

Slide 104

Slide 104 text

Invest in tooling Tracing Runtime Events: CTF-based tracing

Slide 105

Slide 105 text

Invest in tooling Tracing Runtime Events: CTF-based tracing

Slide 106

Slide 106 text

Convincing caml-devel • Quite a challenge maintaining a separate fork for 7+ years! ✦ Multiple rebases to keep it up-to-date with mainline ✦ In hindsight, smaller PRs are better

Slide 107

Slide 107 text

Convincing caml-devel • Quite a challenge maintaining a separate fork for 7+ years! ✦ Multiple rebases to keep it up-to-date with mainline ✦ In hindsight, smaller PRs are better • Peer-reviewed papers adds credibility to the effort

Slide 108

Slide 108 text

Convincing caml-devel • Quite a challenge maintaining a separate fork for 7+ years! ✦ Multiple rebases to keep it up-to-date with mainline ✦ In hindsight, smaller PRs are better • Peer-reviewed papers adds credibility to the effort • Open-source and actively-maintained always ✦ Lots of (academic) users from early days

Slide 109

Slide 109 text

Convincing caml-devel • Quite a challenge maintaining a separate fork for 7+ years! ✦ Multiple rebases to keep it up-to-date with mainline ✦ In hindsight, smaller PRs are better • Peer-reviewed papers adds credibility to the effort • Open-source and actively-maintained always ✦ Lots of (academic) users from early days • Continuous benchmarking, OPAM health check

Slide 110

Slide 110 text

Growing the language OCaml 4 OCaml 5 time

Slide 111

Slide 111 text

Growing the language OCaml 4 OCaml 5 time

Slide 112

Slide 112 text

Growing the language OCaml 4 OCaml 5 time

Slide 113

Slide 113 text

Growing the language OCaml 4 OCaml 5 A few researchers Lots of Engineers time

Slide 114

Slide 114 text

Growing the language OCaml 4 OCaml 5 A few researchers Lots of Engineers time

Slide 115

Slide 115 text

Growing the language OCaml 4 OCaml 5 A few researchers Lots of Engineers time Independent Contributors

Slide 116

Slide 116 text

Where do we go from here? OCaml 5.0

Slide 117

Slide 117 text

Where do we go from here? OCaml 5.0

Slide 118

Slide 118 text

Where do we go from here? OCaml 5.0 Effect System

Slide 119

Slide 119 text

Where do we go from here? OCaml 5.0 Effect System Backwards compatibility, polymorphism, modularity & generatively

Slide 120

Slide 120 text

Where do we go from here? OCaml 5.0 Effect System JavaScript target

Slide 121

Slide 121 text

Where do we go from here? OCaml 5.0 Effect System JavaScript target JFP ‘20

Slide 122

Slide 122 text

Where do we go from here? OCaml 5.0 Effect System JavaScript target Modal Types

Slide 123

Slide 123 text

Where do we go from here? OCaml 5.0 Effect System JavaScript target Modal Types + Lexically scoped typed effect handlers Untyped effects

Slide 124

Slide 124 text

Where do we go from here? OCaml 5.0 Effect System JavaScript target Modal Types Unboxed Types Flambda2 Parallelism Control memory layout Avoid heap allocations Aggressive compiler optimisations Rust/C-like performance (on demand), with GC as default, and the ergonomics and safety of classic ML

Slide 125

Slide 125 text

Where do we go from here? OCaml 5.0 Effect System JavaScript target Stack allocation Unboxed Types Flambda2 Parallelism Control memory layout Avoid heap allocations Agressive compiler optimisations Rust/C-like performance (on demand), with GC as default, and the ergonomics and safety of classic ML

Slide 126

Slide 126 text

Enjoy OCaml 5!

Slide 127

Slide 127 text

Top Secret