Slide 1

Slide 1 text

Bounding Data Races in Space and Time KC Sivaramakrishnan University of Cambridge OCaml Labs Darwin College, Cambridge 1851 Royal Commission 1

Slide 2

Slide 2 text

Multicore OCaml !2

Slide 3

Slide 3 text

Multicore OCaml • OCaml is an industrial-strength, functional programming language ★ Projects: MirageOS unikernel, Coq proof assistant, F* programming language ★ Companies: Facebook (Hack, Flow, Infer, Reason), Microsoft (Everest, F*), JaneStreet (all trading & support systems), Docker (Docker for Mac & Windows), Citrix (XenStore) !2

Slide 4

Slide 4 text

Multicore OCaml • OCaml is an industrial-strength, functional programming language ★ Projects: MirageOS unikernel, Coq proof assistant, F* programming language ★ Companies: Facebook (Hack, Flow, Infer, Reason), Microsoft (Everest, F*), JaneStreet (all trading & support systems), Docker (Docker for Mac & Windows), Citrix (XenStore) • No multicore support! !2

Slide 5

Slide 5 text

Multicore OCaml • OCaml is an industrial-strength, functional programming language ★ Projects: MirageOS unikernel, Coq proof assistant, F* programming language ★ Companies: Facebook (Hack, Flow, Infer, Reason), Microsoft (Everest, F*), JaneStreet (all trading & support systems), Docker (Docker for Mac & Windows), Citrix (XenStore) • No multicore support! • Multicore OCaml ★ Native support for concurrency and parallelism in OCaml ★ Lead from OCaml Labs + (JaneStreet, Microsoft Research, INRIA). !2

Slide 6

Slide 6 text

Modelling Memory !3

Slide 7

Slide 7 text

Modelling Memory • How do you reason about access to memory? !3

Slide 8

Slide 8 text

Modelling Memory • How do you reason about access to memory? ★ Spoiler: No single global sequentially consistent memory !3

Slide 9

Slide 9 text

Modelling Memory • How do you reason about access to memory? ★ Spoiler: No single global sequentially consistent memory • Modern multicore processors reorder instructions for performance !3

Slide 10

Slide 10 text

Modelling Memory • How do you reason about access to memory? ★ Spoiler: No single global sequentially consistent memory • Modern multicore processors reorder instructions for performance Thread 1 r1 = b Thread 2 r2 = a Initially a = 0 && b =0 r1 == 0 && r2 ==0 ??? a = 1 b = 1 !3

Slide 11

Slide 11 text

Modelling Memory • How do you reason about access to memory? ★ Spoiler: No single global sequentially consistent memory • Modern multicore processors reorder instructions for performance Thread 1 r1 = b Thread 2 r2 = a Initially a = 0 && b =0 r1 == 0 && r2 ==0 ??? Allowed under x86, ARM, POWER a = 1 b = 1 !3

Slide 12

Slide 12 text

Modelling Memory • How do you reason about access to memory? ★ Spoiler: No single global sequentially consistent memory • Modern multicore processors reorder instructions for performance Thread 1 r1 = b Thread 2 r2 = a Initially a = 0 && b =0 r1 == 0 && r2 ==0 ??? Allowed under x86, ARM, POWER a = 1 b = 1 Write buffering !3

Slide 13

Slide 13 text

Modelling Memory • How do you reason about access to memory? ★ Spoiler: No single global sequentially consistent memory • Modern multicore processors reorder instructions for performance Thread 1 r1 = b Thread 2 r2 = a Initially a = 0 && b =0 r1 == 0 && r2 ==0 ??? Allowed under x86, ARM, POWER a = 1 b = 1 Write buffering !4

Slide 14

Slide 14 text

Modelling Memory • Compilers optimisations also reorder memory access instructions !5

Slide 15

Slide 15 text

Modelling Memory • Compilers optimisations also reorder memory access instructions !5 Thread 1 r1 = a * 2 r2 = b + 1 r3 = a * 2 Thread 1 r1 = a * 2 r2 = b + 1 r3 = r1 CSE !

Slide 16

Slide 16 text

Modelling Memory • Compilers optimisations also reorder memory access instructions !5 Thread 1 r1 = a * 2 r2 = b + 1 r3 = a * 2 Thread 1 r1 = a * 2 r2 = b + 1 r3 = r1 Initially &a == &b && a = b = 1 CSE !

Slide 17

Slide 17 text

Modelling Memory • Compilers optimisations also reorder memory access instructions !5 Thread 1 r1 = a * 2 r2 = b + 1 r3 = a * 2 Thread 1 r1 = a * 2 r2 = b + 1 r3 = r1 Initially &a == &b && a = b = 1 Thread 2 b = 0 CSE !

Slide 18

Slide 18 text

Modelling Memory • Compilers optimisations also reorder memory access instructions !5 Thread 1 r1 = a * 2 r2 = b + 1 r3 = a * 2 Thread 1 r1 = a * 2 r2 = b + 1 r3 = r1 Initially &a == &b && a = b = 1 Thread 2 b = 0 r1 == 2 && r2 == 0 && r3 == 0 CSE !

Slide 19

Slide 19 text

Modelling Memory • Compilers optimisations also reorder memory access instructions !5 Thread 1 r1 = a * 2 r2 = b + 1 r3 = a * 2 Thread 1 r1 = a * 2 r2 = b + 1 r3 = r1 Initially &a == &b && a = b = 1 Thread 2 b = 0 r1 == 2 && r2 == 0 && r3 == 0 r1 == 2 && r2 == 0 && r3 == 2 CSE !

Slide 20

Slide 20 text

Modelling Memory • Compilers optimisations also reorder memory access instructions !6 Thread 1 r1 = a * 2 r2 = b + 1 r3 = a * 2 Thread 1 r1 = a * 2 r2 = b + 1 r3 = r1 Thread 2 b = 0 r1 == 2 && r2 == 0 && r3 == 0 r1 == 2 && r2 == 0 && r3 == 2 Initially &a == &b && a = b = 1 CSE !

Slide 21

Slide 21 text

Memory Model • Unambiguous specification of program outcomes ★ More than just thread interleavings !7 Memory model OCaml compiler

Slide 22

Slide 22 text

Memory Model • Unambiguous specification of program outcomes ★ More than just thread interleavings • Memory Model Desiderata ★ Not too weak (good for programmers) ★ Not too strong (good for hardware) ★ Admits optimisations (good for compilers) ★ Mathematically rigorous (good for verification) !7 Memory model OCaml compiler

Slide 23

Slide 23 text

Memory Model • Unambiguous specification of program outcomes ★ More than just thread interleavings • Memory Model Desiderata ★ Not too weak (good for programmers) ★ Not too strong (good for hardware) ★ Admits optimisations (good for compilers) ★ Mathematically rigorous (good for verification) • Difficult to get right ★ C/C++11 memory model is flawed ★ Java memory model is flawed ★ Several papers every year in top PL conferences proposing / fixing models !7 Memory model OCaml compiler

Slide 24

Slide 24 text

Memory Model: Programmer’s view !8

Slide 25

Slide 25 text

Memory Model: Programmer’s view • Data race ★ Concurrent access to memory location, one of which is a write !8

Slide 26

Slide 26 text

Memory Model: Programmer’s view • Data race ★ Concurrent access to memory location, one of which is a write • Sequential consistency (SC) ★ No intra-thread reordering, only inter-thread interleaving !8

Slide 27

Slide 27 text

Memory Model: Programmer’s view • Data race ★ Concurrent access to memory location, one of which is a write • Sequential consistency (SC) ★ No intra-thread reordering, only inter-thread interleaving • DRF-SC: primary tool in concurrent programmers arsenal ★ If a program has no races (under SC semantics), then the program has SC semantics ★ Well-synchronised programs do not have surprising behaviours !8

Slide 28

Slide 28 text

Memory Model: Programmer’s view • Data race ★ Concurrent access to memory location, one of which is a write • Sequential consistency (SC) ★ No intra-thread reordering, only inter-thread interleaving • DRF-SC: primary tool in concurrent programmers arsenal ★ If a program has no races (under SC semantics), then the program has SC semantics ★ Well-synchronised programs do not have surprising behaviours • Our observation: DRF-SC is too weak for programmers !8

Slide 29

Slide 29 text

C/C++ Memory Model • C/C++ (C11) memory model offers DRF-SC, but.. !9

Slide 30

Slide 30 text

C/C++ Memory Model • C/C++ (C11) memory model offers DRF-SC, but.. ★ If a program has races (even benign), then the behaviour is undefined! !9

Slide 31

Slide 31 text

C/C++ Memory Model • C/C++ (C11) memory model offers DRF-SC, but.. ★ If a program has races (even benign), then the behaviour is undefined! ★ Most C/C++ programs have races => most C/C++ programs are allowed to crash and burn !9

Slide 32

Slide 32 text

C/C++ Memory Model • C/C++ (C11) memory model offers DRF-SC, but.. ★ If a program has races (even benign), then the behaviour is undefined! ★ Most C/C++ programs have races => most C/C++ programs are allowed to crash and burn • Races on unrelated locations can affect behaviour !9

Slide 33

Slide 33 text

C/C++ Memory Model • C/C++ (C11) memory model offers DRF-SC, but.. ★ If a program has races (even benign), then the behaviour is undefined! ★ Most C/C++ programs have races => most C/C++ programs are allowed to crash and burn • Races on unrelated locations can affect behaviour ★ We would like a memory model where data races are bounded in space !9

Slide 34

Slide 34 text

• Java also offers DRF-SC ★ Unlike C++, type safety necessitates defined behaviour under races !10 Java Memory Model

Slide 35

Slide 35 text

• Java also offers DRF-SC ★ Unlike C++, type safety necessitates defined behaviour under races ★ No data races in space, but allows races in time… !10 Java Memory Model

Slide 36

Slide 36 text

• Java also offers DRF-SC ★ Unlike C++, type safety necessitates defined behaviour under races ★ No data races in space, but allows races in time… !10 Java Memory Model int a; volatile bool flag;

Slide 37

Slide 37 text

• Java also offers DRF-SC ★ Unlike C++, type safety necessitates defined behaviour under races ★ No data races in space, but allows races in time… !10 Java Memory Model int a; volatile bool flag; Thread 1 a = 1; flag = true;

Slide 38

Slide 38 text

• Java also offers DRF-SC ★ Unlike C++, type safety necessitates defined behaviour under races ★ No data races in space, but allows races in time… !10 Java Memory Model int a; volatile bool flag; Thread 1 a = 1; flag = true; Thread 2 a = 2; if (flag) { // no race here r1 = a; r2 = a; }

Slide 39

Slide 39 text

• Java also offers DRF-SC ★ Unlike C++, type safety necessitates defined behaviour under races ★ No data races in space, but allows races in time… !10 Java Memory Model int a; volatile bool flag; Thread 1 a = 1; flag = true; Thread 2 a = 2; if (flag) { // no race here r1 = a; r2 = a; } r1 == 1 && r2 == 2 is allowed

Slide 40

Slide 40 text

• Java also offers DRF-SC ★ Unlike C++, type safety necessitates defined behaviour under races ★ No data races in space, but allows races in time… !10 Java Memory Model int a; volatile bool flag; Thread 1 a = 1; flag = true; Thread 2 a = 2; if (flag) { // no race here r1 = a; r2 = a; } r1 == 1 && r2 == 2 is allowed Races in the past affects future

Slide 41

Slide 41 text

Java Memory Model • Future data races can affect the past !11

Slide 42

Slide 42 text

Java Memory Model • Future data races can affect the past !11 Class C { int x; }

Slide 43

Slide 43 text

Thread 1 C c = new C(); c.x = 42; r1 = c.x; Java Memory Model • Future data races can affect the past !11 Class C { int x; } Can assert (r1 == 42) fail?

Slide 44

Slide 44 text

Java Memory Model • Future data races can affect the past !12 Class C { int x; } C g; Thread 1 C c = new C(); c.x = 42; r1 = c.x; g = c; Thread 2 g.x = 7; Can assert (r1 == 42) fail?

Slide 45

Slide 45 text

Java Memory Model • Future data races can affect the past !13 Class C { int x; } C g; Thread 1 C c = new C(); c.x = 42; r1 = c.x; g = c; Thread 2 g.x = 7;

Slide 46

Slide 46 text

Java Memory Model • Future data races can affect the past !13 Class C { int x; } C g; Thread 1 C c = new C(); c.x = 42; r1 = c.x; g = c; Thread 2 g.x = 7; assert (r1 == 42) fails

Slide 47

Slide 47 text

Java Memory Model • Future data races can affect the past !13 Class C { int x; } C g; Thread 1 C c = new C(); c.x = 42; r1 = c.x; g = c; Thread 2 g.x = 7; assert (r1 == 42) fails • We would like a memory model that bounds data races in time

Slide 48

Slide 48 text

OCaml Memory Model: Goal !14

Slide 49

Slide 49 text

• Language memory models should specify behaviours under data races OCaml Memory Model: Goal !14

Slide 50

Slide 50 text

• Language memory models should specify behaviours under data races ★ Not because they are useful OCaml Memory Model: Goal !14

Slide 51

Slide 51 text

• Language memory models should specify behaviours under data races ★ Not because they are useful ★ But to limit their damage OCaml Memory Model: Goal !14

Slide 52

Slide 52 text

• Language memory models should specify behaviours under data races ★ Not because they are useful ★ But to limit their damage OCaml Memory Model: Goal !14 If I read a variable twice and there are no concurrent writes, then both reads return the same value

Slide 53

Slide 53 text

OCaml MM: Contributions !15 • Memory Model Desiderata ★ Not too weak (good for programmers) ★ Not too strong (good for hardware) ★ Admits optimisations (good for compilers) ★ Mathematically rigorous (good for verification) • OCaml Memory model ★ Local version of DRF-SC — key discovery ★ Free on x86, 0.6% overhead on ARM, 2.6% overhead on POWER ★ Allows most common compiler optimisations ★ Simple operational and axiomatic semantics + proved soundness (optimization + to-hardware)

Slide 54

Slide 54 text

Local DRF !16

Slide 55

Slide 55 text

Local DRF • If there are no data races, !16

Slide 56

Slide 56 text

Local DRF • If there are no data races, ★ on some variables (space) !16

Slide 57

Slide 57 text

Local DRF • If there are no data races, ★ on some variables (space) ★ in some interval (time) !16

Slide 58

Slide 58 text

Local DRF • If there are no data races, ★ on some variables (space) ★ in some interval (time) ★ then the program has SC behaviour on those variables in that time interval !16

Slide 59

Slide 59 text

Local DRF • If there are no data races, ★ on some variables (space) ★ in some interval (time) ★ then the program has SC behaviour on those variables in that time interval • Space = {all variables} && Time = whole execution => DRF-SC !16

Slide 60

Slide 60 text

Local DRF • If there are no data races, ★ on some variables (space) ★ in some interval (time) ★ then the program has SC behaviour on those variables in that time interval • Space = {all variables} && Time = whole execution => DRF-SC !16 Thread 1 msg = 1; b = 0; Flag = 1; Thread 2 b = 1; if (Flag) { r = msg; } Flag is atomic

Slide 61

Slide 61 text

Local DRF • If there are no data races, ★ on some variables (space) ★ in some interval (time) ★ then the program has SC behaviour on those variables in that time interval • Space = {all variables} && Time = whole execution => DRF-SC !16 Thread 1 msg = 1; b = 0; Flag = 1; Thread 2 b = 1; if (Flag) { r = msg; } Flag is atomic

Slide 62

Slide 62 text

Local DRF • If there are no data races, ★ on some variables (space) ★ in some interval (time) ★ then the program has SC behaviour on those variables in that time interval • Space = {all variables} && Time = whole execution => DRF-SC !16 Thread 1 msg = 1; b = 0; Flag = 1; Thread 2 b = 1; if (Flag) { r = msg; } Flag is atomic Due to local DRF, despite the race on b, message-passing idiom still works!

Slide 63

Slide 63 text

Formal Memory Model !17

Slide 64

Slide 64 text

Formal Memory Model !17 • Most programmers can live with local DRF ★ Experts demand more (concurrency libraries, high-performance code, etc.)

Slide 65

Slide 65 text

Formal Memory Model !17 • Most programmers can live with local DRF ★ Experts demand more (concurrency libraries, high-performance code, etc.) • Simple operational semantics that captures all of the allowed behaviours

Slide 66

Slide 66 text

Formal Memory Model !17 • Most programmers can live with local DRF ★ Experts demand more (concurrency libraries, high-performance code, etc.) • Simple operational semantics that captures all of the allowed behaviours

Slide 67

Slide 67 text

Visualising operational semantics !18 Non atomic a b c 1 2 3 4 5 6 7 Histories time ! 5

Slide 68

Slide 68 text

Visualising operational semantics !18 Non atomic a b c 1 2 3 4 5 6 7 Thread 1 Thread 2 Histories time ! 5

Slide 69

Slide 69 text

Visualising operational semantics !18 Non atomic a b c 1 2 3 4 5 6 7 Thread 1 Thread 2 Histories read(b) time ! 5

Slide 70

Slide 70 text

Visualising operational semantics !18 Non atomic a b c 1 2 3 4 5 6 7 Thread 1 Thread 2 Histories read(b) -> 3/4/5 time ! 5

Slide 71

Slide 71 text

Visualising operational semantics !18 Non atomic a b c 1 2 3 4 5 6 7 Thread 1 Thread 2 Histories read(b) -> 3/4/5 write(c,10) time ! 5

Slide 72

Slide 72 text

Visualising operational semantics !19 Non atomic a b c 1 2 3 4 5 6 7 Thread 1 Thread 2 Histories read(b) -> 3/4/5 write(c,10) 10 time ! 5

Slide 73

Slide 73 text

Visualising operational semantics !19 Non atomic a b c 1 2 3 4 5 6 7 Thread 1 Thread 2 Histories read(b) -> 3/4/5 write(c,10) 10 time ! Atomic A B 10 5 5

Slide 74

Slide 74 text

Visualising operational semantics !19 Non atomic a b c 1 2 3 4 5 6 7 Thread 1 Thread 2 Histories read(b) -> 3/4/5 write(c,10) 10 time ! Atomic A B 10 5 5

Slide 75

Slide 75 text

Visualising operational semantics !20 Non atomic a b c 1 2 3 4 5 6 7 Thread 1 Thread 2 Histories read(B) 10 time ! Atomic A B 10 5 5

Slide 76

Slide 76 text

Visualising operational semantics !20 Non atomic a b c 1 2 3 4 5 6 7 Thread 1 Thread 2 Histories read(B) 10 time ! Atomic A B 10 5 -> 5 5

Slide 77

Slide 77 text

Visualising operational semantics !21 Non atomic a b c 1 2 3 4 5 6 7 Thread 1 Thread 2 Histories read(B) 10 time ! Atomic A B 10 5 -> 5 5

Slide 78

Slide 78 text

Visualising operational semantics !21 Non atomic a b c 1 2 3 4 5 6 7 Thread 1 Thread 2 Histories read(B) 10 time ! Atomic A B 10 5 -> 5 write (A,20) 5

Slide 79

Slide 79 text

Visualising operational semantics !22 Non atomic a b c 1 2 3 4 5 6 7 Thread 1 Thread 2 Histories read(B) 10 time ! Atomic A B 20 5 -> 5 write (A,20) 5

Slide 80

Slide 80 text

Formalizing Local DRF !23 Trace

Slide 81

Slide 81 text

Formalizing Local DRF !23 Trace Machine state = State of all threads + Heap

Slide 82

Slide 82 text

Formalizing Local DRF !23 Trace Machine state = State of all threads + Heap Memory access

Slide 83

Slide 83 text

Formalizing Local DRF !23 Trace Machine state = State of all threads + Heap Memory access • Pick a set of L of locations

Slide 84

Slide 84 text

Formalizing Local DRF !23 Trace Machine state = State of all threads + Heap Memory access • Pick a set of L of locations Space

Slide 85

Slide 85 text

Formalizing Local DRF !23 Trace Machine state = State of all threads + Heap Memory access • Pick a set of L of locations • Pick a machine state M where there are no ongoing races in L ★ M is said to be L-stable Space

Slide 86

Slide 86 text

Formalizing Local DRF !23 Trace Machine state = State of all threads + Heap Memory access • Pick a set of L of locations • Pick a machine state M where there are no ongoing races in L ★ M is said to be L-stable • Local DRF Theorem ★ Starting from an L-stable state M, until the next race on any location in L under SC semantics, the program has SC semantics Space

Slide 87

Slide 87 text

Formalizing Local DRF !23 Trace Machine state = State of all threads + Heap Memory access • Pick a set of L of locations • Pick a machine state M where there are no ongoing races in L ★ M is said to be L-stable • Local DRF Theorem ★ Starting from an L-stable state M, until the next race on any location in L under SC semantics, the program has SC semantics Space Time

Slide 88

Slide 88 text

• Local DRF prohibits certain hardware and software optimisations ★ Preserve load-to-store ordering Performance Implication !24

Slide 89

Slide 89 text

• Local DRF prohibits certain hardware and software optimisations ★ Preserve load-to-store ordering • No compiler optimisation that reorders load-to-store ordering is allowed Performance Implication !24

Slide 90

Slide 90 text

• Local DRF prohibits certain hardware and software optimisations ★ Preserve load-to-store ordering • No compiler optimisation that reorders load-to-store ordering is allowed Performance Implication !24 r1 = a; b = c; a = r1; Redundant store elimination ! r1 = a; b = c; ;

Slide 91

Slide 91 text

• Local DRF prohibits certain hardware and software optimisations ★ Preserve load-to-store ordering • No compiler optimisation that reorders load-to-store ordering is allowed Performance Implication !24 r1 = a; b = c; a = r1; Redundant store elimination ! r1 = a; b = c; ;

Slide 92

Slide 92 text

• Local DRF prohibits certain hardware and software optimisations ★ Preserve load-to-store ordering • No compiler optimisation that reorders load-to-store ordering is allowed Performance Implication !24 r1 = a; b = c; a = r1; Redundant store elimination ! r1 = a; b = c; ;

Slide 93

Slide 93 text

• Local DRF prohibits certain hardware and software optimisations ★ Preserve load-to-store ordering • No compiler optimisation that reorders load-to-store ordering is allowed Performance Implication !24 r1 = a; b = c; a = r1; Redundant store elimination ! r1 = a; b = c; ;

Slide 94

Slide 94 text

• Local DRF prohibits certain hardware and software optimisations ★ Preserve load-to-store ordering • No compiler optimisation that reorders load-to-store ordering is allowed • ARM & POWER do not preserve load-to-store ordering ★ Insert necessary synchronisation between every mutable load and store ★ What is the performance cost? Performance Implication !24 r1 = a; b = c; a = r1; Redundant store elimination ! r1 = a; b = c; ;

Slide 95

Slide 95 text

Performance !25

Slide 96

Slide 96 text

Performance !25 0.6% overhead on AArch64 (ARMv8)

Slide 97

Slide 97 text

Performance !25 0.6% overhead on AArch64 (ARMv8) Free on x86, 2.6% on POWER

Slide 98

Slide 98 text

Summary • OCaml memory model ★ Balances comprehensibility (Local DRF theorem) and Performance (free on x86, 0.6% on ARMv8, 2.6% on POWER) ★ Allows common compiler optimisations ★ Compilation + Optimisations proved sound !26

Slide 99

Slide 99 text

Summary • OCaml memory model ★ Balances comprehensibility (Local DRF theorem) and Performance (free on x86, 0.6% on ARMv8, 2.6% on POWER) ★ Allows common compiler optimisations ★ Compilation + Optimisations proved sound • Proposed as the memory model for OCaml ★ Also suitable for other safe languages (Swift, WebAssembly, JavaScript) !26