Formal Reasoning to Build Subtle Systems in Go

Slide 1

Slide 1 text

Formal Reasoning to Build Subtle Systems in Go Raghav Roy

Slide 2

Slide 2 text

whoami

Slide 3

Slide 3 text

What I will be covering ● Basics of Formal Reasoning

Slide 4

Slide 4 text

What I will be covering ● Basics of Formal Reasoning ● What thinking “above” your code means

Slide 5

Slide 5 text

What I will be covering ● Basics of Formal Reasoning ● What thinking “above” your code means ● Examples of using Formal Methods in Go Concurrency problems

Slide 6

Slide 6 text

What I will be covering ● Basics of Formal Reasoning ● What thinking “above” your code means ● Examples of using Formal Methods in Go Concurrency problems ● Formal Methods used in production

Slide 7

Slide 7 text

What I will not be covering ● In depth language speciﬁc details

Slide 8

Slide 8 text

What I will not be covering ● In depth language speciﬁc details ● How to use Tooling around TLA+

Slide 9

Slide 9 text

Software is everywhere

Slide 10

Slide 10 text

(a very scientiﬁc graph)

Slide 11

Slide 11 text

Why is reliability hard?

Slide 12

Slide 12 text

Projects get very complex very quickly

Slide 13

Slide 13 text

Often, something as “artistic” as software is difﬁcult to reason about formally

Slide 14

Slide 14 text

Legacy software,

Slide 15

Slide 15 text

Legacy software, No documentation,

Slide 16

Slide 16 text

Legacy software, No documentation, Entropy

Slide 17

Slide 17 text

How to think about software ● ‘What’ do you want it to do, before the ‘How’

Slide 18

Slide 18 text

How to think about software ● ‘What’ do you want it to do, before the ‘How’ ● Informally or Formally writing down the expected behaviour

Slide 19

Slide 19 text

What it comes down to ● How do we ensure that our system is designed in a way that it doesn’t crash or reach incorrect states?

Slide 20

Slide 20 text

“Writing is nature’s way of telling you how sloppy your thinking is” - Dick Guindon

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Concurrency

Slide 23

Slide 23 text

Where can we see it crop up ● Multiple systems, that are running independently, and have a shared global state

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Where can we see it crop up ● Multiple systems, that are running independently, and have a shared global state ● Non-deterministic: Two executions of the same program with the same input can produce different results ● Example: Writers and Readers from a single queue, results can differ with just changing the order of who writes and who reads from the shared queue

Slide 26

Slide 26 text

Let’s look at a simple example

Slide 27

Slide 27 text

Ye Olde Banking System Even with a simple monolith architecture, with just a frontend, a backend and a database, there are two points of concurrency.

Slide 28

Slide 28 text

Ye Olde Banking System In a system where Person A can transfer money to Person B

Slide 29

Slide 29 text

Ye Olde Banking System In a system where Person A can transfer money to Person B ■ Bank needs to check if Person A has sufﬁcient funds

Slide 30

Slide 30 text

Ye Olde Banking System In a system where Person A can transfer money to Person B ■ Bank needs to check if Person A has sufﬁcient funds ■ Add amount to Person B’s bank account

Slide 31

Slide 31 text

Ye Olde Banking System In a system where Person A can transfer money to Person B ■ Bank needs to check if Person A has sufﬁcient funds ■ Add amount to Person B’s bank account ■ Deduct amount from Person A’s bank account

Slide 32

Slide 32 text

Ye Olde Banking System ● Just in this simple system, one step may not ﬁnish before the other starts -> Races, Crashes/Partial Failures

Slide 33

Slide 33 text

Ye Olde Banking System ● Can writing Unit Tests solve this issue? For this example, if number of simultaneous transfers is N,

Slide 34

Slide 34 text

Ye Olde Banking System ● Can writing Unit Tests solve this issue? For this example, if number of simultaneous transfers is N, the number of unit tests to write is (3N)!/(3!)^N Huge for such a simple system! (1681 for 3 Transactions)

Slide 35

Slide 35 text

Alternative? Formal Speciﬁcations

Slide 36

Slide 36 text

Blueprints, and its spectrum *This part of the talk is from Lamport’s Talk

Slide 37

Slide 37 text

Very simple to very complex, where does building concurrent/distributed systems lie? *This part of the talk is from Lamport’s Talk

Slide 38

Slide 38 text

Blueprints, and its spectrum *This part of the talk is from Lamport’s Talk

Slide 39

Slide 39 text

Blueprints, and its spectrum *This part of the talk is from Lamport’s Talk

Slide 40

Slide 40 text

Blueprints, and its spectrum *This part of the talk is from Lamport’s Talk

Slide 41

Slide 41 text

Blueprints, and its spectrum We need tools to check this *This part of the talk is from Lamport’s Talk

Slide 42

Slide 42 text

Modeling Programs ○ Programs can be modeled in a number of ways: Turing Machines, Automatas, Programming Languages

Slide 43

Slide 43 text

Modeling Programs ○ Programs can be modeled in a number of ways: Turing Machines, Automatas, Programming Languages ○ But all of this can be described in terms of a State Machine

Slide 44

Slide 44 text

Modeling Programs ○ Programs can be modeled in a number of ways: Turing Machines, Automatas, Programming Languages ○ But all of this can be described in terms of a State Machine ■ This means describing your program as a set of ‘behaviours’ where each behaviour is a ‘sequence of discrete steps’

Slide 45

Slide 45 text

Modeling Programs ● This requires us to deﬁne the initial state of the system, and the next state of the system

Slide 46

Slide 46 text

Modeling Programs ● This requires us to deﬁne the initial state of the system, and the next state of the system ● You can have multiple next states for a current state

Slide 47

Slide 47 text

Modeling Programs ● This requires us to deﬁne the initial state of the system, and the next state of the system ● You can have multiple next states for a current state (model non-determinism)

Slide 48

Slide 48 text

Modeling Programs

Slide 49

Slide 49 text

Modeling Programs OR OR

Slide 50

Slide 50 text

Modeling Programs V V

Slide 51

Slide 51 text

Modeling Programs ● TLA+ gives us the framework to do this

Slide 52

Slide 52 text

What is TLA+

Slide 53

Slide 53 text

Temporal Logic of Actions

Slide 54

Slide 54 text

Euclid’s algorithm to ﬁnd greatest common divisor

Slide 55

Slide 55 text

How it looks like in Go

Slide 56

Slide 56 text

How it looks like in Go

Slide 57

Slide 57 text

Let’s look at the TLA+ deﬁnition

Slide 58

Slide 58 text

Let’s look at the TLA+ deﬁnition

Slide 59

Slide 59 text

Let’s look at the TLA+ deﬁnition

Slide 60

Slide 60 text

Let’s look at the TLA+ deﬁnition

Slide 61

Slide 61 text

Model Checking TLA+ is a language that lets you write speciﬁcations formally,

Slide 62

Slide 62 text

Model Checking TLA+ is a language that lets you write speciﬁcations formally, “formal” specs are needed if you want to apply tools to them.

Slide 63

Slide 63 text

Model Checking Model checkers verify the correctness of your speciﬁcation by running it against all possible executions of your program.

Slide 64

Slide 64 text

Model Checking More Speciﬁc Model checkers verify systems by induction, by enumerating possible states a system can take on,

Slide 65

Slide 65 text

Model Checking More Speciﬁc Model checkers verify systems by induction, by enumerating possible states a system can take on, and showing that the none of the states violate the system requirements.

Slide 66

Slide 66 text

Model Checking More Speciﬁc Model checkers verify systems by induction, by enumerating possible states a system can take on, and showing that the none of the states violate the system requirements. (Speciﬁcations)

Slide 67

Slide 67 text

Model Checking Model checkers can check two things

Slide 68

Slide 68 text

Model Checking Model checkers can check two things ● Liveness: Good things happen

Slide 69

Slide 69 text

Model Checking Model checkers can check two things ● Liveness: Good things happen ● Safety: Bad things won’t happen

Slide 70

Slide 70 text

Model Checking Model checkers can check two things ● Liveness: Good things eventually happen (Temporal logic) ● Safety: Bad things won’t happen

Slide 71

Slide 71 text

Model Checking What can Safety look like?

Slide 72

Slide 72 text

Model Checking What can Safety look like? ■ Two threads can’t both be in a critical section at the same time. ■ Users cannot write to ﬁles they don’t have access to. ■ We never use more than 500 kb of RAM. ■ The user_id key in the table is unique. ■ We never add a string to an integer.

Slide 73

Slide 73 text

Slide 74

Slide 74 text

Model Checking Then what is Liveness?

Slide 75

Slide 75 text

Model Checking Then what is Liveness? Every message is received at least once by each client.

Slide 76

Slide 76 text

Slide 77

Slide 77 text

Model Checking Then what is Liveness? Every message is received at least once by each client. ■ No ﬁnite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future. Infinite sequence of steps required break this

Slide 78

Slide 78 text

Slide 79

Slide 79 text

Let’s look at how this can help debug Go concurrency primitives

Slide 80

Slide 80 text

Concurrency Bug : This was an actual error that was found in the very popular Gops library

Slide 81

Slide 81 text

No content

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

unbuffered buffered

Slide 84

Slide 84 text

First this Then this

Slide 85

Slide 85 text

Only read from after for loop Blocked

Slide 86

Slide 86 text

Concurrency Bug : ● This was an actual error that was found in the very popular Gops library ● Hillel Wayne then demonstrates the bug with his TLA+ spec for the implementation

Slide 87

Slide 87 text

Concurrency Bug : ● This was an actual error that was found in the very popular Gops library ● Hillel Wayne then demonstrates the bug with his TLA+ spec for the implementation and ﬁnds the deadlock condition

Slide 88

Slide 88 text

No content

Slide 89

Slide 89 text

No content

Slide 90

Slide 90 text

No content

Slide 91

Slide 91 text

Let’s design our own concurrent system

Slide 92

Slide 92 text

Design ● Imagine a system, working a bit like a pipeline consisting of three steps:

Slide 93

Slide 93 text

Slide 94

Slide 94 text

Design ● Imagine a system, working a bit like a pipeline consisting of three steps: Step 1 - The Input: One component handles some incoming data, does some initial processing, and sends an event on a queue. Step 2 - The Processor: a component that will receive the event sent at 1, and do the processing, and send the result on yet another queue.

Slide 95

Slide 95 text

Slide 96

Slide 96 text

Design

Slide 97

Slide 97 text

Design

Slide 98

Slide 98 text

Design

Slide 99

Slide 99 text

How it looks like in TLA+

Slide 100

Slide 100 text

No content

Slide 101

Slide 101 text

IGNORE if this makes you queasy (like me)

Slide 102

Slide 102 text

No content

Slide 103

Slide 103 text

Running Model Checker

Slide 104

Slide 104 text

Why does it deadlock? The problem was the following sequence of steps:

Slide 105

Slide 105 text

Why does it deadlock? The problem was the following sequence of steps: ● Step 1 would ○ Process input

Slide 106

Slide 106 text

Why does it deadlock? The problem was the following sequence of steps: ● Step 1 would ○ Process input ○ Add it to the shared data

Slide 107

Slide 107 text

Why does it deadlock? The problem was the following sequence of steps: ● Step 1 would ○ Process input ○ Add it to the shared data ○ Send an event to Step 2 containing an identiﬁer

Slide 108

Slide 108 text

Why does it deadlock? The problem was the following sequence of steps: ● Step 2 would ○ Prune the shared data

Slide 109

Slide 109 text

Why does it deadlock? The problem was the following sequence of steps: ● Step 2 would ○ Prune the shared data ○ Remove the object that was added above at Step 1.

Slide 110

Slide 110 text

Why does it deadlock? The problem was the following sequence of steps: ● Step 2 then received the event from Step 1

Slide 111

Slide 111 text

Why does it deadlock? The problem was the following sequence of steps: ● Step 2 then received the event from Step 1 No object in shared data!

Slide 112

Slide 112 text

Why does it deadlock? The problem was the following sequence of steps: Race between ● Step 1 adding the object to the shared data and sending an event to Step 2 ● Step 2 pruning the object before handling that event

Slide 113

Slide 113 text

Modeling the Fix : The Fix! ● The SendIncoming(id) step should only put the identiﬁer on the queue

Slide 114

Slide 114 text

Modeling the Fix : The Fix! ● The SendIncoming(id) step should only put the identiﬁer on the queue ● The ReceiveIncoming step should add the object to, and eventually prune from, the shared storage.

Slide 115

Slide 115 text

Modeling the Fix :

Slide 116

Slide 116 text

Modeling the Fix :

Slide 117

Slide 117 text

New TLA+ spec : Moved

Slide 118

Slide 118 text

Running Model Checker

Slide 119

Slide 119 text

So, was any of that relevant to actual systems in production?

Slide 120

Slide 120 text

Let’s drive this point home

Slide 121

Slide 121 text

Who uses this in production? ● This is not limited to just modeling toy systems, but real systems, here is what Amazon engineers had to say,

Slide 122

Slide 122 text

Who uses this in production? ● This is not limited to just modeling toy systems, but real systems, here is what Amazon engineers had to say, (and they also wrote a paper)

Slide 123

Slide 123 text

No content

Slide 124

Slide 124 text

Who uses this in production? ● They used TLA+ in 10+ large, complex real-world systems

Slide 125

Slide 125 text

Slide 126

Slide 126 text

Who uses this in production? ● They used TLA+ in 10+ large, complex real-world systems ● In every case, TLA+ added significant value by preventing subtle, serious bugs that could have reached production ● Gave them enough understanding and confidence to make aggressive optimisations without sacrificing correctness of their systems

Slide 127

Slide 127 text

Automated Reasoning Group - AWS

Slide 128

Slide 128 text

Thanks for making it this far! Let’s conclude

Slide 129

Slide 129 text

What do programmers need to know about thinking about your code in this way? ● Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong * More Lamport goodness

Slide 130

Slide 130 text

Slide 131

Slide 131 text

What do programmers need to know about thinking about your code in this way? ● Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong ● There is a need to think before coding, or more clearly, the need to Write before you code ● Any piece of code that someone is likely to use or modify, needs to be speciﬁed in some way * More Lamport goodness

Slide 132

Slide 132 text

Slide 133

Slide 133 text

What do programmers need to know about thinking about your code in this way? ● There is importance in specifying everything your code does and if required, how it does it * More Lamport goodness

Slide 134

Slide 134 text

What do programmers need to know about thinking about your code in this way? ● There is importance in specifying everything your code does and if required, how it does it ● You should always specify your code “above” the code level, in terms of states and behaviours or input/output behaviours. * More Lamport goodness

Slide 135

Slide 135 text

Slide 136

Slide 136 text

Why write formal specs? ● Disclaimer, ﬁnding bugs in code is not the intended purpose of writing a formal spec, writing a formal spec is hard work,

Slide 137

Slide 137 text

Why write formal specs? ● Disclaimer, ﬁnding bugs in code is not the intended purpose of writing a formal spec, writing a formal spec is hard work, it requires thinking in a way that isn’t intuitive to most programmers

Slide 138

Slide 138 text

Slide 139

Slide 139 text

Why write formal specs? ● Even if the above talk doesn’t apply to you, and you never write complex critical systems, * More Lamport goodness

Slide 140

Slide 140 text

Slide 141

Slide 141 text

Why write formal specs? ● Even if the above talk doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway ● Trace out the possible states your program can be in, and reason about it

Slide 142

Slide 142 text

Slide 143

Slide 143 text

References ● Leslie Lamport’s lectures on TLA+ ● Hillel Wayne’s lecture on “Tackling Concurrency Bugs with TLA+” ● Hillel Wayne’s TLA+ spec and fix for the Gops bug ● Leslie Lamport’s video on Thinking Above the Code ● Gregory Terzian's article for TLA+ spec for modeling concurrency bug and fix ● Gops issue link ● Amazon’s paper on Using Formal Verification Techniques in Production ● Image References ○ Speaker deck from Leslie Lamport’s lectures for Blueprint Spectrum ○ Simplified version of Gops FindAll function screenshots Gopher Credits: Renée French, Tenntenn, Maria Letta, Women Who Go Speaker Deck: speakerdeck.com/royra

Slide 144

Slide 144 text

Thank you!