Slide 1

Slide 1 text

Hardening Go Concurrency, Using Formal Methods of Verification Raghav Roy – VMware

Slide 2

Slide 2 text

whoami

Slide 3

Slide 3 text

What I will be covering ● Basics of Formal Verification

Slide 4

Slide 4 text

What I will be covering ● Basics of Formal Verification ● What thinking “above” your code means

Slide 5

Slide 5 text

What I will be covering ● Basics of Formal Verification ● What thinking “above” your code means ● Examples of using Formal Verification in Go Concurrency problems

Slide 6

Slide 6 text

What I will be covering ● Basics of Formal Verification ● What thinking “above” your code means ● Examples of using Formal Verification in Go Concurrency problems ● Formal verification used in production

Slide 7

Slide 7 text

What I will not be covering ● In depth language specific details

Slide 8

Slide 8 text

What I will not be covering ● In depth language specific details ● How to use Tooling around TLA+ (not covered in detail)

Slide 9

Slide 9 text

Why do we think? *This part of the talk is borrowed from Lamport’s Talk

Slide 10

Slide 10 text

Well, it helps us do things, like building a house *This part of the talk is borrowed from Lamport’s Talk

Slide 11

Slide 11 text

When should we think? *This part of the talk is borrowed from Lamport’s Talk

Slide 12

Slide 12 text

Ideally before you start construction *This part of the talk is borrowed from Lamport’s Talk

Slide 13

Slide 13 text

For programs, you ideally should think about your code before you start writing any code *This part of the talk is borrowed from Lamport’s Talk

Slide 14

Slide 14 text

“Writing is nature’s way of telling you how sloppy your thinking really is” - Guindon *This part of the talk is borrowed from Lamport’s Talk

Slide 15

Slide 15 text

How to think ● ‘What’ do you want it to do. *This part of the talk is borrowed from Lamport’s Talk

Slide 16

Slide 16 text

How to think ● ‘What’ do you want it to do. ● With concurrent or distributed systems, that rough sketch needs to be promoted to maybe a blueprint, or a functional definition of its behaviour *This part of the talk is borrowed from Lamport’s Talk

Slide 17

Slide 17 text

How to think ● ‘What’ do you want it to do. ● With concurrent or distributed systems, that rough sketch needs to be promoted to maybe a blueprint, or a functional definition of its behaviour ○ Design the system in a way that it can run correctly for every state that it can be in *This part of the talk is borrowed from Lamport’s Talk

Slide 18

Slide 18 text

How to think ● How do we ensure that our system is designed in a way that it doesn’t crash or reach incorrect states? *This part of the talk is borrowed from Lamport’s Talk

Slide 19

Slide 19 text

Concurrency

Slide 20

Slide 20 text

Where can we see it crop up ● Multiple systems, that are running independently, and have a shared global state

Slide 21

Slide 21 text

Where can we see it crop up ● Multiple systems, that are running independently, and have a shared global state ● Non-deterministic: Two executions of the same program with the same input can produce different results

Slide 22

Slide 22 text

Where can we see it crop up ● Multiple systems, that are running independently, and have a shared global state ● Non-deterministic: Two executions of the same program with the same input can produce different results ● Example: Writers and Readers from a single queue, results can differ with just changing the order of who writes and who reads from the shared queue

Slide 23

Slide 23 text

Let’s look at a simple example

Slide 24

Slide 24 text

Ye Olde Banking System ○ Even with a simple monolith architecture, with just a frontend, a backend and a database, there are two points of concurrency.

Slide 25

Slide 25 text

Ye Olde Banking System ○ Even with a simple monolith architecture, with just a frontend, a backend and a database, there are two points of concurrency. ○ In a system where Person A can transfer money to Person B ■ Bank needs to check if Person A has sufficient funds ■ Add amount to Person B’s bank account ■ Deduct amount from Person A’s bank account

Slide 26

Slide 26 text

Ye Olde Banking System ○ Just in this simple system, one step may not finish before the other starts -> Races, Crashes/Partial Failures

Slide 27

Slide 27 text

Ye Olde Banking System ○ Just in this simple system, one step may not finish before the other starts -> Races, Crashes/Partial Failures ○ Can writing Unit Tests solve this issue? For this particular example, if number of simultaneous transfers is N,

Slide 28

Slide 28 text

Ye Olde Banking System ○ Just in this simple system, one step may not finish before the other starts -> Races, Crashes/Partial Failures ○ Can writing Unit Tests solve this issue? For this particular example, if number of simultaneous transfers is N, the number of unit tests to write is (3N)!/(3!)^N - Huge for such a simple system! (1681 for 3 Transactions)

Slide 29

Slide 29 text

Alternative? Formal Specifications

Slide 30

Slide 30 text

Blueprints, and its spectrum *This part of the talk is borrowed from Lamport’s Talk

Slide 31

Slide 31 text

Very simple to very complex, where does building concurrent/distributed systems lie? *This part of the talk is borrowed from Lamport’s Talk

Slide 32

Slide 32 text

Blueprints, and its spectrum *This part of the talk is borrowed from Lamport’s Talk

Slide 33

Slide 33 text

Blueprints, and its spectrum *This part of the talk is borrowed from Lamport’s Talk

Slide 34

Slide 34 text

Blueprints, and its spectrum *This part of the talk is borrowed from Lamport’s Talk

Slide 35

Slide 35 text

Blueprints, and its spectrum We need tools to check this *This part of the talk is borrowed from Lamport’s Talk

Slide 36

Slide 36 text

Modeling Programs ○ Programs can be modeled in a number of ways: Turing Machines, Automatas, Programming Languages

Slide 37

Slide 37 text

Modeling Programs ○ Programs can be modeled in a number of ways: Turing Machines, Automatas, Programming Languages ○ But all of this can be described in terms of a State Machine ■ This means describing your program as a set of “behaviours” where each behaviour is a ‘sequence of discrete steps’

Slide 38

Slide 38 text

Modeling Programs ○ Programs can be modeled in a number of ways: Turing Machines, Automatas, Programming Languages ○ But all of this can be described in terms of a State Machine ■ This means describing your program as a set of “behaviours” where each behaviour is a ‘sequence of discrete steps’ ■ This requires us to define the initial state of the system, and the next state of the system

Slide 39

Slide 39 text

Modeling Programs ○ Programs can be modeled in a number of ways: Turing Machines, Automatas, Programming Languages ○ But all of this can be described in terms of a State Machine ■ This means describing your program as a set of “behaviours” where each behaviour is a ‘sequence of discrete steps’ ■ This requires us to define the initial state of the system, and the next state of the system ■ You can have multiple next states for a current state

Slide 40

Slide 40 text

Modeling Programs

Slide 41

Slide 41 text

Modeling Programs OR OR

Slide 42

Slide 42 text

Modeling Programs V V

Slide 43

Slide 43 text

Modeling Programs ○ Programs can be modeled in a number of ways: Turing Machines, Automatas, Programming Languages ○ But all of this can be described in terms of a State Machine ■ This means describing your program as a set of “behaviours” where each behaviour is a ‘sequence of discrete steps’ ■ This requires us to define the initial state of the system, and the next state of the system ■ You can have multiple next states for a current state (modeling Non-Determinism) ■ TLA+ gives us the framework to do this

Slide 44

Slide 44 text

What is TLA+

Slide 45

Slide 45 text

Temporal Logic of Actions

Slide 46

Slide 46 text

Euclid’s algorithm to find greatest common divisor

Slide 47

Slide 47 text

How it looks like in Go

Slide 48

Slide 48 text

How it looks like in Go

Slide 49

Slide 49 text

Let’s look at the TLA+ definition

Slide 50

Slide 50 text

Let’s look at the TLA+ definition

Slide 51

Slide 51 text

Let’s look at the TLA+ definition

Slide 52

Slide 52 text

Let’s look at the TLA+ definition

Slide 53

Slide 53 text

Model Checking ○ TLA+ is a language that lets you write specifications formally, “formal” specs are needed if you want to apply tools to them. ■ Model checkers verify the correctness of your specification by running it against all possible executions of your program.

Slide 54

Slide 54 text

Model Checking ○ TLA+ is a language that lets you write specifications formally, “formal” specs are needed if you want to apply tools to them. ■ More Specific: Model checkers verify systems by induction, by enumerating possible states a system can take on, and showing that the none of the states violate the system requirements.

Slide 55

Slide 55 text

Model Checking ○ TLA+ is a language that lets you write specifications formally, “formal” specs are needed if you want to apply tools to them. ■ More Specific: Model checkers verify systems by induction, by enumerating possible states a system can take on, and showing that the none of the states violate the system requirements. (Specifications)

Slide 56

Slide 56 text

Model Checking ○ TLA+ is a language that lets you write specifications formally, “formal” specs are needed if you want to apply tools to them. ■ Model checkers can check two things

Slide 57

Slide 57 text

Model Checking ○ TLA+ is a language that lets you write specifications formally, “formal” specs are needed if you want to apply tools to them. ■ Model checkers can check two things ● Liveness: Good things happen

Slide 58

Slide 58 text

Model Checking ○ TLA+ is a language that lets you write specifications formally, “formal” specs are needed if you want to apply tools to them. ■ Model checkers can check two things ● Liveness: Good things happen ● Safety: Bad things won’t happen

Slide 59

Slide 59 text

Model Checking ○ TLA+ is a language that lets you write specifications formally, “formal” specs are needed if you want to apply tools to them. ■ Model checkers can check two things ● Temporal logic for Liveness: Good things eventually happen ● Safety: Bad things won’t happen

Slide 60

Slide 60 text

Model Checking ○ What can Safety look like: ■ Two threads can’t both be in a critical section at the same time. ■ Users cannot write to files they don’t have access to. ■ We never use more than 500 kb of RAM. ■ The user_id key in the table is unique. ■ We never add a string to an integer.

Slide 61

Slide 61 text

Model Checking ○ What can Safety look like: ■ Two threads can’t both be in a critical section at the same time. ■ Users cannot write to files they don’t have access to. ■ We never use more than 500 kb of RAM. ■ The user_id key in the table is unique. ■ We never add a string to an integer. ○ These are Invariants

Slide 62

Slide 62 text

Model Checking Then what is Liveness? Every message is received at least once by each client.

Slide 63

Slide 63 text

Model Checking Then what is Liveness? Every message is received at least once by each client. ■ No finite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future.

Slide 64

Slide 64 text

Model Checking Then what is Liveness? Every message is received at least once by each client. ■ No finite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future. Temporal Logic

Slide 65

Slide 65 text

Model Checking Then what is Liveness? Every message is received at least once by each client. ■ No finite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future. ■ The only way to break a liveness property is to show that at no point in the future does it ever become true.

Slide 66

Slide 66 text

Let’s look at how this can help debug Go concurrency primitives

Slide 67

Slide 67 text

Concurrency Bug : ● This was an actual error that was found in the very popular Gops library

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

unbuffered buffered

Slide 71

Slide 71 text

First this Then this

Slide 72

Slide 72 text

Only read from after for loop Blocked

Slide 73

Slide 73 text

Concurrency Bug : ● This was an actual error that was found in the very popular Gops library ● Hillel Wayne then demonstrates the bug with his TLA+ spec for the implementation and finds the deadlock condition

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

Let’s design our own concurrent system in TLA+

Slide 78

Slide 78 text

Design ● Imagine a system, working a bit like a pipeline consisting of three steps:

Slide 79

Slide 79 text

Design ● Imagine a system, working a bit like a pipeline consisting of three steps: ● The Input: where one component handles some incoming data, does some initial processing, and then sends an event on a queue.

Slide 80

Slide 80 text

Design ● Imagine a system, working a bit like a pipeline consisting of three steps: ● The Input: where one component handles some incoming data, does some initial processing, and then sends an event on a queue. ● The processor: a component that will receive the event sent at 1, and do the actual heavy lifting in terms of processing, and then send the result on yet another queue

Slide 81

Slide 81 text

Design ● Imagine a system, working a bit like a pipeline consisting of three steps: ● The Input: where one component handles some incoming data, does some initial processing, and then sends an event on a queue. ● The processor: a component that will receive the event sent at 1, and do the actual heavy lifting in terms of processing, and then send the result on yet another queue. ● The output: where the output from 2 is further handled downstream.

Slide 82

Slide 82 text

Design

Slide 83

Slide 83 text

Design

Slide 84

Slide 84 text

Design

Slide 85

Slide 85 text

How it looks like in TLA+

Slide 86

Slide 86 text

No content

Slide 87

Slide 87 text

No content

Slide 88

Slide 88 text

No content

Slide 89

Slide 89 text

Running Model Checker

Slide 90

Slide 90 text

Why does it deadlock? The problem was the following sequence of steps: ● A. Step 1 would process input, add it to the shared data, and send an event to Step 2 containing an identifier.

Slide 91

Slide 91 text

Why does it deadlock? The problem was the following sequence of steps: ● A. Step 1 would process input, add it to the shared data, and send an event to Step 2 containing an identifier. ● B. Step 2 would prune the shared data, and remove the object that was added above at 1.

Slide 92

Slide 92 text

Why does it deadlock? The problem was the following sequence of steps: ● A. Step 1 would process input, add it to the shared data, and send an event to Step 2 containing an identifier. ● B. Step 2 would prune the shared data, and remove the object that was added above at 1. ● C. Step 2 then received the event from step 1, and cannot find the object in the shared data.

Slide 93

Slide 93 text

Why does it deadlock? The problem was the following sequence of steps: ● The problem was a race condition between Step 1 adding the object to the shared data and sending an event to Step 2, and Step 2 pruning the object before handling that event

Slide 94

Slide 94 text

Modeling the Fix : ● The fix consist of avoiding more than one concurrent step writing to shared_storage. ● The SendIncoming(id) step should only put the identifier on the queue, and only the ReceiveIncoming step should add the object to, and eventually prune from, the shared storage.

Slide 95

Slide 95 text

Modeling the Fix :

Slide 96

Slide 96 text

Modeling the Fix :

Slide 97

Slide 97 text

New TLA+ spec :

Slide 98

Slide 98 text

Running Model Checker

Slide 99

Slide 99 text

So, was any of that relevant to actual systems in production?

Slide 100

Slide 100 text

Let’s drive this point home

Slide 101

Slide 101 text

Who uses this in production? ● This is not limited to just modeling toy systems, but real systems, here is what Amazon engineers had to say, and the also wrote a paper about how they used formal methods at AWS

Slide 102

Slide 102 text

No content

Slide 103

Slide 103 text

Who uses this in production? ● This is not limited to just modeling toy systems, but real systems, here is what Amazon engineers had to say, and the also wrote a paper about how they used formal methods at AWS. ● They used TLA+ in 10+ large, complex real-world systems

Slide 104

Slide 104 text

Who uses this in production? ● This is not limited to just modeling toy systems, but real systems, here is what Amazon engineers had to say, and the also wrote a paper about how they used formal methods at AWS. ● They used TLA+ in 10+ large, complex real-world systems ● In every case, TLA+ added significant value by preventing subtle, serious bugs that could have reached production

Slide 105

Slide 105 text

Who uses this in production? ● This is not limited to just modeling toy systems, but real systems, here is what Amazon engineers had to say, and the also wrote a paper about how they used formal methods at AWS. ● They used TLA+ in 10+ large, complex real-world systems ● In every case, TLA+ added significant value by preventing subtle, serious bugs that could have reached production ● And also gave them enough understanding and confidence to make aggressive optimisations without sacrificing correctness of their systems

Slide 106

Slide 106 text

Thanks for making it this far! Let’s conclude

Slide 107

Slide 107 text

What do programmers need to know about thinking about your code in this way? ● Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong

Slide 108

Slide 108 text

What do programmers need to know about thinking about your code in this way? ● Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong ● There is a need to think before coding, or more clearly, the need to Write before you code

Slide 109

Slide 109 text

What do programmers need to know about thinking about your code in this way? ● Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong ● There is a need to think before coding, or more clearly, the need to Write before you code ● Any piece of code that someone is likely to use or modify, needs to be specified in some way

Slide 110

Slide 110 text

What do programmers need to know about thinking about your code in this way? ● Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong ● There is a need to think before coding, or more clearly, the need to Write before you code ● Any piece of code that someone is likely to use or modify, needs to be specified in some way ● That someone can be you next month

Slide 111

Slide 111 text

What do programmers need to know about thinking about your code in this way? ● There is importance in specifying everything your code does and if required, how it does it

Slide 112

Slide 112 text

What do programmers need to know about thinking about your code in this way? ● There is importance in specifying everything your code does and if required, how it does it ● You should always specify your code “above” the code level, in terms of states and behaviours or input/output behaviours.

Slide 113

Slide 113 text

What do programmers need to know about thinking about your code in this way? ● There is importance in specifying everything your code does and if required, how it does it ● You should always specify your code “above” the code level, in terms of states and behaviours or input/output behaviours. ● Thinking mathematically provides a rigorous way of doing this, in precise terms and being as unambiguous as possible.

Slide 114

Slide 114 text

Why write formal specs? ● Disclaimer, finding bugs in code is not the intended purpose of writing a formal spec, writing a formal spec is hard work, it requires thinking in a way that isn’t intuitive to most programmers as we are generally used to implementing the ‘How’

Slide 115

Slide 115 text

Why write formal specs? ● Even if the above talk doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway

Slide 116

Slide 116 text

Why write formal specs? ● Even if the above talk doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway ● We learn to write programs by writing them, running them and then correcting the errors

Slide 117

Slide 117 text

Why write formal specs? ● Even if the above talk doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway ● We learn to write programs by writing them, running them and then correcting the errors ● We can learn to write formal specs by writing them, running them against a model checker and then correct errors

Slide 118

Slide 118 text

Why write formal specs? ● Even if the above talk doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway ● We learn to write programs by writing them, running them and then correcting the errors ● We can learn to write formal specs by writing them, running them against a model checker and then correct errors ● A great way to document your systems with precise language

Slide 119

Slide 119 text

References ● Leslie Lamport’s lectures on TLA+ ● Hillel Wayne’s lecture on “Tackling Concurrency Bugs with TLA+” ● Hillel Wayne’s TLA+ spec and fix for the Gops bug ● Leslie Lamport’s video on Thinking Above the Code ● Medium article for TLA+ spec for modeling concurrency bug and fix ● Gops issue link ● Amazon’s paper on Using Formal Verification Techniques in Production ● Image References ○ Speaker deck from Leslie Lamport’s lectures for Blueprint Spectrum ○ Simplified version of Gops FindAll function screenshots

Slide 120

Slide 120 text

Thank you!