Slide 1

Slide 1 text

Building confidence in concurrent code using a model checker (aka TLA+ for programmers) @ScottWlaschin fsharpforfunandprofit.com Warning – this talk will have too much information!

Slide 2

Slide 2 text

People who have written concurrent code People who have had weird painful bugs in concurrent code Why concurrent code in particular?

Slide 3

Slide 3 text

People who have written concurrent code People who have had weird painful bugs in concurrent code Why concurrent code in particular?

Slide 4

Slide 4 text

People who have written concurrent code People who have had weird painful bugs in concurrent code A perfect circle  Why concurrent code in particular?

Slide 5

Slide 5 text

How many programmers are very confident about their code?

Slide 6

Slide 6 text

"This code doesn't work and I don't know why"

Slide 7

Slide 7 text

"This code works and I don't know why"

Slide 8

Slide 8 text

Tools to improve confidence • Design – Domain driven design – Behavior driven design – Rapid prototyping – Modeling with UML etc • Coding – Static typing – Good libraries • Testing – TDD – Property-based testing – Canary testing

Slide 9

Slide 9 text

Tools to improve confidence All of the above, plus • "Model checking"

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

What is "model checking"? • Use a special DSL to design a "model" • Then "check" the model: – Are all the constraints met? – Does anything unexpected happen? – Does it deadlock? • This is part of a "formal methods" approach

Slide 12

Slide 12 text

Two popular model checkers • TLA+ (TLC) – Focuses on temporal properties – Good for modeling concurrent systems • Alloy (Alloy Analyzer) – Focuses on relational logic – Good for modeling structures

Slide 13

Slide 13 text

Two popular model checkers • TLA+ (TLC) – Focuses on temporal properties – Good for modeling concurrent systems • Alloy (Alloy Analyzer) – Focuses on relational logic – Good for modeling structures

Slide 14

Slide 14 text

Start(s) == serverState[s] = "online_v1" /\ ~(\E other \in servers : serverState[other] = "offline") /\ serverState' = [serverState EXCEPT ![s] = "offline"] Finish(s) == serverState[s] = "offline" /\ serverState' = [serverState EXCEPT ![s] = "online_v2"] UpgradeStep == \E s \in servers : Start(s) \/ Finish(s) Done == \A s \in servers : serverState[s] = "online_v2" /\ UNCHANGED serverState Spec == /\ Init /\ [][Next]_serverState /\ WF_serverState(UpgradeStep) Here's what TLA+ looks like

Slide 15

Slide 15 text

Start(s) == serverState[s] = "online_v1" /\ ~(\E other \in servers : serverState[other] = "offline") /\ serverState' = [serverState EXCEPT ![s] = "offline"] Finish(s) == serverState[s] = "offline" /\ serverState' = [serverState EXCEPT ![s] = "online_v2"] UpgradeStep == \E s \in servers : Start(s) \/ Finish(s) Done == \A s \in servers : serverState[s] = "online_v2" /\ UNCHANGED serverState Spec == /\ Init /\ [][Next]_serverState /\ WF_serverState(UpgradeStep) Here's what TLA+ looks like By the end of the talk you should be able to make sense of it!

Slide 16

Slide 16 text

Time for some live polling!

Slide 17

Slide 17 text

bit.ly/tlapoll

Slide 18

Slide 18 text

Poll #1 results: "Can you see this poll?" Link to live poll: bit.ly/tlapoll

Slide 19

Slide 19 text

Outline of this talk • How confident are you? • Introducing TLA+ • Examples: – Using TLA+ for a simple model – Checking a Producer/Consumer model – Checking a zero-downtime deployment model

Slide 20

Slide 20 text

Part I How confident are you?

Slide 21

Slide 21 text

To sort a list: 1) If the list is empty or has 1 element, it is already sorted. So just return it unchanged. 2) Otherwise, take the first element (called the "pivot") 3) Divide the remaining elements into two piles: * those < than the pivot * those > than the pivot 4) Sort each of the two piles using this sort algorithm 5) Return the sorted list by concatenating: * the sorted "smaller" list * then the pivot * then the sorted "bigger" list Here's a spec for a sort algorithm

Slide 22

Slide 22 text

To sort a list: 1) If the list is empty or has 1 element, it is already sorted. So just return it unchanged. 2) Otherwise, take the first element (called the "pivot") 3) Divide the remaining elements into two piles: * those < than the pivot * those > than the pivot 4) Sort each of the two piles using this sort algorithm 5) Return the sorted list by concatenating: * the sorted "smaller" list * then the pivot * then the sorted "bigger" list Here's a spec for a sort algorithm Link to live poll: bit.ly/tlapoll

Slide 23

Slide 23 text

Poll #2 results: " What is your confidence in the design of this sort algorithm?" Link to live poll: bit.ly/tlapoll

Slide 24

Slide 24 text

To sort a list: 1) If the list is empty or has 1 element, it is already sorted. So just return it unchanged. 2) Otherwise, take the first element (called the "pivot") 3) Divide the remaining elements into two piles: * those < than the pivot * those > than the pivot 4) Sort each of the two piles using this sort algorithm 5) Return the sorted list by concatenating: * the sorted "smaller" list * then the pivot * then the sorted "bigger" list Here's a spec for a sort algorithm

Slide 25

Slide 25 text

Some approaches to gain confidence • Careful inspection and code review • Create an implementation and then test it thoroughly – E.g. Using property-based tests • Use mathematical proof assistant tool

Slide 26

Slide 26 text

How confident are you when concurrency is involved?

Slide 27

Slide 27 text

A concurrent producer/consumer system A queue Consumer spec (2 separate steps) 1) Check if queue is not empty 2) If true, then read item from queue Producer spec (2 separate steps) 1) Check if queue is not full 2) If true, then write item to queue Consumer reads from queue Producer writes to queue

Slide 28

Slide 28 text

Given a bounded queue of items And 1 producer, 1 consumer running concurrently Constraints: * never read from an empty queue * never add to a full queue Producer spec (separate steps) 1) Check if queue is not full 2) If true, then write item to queue 3) Go to step 1 Consumer spec (separate steps) 1) Check if queue is not empty 2) If true, then read item from queue 3) Go to step 1 A spec for a producer/consumer system Link to live poll: bit.ly/tlapoll

Slide 29

Slide 29 text

Poll #3 results: "What is your confidence in the design of this producer/consumer system?" Link to live poll: bit.ly/tlapoll

Slide 30

Slide 30 text

Given a bounded queue of items And 2 producers, 2 consumers running concurrently Constraints: * never read from an empty queue * never add to a full queue Producer spec (separate steps) 1) Check if queue is not full 2) If true, then write item to queue 3) Go to step 1 Consumer spec (separate steps) 1) Check if queue is not empty 2) If true, then read item from queue 3) Go to step 1 A spec for a producer/consumer system Link to live poll: bit.ly/tlapoll

Slide 31

Slide 31 text

Poll #4 results: " What is your confidence in the design of this producer/consumer system (now with multiple clients)?"

Slide 32

Slide 32 text

Being confident in the design of concurrent systems is hard

Slide 33

Slide 33 text

How to gain confidence for concurrency? • Careful inspection and code review – Human intuition for concurrency is very bad • Create an implementation and then test it – Many concurrency errors might never show up • Use mathematical proof assistant tool – A model checker is much easier!

Slide 34

Slide 34 text

Part II Introducing TLA+

Slide 35

Slide 35 text

Stand Back! I'm going to use Mathematics!

Slide 36

Slide 36 text

TLA+ was designed by Leslie Lamport – Famous "Time & Clocks" paper – Paxos algorithm for consensus – Turing award winner – Initial developer of LaTeX

Slide 37

Slide 37 text

TLA+ stands for – Temporal – Logic – of Actions – plus …

Slide 38

Slide 38 text

TLA+ stands for – Temporal – Logic – of Actions – plus …

Slide 39

Slide 39 text

TLA+ stands for – Temporal – Logic – of Actions – plus …

Slide 40

Slide 40 text

TLA+ stands for – Temporal – Logic – of Actions – plus …

Slide 41

Slide 41 text

TLA+ stands for – Temporal – Logic – of Actions – plus …

Slide 42

Slide 42 text

TLA+ stands for – Temporal – Logic – of Actions – plus …

Slide 43

Slide 43 text

The "Logic" in TLA+

Slide 44

Slide 44 text

Boolean Logic Boolean Mathematics TLA+ Programming AND a ∧ b a /\ b a && b OR a ∨ b a \/ b a || b NOT ¬a ~a !a; not a You all know how this works, I hope!

Slide 45

Slide 45 text

Boolean Logic A "predicate" is an expression that returns a boolean \* TLA-style definition operator(a,b,c) == (a /\ b) \/ (a /\ ~c) // programming language definition function(a,b,c) { (a && b) || (a && !c) }

Slide 46

Slide 46 text

The "Actions" in TLA+ a.k.a. state transitions

Slide 47

Slide 47 text

State A State B State C Transition from A to B A state machine Transition from B to A Transition from B to C

Slide 48

Slide 48 text

White to play Black to play Game Over White plays and wins Black plays White plays Black plays and wins States and transitions for a chess game

Slide 49

Slide 49 text

Undelivered Out for delivery Delivered Send out for delivery Address not found Signed for Failed Delivery Redeliver States and transitions for deliveries

Slide 50

Slide 50 text

"hello" "goodbye" States and transitions in TLA+ State before State after state = "hello" In TLA+ state' = "goodbye" In TLA+ An "action"

Slide 51

Slide 51 text

"hello" "goodbye" States and transitions in TLA+ Next == state = "hello" /\ state' = "goodbye" In TLA+, define the action "Next" like this Next Or in English: state before is "hello" AND state after is "goodbye"

Slide 52

Slide 52 text

"hello" "goodbye" States and transitions in TLA+ Next == state = "hello" /\ state' = "goodbye" Next

Slide 53

Slide 53 text

"hello" "goodbye" States and transitions in TLA+ Next == state' = "goodbye" /\ state = "hello" Next

Slide 54

Slide 54 text

Actions are not assignments. Actions are tests state = "hello" /\ state' = "goodbye" "hello" "goodbye"  Does match "hello" "ciao" Doesn't match  "howdy" "goodbye" Doesn't match 

Slide 55

Slide 55 text

The "Temporal" in TLA+

Slide 56

Slide 56 text

TLA+ models a series of state transitions over time In TLA+ you can ask questions like: • Is something always true? • Is something ever true? • If X happens, must Y happen afterwards?

Slide 57

Slide 57 text

Temporal Logic of Actions Boolean logic of state transitions over time

Slide 58

Slide 58 text

Temporal Logic of Actions Boolean logic of state transitions over time

Slide 59

Slide 59 text

Temporal Logic of Actions Boolean logic of state transitions over time

Slide 60

Slide 60 text

Temporal Logic of Actions Boolean logic of state transitions over time

Slide 61

Slide 61 text

Part III Using TLA+ for a simple model

Slide 62

Slide 62 text

Count to three 1 2 3 // programming language version var x = 1 x = 2 x = 3

Slide 63

Slide 63 text

Count to three 1 2 3 \* TLA version Init == \* initial state x=1 Next == \* transition (x=1 /\ x'=2) \* match step 1 \/ (x=2 /\ x'=3) \* or match step 2

Slide 64

Slide 64 text

Count to three 1 2 3 Init == x=1 Next == (x=1 /\ x'=2) \* match step 1 \/ (x=2 /\ x'=3) \* or match step 2

Slide 65

Slide 65 text

Count to three 1 2 3 Init == x=1 Next == (x=1 /\ x'=2) \* match step 1 \/ (x=2 /\ x'=3) \* or match step 2

Slide 66

Slide 66 text

Count to three 1 2 3 Init == x=1 Next == (x=1 /\ x'=2) \* match step 1 \/ (x=2 /\ x'=3) \* or match step 2

Slide 67

Slide 67 text

Count to three 1 2 3 Init == x=1 Next == (x=1 /\ x'=2) \* match step 1 \/ (x=2 /\ x'=3) \* or match step 2 

Slide 68

Slide 68 text

A quick refactor

Slide 69

Slide 69 text

Count to three, refactored 1 2 3 Init == x=1 Step1 == x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Next == Step1 \/ Step2 Refactored version. Steps are now explicitly named

Slide 70

Slide 70 text

Count to three, refactored 1 2 3 Init == x=1 Step1 == x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Next == Step1 \/ Step2

Slide 71

Slide 71 text

Introducing the TLA+ Toolbox (the IDE)

Slide 72

Slide 72 text

This is the TLA+ Toolbox app

Slide 73

Slide 73 text

b) Tell the model checker what the initial and next states are

Slide 74

Slide 74 text

c) Run the model checker

Slide 75

Slide 75 text

And if we run this script? • Detects "3 distinct states" – Good – what we expected • But also "Deadlock reached" – Bad!

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

1 2 3 So "Count to three" deadlocks when it reaches 3 If there is no valid transition available, that is what TLA+ calls a "deadlock"

Slide 78

Slide 78 text

It's important to think of these state machines as an infinite series of state transitions. 1 2 3 ? ? ?

Slide 79

Slide 79 text

When we're "done", we can say that a valid transition is from 3 to 3, forever 1 2 3 3 3 3

Slide 80

Slide 80 text

Updated "Count to three" Init == x=1 Step1 == x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done  1 2 3

Slide 81

Slide 81 text

Doing nothing is always an option

Slide 82

Slide 82 text

Staying in the same state is almost always a valid state transition! 1 1 2 2 3 3 What is the difference between these two systems? 1 2 3 1 -> 1 2 -> 2 3 -> 3

Slide 83

Slide 83 text

"Count to three" with stuttering Init == x=1 Step1 == x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done \/ UNCHANGED x 1 2 3

Slide 84

Slide 84 text

Part IV The Power of Temporal Properties

Slide 85

Slide 85 text

Temporal properties A property applies to the whole system over time – Not just to individual states Checking these properties is important – Humans are bad at this – Programming languages are bad at this too – TLA+ is good at this!

Slide 86

Slide 86 text

Useful properties to check • Always true – For all states, "x > 0" • Eventually true – At some point in time, "x = 2" • Eventually always – x eventually becomes 3 and then stays there • Leads to – if x ever becomes 2 then it will become 3 later

Slide 87

Slide 87 text

Properties for "count to three" In English Formally In TLA+ x is always > 2 Always (x > 0) [] (x > 0)

Slide 88

Slide 88 text

Properties for "count to three" In English Formally In TLA+ x is always > 2 Always (x > 0) [] (x > 0) At some point x is 2 Eventually (x = 2) <> (x = 2)

Slide 89

Slide 89 text

Properties for "count to three" In English Formally In TLA+ x is always > 2 Always (x > 0) [] (x > 0) At some point x is 2 Eventually (x = 2) <> (x = 2) x eventually becomes 3 and then stays there. Eventually (Always (x = 3)) <>[] (x = 3)

Slide 90

Slide 90 text

Properties for "count to three" In English Formally In TLA+ x is always > 2 Always (x > 0) [] (x > 0) At some point x is 2 Eventually (x = 2) <> (x = 2) x eventually becomes 3 and then stays there. Eventually (Always (x = 3)) <>[] (x = 3) if x ever becomes 2 then it will become 3 later. (x=2) leads to (x=3) (x=2) ~> (x=3)

Slide 91

Slide 91 text

Adding properties to the script \* Always, x >= 1 && x <= 3 AlwaysWithinBounds == [](x >= 1 /\ x <= 3) \* At some point, x = 2 EventuallyTwo == <>(x = 2) \* At some point, x = 3 and stays there EventuallyAlwaysThree == <>[](x = 3) \* Whenever x=2, then x=3 later TwoLeadsToThree == (x = 2) ~> (x = 3)

Slide 92

Slide 92 text

Tell the model checker what the properties are, and run the model checker again Adding properties to the model in the TLA+ toolbox

Slide 93

Slide 93 text

Adding properties to the script \* Always, x >= 1 && x <= 3 AlwaysWithinBounds == [](x >= 1 /\ x <= 3) \* At some point, x = 2 EventuallyTwo == <>(x = 2) \* At some point, x = 3 and stays there EventuallyAlwaysThree == <>[](x = 3) \* Whenever x=2, then x=3 later TwoLeadsToThree == (x = 2) ~> (x = 3) Link to live poll: bit.ly/tlapoll

Slide 94

Slide 94 text

Poll #5 results: "How many of these properties are true?" Link to live poll: bit.ly/tlapoll

Slide 95

Slide 95 text

Oh no! The model checker says there are errors!

Slide 96

Slide 96 text

Who forgot about stuttering? 1 2 3

Slide 97

Slide 97 text

How to fix this? • Make sure every possible transition is followed • Not just stay stuck in a infinite loop! This is called "fairness"

Slide 98

Slide 98 text

How can we model fairness in TLA+? We have to do some refactoring first Then we can add fairness to the spec (warning: the syntax is a bit ugly)

Slide 99

Slide 99 text

How to fix? Refactor #1: change the spec to merge init/next Init == x=1 Step1 == x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done Spec = Init /\ [](Next \/ UNCHANGED x)

Slide 100

Slide 100 text

How to fix? Init == x=1 Step1 == x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done Spec = Init /\ [](Next \/ UNCHANGED x) Refactor #1: change the spec to merge init/next

Slide 101

Slide 101 text

How to fix? Init == x=1 Step1 == x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done Spec = Init /\ [](Next \/ UNCHANGED x) Refactor #1: change the spec to merge init/next

Slide 102

Slide 102 text

Spec = Init /\ [](Next \/ UNCHANGED x) Refactor #2: Use a special syntax for stuttering Before

Slide 103

Slide 103 text

Spec = Init /\ [][Next]_x Refactor #2: Use a special syntax for stuttering After

Slide 104

Slide 104 text

Spec = Init /\ [][Next]_x Refactor #3: Now we can add fairness!

Slide 105

Slide 105 text

Spec = Init /\ [][Next]_x /\ WF_x(Next) Refactor #3: Now we can add fairness! With fairness

Slide 106

Slide 106 text

Spec = Init /\ [][Next]_x /\ WF_x(Next) Refactor #3: Now we can add fairness! With fairness

Slide 107

Slide 107 text

The complete spec with fairness Init == x=1 Step1 == x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done Spec == Init /\ [][Next]_x /\ WF_x(Next)  \* properties to check AlwaysWithinBounds == [](x >= 1 /\ x <= 3) EventuallyTwo == <>(x = 2) EventuallyAlwaysThree == <>[](x = 3) TwoLeadsToThree == (x = 2) ~> (x = 3)

Slide 108

Slide 108 text

The complete spec with fairness Init == x=1 Step1 == x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done Spec == Init /\ [][Next]_x /\ WF_x(Next) \* properties to check AlwaysWithinBounds == [](x >= 1 /\ x <= 3) EventuallyTwo == <>(x = 2) EventuallyAlwaysThree == <>[](x = 3) TwoLeadsToThree == (x = 2) ~> (x = 3)

Slide 109

Slide 109 text

Part V Using TLA+ to model the producer/consumer examples

Slide 110

Slide 110 text

Modeling a Producer/Consumer system A queue Consumer spec (2 separate steps) 1) Check if queue is not empty 2) If true, then read item from queue Producer spec (2 separate steps) 1) Check if queue is not full 2) If true, then write item to queue Consumer reads from queue Producer writes to queue

Slide 111

Slide 111 text

ready canWrite CheckWritable Write States for a Producer We're choosing to model this as two distinct state transitions, not one atomic step

Slide 112

Slide 112 text

ready canWrite CheckWritable Write States for a Producer def CheckWritable(): if (queueSize < MaxQueueSize) && (producerState = "ready") then producerState = "canWrite"; def Write(): if producerState = "canWrite" then producerState = "ready"; queueSize = queueSize + 1;

Slide 113

Slide 113 text

States for a Producer CheckWritable == producerState = "ready" /\ queueSize < MaxQueueSize /\ producerState' = "canWrite" \* transition /\ UNCHANGED queueSize ready canWrite CheckWritable Write Write == producerState = "canWrite" /\ producerState' = "ready" \* transition /\ queueSize' = queueSize + 1 \* push to queue ProducerAction == CheckWritable \/ Write All the valid actions for a producer

Slide 114

Slide 114 text

States for a Consumer CheckReadable == consumerState = "ready" /\ queueSize > 0 /\ consumerState' = "canRead" \* transition /\ UNCHANGED queueSize Read == consumerState = "canRead" /\ consumerState' = "ready" \* transition /\ queueSize' = queueSize - 1 \* pop from queue ConsumerAction == CheckReadable \/ Read ready canRead CheckReadable Read All the valid actions for a consumer

Slide 115

Slide 115 text

Complete TLA+ script (1/2) VARIABLES queueSize, producerState, consumerState MaxQueueSize == 2 \* can be small Init == queueSize = 0 /\ producerState = "ready" /\ consumerState = "ready" CheckWritable == producerState = "ready" /\ queueSize < MaxQueueSize /\ producerState' = "canWrite" /\ UNCHANGED queueSize /\ UNCHANGED consumerState Write == producerState = "canWrite" /\ producerState' = "ready" /\ queueSize' = queueSize + 1 /\ UNCHANGED consumerState ProducerAction == CheckWritable \/ Write

Slide 116

Slide 116 text

Complete TLA+ script (2/2) CheckReadable == consumerState = "ready" /\ queueSize > 0 /\ consumerState' = "canRead" /\ UNCHANGED queueSize /\ UNCHANGED producerState Read == consumerState = "canRead" /\ consumerState' = "ready" /\ queueSize' = queueSize – 1 /\ UNCHANGED producerState ConsumerAction == CheckReadable \/ Read Next == ProducerAction \/ ConsumerAction

Slide 117

Slide 117 text

Complete TLA+ script (2/2) CheckReadable == consumerState = "ready" /\ queueSize > 0 /\ consumerState' = "canRead" /\ UNCHANGED queueSize /\ UNCHANGED producerState Read == consumerState = "canRead" /\ consumerState' = "ready" /\ queueSize' = queueSize – 1 /\ UNCHANGED producerState ConsumerAction == CheckReadable \/ Read Next == ProducerAction \/ ConsumerAction

Slide 118

Slide 118 text

Complete TLA+ script (2/2) CheckReadable == consumerState = "ready" /\ queueSize > 0 /\ consumerState' = "canRead" /\ UNCHANGED queueSize /\ UNCHANGED producerState Read == consumerState = "canRead" /\ consumerState' = "ready" /\ queueSize' = queueSize – 1 /\ UNCHANGED producerState ConsumerAction == CheckReadable \/ Read Next == ProducerAction \/ ConsumerAction \/ (UNCHANGED producerState /\ UNCHANGED consumerState /\ UNCHANGED queueSize)

Slide 119

Slide 119 text

AlwaysWithinBounds == [] (queueSize >= 0 /\ queueSize <= MaxQueueSize) What are the temporal properties for the producer/consumer design?

Slide 120

Slide 120 text

And if we run this script? • Detects "8 distinct states" – Good • No errors! – Means invariant was always true. – We now have confidence in this design! – But only with a single producer/consumer We don't need to guess, as we did in the earlier poll!

Slide 121

Slide 121 text

Now let's do a concurrent version!

Slide 122

Slide 122 text

Time for the "Plus" in TLA+

Slide 123

Slide 123 text

TLA plus… Set theory Set theory Mathematics TLA+ Programming e is an element of set S e ∈ S e \in S Define a set by enumeration {1,2,3} {1,2,3} [1,2,3] Define a set by predicate "p" { e ∈ S | p } {e \in S : p} Set.filter(p) For all e in Set, some predicate "p" is true ∀ e ∈ S : p \A e \in S : p Set.all(p) There exists e in Set such that some predicate "p" is true ∃ e ∈ S : p \E x \in S : p Set.any(p)

Slide 124

Slide 124 text

Plus… Set theory Set theory Mathematics TLA Programming e is an element of set S e ∈ S e \in S Define a set by enumeration {1,2,3} {1,2,3} [1,2,3] Define a set by predicate "p" { e ∈ S | p } {e \in S : p} Set.filter(p) For all e in Set, some predicate "p" is true ∀ e ∈ S : p \A e \in S : p Set.all(p) There exists e in Set such that some predicate "p" is true ∃ e ∈ S : p \E x \in S : p Set.any(p) Set theory Mathematics TLA+ Programming e is an element of set S e ∈ S e \in S Define a set by enumeration {1,2,3} {1,2,3} [1,2,3] Define a set by predicate "p" { e ∈ S | p } {e \in S : p} Set.filter(p) For all e in Set, some predicate "p" is true ∀ e ∈ S : p \A e \in S : p Set.all(p) There exists e in Set such that some predicate "p" is true ∃ e ∈ S : p \E x \in S : p Set.any(p)

Slide 125

Slide 125 text

Plus… Set theory Set theory Mathematics TLA+ Programming e is an element of set S e ∈ S e \in S Define a set by enumeration {1,2,3} {1,2,3} [1,2,3] Define a set by predicate "p" { e ∈ S | p } {e \in S : p} Set.filter(p) For all e in Set, some predicate "p" is true ∀ e ∈ S : p \A e \in S : p Set.all(p) There exists e in Set such that some predicate "p" is true ∃ e ∈ S : p \E x \in S : p Set.any(p)

Slide 126

Slide 126 text

• We need – a set of producers – a set of consumers • Need to use the set-description part of TLA+ producers={"p1","p2"} consumers={"c1","c2"}

Slide 127

Slide 127 text

CONSTANT producers, consumers \* e.g \* 2 producers={"p1","p2"} \* 2 consumers={"c1","c2"} VARIABLES queueSize, producerState, consumerState MaxQueueSize == 2 Init == queueSize = 0 /\ producerState = [p \in producers |-> "ready"] \* same as {"p1":"ready","p2":"ready"} /\ consumerState = [c \in consumers |-> "ready"] Producer/Consumer Spec, part 1

Slide 128

Slide 128 text

CONSTANT producers, consumers \* e.g \* 2 producers={"p1","p2"} \* 2 consumers={"c1","c2"} VARIABLES queueSize, producerState, consumerState MaxQueueSize == 2 Init == queueSize = 0 /\ producerState = [p \in producers |-> "ready"] \* same as {"p1":"ready","p2":"ready"} /\ consumerState = [c \in consumers |-> "ready"] For each producer, set the state to be "ready" Producer/Consumer Spec, part 1

Slide 129

Slide 129 text

CheckWritable(p) == producerState[p] = "ready" /\ queueSize < MaxQueueSize /\ producerState' = [producerState EXCEPT ![p] = "canWrite"] /\ UNCHANGED queueSize /\ UNCHANGED consumerState Producer/Consumer Spec, part 2

Slide 130

Slide 130 text

CheckWritable(p) == producerState[p] = "ready" /\ queueSize < MaxQueueSize /\ producerState' = [producerState EXCEPT ![p] = "canWrite"] /\ UNCHANGED queueSize /\ UNCHANGED consumerState Parameterized by a producer Update one element of the state map/dictionary only Check the state

Slide 131

Slide 131 text

Write(p) == producerState[p] = "canWrite" /\ queueSize' = queueSize + 1 /\ producerState' = [producerState EXCEPT ![p] = "ready"] /\ UNCHANGED consumerState ProducerAction == \E p \in producers : CheckWritable(p) \/ Write(p) Producer/Consumer Spec, part 2 CheckWritable(p) == producerState[p] = "ready" /\ queueSize < MaxQueueSize /\ producerState' = [producerState EXCEPT ![p] = "canWrite"] /\ UNCHANGED queueSize /\ UNCHANGED consumerState

Slide 132

Slide 132 text

CheckWritable(p) == producerState[p] = "ready" /\ queueSize < MaxQueueSize /\ producerState' = [producerState EXCEPT ![p] = "canWrite"] /\ UNCHANGED queueSize /\ UNCHANGED consumerState Write(p) == producerState[p] = "canWrite" /\ queueSize' = queueSize + 1 /\ producerState' = [producerState EXCEPT ![p] = "ready"] /\ UNCHANGED consumerState ProducerAction == \E p \in producers : CheckWritable(p) \/ Write(p) Find any producer which has a valid action Producer/Consumer Spec, part 2

Slide 133

Slide 133 text

CheckReadable(c) == consumerState[c] = "ready" /\ queueSize > 0 /\ consumerState' = [consumerState EXCEPT ![c] = "canRead"] /\ UNCHANGED queueSize /\ UNCHANGED producerState Producer/Consumer Spec, part 3

Slide 134

Slide 134 text

CheckReadable(c) == consumerState[c] = "ready" /\ queueSize > 0 /\ consumerState' = [consumerState EXCEPT ![c] = "canRead"] /\ UNCHANGED queueSize /\ UNCHANGED producerState Parameterized by a consumer Update one element of the state map/dictionary only Check the state

Slide 135

Slide 135 text

Read(c) == consumerState[c] = "canRead" /\ queueSize' = queueSize - 1 /\ consumerState' = [consumerState EXCEPT ![c] = "ready"] /\ UNCHANGED producerState ConsumerAction == \E c \in consumers : CheckReadable(c) \/ Read(c) CheckReadable(c) == consumerState[c] = "ready" /\ queueSize > 0 /\ consumerState' = [consumerState EXCEPT ![c] = "canRead"] /\ UNCHANGED queueSize /\ UNCHANGED producerState Producer/Consumer Spec, part 3

Slide 136

Slide 136 text

CheckReadable(c) == consumerState[c] = "ready" /\ queueSize > 0 /\ consumerState' = [consumerState EXCEPT ![c] = "canRead"] /\ UNCHANGED queueSize /\ UNCHANGED producerState Read(c) == consumerState[c] = "canRead" /\ queueSize' = queueSize - 1 /\ consumerState' = [consumerState EXCEPT ![c] = "ready"] /\ UNCHANGED producerState ConsumerAction == \E c \in consumers : CheckReadable(c) \/ Read(c) Find any consumer which has a valid action

Slide 137

Slide 137 text

And if we run this script? • Run model checker with 2 producers, 2 consumers – And same "AlwaysWithinBounds" property • Detects 38 distinct states now – Too many for human inspection • Error: "Invariant AlwaysWithinBounds is violated" – We are confident that this design doesn't work! We don't need to guess, as we did in the earlier poll!

Slide 138

Slide 138 text

Fixing the error • TLA+ won't tell you how to fix it – You have to think! • But it is easy to test fixes: – Update the model with the fix • Atomic operations (or locks, or whatever) – Then rerun the model checker – You have confidence that the fix works (or not!) • All this in only 50 lines of code

Slide 139

Slide 139 text

Part VI Using TLA+ to model zero-downtime deployment

Slide 140

Slide 140 text

Using TLA+ as a tool to improve design The process is: – Sketch the design in TLA+ – Then check it with the model checker – Then fix it – Then check it again – Repeat until TLA+ says the design is correct Think of it as TDD but for concurrency design Red Green Remodel

Slide 141

Slide 141 text

Modeling a zero-downtime deployment What to model – We have a bunch of servers – Each server must be upgraded from v1 to v2 – Each server goes offline during the upgrade Conditions to check – There must always be an online server – All servers must be upgraded eventually Idea credit: https://www.hillelwayne.com/post/modeling-deployments/

Slide 142

Slide 142 text

Online(v1) Offline Start Sketching the design \* a dictionary of key/value pairs: server => state VARIABLES serverState Init == serverState = [s \in servers |-> "online_v1"] Start(s) == serverState[s] = "online_v1" /\ serverState' = [serverState EXCEPT ![s] = "offline"] Finish(s) == serverState[s] = "offline" /\ serverState' = [serverState EXCEPT ![s] = "online_v2"] Online(v2) Finish Done Server state

Slide 143

Slide 143 text

Online(v1) Offline Start Sketching the design \* try to find a server to start or finish UpgradeStep == \E s \in servers : Start(s) \/ Finish(s) \* done if ALL servers are finished Done == \A s \in servers : serverState[s] = "online_v2" /\ UNCHANGED serverState \* overall state transition Next == UpgradeStep \/ Done Online(v2) Finish Done Server state

Slide 144

Slide 144 text

Stop and check • Run the script now to check our assumptions – With 1 server: 3 distinct states (as expected) – With 2 servers: 9 distinct states – With 3 servers: 27 distinct states • The number of states gets large very quickly! – Eyeballing for errors will not work

Slide 145

Slide 145 text

Now let's add some properties • Zero downtime – "Not all servers should be offline at once" • Upgrade should complete – "All servers should eventually be upgraded to v2" Temporal properties

Slide 146

Slide 146 text

\* It is always true that there exists \* a server that is not offline (!= is /= in TLA) ZeroDowntime == [](\E s \in servers : serverState[s] /= "offline") Temporal properties Always, there exists a server, such that the state for that server is not "offline"

Slide 147

Slide 147 text

\* Eventually, all servers will be online at v2 EventuallyUpgraded == <>(\A s \in servers : serverState[s] = "online_v2") Temporal properties eventually for all servers the state for that server is "v2" \* It is always true that there exists \* a server that is not offline (!= is /= in TLA) ZeroDowntime == [](\E s \in servers : serverState[s] /= "offline")

Slide 148

Slide 148 text

Running the script If we run this script with two servers Error: "Invariant ZeroDowntime is violated" The model checker trace shows us how: s1 -> "online_v1", s2 -> "online_v1" s1 -> "offline", s2 -> "online_v1" s1 -> "offline", s2 -> "offline" // boom! No problem, we think we have a fix for this

Slide 149

Slide 149 text

Improving the design with upgrade condition Start(s) == \* server is ready serverState[s] = "online_v1" \* NEW: there does not exist any other server which is offline /\ ~(\E other \in servers : serverState[other] = "offline") \* then transition /\ serverState' = [serverState EXCEPT ![s] = "offline"] A new condition for the Start action: You can only transition to "offline" if no other servers are offline.

Slide 150

Slide 150 text

Running the script Now re-run this script with two servers • "ZeroDowntime" works – We have confidence in the design! • "EventuallyUpgraded" fails – Because of stuttering – But add fairness and it works again, yay! We now have confidence in the design!

Slide 151

Slide 151 text

Adding another condition New rule! All online servers must be running the same version \* Define the set of servers which are online. OnlineServers == { s \in servers : serverState[s] /= "offline" } \* It is always true that \* any two online servers are the same version SameVersion == [] (\A s1,s2 \in OnlineServers : serverState[s1] = serverState[s2])

Slide 152

Slide 152 text

Running the script Now run this script with the new property Error "Invariant SameVersion is violated" The model checker trace shows us how: s1 -> "online_v1", s2 -> "online_v1" s1 -> "offline", s2 -> "online_v1" s1 -> "online_v2", s2 -> "online_v1" // boom! Let's add a load balancer to fix this

Slide 153

Slide 153 text

Improving the design with a load balancer VARIABLES serverState, loadBalancer \* initialize all servers to "online_v1" Init == serverState = [s \in servers |-> "online_v1"] /\ loadBalancer = "v1" \* the online servers depend on the load balancer OnlineServers == IF loadBalancer = "v1" THEN { s \in servers : serverState[s] = "online_v1" } ELSE { s \in servers : serverState[s] = "online_v2" } The load balancer points to only "v1" or "v2" servers

Slide 154

Slide 154 text

Improving the design with a load balancer Finish(s) == serverState[s] = "down" /\ serverState' = [serverState EXCEPT ![s] = "online_v2"] \* and load balancer can point to v2 pool now /\ loadBalancer' = "v2" Then, when one server has successfully upgraded, the load balancer can switch over to using v2

Slide 155

Slide 155 text

Running the script Now re-run this script with the load balancer • "ZeroDowntime" works • "EventuallyUpgraded" works • "SameVersion" works

Slide 156

Slide 156 text

Our sketch is complete (for now) Think of TLA+ as "agile" modeling for software systems A few minutes of sketching => much more confidence!

Slide 157

Slide 157 text

Some common questions • How to handle failures? – Just add failure cases to the state diagram! • How does this model convert to code? – It doesn't! Modeling is a tool for thinking, not a code generator. – It's about having confidence in the design.

Slide 158

Slide 158 text

Conclusion • TLA+ and model checking is not that scary – It's just agile modeling for software systems! – For concurrency, it's essential – Check it out! A bigger toolbox is a good thing to have • TLA+ can do much more than I showed today – Not just model checking, but refinements, proofs, etc • More information: – TLA+ Home Page with videos, book, papers, etc – learntla.com book (and trainings!) by Hillel Wayne

Slide 159

Slide 159 text

Slides and video here fsharpforfunandprofit.com/tlaplus Thank you! "Domain Modeling Made Functional" book fsharpforfunandprofit.com/books @ScottWlaschin Me on twitter