$30 off During Our Annual Pro Sale. View Details »

TLA+ for programmers

TLA+ for programmers

As developers, we have a number of well-known practices to ensure code quality, such as unit tests, code review and so on. But these practices often break down when we need to design concurrent systems. Often, there can be subtle and serious bugs that are not found with conventional practices.

But there’s another approach that you can use -- model-checking -- that can detect potential concurrency errors at design time, and so dramatically increase your confidence in your code. In this talk, I’ll demonstrate and demystify TLA+, a powerful design and model-checking system. We’ll see how it can check your concurrent designs for errors, saving you time up front and frustration later!

Scott Wlaschin

June 10, 2020
Tweet

More Decks by Scott Wlaschin

Other Decks in Programming

Transcript

  1. Building confidence in concurrent code using a model checker (aka

    TLA+ for programmers) @ScottWlaschin fsharpforfunandprofit.com Warning – this talk will have too much information!
  2. People who have written concurrent code People who have had

    weird painful bugs in concurrent code Why concurrent code in particular?
  3. People who have written concurrent code People who have had

    weird painful bugs in concurrent code Why concurrent code in particular?
  4. People who have written concurrent code People who have had

    weird painful bugs in concurrent code A perfect circle  Why concurrent code in particular?
  5. How many programmers are very confident about their code?

  6. "This code doesn't work and I don't know why"

  7. "This code works and I don't know why"

  8. Tools to improve confidence • Design – Domain driven design

    – Behavior driven design – Rapid prototyping – Modeling with UML etc • Coding – Static typing – Good libraries • Testing – TDD – Property-based testing – Canary testing
  9. Tools to improve confidence All of the above, plus •

    "Model checking"
  10. None
  11. What is "model checking"? • Use a special DSL to

    design a "model" • Then "check" the model: – Are all the constraints met? – Does anything unexpected happen? – Does it deadlock? • This is part of a "formal methods" approach
  12. Two popular model checkers • TLA+ (TLC) – Focuses on

    temporal properties – Good for modeling concurrent systems • Alloy (Alloy Analyzer) – Focuses on relational logic – Good for modeling structures
  13. Two popular model checkers • TLA+ (TLC) – Focuses on

    temporal properties – Good for modeling concurrent systems • Alloy (Alloy Analyzer) – Focuses on relational logic – Good for modeling structures
  14. Start(s) == serverState[s] = "online_v1" /\ ~(\E other \in servers

    : serverState[other] = "offline") /\ serverState' = [serverState EXCEPT ![s] = "offline"] Finish(s) == serverState[s] = "offline" /\ serverState' = [serverState EXCEPT ![s] = "online_v2"] UpgradeStep == \E s \in servers : Start(s) \/ Finish(s) Done == \A s \in servers : serverState[s] = "online_v2" /\ UNCHANGED serverState Spec == /\ Init /\ [][Next]_serverState /\ WF_serverState(UpgradeStep) Here's what TLA+ looks like
  15. Start(s) == serverState[s] = "online_v1" /\ ~(\E other \in servers

    : serverState[other] = "offline") /\ serverState' = [serverState EXCEPT ![s] = "offline"] Finish(s) == serverState[s] = "offline" /\ serverState' = [serverState EXCEPT ![s] = "online_v2"] UpgradeStep == \E s \in servers : Start(s) \/ Finish(s) Done == \A s \in servers : serverState[s] = "online_v2" /\ UNCHANGED serverState Spec == /\ Init /\ [][Next]_serverState /\ WF_serverState(UpgradeStep) Here's what TLA+ looks like By the end of the talk you should be able to make sense of it!
  16. Time for some live polling!

  17. bit.ly/tlapoll

  18. Poll #1 results: "Can you see this poll?" Link to

    live poll: bit.ly/tlapoll
  19. Outline of this talk • How confident are you? •

    Introducing TLA+ • Examples: – Using TLA+ for a simple model – Checking a Producer/Consumer model – Checking a zero-downtime deployment model
  20. Part I How confident are you?

  21. To sort a list: 1) If the list is empty

    or has 1 element, it is already sorted. So just return it unchanged. 2) Otherwise, take the first element (called the "pivot") 3) Divide the remaining elements into two piles: * those < than the pivot * those > than the pivot 4) Sort each of the two piles using this sort algorithm 5) Return the sorted list by concatenating: * the sorted "smaller" list * then the pivot * then the sorted "bigger" list Here's a spec for a sort algorithm
  22. To sort a list: 1) If the list is empty

    or has 1 element, it is already sorted. So just return it unchanged. 2) Otherwise, take the first element (called the "pivot") 3) Divide the remaining elements into two piles: * those < than the pivot * those > than the pivot 4) Sort each of the two piles using this sort algorithm 5) Return the sorted list by concatenating: * the sorted "smaller" list * then the pivot * then the sorted "bigger" list Here's a spec for a sort algorithm Link to live poll: bit.ly/tlapoll
  23. Poll #2 results: " What is your confidence in the

    design of this sort algorithm?" Link to live poll: bit.ly/tlapoll
  24. To sort a list: 1) If the list is empty

    or has 1 element, it is already sorted. So just return it unchanged. 2) Otherwise, take the first element (called the "pivot") 3) Divide the remaining elements into two piles: * those < than the pivot * those > than the pivot 4) Sort each of the two piles using this sort algorithm 5) Return the sorted list by concatenating: * the sorted "smaller" list * then the pivot * then the sorted "bigger" list Here's a spec for a sort algorithm
  25. Some approaches to gain confidence • Careful inspection and code

    review • Create an implementation and then test it thoroughly – E.g. Using property-based tests • Use mathematical proof assistant tool
  26. How confident are you when concurrency is involved?

  27. A concurrent producer/consumer system A queue Consumer spec (2 separate

    steps) 1) Check if queue is not empty 2) If true, then read item from queue Producer spec (2 separate steps) 1) Check if queue is not full 2) If true, then write item to queue Consumer reads from queue Producer writes to queue
  28. Given a bounded queue of items And 1 producer, 1

    consumer running concurrently Constraints: * never read from an empty queue * never add to a full queue Producer spec (separate steps) 1) Check if queue is not full 2) If true, then write item to queue 3) Go to step 1 Consumer spec (separate steps) 1) Check if queue is not empty 2) If true, then read item from queue 3) Go to step 1 A spec for a producer/consumer system Link to live poll: bit.ly/tlapoll
  29. Poll #3 results: "What is your confidence in the design

    of this producer/consumer system?" Link to live poll: bit.ly/tlapoll
  30. Given a bounded queue of items And 2 producers, 2

    consumers running concurrently Constraints: * never read from an empty queue * never add to a full queue Producer spec (separate steps) 1) Check if queue is not full 2) If true, then write item to queue 3) Go to step 1 Consumer spec (separate steps) 1) Check if queue is not empty 2) If true, then read item from queue 3) Go to step 1 A spec for a producer/consumer system Link to live poll: bit.ly/tlapoll
  31. Poll #4 results: " What is your confidence in the

    design of this producer/consumer system (now with multiple clients)?"
  32. Being confident in the design of concurrent systems is hard

  33. How to gain confidence for concurrency? • Careful inspection and

    code review – Human intuition for concurrency is very bad • Create an implementation and then test it – Many concurrency errors might never show up • Use mathematical proof assistant tool – A model checker is much easier!
  34. Part II Introducing TLA+

  35. Stand Back! I'm going to use Mathematics!

  36. TLA+ was designed by Leslie Lamport – Famous "Time &

    Clocks" paper – Paxos algorithm for consensus – Turing award winner – Initial developer of LaTeX
  37. TLA+ stands for – Temporal – Logic – of Actions

    – plus …
  38. TLA+ stands for – Temporal – Logic – of Actions

    – plus …
  39. TLA+ stands for – Temporal – Logic – of Actions

    – plus …
  40. TLA+ stands for – Temporal – Logic – of Actions

    – plus …
  41. TLA+ stands for – Temporal – Logic – of Actions

    – plus …
  42. TLA+ stands for – Temporal – Logic – of Actions

    – plus …
  43. The "Logic" in TLA+

  44. Boolean Logic Boolean Mathematics TLA+ Programming AND a ∧ b

    a /\ b a && b OR a ∨ b a \/ b a || b NOT ¬a ~a !a; not a You all know how this works, I hope!
  45. Boolean Logic A "predicate" is an expression that returns a

    boolean \* TLA-style definition operator(a,b,c) == (a /\ b) \/ (a /\ ~c) // programming language definition function(a,b,c) { (a && b) || (a && !c) }
  46. The "Actions" in TLA+ a.k.a. state transitions

  47. State A State B State C Transition from A to

    B A state machine Transition from B to A Transition from B to C
  48. White to play Black to play Game Over White plays

    and wins Black plays White plays Black plays and wins States and transitions for a chess game
  49. Undelivered Out for delivery Delivered Send out for delivery Address

    not found Signed for Failed Delivery Redeliver States and transitions for deliveries
  50. "hello" "goodbye" States and transitions in TLA+ State before State

    after state = "hello" In TLA+ state' = "goodbye" In TLA+ An "action"
  51. "hello" "goodbye" States and transitions in TLA+ Next == state

    = "hello" /\ state' = "goodbye" In TLA+, define the action "Next" like this Next Or in English: state before is "hello" AND state after is "goodbye"
  52. "hello" "goodbye" States and transitions in TLA+ Next == state

    = "hello" /\ state' = "goodbye" Next
  53. "hello" "goodbye" States and transitions in TLA+ Next == state'

    = "goodbye" /\ state = "hello" Next
  54. Actions are not assignments. Actions are tests state = "hello"

    /\ state' = "goodbye" "hello" "goodbye"  Does match "hello" "ciao" Doesn't match  "howdy" "goodbye" Doesn't match 
  55. The "Temporal" in TLA+

  56. TLA+ models a series of state transitions over time In

    TLA+ you can ask questions like: • Is something always true? • Is something ever true? • If X happens, must Y happen afterwards?
  57. Temporal Logic of Actions Boolean logic of state transitions over

    time
  58. Temporal Logic of Actions Boolean logic of state transitions over

    time
  59. Temporal Logic of Actions Boolean logic of state transitions over

    time
  60. Temporal Logic of Actions Boolean logic of state transitions over

    time
  61. Part III Using TLA+ for a simple model

  62. Count to three 1 2 3 // programming language version

    var x = 1 x = 2 x = 3
  63. Count to three 1 2 3 \* TLA version Init

    == \* initial state x=1 Next == \* transition (x=1 /\ x'=2) \* match step 1 \/ (x=2 /\ x'=3) \* or match step 2
  64. Count to three 1 2 3 Init == x=1 Next

    == (x=1 /\ x'=2) \* match step 1 \/ (x=2 /\ x'=3) \* or match step 2
  65. Count to three 1 2 3 Init == x=1 Next

    == (x=1 /\ x'=2) \* match step 1 \/ (x=2 /\ x'=3) \* or match step 2
  66. Count to three 1 2 3 Init == x=1 Next

    == (x=1 /\ x'=2) \* match step 1 \/ (x=2 /\ x'=3) \* or match step 2
  67. Count to three 1 2 3 Init == x=1 Next

    == (x=1 /\ x'=2) \* match step 1 \/ (x=2 /\ x'=3) \* or match step 2 
  68. A quick refactor

  69. Count to three, refactored 1 2 3 Init == x=1

    Step1 == x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Next == Step1 \/ Step2 Refactored version. Steps are now explicitly named
  70. Count to three, refactored 1 2 3 Init == x=1

    Step1 == x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Next == Step1 \/ Step2
  71. Introducing the TLA+ Toolbox (the IDE)

  72. This is the TLA+ Toolbox app

  73. b) Tell the model checker what the initial and next

    states are
  74. c) Run the model checker

  75. And if we run this script? • Detects "3 distinct

    states" – Good – what we expected • But also "Deadlock reached" – Bad!
  76. None
  77. 1 2 3 So "Count to three" deadlocks when it

    reaches 3 If there is no valid transition available, that is what TLA+ calls a "deadlock"
  78. It's important to think of these state machines as an

    infinite series of state transitions. 1 2 3 ? ? ?
  79. When we're "done", we can say that a valid transition

    is from 3 to 3, forever 1 2 3 3 3 3
  80. Updated "Count to three" Init == x=1 Step1 == x=1

    /\ x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done  1 2 3
  81. Doing nothing is always an option

  82. Staying in the same state is almost always a valid

    state transition! 1 1 2 2 3 3 What is the difference between these two systems? 1 2 3 1 -> 1 2 -> 2 3 -> 3
  83. "Count to three" with stuttering Init == x=1 Step1 ==

    x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done \/ UNCHANGED x 1 2 3
  84. Part IV The Power of Temporal Properties

  85. Temporal properties A property applies to the whole system over

    time – Not just to individual states Checking these properties is important – Humans are bad at this – Programming languages are bad at this too – TLA+ is good at this!
  86. Useful properties to check • Always true – For all

    states, "x > 0" • Eventually true – At some point in time, "x = 2" • Eventually always – x eventually becomes 3 and then stays there • Leads to – if x ever becomes 2 then it will become 3 later
  87. Properties for "count to three" In English Formally In TLA+

    x is always > 2 Always (x > 0) [] (x > 0)
  88. Properties for "count to three" In English Formally In TLA+

    x is always > 2 Always (x > 0) [] (x > 0) At some point x is 2 Eventually (x = 2) <> (x = 2)
  89. Properties for "count to three" In English Formally In TLA+

    x is always > 2 Always (x > 0) [] (x > 0) At some point x is 2 Eventually (x = 2) <> (x = 2) x eventually becomes 3 and then stays there. Eventually (Always (x = 3)) <>[] (x = 3)
  90. Properties for "count to three" In English Formally In TLA+

    x is always > 2 Always (x > 0) [] (x > 0) At some point x is 2 Eventually (x = 2) <> (x = 2) x eventually becomes 3 and then stays there. Eventually (Always (x = 3)) <>[] (x = 3) if x ever becomes 2 then it will become 3 later. (x=2) leads to (x=3) (x=2) ~> (x=3)
  91. Adding properties to the script \* Always, x >= 1

    && x <= 3 AlwaysWithinBounds == [](x >= 1 /\ x <= 3) \* At some point, x = 2 EventuallyTwo == <>(x = 2) \* At some point, x = 3 and stays there EventuallyAlwaysThree == <>[](x = 3) \* Whenever x=2, then x=3 later TwoLeadsToThree == (x = 2) ~> (x = 3)
  92. Tell the model checker what the properties are, and run

    the model checker again Adding properties to the model in the TLA+ toolbox
  93. Adding properties to the script \* Always, x >= 1

    && x <= 3 AlwaysWithinBounds == [](x >= 1 /\ x <= 3) \* At some point, x = 2 EventuallyTwo == <>(x = 2) \* At some point, x = 3 and stays there EventuallyAlwaysThree == <>[](x = 3) \* Whenever x=2, then x=3 later TwoLeadsToThree == (x = 2) ~> (x = 3) Link to live poll: bit.ly/tlapoll
  94. Poll #5 results: "How many of these properties are true?"

    Link to live poll: bit.ly/tlapoll
  95. Oh no! The model checker says there are errors!

  96. Who forgot about stuttering? 1 2 3

  97. How to fix this? • Make sure every possible transition

    is followed • Not just stay stuck in a infinite loop! This is called "fairness"
  98. How can we model fairness in TLA+? We have to

    do some refactoring first Then we can add fairness to the spec (warning: the syntax is a bit ugly)
  99. How to fix? Refactor #1: change the spec to merge

    init/next Init == x=1 Step1 == x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done Spec = Init /\ [](Next \/ UNCHANGED x)
  100. How to fix? Init == x=1 Step1 == x=1 /\

    x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done Spec = Init /\ [](Next \/ UNCHANGED x) Refactor #1: change the spec to merge init/next
  101. How to fix? Init == x=1 Step1 == x=1 /\

    x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done Spec = Init /\ [](Next \/ UNCHANGED x) Refactor #1: change the spec to merge init/next
  102. Spec = Init /\ [](Next \/ UNCHANGED x) Refactor #2:

    Use a special syntax for stuttering Before
  103. Spec = Init /\ [][Next]_x Refactor #2: Use a special

    syntax for stuttering After
  104. Spec = Init /\ [][Next]_x Refactor #3: Now we can

    add fairness!
  105. Spec = Init /\ [][Next]_x /\ WF_x(Next) Refactor #3: Now

    we can add fairness! With fairness
  106. Spec = Init /\ [][Next]_x /\ WF_x(Next) Refactor #3: Now

    we can add fairness! With fairness
  107. The complete spec with fairness Init == x=1 Step1 ==

    x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done Spec == Init /\ [][Next]_x /\ WF_x(Next)  \* properties to check AlwaysWithinBounds == [](x >= 1 /\ x <= 3) EventuallyTwo == <>(x = 2) EventuallyAlwaysThree == <>[](x = 3) TwoLeadsToThree == (x = 2) ~> (x = 3)
  108. The complete spec with fairness Init == x=1 Step1 ==

    x=1 /\ x'=2 Step2 == x=2 /\ x'=3 Done == x=3 /\ UNCHANGED x Next == Step1 \/ Step2 \/ Done Spec == Init /\ [][Next]_x /\ WF_x(Next) \* properties to check AlwaysWithinBounds == [](x >= 1 /\ x <= 3) EventuallyTwo == <>(x = 2) EventuallyAlwaysThree == <>[](x = 3) TwoLeadsToThree == (x = 2) ~> (x = 3)
  109. Part V Using TLA+ to model the producer/consumer examples

  110. Modeling a Producer/Consumer system A queue Consumer spec (2 separate

    steps) 1) Check if queue is not empty 2) If true, then read item from queue Producer spec (2 separate steps) 1) Check if queue is not full 2) If true, then write item to queue Consumer reads from queue Producer writes to queue
  111. ready canWrite CheckWritable Write States for a Producer We're choosing

    to model this as two distinct state transitions, not one atomic step
  112. ready canWrite CheckWritable Write States for a Producer def CheckWritable():

    if (queueSize < MaxQueueSize) && (producerState = "ready") then producerState = "canWrite"; def Write(): if producerState = "canWrite" then producerState = "ready"; queueSize = queueSize + 1;
  113. States for a Producer CheckWritable == producerState = "ready" /\

    queueSize < MaxQueueSize /\ producerState' = "canWrite" \* transition /\ UNCHANGED queueSize ready canWrite CheckWritable Write Write == producerState = "canWrite" /\ producerState' = "ready" \* transition /\ queueSize' = queueSize + 1 \* push to queue ProducerAction == CheckWritable \/ Write All the valid actions for a producer
  114. States for a Consumer CheckReadable == consumerState = "ready" /\

    queueSize > 0 /\ consumerState' = "canRead" \* transition /\ UNCHANGED queueSize Read == consumerState = "canRead" /\ consumerState' = "ready" \* transition /\ queueSize' = queueSize - 1 \* pop from queue ConsumerAction == CheckReadable \/ Read ready canRead CheckReadable Read All the valid actions for a consumer
  115. Complete TLA+ script (1/2) VARIABLES queueSize, producerState, consumerState MaxQueueSize ==

    2 \* can be small Init == queueSize = 0 /\ producerState = "ready" /\ consumerState = "ready" CheckWritable == producerState = "ready" /\ queueSize < MaxQueueSize /\ producerState' = "canWrite" /\ UNCHANGED queueSize /\ UNCHANGED consumerState Write == producerState = "canWrite" /\ producerState' = "ready" /\ queueSize' = queueSize + 1 /\ UNCHANGED consumerState ProducerAction == CheckWritable \/ Write
  116. Complete TLA+ script (2/2) CheckReadable == consumerState = "ready" /\

    queueSize > 0 /\ consumerState' = "canRead" /\ UNCHANGED queueSize /\ UNCHANGED producerState Read == consumerState = "canRead" /\ consumerState' = "ready" /\ queueSize' = queueSize – 1 /\ UNCHANGED producerState ConsumerAction == CheckReadable \/ Read Next == ProducerAction \/ ConsumerAction
  117. Complete TLA+ script (2/2) CheckReadable == consumerState = "ready" /\

    queueSize > 0 /\ consumerState' = "canRead" /\ UNCHANGED queueSize /\ UNCHANGED producerState Read == consumerState = "canRead" /\ consumerState' = "ready" /\ queueSize' = queueSize – 1 /\ UNCHANGED producerState ConsumerAction == CheckReadable \/ Read Next == ProducerAction \/ ConsumerAction
  118. Complete TLA+ script (2/2) CheckReadable == consumerState = "ready" /\

    queueSize > 0 /\ consumerState' = "canRead" /\ UNCHANGED queueSize /\ UNCHANGED producerState Read == consumerState = "canRead" /\ consumerState' = "ready" /\ queueSize' = queueSize – 1 /\ UNCHANGED producerState ConsumerAction == CheckReadable \/ Read Next == ProducerAction \/ ConsumerAction \/ (UNCHANGED producerState /\ UNCHANGED consumerState /\ UNCHANGED queueSize)
  119. AlwaysWithinBounds == [] (queueSize >= 0 /\ queueSize <= MaxQueueSize)

    What are the temporal properties for the producer/consumer design?
  120. And if we run this script? • Detects "8 distinct

    states" – Good • No errors! – Means invariant was always true. – We now have confidence in this design! – But only with a single producer/consumer We don't need to guess, as we did in the earlier poll!
  121. Now let's do a concurrent version!

  122. Time for the "Plus" in TLA+

  123. TLA plus… Set theory Set theory Mathematics TLA+ Programming e

    is an element of set S e ∈ S e \in S Define a set by enumeration {1,2,3} {1,2,3} [1,2,3] Define a set by predicate "p" { e ∈ S | p } {e \in S : p} Set.filter(p) For all e in Set, some predicate "p" is true ∀ e ∈ S : p \A e \in S : p Set.all(p) There exists e in Set such that some predicate "p" is true ∃ e ∈ S : p \E x \in S : p Set.any(p)
  124. Plus… Set theory Set theory Mathematics TLA Programming e is

    an element of set S e ∈ S e \in S Define a set by enumeration {1,2,3} {1,2,3} [1,2,3] Define a set by predicate "p" { e ∈ S | p } {e \in S : p} Set.filter(p) For all e in Set, some predicate "p" is true ∀ e ∈ S : p \A e \in S : p Set.all(p) There exists e in Set such that some predicate "p" is true ∃ e ∈ S : p \E x \in S : p Set.any(p) Set theory Mathematics TLA+ Programming e is an element of set S e ∈ S e \in S Define a set by enumeration {1,2,3} {1,2,3} [1,2,3] Define a set by predicate "p" { e ∈ S | p } {e \in S : p} Set.filter(p) For all e in Set, some predicate "p" is true ∀ e ∈ S : p \A e \in S : p Set.all(p) There exists e in Set such that some predicate "p" is true ∃ e ∈ S : p \E x \in S : p Set.any(p)
  125. Plus… Set theory Set theory Mathematics TLA+ Programming e is

    an element of set S e ∈ S e \in S Define a set by enumeration {1,2,3} {1,2,3} [1,2,3] Define a set by predicate "p" { e ∈ S | p } {e \in S : p} Set.filter(p) For all e in Set, some predicate "p" is true ∀ e ∈ S : p \A e \in S : p Set.all(p) There exists e in Set such that some predicate "p" is true ∃ e ∈ S : p \E x \in S : p Set.any(p)
  126. • We need – a set of producers – a

    set of consumers • Need to use the set-description part of TLA+ producers={"p1","p2"} consumers={"c1","c2"}
  127. CONSTANT producers, consumers \* e.g \* 2 producers={"p1","p2"} \* 2

    consumers={"c1","c2"} VARIABLES queueSize, producerState, consumerState MaxQueueSize == 2 Init == queueSize = 0 /\ producerState = [p \in producers |-> "ready"] \* same as {"p1":"ready","p2":"ready"} /\ consumerState = [c \in consumers |-> "ready"] Producer/Consumer Spec, part 1
  128. CONSTANT producers, consumers \* e.g \* 2 producers={"p1","p2"} \* 2

    consumers={"c1","c2"} VARIABLES queueSize, producerState, consumerState MaxQueueSize == 2 Init == queueSize = 0 /\ producerState = [p \in producers |-> "ready"] \* same as {"p1":"ready","p2":"ready"} /\ consumerState = [c \in consumers |-> "ready"] For each producer, set the state to be "ready" Producer/Consumer Spec, part 1
  129. CheckWritable(p) == producerState[p] = "ready" /\ queueSize < MaxQueueSize /\

    producerState' = [producerState EXCEPT ![p] = "canWrite"] /\ UNCHANGED queueSize /\ UNCHANGED consumerState Producer/Consumer Spec, part 2
  130. CheckWritable(p) == producerState[p] = "ready" /\ queueSize < MaxQueueSize /\

    producerState' = [producerState EXCEPT ![p] = "canWrite"] /\ UNCHANGED queueSize /\ UNCHANGED consumerState Parameterized by a producer Update one element of the state map/dictionary only Check the state
  131. Write(p) == producerState[p] = "canWrite" /\ queueSize' = queueSize +

    1 /\ producerState' = [producerState EXCEPT ![p] = "ready"] /\ UNCHANGED consumerState ProducerAction == \E p \in producers : CheckWritable(p) \/ Write(p) Producer/Consumer Spec, part 2 CheckWritable(p) == producerState[p] = "ready" /\ queueSize < MaxQueueSize /\ producerState' = [producerState EXCEPT ![p] = "canWrite"] /\ UNCHANGED queueSize /\ UNCHANGED consumerState
  132. CheckWritable(p) == producerState[p] = "ready" /\ queueSize < MaxQueueSize /\

    producerState' = [producerState EXCEPT ![p] = "canWrite"] /\ UNCHANGED queueSize /\ UNCHANGED consumerState Write(p) == producerState[p] = "canWrite" /\ queueSize' = queueSize + 1 /\ producerState' = [producerState EXCEPT ![p] = "ready"] /\ UNCHANGED consumerState ProducerAction == \E p \in producers : CheckWritable(p) \/ Write(p) Find any producer which has a valid action Producer/Consumer Spec, part 2
  133. CheckReadable(c) == consumerState[c] = "ready" /\ queueSize > 0 /\

    consumerState' = [consumerState EXCEPT ![c] = "canRead"] /\ UNCHANGED queueSize /\ UNCHANGED producerState Producer/Consumer Spec, part 3
  134. CheckReadable(c) == consumerState[c] = "ready" /\ queueSize > 0 /\

    consumerState' = [consumerState EXCEPT ![c] = "canRead"] /\ UNCHANGED queueSize /\ UNCHANGED producerState Parameterized by a consumer Update one element of the state map/dictionary only Check the state
  135. Read(c) == consumerState[c] = "canRead" /\ queueSize' = queueSize -

    1 /\ consumerState' = [consumerState EXCEPT ![c] = "ready"] /\ UNCHANGED producerState ConsumerAction == \E c \in consumers : CheckReadable(c) \/ Read(c) CheckReadable(c) == consumerState[c] = "ready" /\ queueSize > 0 /\ consumerState' = [consumerState EXCEPT ![c] = "canRead"] /\ UNCHANGED queueSize /\ UNCHANGED producerState Producer/Consumer Spec, part 3
  136. CheckReadable(c) == consumerState[c] = "ready" /\ queueSize > 0 /\

    consumerState' = [consumerState EXCEPT ![c] = "canRead"] /\ UNCHANGED queueSize /\ UNCHANGED producerState Read(c) == consumerState[c] = "canRead" /\ queueSize' = queueSize - 1 /\ consumerState' = [consumerState EXCEPT ![c] = "ready"] /\ UNCHANGED producerState ConsumerAction == \E c \in consumers : CheckReadable(c) \/ Read(c) Find any consumer which has a valid action
  137. And if we run this script? • Run model checker

    with 2 producers, 2 consumers – And same "AlwaysWithinBounds" property • Detects 38 distinct states now – Too many for human inspection • Error: "Invariant AlwaysWithinBounds is violated" – We are confident that this design doesn't work! We don't need to guess, as we did in the earlier poll!
  138. Fixing the error • TLA+ won't tell you how to

    fix it – You have to think! • But it is easy to test fixes: – Update the model with the fix • Atomic operations (or locks, or whatever) – Then rerun the model checker – You have confidence that the fix works (or not!) • All this in only 50 lines of code
  139. Part VI Using TLA+ to model zero-downtime deployment

  140. Using TLA+ as a tool to improve design The process

    is: – Sketch the design in TLA+ – Then check it with the model checker – Then fix it – Then check it again – Repeat until TLA+ says the design is correct Think of it as TDD but for concurrency design Red Green Remodel
  141. Modeling a zero-downtime deployment What to model – We have

    a bunch of servers – Each server must be upgraded from v1 to v2 – Each server goes offline during the upgrade Conditions to check – There must always be an online server – All servers must be upgraded eventually Idea credit: https://www.hillelwayne.com/post/modeling-deployments/
  142. Online(v1) Offline Start Sketching the design \* a dictionary of

    key/value pairs: server => state VARIABLES serverState Init == serverState = [s \in servers |-> "online_v1"] Start(s) == serverState[s] = "online_v1" /\ serverState' = [serverState EXCEPT ![s] = "offline"] Finish(s) == serverState[s] = "offline" /\ serverState' = [serverState EXCEPT ![s] = "online_v2"] Online(v2) Finish Done Server state
  143. Online(v1) Offline Start Sketching the design \* try to find

    a server to start or finish UpgradeStep == \E s \in servers : Start(s) \/ Finish(s) \* done if ALL servers are finished Done == \A s \in servers : serverState[s] = "online_v2" /\ UNCHANGED serverState \* overall state transition Next == UpgradeStep \/ Done Online(v2) Finish Done Server state
  144. Stop and check • Run the script now to check

    our assumptions – With 1 server: 3 distinct states (as expected) – With 2 servers: 9 distinct states – With 3 servers: 27 distinct states • The number of states gets large very quickly! – Eyeballing for errors will not work
  145. Now let's add some properties • Zero downtime – "Not

    all servers should be offline at once" • Upgrade should complete – "All servers should eventually be upgraded to v2" Temporal properties
  146. \* It is always true that there exists \* a

    server that is not offline (!= is /= in TLA) ZeroDowntime == [](\E s \in servers : serverState[s] /= "offline") Temporal properties Always, there exists a server, such that the state for that server is not "offline"
  147. \* Eventually, all servers will be online at v2 EventuallyUpgraded

    == <>(\A s \in servers : serverState[s] = "online_v2") Temporal properties eventually for all servers the state for that server is "v2" \* It is always true that there exists \* a server that is not offline (!= is /= in TLA) ZeroDowntime == [](\E s \in servers : serverState[s] /= "offline")
  148. Running the script If we run this script with two

    servers Error: "Invariant ZeroDowntime is violated" The model checker trace shows us how: s1 -> "online_v1", s2 -> "online_v1" s1 -> "offline", s2 -> "online_v1" s1 -> "offline", s2 -> "offline" // boom! No problem, we think we have a fix for this
  149. Improving the design with upgrade condition Start(s) == \* server

    is ready serverState[s] = "online_v1" \* NEW: there does not exist any other server which is offline /\ ~(\E other \in servers : serverState[other] = "offline") \* then transition /\ serverState' = [serverState EXCEPT ![s] = "offline"] A new condition for the Start action: You can only transition to "offline" if no other servers are offline.
  150. Running the script Now re-run this script with two servers

    • "ZeroDowntime" works – We have confidence in the design! • "EventuallyUpgraded" fails – Because of stuttering – But add fairness and it works again, yay! We now have confidence in the design!
  151. Adding another condition New rule! All online servers must be

    running the same version \* Define the set of servers which are online. OnlineServers == { s \in servers : serverState[s] /= "offline" } \* It is always true that \* any two online servers are the same version SameVersion == [] (\A s1,s2 \in OnlineServers : serverState[s1] = serverState[s2])
  152. Running the script Now run this script with the new

    property Error "Invariant SameVersion is violated" The model checker trace shows us how: s1 -> "online_v1", s2 -> "online_v1" s1 -> "offline", s2 -> "online_v1" s1 -> "online_v2", s2 -> "online_v1" // boom! Let's add a load balancer to fix this
  153. Improving the design with a load balancer VARIABLES serverState, loadBalancer

    \* initialize all servers to "online_v1" Init == serverState = [s \in servers |-> "online_v1"] /\ loadBalancer = "v1" \* the online servers depend on the load balancer OnlineServers == IF loadBalancer = "v1" THEN { s \in servers : serverState[s] = "online_v1" } ELSE { s \in servers : serverState[s] = "online_v2" } The load balancer points to only "v1" or "v2" servers
  154. Improving the design with a load balancer Finish(s) == serverState[s]

    = "down" /\ serverState' = [serverState EXCEPT ![s] = "online_v2"] \* and load balancer can point to v2 pool now /\ loadBalancer' = "v2" Then, when one server has successfully upgraded, the load balancer can switch over to using v2
  155. Running the script Now re-run this script with the load

    balancer • "ZeroDowntime" works • "EventuallyUpgraded" works • "SameVersion" works
  156. Our sketch is complete (for now) Think of TLA+ as

    "agile" modeling for software systems A few minutes of sketching => much more confidence!
  157. Some common questions • How to handle failures? – Just

    add failure cases to the state diagram! • How does this model convert to code? – It doesn't! Modeling is a tool for thinking, not a code generator. – It's about having confidence in the design.
  158. Conclusion • TLA+ and model checking is not that scary

    – It's just agile modeling for software systems! – For concurrency, it's essential – Check it out! A bigger toolbox is a good thing to have • TLA+ can do much more than I showed today – Not just model checking, but refinements, proofs, etc • More information: – TLA+ Home Page with videos, book, papers, etc – learntla.com book (and trainings!) by Hillel Wayne
  159. Slides and video here fsharpforfunandprofit.com/tlaplus Thank you! "Domain Modeling Made

    Functional" book fsharpforfunandprofit.com/books @ScottWlaschin Me on twitter