Slide 1

Slide 1 text

A TLA+ intro for Software engineers Frédéric G. MARAND / [email protected] SREDay Amsterdam 2025-06-27

Slide 2

Slide 2 text

Where am I speaking from ? ● On-demand software architect / lead dev at OSInet.fr ● Specialty: scaling up backends ● Main stacks ○ Go, Kafka, AWS, IaC ○ PHP, Drupal, MongoDB ● Main apps: publishing, payments, platform infra ● Business domains: ○ Media: LeFigaro, Radio France, FranceTV… ○ Retail: Deliveroo, LeBonCoin ○ Government; Medicare, France.fr, CNRS, Sante.fr, ...

Slide 3

Slide 3 text

Contents 1. Introduction 2. Logic Foundation 3. TLA+ Basics 4. PlusCal 5. In Practice

Slide 4

Slide 4 text

1. Introduction 1. What is TLA+ ? A formal specification language for concurrent/distributed systems 2. Why use formal methods ? Find design bugs before implementation 3. Success stories Amazon S3, Azure Cosmos DB, Apache Kafka 4. Value proposition Cost of bugs vs specification

Slide 5

Slide 5 text

1.1 What is TLA+ ? ● A formal specification language for concurrent and distributed systems ○ NOT a programming language ● Created by Leslie Lamport, ○ Of LaTeX fame. And it shows… TLA+ started as a LaTeX extension ● Mathematical foundation for describing systems ○ Based on the Zermelo–Fraenkel set theory with the axiom of choice (ZFC) ● Focus on system behavior, not implementation ○ So not transpilable to any usual programming language

Slide 6

Slide 6 text

Turing Award: "For fundamental contributions to the theory and practice of distributed and concurrent systems, notably the invention of concepts such as causality and logical clocks, safety and liveness, replicated state machines, and sequential consistency" Image: Autoportrait in the TLA+ course Leslie Lamport

Slide 7

Slide 7 text

1.2 Why use it ? ● Find design flaws before implementation ○ The best implementation is at most as good as the design underlying it ○ Actual implementations… ● Verify complex algorithms and protocols ○ Models can test all relevant timing combinations ● Complement traditional testing ○ Unit tests only check the cases devs were able to discover ○ Fuzz tests only check the cases devs were able to generate ● Most expensive bugs are design flaws ○ No implementation work can fix an undiscovered race condition

Slide 8

Slide 8 text

1.3 Success stories ● Amazon S3: Found critical bugs in internal protocols ○ How formal methods helped AWS to design amazing services ● Azure Cosmos DB: Verified consistency protocols ○ Understanding Inconsistency in Azure Cosmos DB with TLA+ ● Apache Kafka: Validated replication protocol ○ KIP-966 for KRaft ○ Also others KIPs e.g. KIP-848 : new consumer group rebalance protocol ● Each saved months of debugging and prevented outages

Slide 9

Slide 9 text

1.4 Value proposition ● Cost of fixing bugs: ○ Design phase: 1x ○ Development phase: 10x ○ Production phase: 100x ● TLA+ helps find bugs that testing misses ○ Focus on the sad path ○ Even fuzzing tends to only find crashing bugs, not invariant violations ● Especially valuable for distributed systems ○ Human comprehension is poor regarding relative temporality

Slide 10

Slide 10 text

2. Logic foundation 1. Traditional logic vs temporal logic 2. State machines: System as states 3. Actions: State transitions 4. Simple example: Traffic light 5. Temporal properties: Safety vs Liveness 6. Model checking basics

Slide 11

Slide 11 text

2.1 Traditional logic vs temporal logic ● ∀P: P = P

Slide 12

Slide 12 text

2.1 Traditional logic vs temporal logic ● ∀P: P = P ● Is it ALWAYS true ?

Slide 13

Slide 13 text

2.1 Traditional logic vs temporal logic ● ∀P: P = P ● Is it ALWAYS true ? ● How about evaluating P at two different instants ?

Slide 14

Slide 14 text

2.1 Traditional logic vs temporal logic ● ∀P: P = P ● Is it ALWAYS true ? ● How about evaluating P at two different instants ? ● How about P = “the process is running now” ?

Slide 15

Slide 15 text

2.1 Traditional logic vs temporal logic ● ∀P: P = P ● Is it ALWAYS true ? ● How about evaluating P at two different instants ? ● How about P = “the process is running now” ? ● Traditional logic evaluates at an abstract instant in time

Slide 16

Slide 16 text

2.1 Traditional logic vs temporal logic ● ∀P: P = P ● Is it ALWAYS true ? ● How about evaluating P at two different instants ? ● How about P = “the process is running now” ? ● Traditional logic evaluates at an abstract instant in time (“now”) ● Temporal logic adds the time dimension: “now” changes

Slide 17

Slide 17 text

2.2 Systems as states ● The notion that a predicate can be evaluated at different times adds a quantified time component ● The values of the system at these time quanta are states ○ Within one, traditional logic applies ● Three main temporal logics: ○ Linear Temporal Logic (Pnueli, 77): evaluates properties over sequences of states ■ That’s the one TLA+ and many derived tools combine with first-order logic ○ Computation Tree Logic (Clarke/Emerson, 81): combines temporal operators with path quantifiers to specify properties over computation tree ○ Branching Temporal Logic (multiple, 80s): evaluates over branching, capturing multiple possible future paths from a state

Slide 18

Slide 18 text

2.2 Example: light switch - two actions

Slide 19

Slide 19 text

2.2 Example: light switch - two actions

Slide 20

Slide 20 text

2.2 Example: light switch - single action

Slide 21

Slide 21 text

2.3 Actions: state transitions ● Actions: ○ Define how the system takes a step from one state to another ○ Described using preconditions and postconditions. ○ Example: Toggle action for the second light switch. ● Formal Representation: use primed variables ○ Precondition: Light = Off : the value on the initial state ○ Postcondition: Light' = On : the value in the next state

Slide 22

Slide 22 text

2.4 Simple example: traffic lights ● Usually, traffic lights run in a loop ○ Green ○ Orange ○ Red ○ restart from top

Slide 23

Slide 23 text

2.4 Simple example: traffic lights

Slide 24

Slide 24 text

2.4bis Simple example: Spanish traffic lights In Spain, however, traffic lights go to yellow before going from Red to Green. But this has a problem…

Slide 25

Slide 25 text

2.4bis Simple example: Spanish traffic lights The model allows sequence: 1. Red 2. Yellow 3. Red 4. Yellow 5. … That’s not fair!

Slide 26

Slide 26 text

2.4ter Simple example: Spanish traffic lights For the model to run in a cycle, we can add another variable to the state, forcing the choice on Yellow

Slide 27

Slide 27 text

2.5 Temporal Properties: Safety vs Liveness ● Safety Properties: ○ Ensure that something bad never happens. ○ Example: "The traffic light is never both red and green." ● Liveness Properties: ○ Ensure that something good eventually happens. ○ Example: "The traffic light will eventually turn green." ● Which one did our first attempt for Spanish traffic lights fail ? ● But we did not define a rule, so that could not be checked

Slide 28

Slide 28 text

2.6 Model checking basics ● Model Checking: ○ Automated technique to verify temporal properties. ○ Explores all possible states and transitions. ○ Tools: TLC (TLA+ model checker), Apalache (3rd party model checker) ● Process: ○ Define the system and its properties. ■ In our case: an INVARIANT, a (temporal) PROPERTY, and deadlock detection ○ Use a model checker to explore all possible behaviors. ○ Verify if the properties hold

Slide 29

Slide 29 text

2.6bis Model checking Spanish traffic lights \* traffic_lights_spain_inv.cfg INIT Init NEXT Next INVARIANT TypeOK PROPERTY NoRegression CHECK_DEADLOCK TRUE % tlc traffic_lights_spain_inv.tla …snip… Starting... (2025-02-23 16:58:15) Computing initial states... Finished computing initial states: 1 distinct state generated at 2025-02-23 16:58:15. Model checking completed. No error has been found. …snip… 5 states generated, 4 distinct states found, 0 states left on queue. …snip… Finished in 00s at (2025-02-23 16:58:15)

Slide 30

Slide 30 text

3. TLA+ Basics 1. States and variables 2. Basic operators: ≜, ⇒, ∧, ∨, =, ∈ 3. Actions with primed variables 4. Stuttering steps 5. Temporal formulas 6. Specifications: Init /\ [][Next]_vars 7. Simple example: Mutex 8. Checking Mutex with TLC

Slide 31

Slide 31 text

3.1 States and variables ● States: ○ Represent the current configuration of the system. ○ Defined by the values of variables. ● Variables: ○ Store the state of the system. ○ Can be of various types: integers, sets, sequences, etc. ● Example: ○ Our simple traffic light system: light (with values Red, Green, Yellow). ○ Our Spanish variant: light and going_to_green

Slide 32

Slide 32 text

3.2 Basic operators: ≜, ⇒, ∧, ∨, =, ∈ ● Assignment: ≜ ○ Assign values to variables. ○ Example: x ≜ 5 ● Implication: ⇒ ○ Logical implication. ○ Example: A ⇒ B (if A then B) ● Conjunction: ∧ ○ Logical AND. ○ Example: A ∧ B (A and B) ● Disjunction: ∨ ○ Logical OR. ○ Example: A ∨ B (A or B) ● Equality: = ○ Checks for equality. ○ Example: x = 5 ● Membership: ∈ ○ Checks for membership in a set. ○ Example: x ∈ {1, 2, 3}

Slide 33

Slide 33 text

3.3 Actions with primed variables ● Primed Variables: ○ Represent the next state of a variable. ○ Used to describe state transitions. ● Example: ○ light' = "Green" means the next state of light is Green. ● Action Definition: ○ Precondition: light = "Red" ○ Postcondition: light' = "Green"

Slide 34

Slide 34 text

3.4 Stuttering steps ● Stuttering Steps: ○ Steps where the state does not change. ■ No system moves infinitely fast (ever waited on Red ?) ■ It transitions from one state to the same based on time passing ○ Important for modeling concurrent systems. ● Example: ○ In a concurrent system, one process might not change state while another does. ● Formal Representation: ○ light' = light (the light stays the same)

Slide 35

Slide 35 text

3.5 Temporal formulas ● Temporal Operators: ○ [] (always, aka box): Property holds in all states. ○ <> (eventually, aka diamond): Property holds in some future state. ○ ~> (leads to): One property eventually leads to another. ● Example: ○ [](light = "Red" -> <>light = "Green") ■ Always, if the light is red, it will eventually turn green. ■ This is called a liveness property (system will move from such a state)

Slide 36

Slide 36 text

3.6 Specifications ● Specification Structure uses conventional names, defaults in CFG ○ Init: Initial state predicate. ○ Next: Next state relation. ○ Spec: Overall specification. ● Formal Representation: ○ Spec == Init /\ [][Next]_vars ○ Init defines the initial state. ○ Next defines the possible transitions. ○ vars is the tuple (finite sequence) of all variables of the system. ● TLC CFG may either define INIT and NEXT or SPECIFICATION, ○ Defaults to Init, Next, Spec respectively

Slide 37

Slide 37 text

3.7 Simple example: Mutex with 2 processes

Slide 38

Slide 38 text

3.7 Simple example: Mutex with 3 processes ● TLC is a brute-force checker with smart optimizations, so check time depends on the number of states. ● Would most teams identify and test all these state ? And for more processes ?

Slide 39

Slide 39 text

3.7 Mutex cost: Sweeping costs explode fast

Slide 40

Slide 40 text

3.7 Mutex cost: Sweeping costs explode fast - bis Can you guess when I started TLC to check a model ?

Slide 41

Slide 41 text

3.7 Simple example: Mutex ---- MODULE mutex_sweeping ---- EXTENDS Integers, FiniteSets CONSTANT N \* Number of processes ASSUME N \in Nat \* N must be a natural number ASSUME N > 1 \* Need at least 2 processes VARIABLES flag, \* Array of flags, one per process turn \* Who's turn is it to enter critical section Proc == 1..N \* N processes: that’s sweeping vars == <> Init == /\ flag = [i \in Proc |-> FALSE] \* No one wants to enter initially /\ turn = 1 \* Process 1 goes first Try(i) == \* Process i wants to enter critical section /\ flag' = [flag EXCEPT ![i] = TRUE] /\ UNCHANGED turn Give(i) == \* Process i gives turn to other process /\ flag[i] = TRUE /\ \E j \in Proc \ {i} : turn' = j \* Give turn to any other process /\ UNCHANGED flag Enter(i) == \* Process i enters critical section if it's their turn /\ flag[i] = TRUE /\ turn = i /\ \A j \in Proc \ {i} : ~flag[j] \* No other process wants to enter /\ UNCHANGED vars Exit(i) == \* Process i leaves critical section /\ flag' = [flag EXCEPT ![i] = FALSE] /\ UNCHANGED turn Next == \E i \in Proc : Try(i) \/ Give(i) \/ Enter(i) \/ Exit(i) Spec == Init /\ [][Next]_vars \* Type correctness invariant TypeOK == /\ flag \in [Proc -> BOOLEAN] /\ turn \in Proc \* Safety: No two processes in critical section simultaneously Mutex == [][\A i,j \in Proc : i # j => ~(Enter(i) /\ Enter(j))]_vars ====

Slide 42

Slide 42 text

3.7 Checking Mutex with TLC \* mutex_sweep.cfg SPECIFICATION Spec CONSTANT N = 3 \* Try other values INVARIANT TypeOK PROPERTY Mutex CHECK_DEADLOCK TRUE Run check with: tlc mutex_sweep.tla ● SPECIFICATION references the Spec temporal formula, which MUST be shaped like: Init /\ [][Next]_vars ○ Predicate Init starts true AND predicate Next is ALWAYS true ○ Because vars == <> this is syntactic sugar for Init /\ [](Next \/ (flag' = flag /\ turn' = turn)) ● The value of N is a CONSTANT in TLA+, the model assigns it an actual value ● INVARIANT means TypeOK is true on every state ● PROPERTY describes the (safety: []) property Mutex, which states that no process may enter simultaneously. Could also be liveness (<> or ~>), or fairness (WF_Vars / SF_Vars). ● CHECK_DEADLOCK ensures no final state

Slide 43

Slide 43 text

4. PlusCal 1. Bridging code and specifications 2. Algorithm structure 3. Variables and assignments 4. Atomic steps and labels 5. Multi-process algorithms 6. Translation to TLA+

Slide 44

Slide 44 text

4.1 Bridging Code and Specifications ● PlusCal: ○ High-level algorithm language ○ Bridges the gap between code and formal specifications. ○ For programmers, easier to write and understand than raw TLA+ ● Pros: ○ Simplifies the process of writing specifications. ○ Automatically translates to TLA+ for verification. ○ More intuitive for developers familiar with programming languages. ● Cons ○ Less Expressive: Some complex temporal properties and constraints are harder to express in PlusCal compared to raw TLA+. ○ Abstraction Limitations: PlusCal's higher-level abstractions might obscure some details that are explicit in TLA+, potentially leading to misunderstandings. ○ Learning Curve: While easier than raw TLA+, PlusCal still requires learning its syntax and semantics, which might be a barrier for some users. ○ Translation Overhead: The automatic translation to TLA+ can sometimes introduce inefficiencies or complexities in the generated TLA+ code.

Slide 45

Slide 45 text

4.2 Algorithm structure ● Embedded in a specially shaped multiline comment within a TLA+ module ● Structure: ○ Defines algorithms using familiar programming constructs. ○ Includes variables, assignments, conditionals, and loops. ● Example: basic structure of a PlusCal algorithm in P-syntax vs C-syntax. ○ Observe semicolon usage and alignment \* P-syntax while x > 0 do if y > 0 then y := y-1; x := x-1 else x := x-2 end if end while; print y; \* C-syntax while (x > 0) { if (y > 0) { y := y-1; x := x-1 } else x := x-2 } ; print y;

Slide 46

Slide 46 text

4.3 Variables and assignments ● Variables: ○ Declared and initialized at the beginning, using =, like variables x = 5; ○ Can be of various types: integers, sets, sequences, etc. ● Assignments: ○ Use := to assign values to variables, not = ○ Example: x := x + 1. ■ In TLA+ that’s really x’ = x + 1 🤯 ■ That’s how assignment in programming languages is really stepping through time ○ No self-assignment: x := x ○ Only ONCE within a “label”, see Atomic Steps and Labels on next slide

Slide 47

Slide 47 text

4.4 Atomics steps and labels ● Labels: ○ Used to identify steps in the algorithm ○ Help in referencing and understanding the flow. ● Atomic Steps: ○ All simple statements within a label happen within a single state ○ They are executed atomically (without interruption). ○ Important for modeling concurrent systems. ● Example: statements in the A and B labels are not defining an atomic sequence ○ --algorithm TwoLabels variables x = 0; begin A: x := x + 1; B: x := x - 1; end algorithm;

Slide 48

Slide 48 text

4.5 Multi-process algorithms ● Multi-process Algorithms: ○ Define multiple processes that run concurrently. ○ Each process has its own set of variables and steps. ● Example: ○ --algorithm MultiProcess variables x = 0; process P1 = 1 begin A: x := x + 1; end process; process P2 = 2 begin B: x := x - 1; end process; end algorithm Watch out for colors on the TLA+ translation on the next slide

Slide 49

Slide 49 text

4.6 Translation to TLA+ ● PlusCal algorithms are automatically translated to TLA+. ● The translation process is straightforward and preserves the semantics. B == /\ pc[2] = "B" /\ x' = x - 1 /\ pc' = [pc EXCEPT ![2] = "Done"] P2 == B (* Allow inf. stuttering: prevent termination deadlock *) Terminating == /\ \A self \in ProcSet: pc[self] = "Done" /\ UNCHANGED vars Next == P1 \/ P2 \/ Terminating Spec == Init /\ [][Next]_vars Termination == <>(\A self \in ProcSet: pc[self] = "Done") VARIABLES pc, x vars == << pc, x >> ProcSet == {1} \cup {2} Init == (* Global variables *) /\ x = 0 /\ pc = [self \in ProcSet |-> CASE self = 1 -> "A" [] self = 2 -> "B"] A == /\ pc[1] = "A" /\ x' = x + 1 /\ pc' = [pc EXCEPT ![1] = "Done"] P1 == A

Slide 50

Slide 50 text

5. In practice 1. When to use TLA+ 2. Integration with development workflow 3. Software tools 4. Starting with TLA+ / PlusCal 5. A final hint

Slide 51

Slide 51 text

5.1 Where/When to Use TLA+ / PlusCal ● Complex Concurrent Systems: ○ When designing systems with complex concurrency and distributed algorithms (XBox 360) ○ Examples: Distributed databases/storage (ex: MongoDB, Cosmos DB, AWS S3), consensus algorithms (Paxos variants, Raft, KRaft, Kafka replication), network protocols (Dropbox, Azure DNS) ● Critical Systems: ○ Systems where correctness is crucial, such as financial systems, medical devices, and aerospace. ○ Helps in identifying and fixing design bugs before implementation. ● Early Design Phase: ○ Early stages of design to specify and verify high-level system behavior. ○ Allows for early detection of design flaws and ensures a solid foundation before coding. ● Refactoring and Maintenance: ○ Refactoring or maintaining complex systems to ensure that changes do not introduce new bugs. ○ Helps in understanding and verifying the impact of changes.

Slide 52

Slide 52 text

5.2 Integration with Development Workflow ● Specification Writing: ○ Write TLA+ specifications alongside high-level design documents. ○ Use PlusCal for easier translation of algorithms into TLA+. ● Model Checking: ○ Use TLC or Apalache model checkers to verify TLA+ specifications. ○ Run TLC as part of the continuous integration (CI) pipeline to catch design bugs early. ● Code Generation: ○ Use TLA+ to generate test cases and invariants that can be used in unit tests. ○ TLA+ is not typically used for code generation, but the insights gained from model checking can guide implementation. ● Documentation: ○ Include TLA+ specifications in project documentation to provide a formal reference for system behavior. ○ Use TLA+ to document assumptions, invariants, and safety properties. ● Collaboration: ○ Use TLA+ to facilitate collaboration between developers, architects, and stakeholders. ○ Provides a common language for discussing system behavior and requirements. ● Training and Onboarding: ○ Train new team members in TLA+ to ensure a consistent understanding of system specifications. ○ Use TLA+ examples and exercises to onboard new developers.

Slide 53

Slide 53 text

5.3 Software tools ● From the TLA+ Foundation ○ As of 2025-06-27, current release is Xenophanes, tag 1.7.4 (1.8.0 is still pre-release) ○ CLI tools come together in tlatools.jar ■ tlc2: model checker - use all the time ■ pcal: the PlusCal to TLA+ transpiler ■ tla2tex: format source TLA+ to math LaTeX. Combine with pdflatex for PDFs. ■ tla2sany: syntax checker - better use your IDE plugin ■ tlc2.REPL: a REPL - new in upcoming Clarke release, tag 1.8.0 - see demo ○ The Toolbox is the default Eclipse-based IDE, not recommended for work, but useful to follow early tutorials and the profiler UI. Comes in the TLAToolbox*.(zip|deb) packages ● Apalache is a symbolic model checker for large models ● The TLA Proof System is useful for infinite models, that a brute force checker like TLC cannot verify

Slide 54

Slide 54 text

5.3 Software tools: profiler UI in Toolbox Source: INRIA

Slide 55

Slide 55 text

5.4 Starting with TLA+ / PlusCal ● Start Small: ○ Begin with small, critical components of your system. ○ Gradually expand the use of TLA+ as you gain confidence. ● Iterate: ○ Use TLA+ iteratively to refine your specifications and designs. ○ Continuously verify and update your specifications as the system evolves: use CI for this ● Collaborate: ○ Involve your team in writing and reviewing TLA+ specifications. ○ Foster a culture of formal verification.

Slide 56

Slide 56 text

5.5 A final hint From Learn TLA+: “The dirty secret of formal methods is that the only way we know to scale it up is to use state machines.”

Slide 57

Slide 57 text

Just one more thing … ● Read the books (affiliate links) ○ Wayne Hillel’s “Practical TLA+” https://amzn.to/3HYU7Sx ○ The original Leslie Lamport book https://amzn.to/4l31t64 ● Or, better, let’s talk ! ○ LinkedIn: https://linkedin.com/in/marand ○ Bluesky: https://bsky.app/profile/fgmarand.bsky.social