Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A TLA+ intro for Software engineers

A TLA+ intro for Software engineers

This presentation introduces TLA+, a formal specification language for concurrent systems.

Unlike most existing reference sources, it is meant for use by software developers, not logicians or mathematicians in general.

TLA+ helps find design flaws before implementation in code, reducing costly bugs later. Success stories include Amazon S3, Azure Cosmos DB, and Apache Kafka. The PlusCal extra language bridges code and specifications, simplifying TLA+ writing for developers.

TLA+ is recommended for complex, critical systems in the early design phase.

Frédéric G. MARAND

February 23, 2025
Tweet

More Decks by Frédéric G. MARAND

Other Decks in Technology

Transcript

  1. 1. Introduction 1. What is TLA+ ? A formal specification

    language for concurrent/distributed systems 2. Why use formal methods ? Find design bugs before implementation 3. Success stories Amazon S3, Azure Cosmos DB, Apache Kafka 4. Value proposition Cost of bugs vs specification
  2. 1.1 What is TLA+ ? • A formal specification language

    for concurrent and distributed systems ◦ NOT a programming language • Created by Leslie Lamport, ◦ Of LaTeX fame. And it shows… • Mathematical foundation for describing systems ◦ Based on the Zermelo–Fraenkel set theory with the axiom of choice (ZFC) • Focus on system behavior, not implementation ◦ So not transpilable to any programming language
  3. Turing Award: "For fundamental contributions to the theory and practice

    of distributed and concurrent systems, notably the invention of concepts such as causality and logical clocks, safety and liveness, replicated state machines, and sequential consistency" Image: Autoportrait in the TLA+ course Leslie Lamport
  4. 1.2 Why use it ? • Find design flaws before

    implementation ◦ The best implementation is at most as good as the design underlying it ◦ Actual implementations… • Verify complex algorithms and protocols ◦ Models can test all relevant timing combinations • Complement traditional testing ◦ Unit tests only check the cases devs were able to discover ◦ Fuzz tests only check the cases devs were able to generate • Most expensive bugs are design flaws ◦ No implementation work can fix an undiscovered race condition
  5. 1.3 Success stories • Amazon S3: Found critical bugs in

    internal protocols ◦ How formal methods helped AWS to design amazing services • Azure Cosmos DB: Verified consistency protocols ◦ Understanding Inconsistency in Azure Cosmos DB with TLA+ • Apache Kafka: Validated replication protocol ◦ KIP-966 for KRaft ◦ Also others KIPs e.g. KIP-848 : new consumer group rebalance protocol • Each saved months of debugging and prevented outages
  6. 1.4 Value proposition • Cost of fixing bugs: ◦ Design

    phase: 1x ◦ Development phase: 10x ◦ Production phase: 100x • TLA+ helps find bugs that testing misses ◦ Focus on the sad path ◦ Even fuzzing tends to only find crashing bugs, not invariant violations • Especially valuable for distributed systems ◦ Human comprehension is poor regarding relative temporality
  7. 2. Logic foundation 1. Traditional logic vs temporal logic 2.

    State machines: System as states 3. Actions: State transitions 4. Simple example: Traffic light 5. Temporal properties: Safety vs Liveness 6. Model checking basics
  8. 2.1 Traditional logic vs temporal logic • ∀P: P =

    P • Is it ALWAYS true ? • How about evaluating P at two different instants ?
  9. 2.1 Traditional logic vs temporal logic • ∀P: P =

    P • Is it ALWAYS true ? • How about evaluating P at two different instants ? • How about P = “the process is running now” ?
  10. 2.1 Traditional logic vs temporal logic • ∀P: P =

    P • Is it ALWAYS true ? • How about evaluating P at two different instants ? • How about P = “the process is running now” ? • Traditional logic evaluates at an abstract instant in time
  11. 2.1 Traditional logic vs temporal logic • ∀P: P =

    P • Is it ALWAYS true ? • How about evaluating P at two different instants ? • How about P = “the process is running now” ? • Traditional logic evaluates at an abstract instant in time (“now”) • Temporal logic adds the time dimension: “now” changes
  12. 2.2 Systems as states • The notion that a predicate

    can be evaluated at different times adds a quantified time component • The values of the system at these time quanta are states ◦ Within one, traditional logic applies • Three main temporal logics: ◦ Linear Temporal Logic (Pnueli, 77): evaluates properties over sequences of states ▪ That’s the one TLA+ and many derived tools combine with first-order logic ◦ Computation Tree Logic (Clarke/Emerson, 81): combines temporal operators with path quantifiers to specify properties over computation tree ◦ Branching Temporal Logic (multiple, 80s): evaluates over branching, capturing multiple possible future paths from a state
  13. 2.3 Actions: state transitions • Actions: ◦ Define how the

    system takes a step from one state to another ◦ Described using preconditions and postconditions. ◦ Example: Toggle action for the second light switch. • Formal Representation: use primed variables ◦ Precondition: Light = Off : the value on the initial state ◦ Postcondition: Light' = On : the value in the next state
  14. 2.4 Simple example: traffic lights • Usually, traffic lights run

    in a loop ◦ Green ◦ Orange ◦ Red ◦ restart from top
  15. 2.4bis Simple example: Spanish traffic lights In Spain, however, traffic

    lights go to yellow before going from Red to Green. But this has a problem…
  16. 2.4bis Simple example: Spanish traffic lights The model allows sequence:

    1. Red 2. Yellow 3. Red 4. Yellow 5. … That’s not fair!
  17. 2.4ter Simple example: Spanish traffic lights For the model to

    run in a cycle, we can add another variable to the state, forcing the choice on Yellow
  18. 2.5 Temporal Properties: Safety vs Liveness • Safety Properties: ◦

    Ensure that something bad never happens. ◦ Example: "The traffic light is never both red and green." • Liveness Properties: ◦ Ensure that something good eventually happens. ◦ Example: "The traffic light will eventually turn green." • Which one did our first attempt for Spanish traffic lights fail ? • But we did not define a rule, so that could not be checked
  19. 2.6 Model checking basics • Model Checking: ◦ Automated technique

    to verify temporal properties. ◦ Explores all possible states and transitions. ◦ Tools: TLC (TLA+ model checker), Apalache (3rd party model checker) • Process: ◦ Define the system and its properties. ▪ In our case: an INVARIANT, a (temporal) PROPERTY, and deadlock detection ◦ Use a model checker to explore all possible behaviors. ◦ Verify if the properties hold
  20. 2.6bis Model checking Spanish traffic lights \* traffic_lights_spain_inv.cfg INIT Init

    NEXT Next INVARIANT TypeOK PROPERTY NoRegression CHECK_DEADLOCK TRUE % tlc traffic_lights_spain_inv.tla …snip… Starting... (2025-02-23 16:58:15) Computing initial states... Finished computing initial states: 1 distinct state generated at 2025-02-23 16:58:15. Model checking completed. No error has been found. …snip… 5 states generated, 4 distinct states found, 0 states left on queue. …snip… Finished in 00s at (2025-02-23 16:58:15)
  21. 3. TLA+ Basics 1. States and variables 2. Basic operators:

    ≜, ⇒, ∧, ∨, =, ∈ 3. Actions with primed variables 4. Stuttering steps 5. Temporal formulas 6. Specifications: Init /\ [][Next]_vars 7. Simple example: Mutex 8. Checking Mutex with TLC
  22. 3.1 States and variables • States: ◦ Represent the current

    configuration of the system. ◦ Defined by the values of variables. • Variables: ◦ Store the state of the system. ◦ Can be of various types: integers, sets, sequences, etc. • Example: ◦ Our simple traffic light system: light (with values Red, Green, Yellow). ◦ Our Spanish variant: light and going_to_green
  23. 3.2 Basic operators: ≜, ⇒, ∧, ∨, =, ∈ •

    Assignment: ≜ ◦ Assign values to variables. ◦ Example: x ≜ 5 • Implication: ⇒ ◦ Logical implication. ◦ Example: A ⇒ B (if A then B) • Conjunction: ∧ ◦ Logical AND. ◦ Example: A ∧ B (A and B) • Disjunction: ∨ ◦ Logical OR. ◦ Example: A ∨ B (A or B) • Equality: = ◦ Checks for equality. ◦ Example: x = 5 • Membership: ∈ ◦ Checks for membership in a set. ◦ Example: x ∈ {1, 2, 3}
  24. 3.3 Actions with primed variables • Primed Variables: ◦ Represent

    the next state of a variable. ◦ Used to describe state transitions. • Example: ◦ light' = "Green" means the next state of light is Green. • Action Definition: ◦ Precondition: light = "Red" ◦ Postcondition: light' = "Green"
  25. 3.4 Stuttering steps • Stuttering Steps: ◦ Steps where the

    state does not change. ▪ No system moves infinitely fast (ever waited on Red ?) ▪ It transitions from one state to the same based on time passing ◦ Important for modeling concurrent systems. • Example: ◦ In a concurrent system, one process might not change state while another does. • Formal Representation: ◦ light' = light (the light stays the same)
  26. 3.5 Temporal formulas • Temporal Operators: ◦ [] (always, aka

    box): Property holds in all states. ◦ <> (eventually, aka diamond): Property holds in some future state. ◦ ~> (leads to): One property eventually leads to another. • Example: ◦ [](light = "Red" -> <>light = "Green") ▪ Always, if the light is red, it will eventually turn green. ▪ This is called a liveness property (system will move from such a state)
  27. 3.6 Specifications • Specification Structure uses conventional names, defaults in

    CFG ◦ Init: Initial state predicate. ◦ Next: Next state relation. ◦ Spec: Overall specification. • Formal Representation: ◦ Spec == Init /\ [][Next]_vars ◦ Init defines the initial state. ◦ Next defines the possible transitions. ◦ vars is the tuple (finite sequence) of all variables of the system. • TLC CFG may either define INIT and NEXT or SPECIFICATION, ◦ Defaults to Init, Next, Spec respectively
  28. 3.7 Simple example: Mutex with 3 processes • TLC is

    a brute-force checker with smart optimizations, so check time depends on the number of states. • Would most teams identify and test all these state ? And for more processes ?
  29. 3.7 Simple example: Mutex ---- MODULE mutex_sweeping ---- EXTENDS Integers,

    FiniteSets CONSTANT N \* Number of processes ASSUME N \in Nat \* N must be a natural number ASSUME N > 1 \* Need at least 2 processes VARIABLES flag, \* Array of flags, one per process turn \* Who's turn is it to enter critical section Proc == 1..N \* N processes: that’s sweeping vars == <<flag, turn>> Init == /\ flag = [i \in Proc |-> FALSE] \* No one wants to enter initially /\ turn = 1 \* Process 1 goes first Try(i) == \* Process i wants to enter critical section /\ flag' = [flag EXCEPT ![i] = TRUE] /\ UNCHANGED turn Give(i) == \* Process i gives turn to other process /\ flag[i] = TRUE /\ \E j \in Proc \ {i} : turn' = j \* Give turn to any other process /\ UNCHANGED flag Enter(i) == \* Process i enters critical section if it's their turn /\ flag[i] = TRUE /\ turn = i /\ \A j \in Proc \ {i} : ~flag[j] \* No other process wants to enter /\ UNCHANGED vars Exit(i) == \* Process i leaves critical section /\ flag' = [flag EXCEPT ![i] = FALSE] /\ UNCHANGED turn Next == \E i \in Proc : Try(i) \/ Give(i) \/ Enter(i) \/ Exit(i) Spec == Init /\ [][Next]_vars \* Type correctness invariant TypeOK == /\ flag \in [Proc -> BOOLEAN] /\ turn \in Proc \* Safety: No two processes in critical section simultaneously Mutex == [][\A i,j \in Proc : i # j => ~(Enter(i) /\ Enter(j))]_vars ====
  30. 3.7 Checking Mutex with TLC \* mutex_sweep.cfg SPECIFICATION Spec CONSTANT

    N = 3 \* Try other values INVARIANT TypeOK PROPERTY Mutex CHECK_DEADLOCK TRUE Run check with: tlc mutex_sweep.tla • SPECIFICATION references the Spec temporal formula, which MUST be shaped like: Init /\ [][Next]_vars ◦ Predicate Init starts true AND predicate Next is ALWAYS true ◦ Because vars == <<flag, turn>> this is syntactic sugar for Init /\ [](Next \/ (flag' = flag /\ turn' = turn)) • The value of N is a CONSTANT in TLA+, the model assigns it an actual value • INVARIANT means TypeOK is true on every state • PROPERTY describes the (safety: []) property Mutex, which states that no process may enter simultaneously. Could also be liveness (<> or ~>), or fairness (WF_Vars / SF_Vars). • CHECK_DEADLOCK ensures no final state
  31. 4. PlusCal 1. Bridging code and specifications 2. Algorithm structure

    3. Variables and assignments 4. Atomic steps and labels 5. Multi-process algorithms 6. Translation to TLA+
  32. 4.1 Bridging Code and Specifications • PlusCal: ◦ High-level algorithm

    language ◦ Bridges the gap between code and formal specifications. ◦ For programmers, easier to write and understand than raw TLA+ • Pros: ◦ Simplifies the process of writing specifications. ◦ Automatically translates to TLA+ for verification. ◦ More intuitive for developers familiar with programming languages. • Cons ◦ Less Expressive: Some complex temporal properties and constraints are harder to express in PlusCal compared to raw TLA+. ◦ Abstraction Limitations: PlusCal's higher-level abstractions might obscure some details that are explicit in TLA+, potentially leading to misunderstandings. ◦ Learning Curve: While easier than raw TLA+, PlusCal still requires learning its syntax and semantics, which might be a barrier for some users. ◦ Translation Overhead: The automatic translation to TLA+ can sometimes introduce inefficiencies or complexities in the generated TLA+ code.
  33. 4.2 Algorithm structure • Embedded in a specially shaped multiline

    comment within a TLA+ module • Structure: ◦ Defines algorithms using familiar programming constructs. ◦ Includes variables, assignments, conditionals, and loops. • Example: basic structure of a PlusCal algorithm in P-syntax vs C-syntax. ◦ Observe semicolon usage and alignment \* P-syntax while x > 0 do if y > 0 then y := y-1; x := x-1 else x := x-2 end if end while; print y; \* C-syntax while (x > 0) { if (y > 0) { y := y-1; x := x-1 } else x := x-2 } ; print y;
  34. 4.3 Variables and assignments • Variables: ◦ Declared and initialized

    at the beginning, using =, like variables x = 5; ◦ Can be of various types: integers, sets, sequences, etc. • Assignments: ◦ Use := to assign values to variables, not = ◦ Example: x := x + 1. ▪ In TLA+ that’s really x’ = x + 1 🤯 ▪ That’s how assignment in programming languages is really stepping through time ◦ No self-assignment: x := x ◦ Only ONCE within a “label”, see Atomic Steps and Labels on next slide
  35. 4.4 Atomics steps and labels • Labels: ◦ Used to

    identify steps in the algorithm ◦ Help in referencing and understanding the flow. • Atomic Steps: ◦ All simple statements within a label happen within a single state ◦ They are executed atomically (without interruption). ◦ Important for modeling concurrent systems. • Example: statements in the A and B labels are not defining an atomic sequence ◦ --algorithm TwoLabels variables x = 0; begin A: x := x + 1; B: x := x - 1; end algorithm;
  36. 4.5 Multi-process algorithms • Multi-process Algorithms: ◦ Define multiple processes

    that run concurrently. ◦ Each process has its own set of variables and steps. • Example: ◦ --algorithm MultiProcess variables x = 0; process P1 = 1 begin A: x := x + 1; end process; process P2 = 2 begin B: x := x - 1; end process; end algorithm Watch out for colors on the TLA+ translation on the next slide
  37. 4.6 Translation to TLA+ • PlusCal algorithms are automatically translated

    to TLA+. • The translation process is straightforward and preserves the semantics. B == /\ pc[2] = "B" /\ x' = x - 1 /\ pc' = [pc EXCEPT ![2] = "Done"] P2 == B (* Allow inf. stuttering: prevent termination deadlock *) Terminating == /\ \A self \in ProcSet: pc[self] = "Done" /\ UNCHANGED vars Next == P1 \/ P2 \/ Terminating Spec == Init /\ [][Next]_vars Termination == <>(\A self \in ProcSet: pc[self] = "Done") VARIABLES pc, x vars == << pc, x >> ProcSet == {1} \cup {2} Init == (* Global variables *) /\ x = 0 /\ pc = [self \in ProcSet |-> CASE self = 1 -> "A" [] self = 2 -> "B"] A == /\ pc[1] = "A" /\ x' = x + 1 /\ pc' = [pc EXCEPT ![1] = "Done"] P1 == A
  38. 5. In practice 1. When to use TLA+ 2. Integration

    with development workflow 3. Software tools 4. Starting with TLA+ / PlusCal 5. A final hint
  39. 5.1 Where/When to Use TLA+ / PlusCal • Complex Concurrent

    Systems: ◦ When designing systems with complex concurrency and distributed algorithms (XBox 360) ◦ Examples: Distributed databases/storage (ex: MongoDB, Cosmos DB, AWS S3), consensus algorithms (Paxos variants, Raft, KRaft, Kafka replication), network protocols (Dropbox, Azure DNS) • Critical Systems: ◦ Systems where correctness is crucial, such as financial systems, medical devices, and aerospace. ◦ Helps in identifying and fixing design bugs before implementation. • Early Design Phase: ◦ Early stages of design to specify and verify high-level system behavior. ◦ Allows for early detection of design flaws and ensures a solid foundation before coding. • Refactoring and Maintenance: ◦ Refactoring or maintaining complex systems to ensure that changes do not introduce new bugs. ◦ Helps in understanding and verifying the impact of changes.
  40. 5.2 Integration with Development Workflow • Specification Writing: ◦ Write

    TLA+ specifications alongside high-level design documents. ◦ Use PlusCal for easier translation of algorithms into TLA+. • Model Checking: ◦ Use TLC or Apalache model checkers to verify TLA+ specifications. ◦ Run TLC as part of the continuous integration (CI) pipeline to catch design bugs early. • Code Generation: ◦ Use TLA+ to generate test cases and invariants that can be used in unit tests. ◦ TLA+ is not typically used for code generation, but the insights gained from model checking can guide implementation. • Documentation: ◦ Include TLA+ specifications in project documentation to provide a formal reference for system behavior. ◦ Use TLA+ to document assumptions, invariants, and safety properties. • Collaboration: ◦ Use TLA+ to facilitate collaboration between developers, architects, and stakeholders. ◦ Provides a common language for discussing system behavior and requirements. • Training and Onboarding: ◦ Train new team members in TLA+ to ensure a consistent understanding of system specifications. ◦ Use TLA+ examples and exercises to onboard new developers.
  41. 5.3 Software tools • From the TLA+ Foundation ◦ As

    of 2025-02-26, current release is Xenophanes, tag 1.7.4 ◦ CLI tools come together in tlatools.jar ▪ tlc2: model checker - use all the time ▪ pcal: the PlusCal to TLA+ transpiler ▪ tla2tex: format source TLA+ to math LaTeX. Combine with pdflatex for PDFs. ▪ tla2sany: syntax checker - better use your IDE plugin ▪ tlc2.REPL: a REPL - new in upcoming Clarke release, tag 1.8.0 - see demo ◦ The Toolbox is the default Eclipse-based IDE, not recommended for work, but useful to follow early tutorials and the profiler UI. Comes in the TLAToolbox*.(zip|deb) packages • Apalache is a symbolic model checker for large models • The TLA Proof System is useful for infinite models, that a brute force checker like TLC cannot verify
  42. 5.4 Starting with TLA+ / PlusCal • Start Small: ◦

    Begin with small, critical components of your system. ◦ Gradually expand the use of TLA+ as you gain confidence. • Iterate: ◦ Use TLA+ iteratively to refine your specifications and designs. ◦ Continuously verify and update your specifications as the system evolves: use CI for this • Collaborate: ◦ Involve your team in writing and reviewing TLA+ specifications. ◦ Foster a culture of formal verification.
  43. 5.5 A final hint From Learn TLA+: “The dirty secret

    of formal methods is that the only way we know to scale it up is to use state machines.”