Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A TLA+ intro for Software engineers - 3rd editi...

A TLA+ intro for Software engineers - 3rd edition - SREDay London 2025

The most expensive bugs are often design flaws, especially in the complex concurrent and distributed systems we build today.

This presentation introduces TLA+, a formal specification language created by Leslie Lamport, designed to find these flaws before a single line of code is written. Learn how major services at Amazon S3, Azure Cosmos DB, and Apache Kafka have used formal methods to validate their protocols, prevent outages, and save months of debugging.

The talk delves into the fundamentals of temporal logic, explaining how TLA+ models systems as state machines with actions defining state transitions. Through simple examples like traffic lights and a mutex, you will understand key concepts such as primed variables, stuttering steps, and the crucial difference between safety ("something bad never happens") and liveness ("something good eventually happens") properties. We'll explore how the TLC model checker automatically explores all possible system states to verify these properties and find bugs that traditional testing often misses.

Beyond the theory, this session provides practical guidance on when to use TLA+ (critical systems, early design phase) and how to integrate it into your development workflow, from specification writing to running checks in a CI pipeline. This third edition of the talk includes new content on the latest "Clarke" release of TLA+ tools, notes on the current capabilities of GenAI for writing specifications, and hard-won lessons from the speaker's personal experience.

Avatar for Frédéric G. MARAND

Frédéric G. MARAND

September 19, 2025
Tweet

More Decks by Frédéric G. MARAND

Other Decks in Technology

Transcript

  1. A TLA+ intro for Software engineers Frédéric G. MARAND /

    [email protected] SREDay London 2025-09-19 New in this 3rd edition: - Clarke release - GenAI notes - Lessons from experience
  2. Where am I speaking from ? • On-demand software architect

    / lead platform dev at OSInet.fr • Specialty: scaling up backends • Main stacks ◦ Go, Kafka, AWS, IaC ◦ PHP, Drupal, MongoDB • Main apps: publishing, payments, platform infra • Business domains: ◦ Media: LeFigaro, Radio France, FranceTV… ◦ Retail: Deliveroo, LeBonCoin ◦ Government; Medicare, France.fr, CNRS, Sante.fr, ...
  3. Contents 1. Introduction 2. Logic Foundation 3. TLA+ Basics 4.

    PlusCal 5. In Practice: GenAI, personal experience
  4. 1. Introduction 1. What is TLA+ ? 2. Why use

    formal methods ? 3. Success stories 4. Value proposition
  5. 1.1 What is TLA+ ? • A formal specification language

    for concurrent and distributed systems ◦ NOT a programming language • Created by Leslie Lamport, ◦ Of LaTeX fame. And it shows… TLA+ started as a LaTeX extension to typeset formal logic • Mathematical foundation for describing systems ◦ Based on the Zermelo–Fraenkel set theory with the axiom of choice (ZFC) • Focus is on system behavior, not implementation ◦ So not transpilable to any usual programming language
  6. Turing Award: "For fundamental contributions to the theory and practice

    of distributed and concurrent systems, notably the invention of concepts such as causality and logical clocks, safety and liveness, replicated state machines, and sequential consistency" Image: Autoportrait in the TLA+ course Leslie Lamport
  7. 1.2 Why use it ? • Find design flaws before

    implementation ◦ The best implementation is at most as good as the design underlying it ◦ Actual implementations… • Verify complex algorithms and protocols ◦ Models can test all relevant timing combinations • Complement traditional testing ◦ Unit tests only check the cases devs were able to discover ◦ Fuzz tests only check the cases devs were able to generate • Most expensive bugs are design flaws ◦ No implementation work can fix an undiscovered race condition
  8. 1.3 Success stories • Amazon S3: Found critical bugs in

    internal protocols ◦ How formal methods helped AWS to design amazing services • Azure Cosmos DB: Verified consistency protocols ◦ Understanding Inconsistency in Azure Cosmos DB with TLA+ • Apache Kafka: Validated replication protocol ◦ KIP-966 for KRaft ◦ Also others KIPs e.g. KIP-848 : new consumer group rebalance protocol • Each saved months of debugging and prevented outages
  9. 1.4 Value proposition • Cost of fixing bugs: ◦ Design

    phase: 1x ◦ Development phase: 10x ◦ Production phase: 100x • TLA+ helps find bugs that testing misses ◦ Focus on the sad path ◦ Even fuzzing tends to only find crashing bugs, not invariant violations • Especially valuable for distributed systems ◦ Human comprehension is poor regarding relative temporality
  10. 2. Logic foundation 1. Traditional logic vs temporal logic 2.

    State machines: System as states 3. Actions: State transitions 4. Simple example: Traffic light 5. Temporal properties: Safety vs Liveness 6. Model checking basics
  11. 2.1 Traditional logic vs temporal logic • ∀P: P =

    P • Is it ALWAYS true ? • How about evaluating P at two different instants ?
  12. 2.1 Traditional logic vs temporal logic • ∀P: P =

    P • Is it ALWAYS true ? • How about evaluating P at two different instants ? • How about P = “the process is running now” ?
  13. 2.1 Traditional logic vs temporal logic • ∀P: P =

    P • Is it ALWAYS true ? • How about evaluating P at two different instants ? • How about P = “the process is running now” ? • Traditional logic evaluates at an abstract instant in time
  14. 2.1 Traditional logic vs temporal logic • ∀P: P =

    P • Is it ALWAYS true ? • How about evaluating P at two different instants ? • How about P = “the process is running now” ? • Traditional logic evaluates at an abstract instant in time (“now”) • Temporal logic adds the time dimension: “now” changes
  15. 2.2 Systems as states • The notion that a predicate

    can be evaluated at different times adds a quantified time component (think Planck time) • The values of the system at these time quanta are states ◦ Within one, traditional logic applies • Three main temporal logics: ◦ Linear Temporal Logic (Pnueli, 77): evaluates properties over sequences of states ▪ That’s the one TLA+ and many derived tools combine with first-order logic ◦ Computation Tree Logic (Clarke/Emerson, 81): combines temporal operators with path quantifiers to specify properties over computation tree ◦ Branching Temporal Logic (multiple, 80s): evaluates over branching, capturing multiple possible future paths from a state
  16. 2.3 Actions: state transitions • Actions: ◦ Define how the

    system takes a step from one state to another ◦ Described using preconditions and postconditions. ◦ Example: Toggle action for the second light switch. • Formal Representation: use primed variables ◦ Precondition: Light = Off : the value on the initial state ◦ Postcondition: Light' = On : the value in the next state
  17. 2.4 Simple example: traffic lights • Usually, traffic lights run

    in a loop ◦ Green ◦ Orange ◦ Red ◦ restart from top
  18. 2.4bis Simple example: Spanish traffic lights In Spain, however, traffic

    lights go to yellow before going from Red to Green. But this has a problem…
  19. 2.4bis Simple example: Spanish traffic lights The model allows sequence:

    1. Red 2. Yellow 3. Red 4. Yellow 5. … That’s not fair!
  20. 2.4ter Simple example: Spanish traffic lights For the model to

    run in a cycle, we can add another variable to the state, forcing the choice on Yellow
  21. 2.5 Temporal Properties: Safety vs Liveness • Safety Properties: ◦

    Ensure that something bad never happens. ◦ Example: "The traffic light is never both red and green." • Liveness Properties: ◦ Ensure that something good eventually happens. ◦ Example: "The traffic light will eventually turn green." • Which one did our first attempt for Spanish traffic lights fail ? ◦ We did not define a liveness rule, so that could not be checked
  22. 2.6 Model checking basics • Model Checking: ◦ Automated technique

    to verify temporal properties. ◦ Explores all possible states and transitions. ◦ Tools: TLC (TLA+ model checker), Apalache (3rd party model checker) • Process: ◦ Define the system and its properties. ▪ In our case: an INVARIANT, a (temporal) PROPERTY, and deadlock detection ◦ Use a model checker to explore all possible behaviors. ◦ Verify if the properties hold
  23. 2.6bis Model checking Spanish traffic lights \* traffic_lights_spain_inv.cfg INIT Init

    NEXT Next INVARIANT TypeOK PROPERTY NoRegression CHECK_DEADLOCK TRUE % tlc traffic_lights_spain_inv.tla …snip… Starting... (2025-02-23 16:58:15) Computing initial states... Finished computing initial states: 1 distinct state generated at 2025-02-23 16:58:15. Model checking completed. No error has been found. …snip… 5 states generated, 4 distinct states found, 0 states left on queue. …snip… Finished in 00s at (2025-02-23 16:58:15)
  24. 3. TLA+ Basics 1. States and variables 2. Basic operators:

    ≜, ⇒, ∧, ∨, =, ∈ 3. Actions with primed variables 4. Stuttering steps 5. Temporal formulas 6. Specifications: Init /\ [][Next]_vars 7. Simple example: Mutex 8. Checking Mutex with TLC
  25. 3.1 States and variables • States: ◦ Represent the current

    configuration of the system. ◦ Defined by the values of variables. • Variables: ◦ Store the state of the system. ◦ Can be of various types: integers, sets, sequences, etc. • Example: ◦ Our simple traffic light system: light (with values Red, Green, Yellow). ◦ Our Spanish variant: light and going_to_green
  26. 3.2 Basic operators: ≜, ⇒, ∧, ∨, =, ∈ •

    Assignment: ≜ ◦ Assign values to variables. ◦ Example: x ≜ 5 • Implication: ⇒ ◦ Logical implication. ◦ Example: A ⇒ B (if A then B) • Conjunction: ∧ ◦ Logical AND. ◦ Example: A ∧ B (A and B) • Disjunction: ∨ ◦ Logical OR. ◦ Example: A ∨ B (A or B) • Equality: = ◦ Checks for equality. ◦ Example: x = 5 • Membership: ∈ ◦ Checks for membership in a set. ◦ Example: x ∈ {1, 2, 3}
  27. 3.3 Actions with primed variables • Primed Variables: ◦ Represent

    the next state of a variable. ◦ Used to describe state transitions. • Example: ◦ light' = "Green" means the next state of light is Green. • Action Definition: ◦ Precondition: light = "Red" ◦ Postcondition: light' = "Green"
  28. 3.4 Stuttering steps • Stuttering Steps: ◦ Steps where the

    state does not change. ▪ No system moves infinitely fast (ever waited on Red ?) ▪ It transitions from one state to the same based on time passing ◦ Important for modeling concurrent systems. • Example: ◦ In a concurrent system, one process might not change state while another does. • Formal Representation: ◦ light' = light (the light stays the same)
  29. 3.5 Temporal formulas • Temporal Operators: ◦ [] (always, aka

    box): Property holds in all states. ◦ <> (eventually, aka diamond): Property holds in some future state. ◦ ~> (leads to): One property eventually leads to another. • Example: ◦ [](light = "Red" -> <>light = "Green") ▪ Always, if the light is red, it will eventually turn green. ▪ This is called a liveness property (system will move from such a state)
  30. 3.6 Specifications • Specification Structure uses conventional names, defaults in

    CFG ◦ Init: Initial state predicate. ◦ Next: Next state relation. ◦ Spec: Overall specification. • Formal Representation: ◦ Spec == Init /\ [][Next]_vars ◦ Init defines the initial state, Next defines the possible transitions. ◦ vars is the tuple (finite sequence) of all variables of the system. • TLC CFG may either define INIT and NEXT or SPECIFICATION, ◦ Defaults to Init, Next, Spec respectively
  31. 3.7 Simple example: Mutex with 3 processes • TLC is

    a brute-force checker with smart optimizations, so check time depends on the number of states. • Would most teams identify and test all these state ? And for more processes ?
  32. 3.7 Mutex cost: Sweeping costs explode fast - bis Can

    you guess when I started TLC to check a model ?
  33. 3.7 Simple example: Mutex in 50 lines ---- MODULE mutex_sweeping

    ---- EXTENDS Integers, FiniteSets CONSTANT N \* Number of processes ASSUME N \in Nat \* N must be a natural number ASSUME N > 1 \* Need at least 2 processes VARIABLES flag, \* Array of flags, one per process turn \* Who's turn is it to enter critical section Proc == 1..N \* N processes: that’s sweeping vars == <<flag, turn>> Init == /\ flag = [i \in Proc |-> FALSE] \* No one wants to enter initially /\ turn = 1 \* Process 1 goes first Try(i) == \* Process i wants to enter critical section /\ flag' = [flag EXCEPT ![i] = TRUE] /\ UNCHANGED turn Give(i) == \* Process i gives turn to other process /\ flag[i] = TRUE /\ \E j \in Proc \ {i} : turn' = j \* Give turn to any other process /\ UNCHANGED flag Enter(i) == \* Process i enters critical section if it's their turn /\ flag[i] = TRUE /\ turn = i /\ \A j \in Proc \ {i} : ~flag[j] \* No other process wants to enter /\ UNCHANGED vars Exit(i) == \* Process i leaves critical section /\ flag' = [flag EXCEPT ![i] = FALSE] /\ UNCHANGED turn Next == \E i \in Proc : Try(i) \/ Give(i) \/ Enter(i) \/ Exit(i) Spec == Init /\ [][Next]_vars \* Type correctness invariant TypeOK == /\ flag \in [Proc -> BOOLEAN] /\ turn \in Proc \* Safety: No two processes in critical section simultaneously Mutex == [][\A i,j \in Proc : i # j => ~(Enter(i) /\ Enter(j))]_vars ====
  34. 3.7 Checking Mutex with TLC \* mutex_sweep.cfg SPECIFICATION Spec CONSTANT

    N = 3 \* Try other values INVARIANT TypeOK PROPERTY Mutex CHECK_DEADLOCK TRUE Run check with: tlc mutex_sweep.tla • SPECIFICATION references the Spec temporal formula, which MUST be shaped like: Init /\ [][Next]_vars ◦ Predicate Init starts true AND predicate Next is ALWAYS true ◦ Because vars == <<flag, turn>> this is syntactic sugar for Init /\ [](Next \/ (flag' = flag /\ turn' = turn)) • The value of N is a CONSTANT in TLA+, the model assigns it an actual value • INVARIANT means TypeOK is true on every state • PROPERTY describes the (safety: []) property Mutex, which states that no process may enter simultaneously. Could also be liveness (<> or ~>), or fairness (WF_Vars / SF_Vars). • CHECK_DEADLOCK ensures no final state
  35. 4. PlusCal 1. Bridging code and specifications 2. Algorithm structure

    3. Variables and assignments 4. Atomic steps and labels 5. Multi-process algorithms 6. Translation to TLA+ Only in full-length presentations, sorry !
  36. 5. In practice 1. When to use TLA+ 2. Integration

    with development workflow 3. Software tools 4. Starting with TLA+ / PlusCal 5. A final hint from LearnTLA+ 6. GenAI ? 7. Lessons from personal experience
  37. 5.1 Where/When to Use TLA+ / PlusCal • Complex Concurrent

    Systems: ◦ When designing systems with complex concurrency and distributed algorithms (XBox 360) ◦ Examples: Distributed databases/storage (ex: MongoDB, Cosmos DB, AWS S3), consensus algorithms (Paxos variants, Raft, KRaft, Kafka replication), network protocols (Dropbox, Azure DNS) • Critical Systems: ◦ Systems where correctness is crucial, such as financial systems, medical devices, and aerospace. ◦ Helps in identifying and fixing design bugs before implementation. • Early Design Phase: ◦ Early stages of design to specify and verify high-level system behavior. ◦ Allows for early detection of design flaws and ensures a solid foundation before coding. • Refactoring and Maintenance: ◦ Refactoring or maintaining complex systems to ensure that changes do not introduce new bugs. ◦ Helps in understanding and verifying the impact of changes.
  38. 5.2 Integration with Development Workflow • Specification Writing: ◦ Write

    TLA+ specifications alongside high-level design documents, ◦ Consider PlusCal for easier translation of algorithms into TLA+. • Model Checking: ◦ Use TLC or Apalache model checkers to verify TLA+ specifications. ◦ Run TLC as part of the (CI) pipeline to catch design bugs early. • Code Generation: ◦ Use TLA+ to generate test cases and invariants that can be used in unit tests. ◦ TLA+ is not transpiled to code, but model checking results can guide implementers • Documentation: ◦ Include TLA+ specifications in project documentation to provide a formal reference for system behavior. ◦ Use TLA+ to document assumptions, invariants, and safety properties. • Collaboration: ◦ Use TLA+ to facilitate collaboration between developers, architects, and stakeholders. ◦ Provides a common language for discussing system behavior and requirements. • Training and Onboarding: ◦ Train new team members to ensure a consistent understanding of system specifications. ◦ Use TLA+ examples/exercises to onboard new devs.
  39. 5.3 Software tools • From the TLA+ Foundation ◦ As

    of 2025-09-19, current release is Clarke, tag 1.8.0 (just released two days ago) ◦ CLI tools come together in tlatools.jar ▪ tlc2: model checker - use all the time ▪ tlc2.REPL: a REPL - the big new shiny in Clarke, 5 years in the making - see demo ▪ pcal: the PlusCal to TLA+ transpiler ▪ tla2tex: format source TLA+ to math LaTeX. Combine with pdflatex for PDFs. ▪ tla2sany: syntax checker - better use your IDE plugin ◦ The Toolbox is the default Eclipse-based IDE, no longer recommended for work, but useful to follow early tutorials and the profiler UI. Comes in the TLAToolbox*.(zip|deb) packages • Apalache is a symbolic model checker for large models, translates for SMT solvers • The TLA Proof System is useful for infinite models ◦ A brute force checker like TLC cannot verify those, by design ◦ Can only check safety properties, ongoing work for liveness properties
  40. 5.4 Starting with TLA+ / PlusCal • Start Small: ◦

    Begin with small, critical components of your system. ◦ Gradually expand the use of TLA+ as you gain confidence. • Iterate: ◦ Use TLA+ iteratively to refine your specifications and designs. ◦ Continuously verify / update your specifications using CI as the system evolves • Collaborate: ◦ Involve your team in writing and reviewing TLA+ specifications. ◦ Foster a culture of formal verification.
  41. 5.5 A final hint from “Learn TLA+” « The dirty

    secret of formal methods is that the only way we know to scale it up is to use state machines »
  42. 5.6 GenAI ? • Time-dependent expiration: checked in 2025 Q1,

    Q2, Q3… • Code chats (Claude, GPT, Gemini, Codestral) are unbelievably inappropriate ◦ They will hallucinate incorrect code, say they can fix it, and emit it again ◦ None of them “understands” that the single Pluscal language has two different syntaxes and they will usually use C-syntax calling it P-syntax, and miss syntactical elements to create valid code • Not worth the time and money using them … …at this point in time
  43. 5.7 Lessons learned from experience • The first time, my

    code still broke. WTH ? ◦ The part I modeled proved be bug-free, especially as it used a FSM ◦ The package containing it was not: ▪ I failed to consider what happened after the FSM final state. ▪ The implementation had observability dependencies. Guess what… • Lessons learned ◦ Verify what code does before initial and after final model states ◦ Do not just stop at that initial first critical component ◦ If your code has dependencies, they must be part of the model ◦ A single additional state property can save big on concurrent execution (CPU for GC and scheduling, maintenance)
  44. Just one more thing … • Read the books (affiliate

    links) ◦ Wayne Hillel’s “Practical TLA+” https:/ /amzn.to/3HYU7Sx ◦ The original Leslie Lamport book https:/ /amzn.to/4l31t64 • Or, better, let’s talk ! ◦ LinkedIn: https:/ /linkedin.com/in/marand ◦ Bluesky: https:/ /bsky.app/profile/fgmarand.bsky.social