Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Formal Reasoning to Build Subtle Systems in Go

Formal Reasoning to Build Subtle Systems in Go

When you want to write programs, you start with a blueprint or a rough sketch of what you want the behaviour of the program to be, not really caring about how you will be implementing it, this is a great practice in general, especially when you can conceive of all the possible routes your program can take and are clear about what you want it to achieve. But what if you’re building critical or complex systems?

Go provides support out-of-the-box to build concurrent and distributed systems. These are notorious to build and test, and very quickly you realise that there is a need to specify that it works correctly for every possible condition it can encounter.

For this, you need tools to check your blueprints, and to think about your programs scientifically. To specify “precisely”, the best way is to build mathematical models of your programs, to define unambiguously What your program should do, before you implement the How, and TLA+ is an example of such a way of describing your designs.

It uses math, and a language built around it to Formally Specify your system’s behaviour and do an exhaustive sweep of all possible states your system can be in, and verifying that your program runs correctly as defined by your specification every time.

Describing your systems formally can do more than just ensuring your critical system is designed correctly, for already existing systems, you can find subtle bugs that could not have easily been found by simple testing and it also provides a great way of documenting and explaining your system in the most precise way possible without ambiguities.

Raghav Roy

June 15, 2024
Tweet

More Decks by Raghav Roy

Other Decks in Technology

Transcript

  1. What I will be covering • Basics of Formal Reasoning

    • What thinking “above” your code means
  2. What I will be covering • Basics of Formal Reasoning

    • What thinking “above” your code means • Examples of using Formal Methods in Go Concurrency problems
  3. What I will be covering • Basics of Formal Reasoning

    • What thinking “above” your code means • Examples of using Formal Methods in Go Concurrency problems • Formal Methods used in production
  4. What I will not be covering • In depth language

    specific details • How to use Tooling around TLA+
  5. How to think about software • ‘What’ do you want

    it to do, before the ‘How’ • Informally or Formally writing down the expected behaviour
  6. What it comes down to • How do we ensure

    that our system is designed in a way that it doesn’t crash or reach incorrect states?
  7. Where can we see it crop up • Multiple systems,

    that are running independently, and have a shared global state
  8. Where can we see it crop up • Multiple systems,

    that are running independently, and have a shared global state • Non-deterministic: Two executions of the same program with the same input can produce different results
  9. Where can we see it crop up • Multiple systems,

    that are running independently, and have a shared global state • Non-deterministic: Two executions of the same program with the same input can produce different results • Example: Writers and Readers from a single queue, results can differ with just changing the order of who writes and who reads from the shared queue
  10. Ye Olde Banking System Even with a simple monolith architecture,

    with just a frontend, a backend and a database, there are two points of concurrency.
  11. Ye Olde Banking System In a system where Person A

    can transfer money to Person B
  12. Ye Olde Banking System In a system where Person A

    can transfer money to Person B ▪ Bank needs to check if Person A has sufficient funds
  13. Ye Olde Banking System In a system where Person A

    can transfer money to Person B ▪ Bank needs to check if Person A has sufficient funds ▪ Add amount to Person B’s bank account
  14. Ye Olde Banking System In a system where Person A

    can transfer money to Person B ▪ Bank needs to check if Person A has sufficient funds ▪ Add amount to Person B’s bank account ▪ Deduct amount from Person A’s bank account
  15. Ye Olde Banking System • Just in this simple system,

    one step may not finish before the other starts -> Races, Crashes/Partial Failures
  16. Ye Olde Banking System • Can writing Unit Tests solve

    this issue? For this example, if number of simultaneous transfers is N,
  17. Ye Olde Banking System • Can writing Unit Tests solve

    this issue? For this example, if number of simultaneous transfers is N, the number of unit tests to write is (3N)!/(3!)^N Huge for such a simple system! (1681 for 3 Transactions)
  18. Very simple to very complex, where does building concurrent/distributed systems

    lie? *This part of the talk is from Lamport’s Talk
  19. Blueprints, and its spectrum We need tools to check this

    *This part of the talk is from Lamport’s Talk
  20. Modeling Programs ◦ Programs can be modeled in a number

    of ways: Turing Machines, Automatas, Programming Languages
  21. Modeling Programs ◦ Programs can be modeled in a number

    of ways: Turing Machines, Automatas, Programming Languages ◦ But all of this can be described in terms of a State Machine
  22. Modeling Programs ◦ Programs can be modeled in a number

    of ways: Turing Machines, Automatas, Programming Languages ◦ But all of this can be described in terms of a State Machine ▪ This means describing your program as a set of ‘behaviours’ where each behaviour is a ‘sequence of discrete steps’
  23. Modeling Programs • This requires us to define the initial

    state of the system, and the next state of the system
  24. Modeling Programs • This requires us to define the initial

    state of the system, and the next state of the system • You can have multiple next states for a current state
  25. Modeling Programs • This requires us to define the initial

    state of the system, and the next state of the system • You can have multiple next states for a current state (model non-determinism)
  26. Model Checking TLA+ is a language that lets you write

    specifications formally, “formal” specs are needed if you want to apply tools to them.
  27. Model Checking Model checkers verify the correctness of your specification

    by running it against all possible executions of your program.
  28. Model Checking More Specific Model checkers verify systems by induction,

    by enumerating possible states a system can take on,
  29. Model Checking More Specific Model checkers verify systems by induction,

    by enumerating possible states a system can take on, and showing that the none of the states violate the system requirements.
  30. Model Checking More Specific Model checkers verify systems by induction,

    by enumerating possible states a system can take on, and showing that the none of the states violate the system requirements. (Specifications)
  31. Model Checking Model checkers can check two things • Liveness:

    Good things happen • Safety: Bad things won’t happen
  32. Model Checking Model checkers can check two things • Liveness:

    Good things eventually happen (Temporal logic) • Safety: Bad things won’t happen
  33. Model Checking What can Safety look like? ▪ Two threads

    can’t both be in a critical section at the same time. ▪ Users cannot write to files they don’t have access to. ▪ We never use more than 500 kb of RAM. ▪ The user_id key in the table is unique. ▪ We never add a string to an integer.
  34. Model Checking What can Safety look like? ▪ Two threads

    can’t both be in a critical section at the same time. ▪ Users cannot write to files they don’t have access to. ▪ We never use more than 500 kb of RAM. ▪ The user_id key in the table is unique. ▪ We never add a string to an integer. These are Invariants
  35. Model Checking Then what is Liveness? Every message is received

    at least once by each client. ▪ No finite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future.
  36. Model Checking Then what is Liveness? Every message is received

    at least once by each client. ▪ No finite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future. Infinite sequence of steps required break this
  37. Model Checking Then what is Liveness? Every message is received

    at least once by each client. ▪ No finite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future. ▪ The only way to break a liveness property is to show that at no point in the future does it ever become true.
  38. Concurrency Bug : This was an actual error that was

    found in the very popular Gops library
  39. Concurrency Bug : • This was an actual error that

    was found in the very popular Gops library • Hillel Wayne then demonstrates the bug with his TLA+ spec for the implementation
  40. Concurrency Bug : • This was an actual error that

    was found in the very popular Gops library • Hillel Wayne then demonstrates the bug with his TLA+ spec for the implementation and finds the deadlock condition
  41. Design • Imagine a system, working a bit like a

    pipeline consisting of three steps:
  42. Design • Imagine a system, working a bit like a

    pipeline consisting of three steps: Step 1 - The Input: One component handles some incoming data, does some initial processing, and sends an event on a queue.
  43. Design • Imagine a system, working a bit like a

    pipeline consisting of three steps: Step 1 - The Input: One component handles some incoming data, does some initial processing, and sends an event on a queue. Step 2 - The Processor: a component that will receive the event sent at 1, and do the processing, and send the result on yet another queue.
  44. Design • Imagine a system, working a bit like a

    pipeline consisting of three steps: Step 1 - The Input: One component handles some incoming data, does some initial processing, and sends an event on a queue. Step 2 - The Processor: a component that will receive the event sent at 1, and do the processing, and send the result on yet another queue. Step 3 - The Output: where the output from 2 is further handled downstream.
  45. Why does it deadlock? The problem was the following sequence

    of steps: • Step 1 would ◦ Process input
  46. Why does it deadlock? The problem was the following sequence

    of steps: • Step 1 would ◦ Process input ◦ Add it to the shared data
  47. Why does it deadlock? The problem was the following sequence

    of steps: • Step 1 would ◦ Process input ◦ Add it to the shared data ◦ Send an event to Step 2 containing an identifier
  48. Why does it deadlock? The problem was the following sequence

    of steps: • Step 2 would ◦ Prune the shared data
  49. Why does it deadlock? The problem was the following sequence

    of steps: • Step 2 would ◦ Prune the shared data ◦ Remove the object that was added above at Step 1.
  50. Why does it deadlock? The problem was the following sequence

    of steps: • Step 2 then received the event from Step 1
  51. Why does it deadlock? The problem was the following sequence

    of steps: • Step 2 then received the event from Step 1 No object in shared data!
  52. Why does it deadlock? The problem was the following sequence

    of steps: Race between • Step 1 adding the object to the shared data and sending an event to Step 2 • Step 2 pruning the object before handling that event
  53. Modeling the Fix : The Fix! • The SendIncoming(id) step

    should only put the identifier on the queue
  54. Modeling the Fix : The Fix! • The SendIncoming(id) step

    should only put the identifier on the queue • The ReceiveIncoming step should add the object to, and eventually prune from, the shared storage.
  55. Who uses this in production? • This is not limited

    to just modeling toy systems, but real systems, here is what Amazon engineers had to say,
  56. Who uses this in production? • This is not limited

    to just modeling toy systems, but real systems, here is what Amazon engineers had to say, (and they also wrote a paper)
  57. Who uses this in production? • They used TLA+ in

    10+ large, complex real-world systems
  58. Who uses this in production? • They used TLA+ in

    10+ large, complex real-world systems • In every case, TLA+ added significant value by preventing subtle, serious bugs that could have reached production
  59. Who uses this in production? • They used TLA+ in

    10+ large, complex real-world systems • In every case, TLA+ added significant value by preventing subtle, serious bugs that could have reached production • Gave them enough understanding and confidence to make aggressive optimisations without sacrificing correctness of their systems
  60. What do programmers need to know about thinking about your

    code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong * More Lamport goodness
  61. What do programmers need to know about thinking about your

    code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong • There is a need to think before coding, or more clearly, the need to Write before you code * More Lamport goodness
  62. What do programmers need to know about thinking about your

    code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong • There is a need to think before coding, or more clearly, the need to Write before you code • Any piece of code that someone is likely to use or modify, needs to be specified in some way * More Lamport goodness
  63. What do programmers need to know about thinking about your

    code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong • There is a need to think before coding, or more clearly, the need to Write before you code • Any piece of code that someone is likely to use or modify, needs to be specified in some way • That someone can be you next month * More Lamport goodness
  64. What do programmers need to know about thinking about your

    code in this way? • There is importance in specifying everything your code does and if required, how it does it * More Lamport goodness
  65. What do programmers need to know about thinking about your

    code in this way? • There is importance in specifying everything your code does and if required, how it does it • You should always specify your code “above” the code level, in terms of states and behaviours or input/output behaviours. * More Lamport goodness
  66. What do programmers need to know about thinking about your

    code in this way? • There is importance in specifying everything your code does and if required, how it does it • You should always specify your code “above” the code level, in terms of states and behaviours or input/output behaviours. • Thinking mathematically provides a rigorous way of doing this, in precise terms and being as unambiguous as possible. * More Lamport goodness
  67. Why write formal specs? • Disclaimer, finding bugs in code

    is not the intended purpose of writing a formal spec, writing a formal spec is hard work,
  68. Why write formal specs? • Disclaimer, finding bugs in code

    is not the intended purpose of writing a formal spec, writing a formal spec is hard work, it requires thinking in a way that isn’t intuitive to most programmers
  69. Why write formal specs? • Disclaimer, finding bugs in code

    is not the intended purpose of writing a formal spec, writing a formal spec is hard work, it requires thinking in a way that isn’t intuitive to most programmers as we are generally used to implementing the ‘How’
  70. Why write formal specs? • Even if the above talk

    doesn’t apply to you, and you never write complex critical systems, * More Lamport goodness
  71. Why write formal specs? • Even if the above talk

    doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway * More Lamport goodness
  72. Why write formal specs? • Even if the above talk

    doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway • Trace out the possible states your program can be in, and reason about it
  73. Why write formal specs? • Even if the above talk

    doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway • Trace out the possible states your program can be in, and reason about it • A great way to document your systems with precise language
  74. References • Leslie Lamport’s lectures on TLA+ • Hillel Wayne’s

    lecture on “Tackling Concurrency Bugs with TLA+” • Hillel Wayne’s TLA+ spec and fix for the Gops bug • Leslie Lamport’s video on Thinking Above the Code • Gregory Terzian's article for TLA+ spec for modeling concurrency bug and fix • Gops issue link • Amazon’s paper on Using Formal Verification Techniques in Production • Image References ◦ Speaker deck from Leslie Lamport’s lectures for Blueprint Spectrum ◦ Simplified version of Gops FindAll function screenshots Gopher Credits: Renée French, Tenntenn, Maria Letta, Women Who Go Speaker Deck: speakerdeck.com/royra