Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hardening Go Concurrency, Using Formal Methods of Verifying Correctness

Raghav Roy
September 15, 2023
78

Hardening Go Concurrency, Using Formal Methods of Verifying Correctness

When you want to write programs, you start with a blueprint or a rough sketch of what you want the behaviour of the program to be, not really caring about how you will be implementing it, this is a great practice in general, especially when you can conceive of all the possible routes your program can take and are clear about what you want it to achieve. But what if you’re building critical or complex systems?

Go provides support out-of-the-box to build concurrent and distributed systems. These are notorious to build and test, and very quickly you realise that there is a need to specify that it works correctly for every possible condition it can encounter.

For this, you need tools to check your blueprints, and to think about your programs scientifically. To specify “precisely”, the best way is to build mathematical models of your programs, to define unambiguously What your program should do, before you implement the How, and TLA+ is an example of such a way of describing your designs.

It uses math, and a language built around it to Formally Specify your system’s behaviour and do an exhaustive sweep of all possible states your system can be in, and verifying that your program runs correctly as defined by your specification every time.

Describing your systems formally can do more than just ensuring your critical system is designed correctly, for already existing systems, you can find subtle bugs that could not have easily been found by simple testing and it also provides a great way of documenting and explaining your system in the most precise way possible without ambiguities.

Raghav Roy

September 15, 2023
Tweet

Transcript

  1. What I will be covering • Basics of Formal Verification

    • What thinking “above” your code means
  2. What I will be covering • Basics of Formal Verification

    • What thinking “above” your code means • Examples of using Formal Verification in Go Concurrency problems
  3. What I will be covering • Basics of Formal Verification

    • What thinking “above” your code means • Examples of using Formal Verification in Go Concurrency problems • Formal verification used in production
  4. What I will not be covering • In depth language

    specific details • How to use Tooling around TLA+ (not covered in detail)
  5. Why do we think? *This part of the talk is

    borrowed from Lamport’s Talk
  6. Well, it helps us do things, like building a house

    *This part of the talk is borrowed from Lamport’s Talk
  7. When should we think? *This part of the talk is

    borrowed from Lamport’s Talk
  8. For programs, you ideally should think about your code before

    you start writing any code *This part of the talk is borrowed from Lamport’s Talk
  9. “Writing is nature’s way of telling you how sloppy your

    thinking really is” - Guindon *This part of the talk is borrowed from Lamport’s Talk
  10. How to think • ‘What’ do you want it to

    do. *This part of the talk is borrowed from Lamport’s Talk
  11. How to think • ‘What’ do you want it to

    do. • With concurrent or distributed systems, that rough sketch needs to be promoted to maybe a blueprint, or a functional definition of its behaviour *This part of the talk is borrowed from Lamport’s Talk
  12. How to think • ‘What’ do you want it to

    do. • With concurrent or distributed systems, that rough sketch needs to be promoted to maybe a blueprint, or a functional definition of its behaviour ◦ Design the system in a way that it can run correctly for every state that it can be in *This part of the talk is borrowed from Lamport’s Talk
  13. How to think • How do we ensure that our

    system is designed in a way that it doesn’t crash or reach incorrect states? *This part of the talk is borrowed from Lamport’s Talk
  14. Where can we see it crop up • Multiple systems,

    that are running independently, and have a shared global state
  15. Where can we see it crop up • Multiple systems,

    that are running independently, and have a shared global state • Non-deterministic: Two executions of the same program with the same input can produce different results
  16. Where can we see it crop up • Multiple systems,

    that are running independently, and have a shared global state • Non-deterministic: Two executions of the same program with the same input can produce different results • Example: Writers and Readers from a single queue, results can differ with just changing the order of who writes and who reads from the shared queue
  17. Ye Olde Banking System ◦ Even with a simple monolith

    architecture, with just a frontend, a backend and a database, there are two points of concurrency.
  18. Ye Olde Banking System ◦ Even with a simple monolith

    architecture, with just a frontend, a backend and a database, there are two points of concurrency. ◦ In a system where Person A can transfer money to Person B ▪ Bank needs to check if Person A has sufficient funds ▪ Add amount to Person B’s bank account ▪ Deduct amount from Person A’s bank account
  19. Ye Olde Banking System ◦ Just in this simple system,

    one step may not finish before the other starts -> Races, Crashes/Partial Failures
  20. Ye Olde Banking System ◦ Just in this simple system,

    one step may not finish before the other starts -> Races, Crashes/Partial Failures ◦ Can writing Unit Tests solve this issue? For this particular example, if number of simultaneous transfers is N,
  21. Ye Olde Banking System ◦ Just in this simple system,

    one step may not finish before the other starts -> Races, Crashes/Partial Failures ◦ Can writing Unit Tests solve this issue? For this particular example, if number of simultaneous transfers is N, the number of unit tests to write is (3N)!/(3!)^N - Huge for such a simple system! (1681 for 3 Transactions)
  22. Very simple to very complex, where does building concurrent/distributed systems

    lie? *This part of the talk is borrowed from Lamport’s Talk
  23. Blueprints, and its spectrum We need tools to check this

    *This part of the talk is borrowed from Lamport’s Talk
  24. Modeling Programs ◦ Programs can be modeled in a number

    of ways: Turing Machines, Automatas, Programming Languages
  25. Modeling Programs ◦ Programs can be modeled in a number

    of ways: Turing Machines, Automatas, Programming Languages ◦ But all of this can be described in terms of a State Machine ▪ This means describing your program as a set of “behaviours” where each behaviour is a ‘sequence of discrete steps’
  26. Modeling Programs ◦ Programs can be modeled in a number

    of ways: Turing Machines, Automatas, Programming Languages ◦ But all of this can be described in terms of a State Machine ▪ This means describing your program as a set of “behaviours” where each behaviour is a ‘sequence of discrete steps’ ▪ This requires us to define the initial state of the system, and the next state of the system
  27. Modeling Programs ◦ Programs can be modeled in a number

    of ways: Turing Machines, Automatas, Programming Languages ◦ But all of this can be described in terms of a State Machine ▪ This means describing your program as a set of “behaviours” where each behaviour is a ‘sequence of discrete steps’ ▪ This requires us to define the initial state of the system, and the next state of the system ▪ You can have multiple next states for a current state
  28. Modeling Programs ◦ Programs can be modeled in a number

    of ways: Turing Machines, Automatas, Programming Languages ◦ But all of this can be described in terms of a State Machine ▪ This means describing your program as a set of “behaviours” where each behaviour is a ‘sequence of discrete steps’ ▪ This requires us to define the initial state of the system, and the next state of the system ▪ You can have multiple next states for a current state (modeling Non-Determinism) ▪ TLA+ gives us the framework to do this
  29. Model Checking ◦ TLA+ is a language that lets you

    write specifications formally, “formal” specs are needed if you want to apply tools to them. ▪ Model checkers verify the correctness of your specification by running it against all possible executions of your program.
  30. Model Checking ◦ TLA+ is a language that lets you

    write specifications formally, “formal” specs are needed if you want to apply tools to them. ▪ More Specific: Model checkers verify systems by induction, by enumerating possible states a system can take on, and showing that the none of the states violate the system requirements.
  31. Model Checking ◦ TLA+ is a language that lets you

    write specifications formally, “formal” specs are needed if you want to apply tools to them. ▪ More Specific: Model checkers verify systems by induction, by enumerating possible states a system can take on, and showing that the none of the states violate the system requirements. (Specifications)
  32. Model Checking ◦ TLA+ is a language that lets you

    write specifications formally, “formal” specs are needed if you want to apply tools to them. ▪ Model checkers can check two things
  33. Model Checking ◦ TLA+ is a language that lets you

    write specifications formally, “formal” specs are needed if you want to apply tools to them. ▪ Model checkers can check two things • Liveness: Good things happen
  34. Model Checking ◦ TLA+ is a language that lets you

    write specifications formally, “formal” specs are needed if you want to apply tools to them. ▪ Model checkers can check two things • Liveness: Good things happen • Safety: Bad things won’t happen
  35. Model Checking ◦ TLA+ is a language that lets you

    write specifications formally, “formal” specs are needed if you want to apply tools to them. ▪ Model checkers can check two things • Temporal logic for Liveness: Good things eventually happen • Safety: Bad things won’t happen
  36. Model Checking ◦ What can Safety look like: ▪ Two

    threads can’t both be in a critical section at the same time. ▪ Users cannot write to files they don’t have access to. ▪ We never use more than 500 kb of RAM. ▪ The user_id key in the table is unique. ▪ We never add a string to an integer.
  37. Model Checking ◦ What can Safety look like: ▪ Two

    threads can’t both be in a critical section at the same time. ▪ Users cannot write to files they don’t have access to. ▪ We never use more than 500 kb of RAM. ▪ The user_id key in the table is unique. ▪ We never add a string to an integer. ◦ These are Invariants
  38. Model Checking Then what is Liveness? Every message is received

    at least once by each client. ▪ No finite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future.
  39. Model Checking Then what is Liveness? Every message is received

    at least once by each client. ▪ No finite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future. Temporal Logic
  40. Model Checking Then what is Liveness? Every message is received

    at least once by each client. ▪ No finite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future. ▪ The only way to break a liveness property is to show that at no point in the future does it ever become true.
  41. Concurrency Bug : • This was an actual error that

    was found in the very popular Gops library
  42. Concurrency Bug : • This was an actual error that

    was found in the very popular Gops library • Hillel Wayne then demonstrates the bug with his TLA+ spec for the implementation and finds the deadlock condition
  43. Design • Imagine a system, working a bit like a

    pipeline consisting of three steps:
  44. Design • Imagine a system, working a bit like a

    pipeline consisting of three steps: • The Input: where one component handles some incoming data, does some initial processing, and then sends an event on a queue.
  45. Design • Imagine a system, working a bit like a

    pipeline consisting of three steps: • The Input: where one component handles some incoming data, does some initial processing, and then sends an event on a queue. • The processor: a component that will receive the event sent at 1, and do the actual heavy lifting in terms of processing, and then send the result on yet another queue
  46. Design • Imagine a system, working a bit like a

    pipeline consisting of three steps: • The Input: where one component handles some incoming data, does some initial processing, and then sends an event on a queue. • The processor: a component that will receive the event sent at 1, and do the actual heavy lifting in terms of processing, and then send the result on yet another queue. • The output: where the output from 2 is further handled downstream.
  47. Why does it deadlock? The problem was the following sequence

    of steps: • A. Step 1 would process input, add it to the shared data, and send an event to Step 2 containing an identifier.
  48. Why does it deadlock? The problem was the following sequence

    of steps: • A. Step 1 would process input, add it to the shared data, and send an event to Step 2 containing an identifier. • B. Step 2 would prune the shared data, and remove the object that was added above at 1.
  49. Why does it deadlock? The problem was the following sequence

    of steps: • A. Step 1 would process input, add it to the shared data, and send an event to Step 2 containing an identifier. • B. Step 2 would prune the shared data, and remove the object that was added above at 1. • C. Step 2 then received the event from step 1, and cannot find the object in the shared data.
  50. Why does it deadlock? The problem was the following sequence

    of steps: • The problem was a race condition between Step 1 adding the object to the shared data and sending an event to Step 2, and Step 2 pruning the object before handling that event
  51. Modeling the Fix : • The fix consist of avoiding

    more than one concurrent step writing to shared_storage. • The SendIncoming(id) step should only put the identifier on the queue, and only the ReceiveIncoming step should add the object to, and eventually prune from, the shared storage.
  52. Who uses this in production? • This is not limited

    to just modeling toy systems, but real systems, here is what Amazon engineers had to say, and the also wrote a paper about how they used formal methods at AWS
  53. Who uses this in production? • This is not limited

    to just modeling toy systems, but real systems, here is what Amazon engineers had to say, and the also wrote a paper about how they used formal methods at AWS. • They used TLA+ in 10+ large, complex real-world systems
  54. Who uses this in production? • This is not limited

    to just modeling toy systems, but real systems, here is what Amazon engineers had to say, and the also wrote a paper about how they used formal methods at AWS. • They used TLA+ in 10+ large, complex real-world systems • In every case, TLA+ added significant value by preventing subtle, serious bugs that could have reached production
  55. Who uses this in production? • This is not limited

    to just modeling toy systems, but real systems, here is what Amazon engineers had to say, and the also wrote a paper about how they used formal methods at AWS. • They used TLA+ in 10+ large, complex real-world systems • In every case, TLA+ added significant value by preventing subtle, serious bugs that could have reached production • And also gave them enough understanding and confidence to make aggressive optimisations without sacrificing correctness of their systems
  56. What do programmers need to know about thinking about your

    code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong
  57. What do programmers need to know about thinking about your

    code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong • There is a need to think before coding, or more clearly, the need to Write before you code
  58. What do programmers need to know about thinking about your

    code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong • There is a need to think before coding, or more clearly, the need to Write before you code • Any piece of code that someone is likely to use or modify, needs to be specified in some way
  59. What do programmers need to know about thinking about your

    code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong • There is a need to think before coding, or more clearly, the need to Write before you code • Any piece of code that someone is likely to use or modify, needs to be specified in some way • That someone can be you next month
  60. What do programmers need to know about thinking about your

    code in this way? • There is importance in specifying everything your code does and if required, how it does it
  61. What do programmers need to know about thinking about your

    code in this way? • There is importance in specifying everything your code does and if required, how it does it • You should always specify your code “above” the code level, in terms of states and behaviours or input/output behaviours.
  62. What do programmers need to know about thinking about your

    code in this way? • There is importance in specifying everything your code does and if required, how it does it • You should always specify your code “above” the code level, in terms of states and behaviours or input/output behaviours. • Thinking mathematically provides a rigorous way of doing this, in precise terms and being as unambiguous as possible.
  63. Why write formal specs? • Disclaimer, finding bugs in code

    is not the intended purpose of writing a formal spec, writing a formal spec is hard work, it requires thinking in a way that isn’t intuitive to most programmers as we are generally used to implementing the ‘How’
  64. Why write formal specs? • Even if the above talk

    doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway
  65. Why write formal specs? • Even if the above talk

    doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway • We learn to write programs by writing them, running them and then correcting the errors
  66. Why write formal specs? • Even if the above talk

    doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway • We learn to write programs by writing them, running them and then correcting the errors • We can learn to write formal specs by writing them, running them against a model checker and then correct errors
  67. Why write formal specs? • Even if the above talk

    doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway • We learn to write programs by writing them, running them and then correcting the errors • We can learn to write formal specs by writing them, running them against a model checker and then correct errors • A great way to document your systems with precise language
  68. References • Leslie Lamport’s lectures on TLA+ • Hillel Wayne’s

    lecture on “Tackling Concurrency Bugs with TLA+” • Hillel Wayne’s TLA+ spec and fix for the Gops bug • Leslie Lamport’s video on Thinking Above the Code • Medium article for TLA+ spec for modeling concurrency bug and fix • Gops issue link • Amazon’s paper on Using Formal Verification Techniques in Production • Image References ◦ Speaker deck from Leslie Lamport’s lectures for Blueprint Spectrum ◦ Simplified version of Gops FindAll function screenshots