Hardening Go Concurrency, Using Formal Methods of Verifying Correctness

Hardening Go Concurrency, Using Formal Methods of Veriﬁcation Raghav Roy
– VMware

whoami

What I will be covering • Basics of Formal Veriﬁcation

• What thinking “above” your code means

• What thinking “above” your code means • Examples of using Formal Veriﬁcation in Go Concurrency problems

• What thinking “above” your code means • Examples of using Formal Veriﬁcation in Go Concurrency problems • Formal veriﬁcation used in production

What I will not be covering • In depth language
speciﬁc details

What I will not be covering • In depth language
speciﬁc details • How to use Tooling around TLA+ (not covered in detail)

Why do we think? *This part of the talk is
borrowed from Lamport’s Talk

Well, it helps us do things, like building a house
*This part of the talk is borrowed from Lamport’s Talk

When should we think? *This part of the talk is

Ideally before you start construction *This part of the talk
is borrowed from Lamport’s Talk

For programs, you ideally should think about your code before
you start writing any code *This part of the talk is borrowed from Lamport’s Talk

“Writing is nature’s way of telling you how sloppy your
thinking really is” - Guindon *This part of the talk is borrowed from Lamport’s Talk

How to think • ‘What’ do you want it to
do. *This part of the talk is borrowed from Lamport’s Talk

do. • With concurrent or distributed systems, that rough sketch needs to be promoted to maybe a blueprint, or a functional deﬁnition of its behaviour *This part of the talk is borrowed from Lamport’s Talk

do. • With concurrent or distributed systems, that rough sketch needs to be promoted to maybe a blueprint, or a functional deﬁnition of its behaviour ◦ Design the system in a way that it can run correctly for every state that it can be in *This part of the talk is borrowed from Lamport’s Talk

How to think • How do we ensure that our
system is designed in a way that it doesn’t crash or reach incorrect states? *This part of the talk is borrowed from Lamport’s Talk

Concurrency

Where can we see it crop up • Multiple systems,
that are running independently, and have a shared global state

that are running independently, and have a shared global state • Non-deterministic: Two executions of the same program with the same input can produce different results

that are running independently, and have a shared global state • Non-deterministic: Two executions of the same program with the same input can produce different results • Example: Writers and Readers from a single queue, results can differ with just changing the order of who writes and who reads from the shared queue

Let’s look at a simple example

Ye Olde Banking System ◦ Even with a simple monolith
architecture, with just a frontend, a backend and a database, there are two points of concurrency.

Ye Olde Banking System ◦ Even with a simple monolith
architecture, with just a frontend, a backend and a database, there are two points of concurrency. ◦ In a system where Person A can transfer money to Person B ▪ Bank needs to check if Person A has sufﬁcient funds ▪ Add amount to Person B’s bank account ▪ Deduct amount from Person A’s bank account

Ye Olde Banking System ◦ Just in this simple system,
one step may not ﬁnish before the other starts -> Races, Crashes/Partial Failures

one step may not ﬁnish before the other starts -> Races, Crashes/Partial Failures ◦ Can writing Unit Tests solve this issue? For this particular example, if number of simultaneous transfers is N,

one step may not ﬁnish before the other starts -> Races, Crashes/Partial Failures ◦ Can writing Unit Tests solve this issue? For this particular example, if number of simultaneous transfers is N, the number of unit tests to write is (3N)!/(3!)^N - Huge for such a simple system! (1681 for 3 Transactions)

Alternative? Formal Speciﬁcations

Blueprints, and its spectrum *This part of the talk is

Very simple to very complex, where does building concurrent/distributed systems
lie? *This part of the talk is borrowed from Lamport’s Talk

Blueprints, and its spectrum *This part of the talk is

Blueprints, and its spectrum We need tools to check this
*This part of the talk is borrowed from Lamport’s Talk

Modeling Programs ◦ Programs can be modeled in a number
of ways: Turing Machines, Automatas, Programming Languages

of ways: Turing Machines, Automatas, Programming Languages ◦ But all of this can be described in terms of a State Machine ▪ This means describing your program as a set of “behaviours” where each behaviour is a ‘sequence of discrete steps’

of ways: Turing Machines, Automatas, Programming Languages ◦ But all of this can be described in terms of a State Machine ▪ This means describing your program as a set of “behaviours” where each behaviour is a ‘sequence of discrete steps’ ▪ This requires us to deﬁne the initial state of the system, and the next state of the system

of ways: Turing Machines, Automatas, Programming Languages ◦ But all of this can be described in terms of a State Machine ▪ This means describing your program as a set of “behaviours” where each behaviour is a ‘sequence of discrete steps’ ▪ This requires us to deﬁne the initial state of the system, and the next state of the system ▪ You can have multiple next states for a current state

Modeling Programs

Modeling Programs OR OR

Modeling Programs V V

of ways: Turing Machines, Automatas, Programming Languages ◦ But all of this can be described in terms of a State Machine ▪ This means describing your program as a set of “behaviours” where each behaviour is a ‘sequence of discrete steps’ ▪ This requires us to deﬁne the initial state of the system, and the next state of the system ▪ You can have multiple next states for a current state (modeling Non-Determinism) ▪ TLA+ gives us the framework to do this

What is TLA+

Temporal Logic of Actions

Euclid’s algorithm to ﬁnd greatest common divisor

How it looks like in Go

Let’s look at the TLA+ deﬁnition

Model Checking ◦ TLA+ is a language that lets you
write speciﬁcations formally, “formal” specs are needed if you want to apply tools to them. ▪ Model checkers verify the correctness of your speciﬁcation by running it against all possible executions of your program.

write speciﬁcations formally, “formal” specs are needed if you want to apply tools to them. ▪ More Speciﬁc: Model checkers verify systems by induction, by enumerating possible states a system can take on, and showing that the none of the states violate the system requirements.

write specifications formally, “formal” specs are needed if you want to apply tools to them. ▪ More Specific: Model checkers verify systems by induction, by enumerating possible states a system can take on, and showing that the none of the states violate the system requirements. (Specifications)

write speciﬁcations formally, “formal” specs are needed if you want to apply tools to them. ▪ Model checkers can check two things

write speciﬁcations formally, “formal” specs are needed if you want to apply tools to them. ▪ Model checkers can check two things • Liveness: Good things happen

write speciﬁcations formally, “formal” specs are needed if you want to apply tools to them. ▪ Model checkers can check two things • Liveness: Good things happen • Safety: Bad things won’t happen

write speciﬁcations formally, “formal” specs are needed if you want to apply tools to them. ▪ Model checkers can check two things • Temporal logic for Liveness: Good things eventually happen • Safety: Bad things won’t happen

Model Checking ◦ What can Safety look like: ▪ Two
threads can’t both be in a critical section at the same time. ▪ Users cannot write to ﬁles they don’t have access to. ▪ We never use more than 500 kb of RAM. ▪ The user_id key in the table is unique. ▪ We never add a string to an integer.

Model Checking ◦ What can Safety look like: ▪ Two
threads can’t both be in a critical section at the same time. ▪ Users cannot write to ﬁles they don’t have access to. ▪ We never use more than 500 kb of RAM. ▪ The user_id key in the table is unique. ▪ We never add a string to an integer. ◦ These are Invariants

Model Checking Then what is Liveness? Every message is received
at least once by each client.

at least once by each client. ▪ No ﬁnite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future.

at least once by each client. ▪ No ﬁnite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future. Temporal Logic

at least once by each client. ▪ No ﬁnite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future. ▪ The only way to break a liveness property is to show that at no point in the future does it ever become true.

Let’s look at how this can help debug Go concurrency
primitives

Concurrency Bug : • This was an actual error that
was found in the very popular Gops library

unbuffered buffered

First this Then this

Only read from after for loop Blocked

Concurrency Bug : • This was an actual error that
was found in the very popular Gops library • Hillel Wayne then demonstrates the bug with his TLA+ spec for the implementation and ﬁnds the deadlock condition

Let’s design our own concurrent system in TLA+

Design • Imagine a system, working a bit like a
pipeline consisting of three steps:

pipeline consisting of three steps: • The Input: where one component handles some incoming data, does some initial processing, and then sends an event on a queue.

pipeline consisting of three steps: • The Input: where one component handles some incoming data, does some initial processing, and then sends an event on a queue. • The processor: a component that will receive the event sent at 1, and do the actual heavy lifting in terms of processing, and then send the result on yet another queue

pipeline consisting of three steps: • The Input: where one component handles some incoming data, does some initial processing, and then sends an event on a queue. • The processor: a component that will receive the event sent at 1, and do the actual heavy lifting in terms of processing, and then send the result on yet another queue. • The output: where the output from 2 is further handled downstream.

Design

How it looks like in TLA+

Running Model Checker

Why does it deadlock? The problem was the following sequence
of steps: • A. Step 1 would process input, add it to the shared data, and send an event to Step 2 containing an identiﬁer.

of steps: • A. Step 1 would process input, add it to the shared data, and send an event to Step 2 containing an identiﬁer. • B. Step 2 would prune the shared data, and remove the object that was added above at 1.

of steps: • A. Step 1 would process input, add it to the shared data, and send an event to Step 2 containing an identiﬁer. • B. Step 2 would prune the shared data, and remove the object that was added above at 1. • C. Step 2 then received the event from step 1, and cannot ﬁnd the object in the shared data.

of steps: • The problem was a race condition between Step 1 adding the object to the shared data and sending an event to Step 2, and Step 2 pruning the object before handling that event

Modeling the Fix : • The ﬁx consist of avoiding
more than one concurrent step writing to shared_storage. • The SendIncoming(id) step should only put the identiﬁer on the queue, and only the ReceiveIncoming step should add the object to, and eventually prune from, the shared storage.

Modeling the Fix :

New TLA+ spec :

Running Model Checker

So, was any of that relevant to actual systems in
production?

Let’s drive this point home

Who uses this in production? • This is not limited
to just modeling toy systems, but real systems, here is what Amazon engineers had to say, and the also wrote a paper about how they used formal methods at AWS

to just modeling toy systems, but real systems, here is what Amazon engineers had to say, and the also wrote a paper about how they used formal methods at AWS. • They used TLA+ in 10+ large, complex real-world systems

to just modeling toy systems, but real systems, here is what Amazon engineers had to say, and the also wrote a paper about how they used formal methods at AWS. • They used TLA+ in 10+ large, complex real-world systems • In every case, TLA+ added signiﬁcant value by preventing subtle, serious bugs that could have reached production

to just modeling toy systems, but real systems, here is what Amazon engineers had to say, and the also wrote a paper about how they used formal methods at AWS. • They used TLA+ in 10+ large, complex real-world systems • In every case, TLA+ added significant value by preventing subtle, serious bugs that could have reached production • And also gave them enough understanding and confidence to make aggressive optimisations without sacrificing correctness of their systems

Thanks for making it this far! Let’s conclude

What do programmers need to know about thinking about your
code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong

code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong • There is a need to think before coding, or more clearly, the need to Write before you code

code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong • There is a need to think before coding, or more clearly, the need to Write before you code • Any piece of code that someone is likely to use or modify, needs to be speciﬁed in some way

code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong • There is a need to think before coding, or more clearly, the need to Write before you code • Any piece of code that someone is likely to use or modify, needs to be speciﬁed in some way • That someone can be you next month

code in this way? • There is importance in specifying everything your code does and if required, how it does it

code in this way? • There is importance in specifying everything your code does and if required, how it does it • You should always specify your code “above” the code level, in terms of states and behaviours or input/output behaviours.

code in this way? • There is importance in specifying everything your code does and if required, how it does it • You should always specify your code “above” the code level, in terms of states and behaviours or input/output behaviours. • Thinking mathematically provides a rigorous way of doing this, in precise terms and being as unambiguous as possible.

Why write formal specs? • Disclaimer, ﬁnding bugs in code
is not the intended purpose of writing a formal spec, writing a formal spec is hard work, it requires thinking in a way that isn’t intuitive to most programmers as we are generally used to implementing the ‘How’

Why write formal specs? • Even if the above talk
doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway

doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway • We learn to write programs by writing them, running them and then correcting the errors

doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway • We learn to write programs by writing them, running them and then correcting the errors • We can learn to write formal specs by writing them, running them against a model checker and then correct errors

doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway • We learn to write programs by writing them, running them and then correcting the errors • We can learn to write formal specs by writing them, running them against a model checker and then correct errors • A great way to document your systems with precise language

References • Leslie Lamport’s lectures on TLA+ • Hillel Wayne’s
lecture on “Tackling Concurrency Bugs with TLA+” • Hillel Wayne’s TLA+ spec and fix for the Gops bug • Leslie Lamport’s video on Thinking Above the Code • Medium article for TLA+ spec for modeling concurrency bug and fix • Gops issue link • Amazon’s paper on Using Formal Verification Techniques in Production • Image References ◦ Speaker deck from Leslie Lamport’s lectures for Blueprint Spectrum ◦ Simplified version of Gops FindAll function screenshots

Thank you!

Hardening Go Concurrency, Using Formal Methods ...

Hardening Go Concurrency, Using Formal Methods of Verifying Correctness

More Decks by Raghav Roy

Featured

Transcript