Formal Reasoning to Build Subtle Systems in Go

Formal Reasoning to Build Subtle Systems in Go Raghav Roy

whoami

What I will be covering • Basics of Formal Reasoning

• What thinking “above” your code means

• What thinking “above” your code means • Examples of using Formal Methods in Go Concurrency problems

• What thinking “above” your code means • Examples of using Formal Methods in Go Concurrency problems • Formal Methods used in production

What I will not be covering • In depth language
speciﬁc details

What I will not be covering • In depth language
speciﬁc details • How to use Tooling around TLA+

Software is everywhere

(a very scientiﬁc graph)

Why is reliability hard?

Projects get very complex very quickly

Often, something as “artistic” as software is difﬁcult to reason
about formally

Legacy software,

Legacy software, No documentation,

Legacy software, No documentation, Entropy

How to think about software • ‘What’ do you want
it to do, before the ‘How’

How to think about software • ‘What’ do you want
it to do, before the ‘How’ • Informally or Formally writing down the expected behaviour

What it comes down to • How do we ensure
that our system is designed in a way that it doesn’t crash or reach incorrect states?

“Writing is nature’s way of telling you how sloppy your
thinking is” - Dick Guindon

Concurrency

Where can we see it crop up • Multiple systems,
that are running independently, and have a shared global state

that are running independently, and have a shared global state • Non-deterministic: Two executions of the same program with the same input can produce different results

that are running independently, and have a shared global state • Non-deterministic: Two executions of the same program with the same input can produce different results • Example: Writers and Readers from a single queue, results can differ with just changing the order of who writes and who reads from the shared queue

Let’s look at a simple example

Ye Olde Banking System Even with a simple monolith architecture,
with just a frontend, a backend and a database, there are two points of concurrency.

Ye Olde Banking System In a system where Person A
can transfer money to Person B

can transfer money to Person B ▪ Bank needs to check if Person A has sufﬁcient funds

can transfer money to Person B ▪ Bank needs to check if Person A has sufﬁcient funds ▪ Add amount to Person B’s bank account

can transfer money to Person B ▪ Bank needs to check if Person A has sufﬁcient funds ▪ Add amount to Person B’s bank account ▪ Deduct amount from Person A’s bank account

Ye Olde Banking System • Just in this simple system,
one step may not ﬁnish before the other starts -> Races, Crashes/Partial Failures

Ye Olde Banking System • Can writing Unit Tests solve
this issue? For this example, if number of simultaneous transfers is N,

Ye Olde Banking System • Can writing Unit Tests solve
this issue? For this example, if number of simultaneous transfers is N, the number of unit tests to write is (3N)!/(3!)^N Huge for such a simple system! (1681 for 3 Transactions)

Alternative? Formal Speciﬁcations

Blueprints, and its spectrum *This part of the talk is
from Lamport’s Talk

Very simple to very complex, where does building concurrent/distributed systems
lie? *This part of the talk is from Lamport’s Talk

Blueprints, and its spectrum *This part of the talk is
from Lamport’s Talk

Blueprints, and its spectrum We need tools to check this
*This part of the talk is from Lamport’s Talk

Modeling Programs ◦ Programs can be modeled in a number
of ways: Turing Machines, Automatas, Programming Languages

of ways: Turing Machines, Automatas, Programming Languages ◦ But all of this can be described in terms of a State Machine

of ways: Turing Machines, Automatas, Programming Languages ◦ But all of this can be described in terms of a State Machine ▪ This means describing your program as a set of ‘behaviours’ where each behaviour is a ‘sequence of discrete steps’

Modeling Programs • This requires us to deﬁne the initial
state of the system, and the next state of the system

state of the system, and the next state of the system • You can have multiple next states for a current state

state of the system, and the next state of the system • You can have multiple next states for a current state (model non-determinism)

Modeling Programs

Modeling Programs OR OR

Modeling Programs V V

Modeling Programs • TLA+ gives us the framework to do
this

What is TLA+

Temporal Logic of Actions

Euclid’s algorithm to ﬁnd greatest common divisor

How it looks like in Go

Let’s look at the TLA+ deﬁnition

Model Checking TLA+ is a language that lets you write
speciﬁcations formally,

Model Checking TLA+ is a language that lets you write
speciﬁcations formally, “formal” specs are needed if you want to apply tools to them.

Model Checking Model checkers verify the correctness of your speciﬁcation
by running it against all possible executions of your program.

Model Checking More Speciﬁc Model checkers verify systems by induction,
by enumerating possible states a system can take on,

by enumerating possible states a system can take on, and showing that the none of the states violate the system requirements.

by enumerating possible states a system can take on, and showing that the none of the states violate the system requirements. (Speciﬁcations)

Model Checking Model checkers can check two things

Model Checking Model checkers can check two things • Liveness:
Good things happen

Good things happen • Safety: Bad things won’t happen

Good things eventually happen (Temporal logic) • Safety: Bad things won’t happen

Model Checking What can Safety look like?

Model Checking What can Safety look like? ▪ Two threads
can’t both be in a critical section at the same time. ▪ Users cannot write to ﬁles they don’t have access to. ▪ We never use more than 500 kb of RAM. ▪ The user_id key in the table is unique. ▪ We never add a string to an integer.

Model Checking What can Safety look like? ▪ Two threads
can’t both be in a critical section at the same time. ▪ Users cannot write to ﬁles they don’t have access to. ▪ We never use more than 500 kb of RAM. ▪ The user_id key in the table is unique. ▪ We never add a string to an integer. These are Invariants

Model Checking Then what is Liveness?

Model Checking Then what is Liveness? Every message is received
at least once by each client.

at least once by each client. ▪ No ﬁnite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future.

at least once by each client. ▪ No ﬁnite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future. Infinite sequence of steps required break this

at least once by each client. ▪ No ﬁnite sequence of steps can break this property. Even if the message isn’t delivered now, it might be delivered in the future. ▪ The only way to break a liveness property is to show that at no point in the future does it ever become true.

Let’s look at how this can help debug Go concurrency
primitives

Concurrency Bug : This was an actual error that was
found in the very popular Gops library

unbuffered buffered

First this Then this

Only read from after for loop Blocked

Concurrency Bug : • This was an actual error that
was found in the very popular Gops library • Hillel Wayne then demonstrates the bug with his TLA+ spec for the implementation

Concurrency Bug : • This was an actual error that
was found in the very popular Gops library • Hillel Wayne then demonstrates the bug with his TLA+ spec for the implementation and ﬁnds the deadlock condition

Let’s design our own concurrent system

Design • Imagine a system, working a bit like a
pipeline consisting of three steps:

pipeline consisting of three steps: Step 1 - The Input: One component handles some incoming data, does some initial processing, and sends an event on a queue.

pipeline consisting of three steps: Step 1 - The Input: One component handles some incoming data, does some initial processing, and sends an event on a queue. Step 2 - The Processor: a component that will receive the event sent at 1, and do the processing, and send the result on yet another queue.

pipeline consisting of three steps: Step 1 - The Input: One component handles some incoming data, does some initial processing, and sends an event on a queue. Step 2 - The Processor: a component that will receive the event sent at 1, and do the processing, and send the result on yet another queue. Step 3 - The Output: where the output from 2 is further handled downstream.

Design

How it looks like in TLA+

IGNORE if this makes you queasy (like me)

Running Model Checker

Why does it deadlock? The problem was the following sequence
of steps:

of steps: • Step 1 would ◦ Process input

of steps: • Step 1 would ◦ Process input ◦ Add it to the shared data

of steps: • Step 1 would ◦ Process input ◦ Add it to the shared data ◦ Send an event to Step 2 containing an identiﬁer

of steps: • Step 2 would ◦ Prune the shared data

of steps: • Step 2 would ◦ Prune the shared data ◦ Remove the object that was added above at Step 1.

of steps: • Step 2 then received the event from Step 1

of steps: • Step 2 then received the event from Step 1 No object in shared data!

of steps: Race between • Step 1 adding the object to the shared data and sending an event to Step 2 • Step 2 pruning the object before handling that event

Modeling the Fix : The Fix! • The SendIncoming(id) step
should only put the identiﬁer on the queue

Modeling the Fix : The Fix! • The SendIncoming(id) step
should only put the identiﬁer on the queue • The ReceiveIncoming step should add the object to, and eventually prune from, the shared storage.

Modeling the Fix :

New TLA+ spec : Moved

Running Model Checker

So, was any of that relevant to actual systems in
production?

Let’s drive this point home

Who uses this in production? • This is not limited
to just modeling toy systems, but real systems, here is what Amazon engineers had to say,

Who uses this in production? • This is not limited
to just modeling toy systems, but real systems, here is what Amazon engineers had to say, (and they also wrote a paper)

Who uses this in production? • They used TLA+ in
10+ large, complex real-world systems

10+ large, complex real-world systems • In every case, TLA+ added signiﬁcant value by preventing subtle, serious bugs that could have reached production

10+ large, complex real-world systems • In every case, TLA+ added significant value by preventing subtle, serious bugs that could have reached production • Gave them enough understanding and confidence to make aggressive optimisations without sacrificing correctness of their systems

Automated Reasoning Group - AWS

Thanks for making it this far! Let’s conclude

What do programmers need to know about thinking about your
code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong * More Lamport goodness

code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong • There is a need to think before coding, or more clearly, the need to Write before you code * More Lamport goodness

code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong • There is a need to think before coding, or more clearly, the need to Write before you code • Any piece of code that someone is likely to use or modify, needs to be speciﬁed in some way * More Lamport goodness

code in this way? • Everyone thinks they are thinking, but if you don’t write down your thoughts you can really go wrong • There is a need to think before coding, or more clearly, the need to Write before you code • Any piece of code that someone is likely to use or modify, needs to be speciﬁed in some way • That someone can be you next month * More Lamport goodness

code in this way? • There is importance in specifying everything your code does and if required, how it does it * More Lamport goodness

code in this way? • There is importance in specifying everything your code does and if required, how it does it • You should always specify your code “above” the code level, in terms of states and behaviours or input/output behaviours. * More Lamport goodness

code in this way? • There is importance in specifying everything your code does and if required, how it does it • You should always specify your code “above” the code level, in terms of states and behaviours or input/output behaviours. • Thinking mathematically provides a rigorous way of doing this, in precise terms and being as unambiguous as possible. * More Lamport goodness

Why write formal specs? • Disclaimer, ﬁnding bugs in code
is not the intended purpose of writing a formal spec, writing a formal spec is hard work,

is not the intended purpose of writing a formal spec, writing a formal spec is hard work, it requires thinking in a way that isn’t intuitive to most programmers

is not the intended purpose of writing a formal spec, writing a formal spec is hard work, it requires thinking in a way that isn’t intuitive to most programmers as we are generally used to implementing the ‘How’

Why write formal specs? • Even if the above talk
doesn’t apply to you, and you never write complex critical systems, * More Lamport goodness

doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway * More Lamport goodness

doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway • Trace out the possible states your program can be in, and reason about it

doesn’t apply to you, and you never write complex critical systems, learning to write formal specs helps you write informal specs that you need to write anyway • Trace out the possible states your program can be in, and reason about it • A great way to document your systems with precise language

References • Leslie Lamport’s lectures on TLA+ • Hillel Wayne’s
lecture on “Tackling Concurrency Bugs with TLA+” • Hillel Wayne’s TLA+ spec and fix for the Gops bug • Leslie Lamport’s video on Thinking Above the Code • Gregory Terzian's article for TLA+ spec for modeling concurrency bug and fix • Gops issue link • Amazon’s paper on Using Formal Verification Techniques in Production • Image References ◦ Speaker deck from Leslie Lamport’s lectures for Blueprint Spectrum ◦ Simplified version of Gops FindAll function screenshots Gopher Credits: Renée French, Tenntenn, Maria Letta, Women Who Go Speaker Deck: speakerdeck.com/royra

Thank you!

Formal Reasoning to Build Subtle Systems in Go

Formal Reasoning to Build Subtle Systems in Go

More Decks by Raghav Roy

Other Decks in Technology

Featured

Transcript