[RubyConf 2023] The Secret Ingredient: How to Understand and Resolve Just about Any Flaky Test

It’s after 4 o’clock. The release was due an hour
ago. You’ve got less than an hour to leave, or you’re going to be late for that thing… You can feel the clock ticking…

Slack clicks to life… You check your messages… < HEAVY
SIGH >

The build just failed. A-gain. You look at the build.
You look at the clock. < SHAKING HEAD > You don’t have time for flakiness…

So, you re-run the build. A-gain. Two builds. Five different
failing specs. None of them have anything to do with your commit. All you can think about is how you can’t be late to another thing…

If only you knew the secret ingredient that all flaky
tests have in common… You might be on your way to that thing right now…

Hello! My name is Alan Ridlehoover. I’m an Engineering Manager
at Cisco Meraki — the largest Rails shop you’ve never heard of. And, though I’m not a baker, I do know a thing or two about flakiness. In fact… sometimes…

It’s all I can think about! Seriously! Since I started
automating tests over 20 years ago, I’ve written my fair share of flaky specs. Daylight Saving Time is my personal nemesis. I can’t tell you how many times I’ve tripped on that. Let’s just say I’m well into the “shame on me” part of that relationship. Or, I was… But, I’m getting ahead of myself. Let’s start with a definition. What is a flaky test?

A flaky spec is one that changes state without modification
to either the test itself or the code being tested.

So, if you write a spec A…

And a method #foo…

that makes the spec pass… Then you can expect that
as long as…

The spec…

and the method remain unchanged…

The spec should continue to pass.

It’s when the spec…

And the method stay the same…

but the result changes…

That’s when you know you have a flaky spec. But,
how does this happen?

Well, it happens because of the secret ingredient that all
flaky tests have in common… But, what is it? What is that secret ingredient?

It’s an assumption. All flaky tests make invalid assumptions about
their environment. They assume their environment will be in a particular state when they begin. But that assumption is rendered incorrect by some change in the environment between or during test runs.

Ok. But, what causes that change to the environment? Well,
there are three recipes: * Non-determinism * Order dependence, and * Race conditions Let’s take a look at each of these, along with some examples in code… Starting with…

…non-determinism. So, what is non-determinism? For that matter, what is
determinism?

Well, a deterministic algorithm is one that, given the same
inputs, always produces the same output. For example...

If I take these parameters,

and pass them to a method called add,

it should always return 2,

no matter how many

times you call it.

But, what if there were a method #foo that

always returned true

until it didn’t. That’s the definition of non-determinism: an algorithm
that, given the same inputs, does not always produce the same output. But, how could this be?

Well, it might sound obvious, but utilizing non-deterministic features of
the environment leads to non-deterministic code, including… * Random number generators - clearly - these are intended to be, well, random * The system clock - we don’t always think of this, but it’s always changing * Network connections - that might be up one minute and down the next * Floating point precision - it’s not guaranteed These are just a few examples, I’m sure this list is not exhaustive. But, what if our code relies on these things? How can we possibly write deterministic tests?

The trick is to remove the non-determinism from the test
by stubbing it, or to account for it by using advanced matchers so that the spec produces consistent results from one run to the next. To do that… * You can stub the random number generator to return a specific number * You can mock (or “freeze”) time * You can stub network responses * And, for floats, you can leverage some of RSpec’s more advanced matchers, like: be_witihin, and be_between. And, please! < ANIMATE > Don’t forget to document the undocumented use case with a spec!

Ok. So, while that build is running, let’s see if
we can fix some of those flaky specs that are making you late for that thing… First, a bit of context… The code we’re about to look at is entirely made up. Well, I guess, technically, all code is made up. But, what I mean is that this code was made up fresh, just for this talk. By me. With TDD. Not ChatGPT. It’s not production code. But, it was inspired by real code that I’ve personally worked on in production applications. It’s a bit of a hodge podge. It’s a class called RubyConf that provides some utility methods that might be useful for running the conference.

Here’s a bit of code to determine whether or not
the conference is currently in progress.

Notice that the method uses the system clock to determine
the current date and time. Simple enough. Let’s look at the specs…

Oh… There’s only one spec. Hmmm… O…K… Well, let’s take
a look at it.

It says that the #in_progress? Method returns false before the
conference begins. Ok. That makes sense. But, it does seem like the author forgot at least two other cases: during the conference and after the conference. But you know what? This is a common problem I see with date based specs. The author of the spec is living in the now. They aren’t thinking about the future. In fact, this is exactly what happens to me with Daylight Saving Time. I forget about it and never write a spec that proves the software still works after the clock changes. I bet this spec ran fine before the conference. But, it’s failing now that the conference is actually in progress. Let’s play with the system clock to see if I’m right…

Ok. As predicted, it passes if I set the clock
to October 21, 2023 - well before the conference. So, we know this is a flaky test because it was passing and now it’s failing despite there being no modifications to the code or tests.

And, if I set the clock to the first day
of the conference? It fails. This spec is flaky. It depends on the system clock being in a particular state. Ok, how do we fix this? Remember, whenever we’re facing non-deterministic flakiness, we want to mock the non-determinism to make it deterministic. In this case, that’s the system clock…

So, here’s the code and the spec as they were…

And, here it is with the fixed spec. Notice that
the only difference here is that the new spec is mocking (or freezing) time. The code didn’t change at all. It’s fine. Only the spec changed. It sets time to a specific date so that the spec will never run outside that context. It does this for the duration of the block, then it returns the system to it’s normal state. Ok. Let’s see if that fixed it…

Great! So, now the specs pass on November 13th, the
first day of the conference. And, I even added specs for the missing use cases by freezing time and setting the dates appropriately. Like this…

These specs freeze time the same way the other spec
freezes time, just with different dates and expected results.

Ok. Next let’s look at fixing a test that fails
when the network goes down…

It’s not uncommon for code to need to call external
services across a network. In this session_description method,

we’re calling an API to fetch the conference schedule.

Then we’re parsing it, finding the session that matches the
title parameter, and returning it’s description.

And, here’s the spec.

With WiFi enabled, this spec passes.

Note that the call to the network adds over a
second to the runtime. And, that’s when it succeeds. Most get requests default to a 60 second timeout when waiting for the other service to respond. Plus…

With WiFi turned off, the spec fails.

Fortunately, it fails quickly because the network failure is on
my end. Now, these are particularly nasty tests to debug because the loss in connectivity is neither logged, nor persistent. So, by the time you’re debugging the failure, it may not be possible to reproduce. Pay attention to HTTP calls, or any other type of call that crosses a network (e.g. GRPC). And, try running your specs with WiFi turned off to see if you can catch any failures.

Alright. Let’s fix this spec… Here it is a bit
smaller than before. Same code. Different font size, because the fix is a bit large…

Notice that again, the code itself is not changing. The
problem is with the spec. Most of the changes to the spec are setting up data to stub a response from the API. Here’s where we’re actually creating the stub. This allows the spec to validate that we’re parsing the results correctly. That’s the code we care about. We don’t actually care whether the external service is up and running when we’re running our test suite. It shouldn’t matter.

Now, sometime I get a bit of pushback when I
offer this advice. Folks ask, “What if the API changes? How will we know if we’re mocking the response?” My answer to that is that each spec should have one and only one reason to exist. This spec’s reason to exist is to verify the code we wrote works the way we expect. It is a unit test. We want these to have as few dependencies as possible so they’ll run fast. You may ALSO require a spec to validate that an API’s schema has not changed. That’s a different reason for a spec to exist. So, that’s not this spec. In fact, that’s not even a unit test. It’s an integration test. And, it’s one that’s designed to fail in order to catch changes to the API. So, we probably don’t want to run it with our unit tests, which are designed to pass. Maybe the integration tests should be run separately, on a schedule, rather than intermingled with our unit tests on every build. Alright. Let’s run the specs.

Alright! Running the specs with the WiFi turned off proves
that the stubbed response prevented the spec from flaking.

It’s also important to point out the difference in time
between the live version and the stubbed version. The live version took 1.3 seconds to execute. This version took less than 1/100th of a second. Those decisions really add up as your test suite grows. They can become a real problem when you hit one hundred thousand specs like we did a few months ago.

Ok. It’s now 4:15. Those specs took us about 10
minutes to resolve. That wasn’t so bad. But, you can still feel the clock ticking. Are you going to be able to make it to the thing on time?

Next, let’s take a look at order dependence, starting with
a definition…

Order dependent specs are specs that pass in isolation, but
fail when run with other specs in a specific order. So, for example,

If Test A and B both pass when run in
alphabetical order.

But, Test A fails when run after Test B.

That makes Test A flaky, and Test B leaky. But,
what does that mean? Leaky?

Well, remember, these specs are making an invalid assumption about
their environment. Their environment includes all of the shared state they have access to. It works like this…

Let’s pretend this blue square is the starting point for
the shared environment. Spec A runs first, so it gets the blue square environment and passes.

Spec A does not modify the environment, so spec B
runs in the same context as spec A. It also passes.

But imagine, if spec B runs first… It gets the
starting environment… The blue square.

And, it changes the environment to a pink hexagon

causing spec A to fail.

So, what’s happening is that state from spec B is
leaking into the environment, causing spec A to flake. For this reason, this class of specs are often referred to as “leaky.”

So, isn’t the leaky spec the real problem here? Not
really. Both specs are to blame. Only one is breaking your build. Fix the broken spec first. Often you’ll find that fixing the broken spec will point to a broader solution. But, how do you do fix order dependent flakiness? Well, first, let’s take a look at what causes these kinds of failures…

Order dependent failures are caused by mutable state that is
shared across specs. This could be in the form of: * Broadly scoped variables, like global or class variables * Databases * Key/value stores * Caches * Or, even the DOM, if you’re writing JavaScript tests. Alright, that’s what causes order dependency, but…

How do you reproduce these things?

First, eliminate non-determinism by running the failing spec repeatedly, in
isolation. If it fails, that’s non-determinism.

If not, then run ALL the specs that ran together
with the failing spec. One of them is leaking state into the failing spec.

If running the specs in the default order doesn’t reproduce
the failure, randomize the order in which the specs are run using the rspec order random option. Keep running it until you find a seed that consistently causes the failure.

Next, locate the leaky spec or specs. I say specs,
plural, because sometimes it takes several specs running in a specific order to produce the failure. You can use rspec bisect to find the leaky specs for you. I’ll show you how in a minute…

But, first, how can we fix order dependent failures? You
can remove the shared state, make it immutable, or isolate it… * Don’t use broadly scoped variables * Mock the shared data store (which you can do easily with a layer of abstraction, like the repository pattern). * Use database transactions, or * Reset the shared state between specs

Alright. Let’s see if we can fix another one of
those flaky specs that are keeping you from that thing…

Here’s a simple getter and setter to store a favorite
session.

Notice that both getter and setter leverage an object called
Cache. And, they are calling a class method named `instance`. What is that? Let’s take a look.

The Cache class is a simple, in-memory, key/value store backed
by a hash.

The `instance` method effectively turns this class into a singleton
so that every reference to the Cache.instance method is getting the same instance.

Here’s the specs for the `favorite_session` getter and setter… These
specs pass when run in this order: getter first, then setter. But, they’ll fail in the opposite order because

we’re storing the title of the session in the cache.

So, the getter will return the value in the cache,
not “Matz’s Keynote”. To prove that, let’s run rspec with the order random option to see if we can get it to fail…

So, here you can see… I ran the specs with
the order random option. And,

RSpec chose the random seed 12322. And, The getter ran
before the setter, so it passed. Let’s try that again…

The getter ran before the setter, so it passed. Let’s
try that again…

Ok. I’m still running rspec with the order random option.

This time rspec chose the seed 63603. And…

Low and behold! The setter spec ran first…

Causing the getter spec to fail. So, how do you
go about fixing this?

Well, we know that one of the specs that ran
before the getter spec must have polluted the environment. In this case, we’re pretty sure the setter spec is the culprit because of the memoization. But, what if you didn’t know which of the specs was to blame for modifying the environment? That’s where rspec bisect comes in. Let’s take a look…

Here, I’m running rspec bisect with the same order clause
and seed that produced the failure. This is important, because bisect won’t work unless it can reproduce the failure.

The first thing bisect does is to find the failing
spec…

Next, it analyzes whether or not the failure appears to
be order dependent. In this case, it does.

So, it performs a binary search, looking for the spec
or specs that need to be run first in order for the failure to happen. Note that this can take a very long time if the set of candidate leaky specs is large.

Then, finally, it reports the minimal command required to reproduce
the problem. Run that command and you’ll see exactly which order the specs ran in to cause the issue.

So, here we go… I’ve run the command that RSpec
bisect gave us.

And, sure enough, the setter spec is the culprit. So,
how do we fix it?

Here we are back at the beginning. The font is
smaller because the solution here is bigger. One way we could approach this would be to call Cache dot clear in between specs. But, because our specs are currently sharing state, that would likely lead to a race condition on the build server where we’re probably running the specs in parallel. So, the solution I prefer is actually dependency injection. That’s a simple technique where I just pass the Cache object into the RubyConf object when it is created. So each spec can have it’s own cache. Here’s what that looks like in the code…

First, here’s the new initializer. Notice that the cache parameter
defaults to Cache.instance. So, if we don’t pass anything, the code will just use the singleton, which is what we want. By doing this we’ve now created a seam in the software that allows the specs to use their own cache objects. This prevents state from leaking between the specs, without modifying the behavior of the production code. To finish up, we need to modify the specs…

Like this…

Here, we’re creating a new instance of the Cache class
and passing it to the RubyConf object when we create it. That’s it. That’s all there is to dependency injection. And, because each spec has it’s own cache, they no longer share state. Let’s run the specs again…

I’m running the specs again — with the same randomized
seed that caused them to fail in the first place. Now, they pass, even though

The setter ran before the getter Voila!

Ok! We’re making good progress, and it’s only 4:30! You
might actually make that thing, after all! Just a couple more broken specs.

Finally, let’s look at race conditions.

What is a race condition? A race condition is what
can happen when parallel processes compete over a scarce, shared resource. Let’s look at how that happens with a file…

Let’s start by looking at two specs running in sequence.
In this example, Spec 1…

first writes to the file,

then reads from it, checks the result,

and passes. Once Spec 1 finishes, Spec 2 runs…

first it writes to the same file,

twice,

then reads from it, checks the result,

and passes. But…

When you run these same specs in parallel…

Spec 1 writes to the file

Then, spec 2 writes to the file

Then, spec 1 reads from the file, checks the result,

And, fails — because there were two rows, not one

Then spec 2 writes to the file again

Then spec 2 reads from the file, checks the result,

And fails — because there were three rows, not two

So, both specs in this case, are susceptible to parallel
flakiness due to a race condition. But, since this is asynchronous code, it’s entirely possible that the specs could pass.

This is why race conditions are notoriously hard to reproduce.
So, how can you go about debugging them if you can’t reproduce them? Well, you want to take a methodical approach.

The first thing to do is to eliminate non-determinism. Run
the failing spec repeatedly in isolation. If it fails, that’s non-determinism.

If not, try to eliminate order dependence. Run the failing
spec and all the specs that ran with it repeatedly in different orders. If you can repro the failure, that’s order dependence.

If that doesn’t work, then run the specs repeatedly in
parallel with the Parallel RSpec gem. I specifically mention that gem because it seems better suited for running the specs locally than Parallel Test or other options like Knapsack, which seem targeted at Rails apps running on CI. It’s best to debug this locally if at all possible.

If you still can’t reproduce it, you can try randomizing
the order in which the specs run in parallel.

Once you’ve reproduced it, or even if you can’t, what
should you look for to fix? The main cause of race conditions on build servers is asynchronous code competing over scarce, shared resources. Those resources might include: * File, or * Socket IO * Thread pools * Connection pools, or even * Low memory Once you have a suspect, how do you fix it?

Well… * For IO-based issues, you can substitute StringIO for
other kinds of IO in your spec. I’ll share an example of this in a moment. * You can test that the correct messages are being sent between collaborating objects, rather than testing the return value of a method. * You can write thread-safe code. * You can test threaded code synchronously - by extracting the guts of the thread into a plain old Ruby object (or PORO) and testing that, or * You can switch to fibers instead of threads Fibers are cool, because you can test them synchronously which is awesome. They significantly reduce the chances of a race condition, because they’re in control of when they relinquish the CPU back to the OS. So, atomic operations can complete without interruption. * Finally, you can always add more resources to your test environment. Though, that’s a bit of an arms race. You’ll most likely end up come back and increasing it again, and again, and again.

Ok. Let’s take a look at the last two flaky
specs that are keeping you from that thing…

So, this feature of the app manages a list of
reservations. There are two methods, one to reserve a seat. The other to get a list of the attendees. As you can see, we’re just…

writing the names to a file and

reading from that file. Let’s take a look at the
specs…

Ok. So…

The first spec ensures that writing “Mickey Mouse” to the
file grows the number of attendees by 1.

The second spec ensures that when writing multiple lines, Donald
and Goofy, the attendee count goes up accordingly. These are fine specs. Let’s run them…

Hey! They pass! When run in sequence. In fact, they
will even pass in the opposite order. But…

If I break out the `parallel_rspec` gem and run them
in parallel, they both fail.

In this case, the second spec actually finished first. It
failed because the attendee count to grew by 3 not 2.

And, the first spec, which finished second thanks to parallelism,
failed because the count grew by 2 not 1. We’ve already seen how that can happen, but let’s walk through it again…

Before we get into how the specs failed, let’s look
at how this RSpec code works. It’s a little bit complicated.

Here’s the expectation in the second spec. It defines two
blocks of code.

The first block is passed to the expect method. And,

The second is passed to the change method. The way
RSpec handles this is to execute the change block (to get the initial value),

then the expect block,

and then the change block again (to get the final
value). Finally, RSpec subtracts the initial value from the final value to get the delta, which needs to match the “by” clause, which in this case is 2.

Ok. Here we are back at the beginning… This time,
let’s track the order of operations on this timeline.

First the second spec reads the file to grab the
attendee count, which should be 0.

Next, the first write happens in the second spec…

Next, the first spec reads the file. This time it
gets 1.

Next, the other writes occur. No telling what order. You’d
need to look at the file. But, they both happen…

And, finally, both reads happen again. Here, we know that
the second spec finished first, because it’s output appeared first. But, it doesn’t really matter.

Ok. So, now that we know how it failed, let’s
go back to the beginning and show you the solution. Turns out, the Ruby core team thought of this. They knew that testing asynchronous File IO would be a challenge. So, they included a class called StringIO to simulate other kinds of IO in specs. StringIO is a string, but with the interface of a file.

So, what we want to do is to… Allow File
to receive open and yield the a StringIO object.

So, now, when the code calls File.open, the actual object
that it will receive is a StringIO object. One caveat: Because this string behaves like a file, it has a cursor.

So, after writing to the string, we need to rewind
it before we can read it. That’s not necessary prior to introducing the StringIO because the file was being closed when the object fell out of scope. Here, the StringIO object isn’t closed because it hasn’t fallen out of scope, because it was declared in the spec. So, we need to rewind, before we can read… Alright! The proof is in the pudding. Did we fix the race condition?

So, here we are, 40 minutes in, and we’ve found
and resolved ALL of the flaky specs that were keeping you from that thing! Time to wrap things up so we can get to that thing in the lunchroom!

‘Cause, I don’t know about you, but this talk always
makes me hungry…

Ok. Here’s a cheat sheet for the entire talk…

Non-deterministic flakiness reproduces in isolation. Look for interactions with non-deterministic
elements of the environment. To fix this kind of flakiness, mock the non-determinism to make it deterministic. Don’t forget about Timecop when working with date or time related specs. There are also tools like webMock and VCR for handling specs that require network connections. I didn’t show them here because I prefer to just use RSpec like I did in this presentation. But, lots of folks find those tools very useful.

Order dependent flakiness only reproduces with other specs, run in
a certain order. It will not reproduce in isolation. Look for state that is shared across tests. To fix order dependency, remove shared state, make it immutable, or isolate it. RSpec order random can help you reproduce the failures. And, RSpec bisect can help you locate the leaky specs.

And, race conditions only reproduce with other specs when run
in parallel, not in isolation. Look for asynchronous code, or exhaustible shared resources. To fix race conditions, isolate things from one another (like we did with StringIO), or use fibers instead of threads. Seriously, they’re amazing! Finally, you can use Parallel RSpec to repro the failures locally instead of on your build server.

And keep in mind that the secret ingredient in every
flaky spec is an invalid assumption about the environment in which it is running. Sometimes, just remembering that fact will help you identify and resolve the flakiness. Ask yourself, how can I ensure that the environment for this test is what it expects?

Oh, and one more thing… I have a bit of
a hot take… Debugging this stuff is hard enough. But, it gets one hundred times harder if your specs are too DRY. So, avoid the use of these features of RSpec. They seem harmless — useful even — when you’re writing the specs. But, ultimately they make debugging way too hard. So, try to avoid… * Shared specs * Shared contexts * Nested contexts, and * Let statements Your specs should be super communicative. They are, after all, the executable documentation for your code. If you have to scroll all over the place or open a ton of files to write the specs, you can be guaranteed that you’ll be doing the same when you’re trying to understand and debug them when they fail.

Don’t get me wrong. I love RSpec. But, it’s best
to leave your tests WET. And, I’m not alone in this belief. The fine folks at thoughtbot have written about it. And, in fact, I honestly think that DRY might be the worst programming advice ever. I told you it was a hot take. If you disagree, come find me so I can change your mind.

Again, my name is Alan Ridlehoover. I do know a
thing or two about flakiness. But, it took me 20+ years to get here. Hopefully, this talk has short circuited that for you…

As I mentioned at the beginning of the talk, I
work for Cisco Meraki. So, I also know a thing or two about connectivity! Here’s how to connect with me. And, that last item is where you can find the source code for this talk. There’s even a bonus flaky spec regarding a raffle winner that I didn’t have time to cover today. Cisco Meraki is probably the largest Rails shop you’ve never heard of. And, we’re growing. We are currently hiring for a limited number of roles. Come chat with us to find out what it’s like to work at Meraki.

Finally, a little shameless self promotion… My friend, Fito von
Zastrow, and I love Ruby so much, we occasionally release something into the wild in the hopes that folks will find it useful. You can find links to our stuff at first try dot software, including Rubyist, the opinionated VS Code color theme I used in this talk. We’d love for you to check it out.

Thank you so much for coming! If you have questions,
come chat with me offstage. Or find me at lunch.

[RubyConf 2023] The Secret Ingredient: How to U...

[RubyConf 2023] The Secret Ingredient: How to Understand and Resolve Just about Any Flaky Test

Video

More Decks by Alan Ridlehoover

Other Decks in Programming

Featured

Transcript