Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Secret Ingredient: How to Understand and Resolve Just about Any Flaky Test

The Secret Ingredient: How to Understand and Resolve Just about Any Flaky Test

Flaky tests are an inscrutable bane. Hard to understand. Annoying. And, so frustrating! My personal nemesis is Daylight Saving Time. I can’t tell you how many times I’ve tripped over it. Let’s just say I was well into the “shame on me” part of that relationship, until I discovered the secret ingredient that nearly all flaky tests have in common. Turns out they only seem inscrutable. It really is possible to understand and resolve just about any flaky test.

Alan Ridlehoover

November 15, 2023
Tweet

More Decks by Alan Ridlehoover

Other Decks in Programming

Transcript

  1. It’s after 4 o’clock. The release was due an hour

    ago. You’ve got less than an hour to leave, or you’re going to be late for that thing… You can feel the clock ticking…
  2. The build just failed. A-gain. You look at the build.

    You look at the clock. < SHAKING HEAD > You don’t have time for flakiness…
  3. So, you re-run the build. A-gain. Two builds. Five different

    failing specs. None of them have anything to do with your commit. All you can think about is how you can’t be late to another thing…
  4. If only you knew the secret ingredient that all flaky

    tests have in common… You might be on your way to that thing right now…
  5. Hello! My name is Alan Ridlehoover. I’m an Engineering Manager

    at Cisco Meraki — the largest Rails shop you’ve never heard of. And, though I’m not a baker, I do know a thing or two about flakiness. In fact… sometimes…
  6. It’s all I can think about! Seriously! Since I started

    automating tests over 20 years ago, I’ve written my fair share of flaky specs. Daylight Saving Time is my personal nemesis. I can’t tell you how many times I’ve tripped on that. Let’s just say I’m well into the “shame on me” part of that relationship. Or, I was… But, I’m getting ahead of myself. Let’s start with a definition. What is a flaky test?
  7. A flaky spec is one that changes state without modification

    to either the test itself or the code being tested.
  8. Well, it happens because of the secret ingredient that all

    flaky tests have in common… But, what is it? What is that secret ingredient?
  9. It’s an assumption. All flaky tests make invalid assumptions about

    their environment. They assume their environment will be in a particular state when they begin. But that assumption is rendered incorrect by some change in the environment between or during test runs.
  10. Ok. But, what causes that change to the environment? Well,

    there are three recipes: * Non-determinism * Order dependence, and * Race conditions Let’s take a look at each of these, along with some examples in code… Starting with…
  11. Well, a deterministic algorithm is one that, given the same

    inputs, always produces the same output. For example...
  12. until it didn’t. That’s the definition of non-determinism: an algorithm

    that, given the same inputs, does not always produce the same output. But, how could this be?
  13. Well, it might sound obvious, but utilizing non-deterministic features of

    the environment leads to non-deterministic code, including… * Random number generators - clearly - these are intended to be, well, random * The system clock - we don’t always think of this, but it’s always changing * Network connections - that might be up one minute and down the next * Floating point precision - it’s not guaranteed These are just a few examples, I’m sure this list is not exhaustive. But, what if our code relies on these things? How can we possibly write deterministic tests?
  14. The trick is to remove the non-determinism from the test

    by stubbing it, or to account for it by using advanced matchers so that the spec produces consistent results from one run to the next. To do that… * You can stub the random number generator to return a specific number * You can mock (or “freeze”) time * You can stub network responses * And, for floats, you can leverage some of RSpec’s more advanced matchers, like: be_witihin, and be_between. And, please! < ANIMATE > Don’t forget to document the undocumented use case with a spec!
  15. Ok. So, while that build is running, let’s see if

    we can fix some of those flaky specs that are making you late for that thing… First, a bit of context… The code we’re about to look at is entirely made up. Well, I guess, technically, all code is made up. But, what I mean is that this code was made up fresh, just for this talk. By me. With TDD. Not ChatGPT. It’s not production code. But, it was inspired by real code that I’ve personally worked on in production applications. It’s a bit of a hodge podge. It’s a class called RubyConf that provides some utility methods that might be useful for running the conference.
  16. Here’s a bit of code to determine whether or not

    the conference is currently in progress.
  17. Notice that the method uses the system clock to determine

    the current date and time. Simple enough. Let’s look at the specs…
  18. It says that the #in_progress? Method returns false before the

    conference begins. Ok. That makes sense. But, it does seem like the author forgot at least two other cases: during the conference and after the conference. But you know what? This is a common problem I see with date based specs. The author of the spec is living in the now. They aren’t thinking about the future. In fact, this is exactly what happens to me with Daylight Saving Time. I forget about it and never write a spec that proves the software still works after the clock changes. I bet this spec ran fine before the conference. But, it’s failing now that the conference is actually in progress. Let’s play with the system clock to see if I’m right…
  19. Ok. As predicted, it passes if I set the clock

    to October 21, 2023 - well before the conference. So, we know this is a flaky test because it was passing and now it’s failing despite there being no modifications to the code or tests.
  20. And, if I set the clock to the first day

    of the conference? It fails. This spec is flaky. It depends on the system clock being in a particular state. Ok, how do we fix this? Remember, whenever we’re facing non-deterministic flakiness, we want to mock the non-determinism to make it deterministic. In this case, that’s the system clock…
  21. And, here it is with the fixed spec. Notice that

    the only difference here is that the new spec is mocking (or freezing) time. The code didn’t change at all. It’s fine. Only the spec changed. It sets time to a specific date so that the spec will never run outside that context. It does this for the duration of the block, then it returns the system to it’s normal state. Ok. Let’s see if that fixed it…
  22. Great! So, now the specs pass on November 13th, the

    first day of the conference. And, I even added specs for the missing use cases by freezing time and setting the dates appropriately. Like this…
  23. These specs freeze time the same way the other spec

    freezes time, just with different dates and expected results.
  24. It’s not uncommon for code to need to call external

    services across a network. In this session_description method,
  25. Then we’re parsing it, finding the session that matches the

    title parameter, and returning it’s description.
  26. Note that the call to the network adds over a

    second to the runtime. And, that’s when it succeeds. Most get requests default to a 60 second timeout when waiting for the other service to respond. Plus…
  27. Fortunately, it fails quickly because the network failure is on

    my end. Now, these are particularly nasty tests to debug because the loss in connectivity is neither logged, nor persistent. So, by the time you’re debugging the failure, it may not be possible to reproduce. Pay attention to HTTP calls, or any other type of call that crosses a network (e.g. GRPC). And, try running your specs with WiFi turned off to see if you can catch any failures.
  28. Alright. Let’s fix this spec… Here it is a bit

    smaller than before. Same code. Different font size, because the fix is a bit large…
  29. Notice that again, the code itself is not changing. The

    problem is with the spec. Most of the changes to the spec are setting up data to stub a response from the API. Here’s where we’re actually creating the stub. This allows the spec to validate that we’re parsing the results correctly. That’s the code we care about. We don’t actually care whether the external service is up and running when we’re running our test suite. It shouldn’t matter.
  30. Now, sometime I get a bit of pushback when I

    offer this advice. Folks ask, “What if the API changes? How will we know if we’re mocking the response?” My answer to that is that each spec should have one and only one reason to exist. This spec’s reason to exist is to verify the code we wrote works the way we expect. It is a unit test. We want these to have as few dependencies as possible so they’ll run fast. You may ALSO require a spec to validate that an API’s schema has not changed. That’s a different reason for a spec to exist. So, that’s not this spec. In fact, that’s not even a unit test. It’s an integration test. And, it’s one that’s designed to fail in order to catch changes to the API. So, we probably don’t want to run it with our unit tests, which are designed to pass. Maybe the integration tests should be run separately, on a schedule, rather than intermingled with our unit tests on every build. Alright. Let’s run the specs.
  31. Alright! Running the specs with the WiFi turned off proves

    that the stubbed response prevented the spec from flaking.
  32. It’s also important to point out the difference in time

    between the live version and the stubbed version. The live version took 1.3 seconds to execute. This version took less than 1/100th of a second. Those decisions really add up as your test suite grows. They can become a real problem when you hit one hundred thousand specs like we did a few months ago.
  33. Ok. It’s now 4:15. Those specs took us about 10

    minutes to resolve. That wasn’t so bad. But, you can still feel the clock ticking. Are you going to be able to make it to the thing on time?
  34. Order dependent specs are specs that pass in isolation, but

    fail when run with other specs in a specific order. So, for example,
  35. Well, remember, these specs are making an invalid assumption about

    their environment. Their environment includes all of the shared state they have access to. It works like this…
  36. Let’s pretend this blue square is the starting point for

    the shared environment. Spec A runs first, so it gets the blue square environment and passes.
  37. Spec A does not modify the environment, so spec B

    runs in the same context as spec A. It also passes.
  38. But imagine, if spec B runs first… It gets the

    starting environment… The blue square.
  39. So, what’s happening is that state from spec B is

    leaking into the environment, causing spec A to flake. For this reason, this class of specs are often referred to as “leaky.”
  40. So, isn’t the leaky spec the real problem here? Not

    really. Both specs are to blame. Only one is breaking your build. Fix the broken spec first. Often you’ll find that fixing the broken spec will point to a broader solution. But, how do you do fix order dependent flakiness? Well, first, let’s take a look at what causes these kinds of failures…
  41. Order dependent failures are caused by mutable state that is

    shared across specs. This could be in the form of: * Broadly scoped variables, like global or class variables * Databases * Key/value stores * Caches * Or, even the DOM, if you’re writing JavaScript tests. Alright, that’s what causes order dependency, but…
  42. First, eliminate non-determinism by running the failing spec repeatedly, in

    isolation. If it fails, that’s non-determinism.
  43. If not, then run ALL the specs that ran together

    with the failing spec. One of them is leaking state into the failing spec.
  44. If running the specs in the default order doesn’t reproduce

    the failure, randomize the order in which the specs are run using the rspec order random option. Keep running it until you find a seed that consistently causes the failure.
  45. Next, locate the leaky spec or specs. I say specs,

    plural, because sometimes it takes several specs running in a specific order to produce the failure. You can use rspec bisect to find the leaky specs for you. I’ll show you how in a minute…
  46. But, first, how can we fix order dependent failures? You

    can remove the shared state, make it immutable, or isolate it… * Don’t use broadly scoped variables * Mock the shared data store (which you can do easily with a layer of abstraction, like the repository pattern). * Use database transactions, or * Reset the shared state between specs
  47. Alright. Let’s see if we can fix another one of

    those flaky specs that are keeping you from that thing…
  48. Notice that both getter and setter leverage an object called

    Cache. And, they are calling a class method named `instance`. What is that? Let’s take a look.
  49. The `instance` method effectively turns this class into a singleton

    so that every reference to the Cache.instance method is getting the same instance.
  50. Here’s the specs for the `favorite_session` getter and setter… These

    specs pass when run in this order: getter first, then setter. But, they’ll fail in the opposite order because
  51. So, the getter will return the value in the cache,

    not “Matz’s Keynote”. To prove that, let’s run rspec with the order random option to see if we can get it to fail…
  52. So, here you can see… I ran the specs with

    the order random option. And,
  53. RSpec chose the random seed 12322. And, The getter ran

    before the setter, so it passed. Let’s try that again…
  54. Well, we know that one of the specs that ran

    before the getter spec must have polluted the environment. In this case, we’re pretty sure the setter spec is the culprit because of the memoization. But, what if you didn’t know which of the specs was to blame for modifying the environment? That’s where rspec bisect comes in. Let’s take a look…
  55. Here, I’m running rspec bisect with the same order clause

    and seed that produced the failure. This is important, because bisect won’t work unless it can reproduce the failure.
  56. Next, it analyzes whether or not the failure appears to

    be order dependent. In this case, it does.
  57. So, it performs a binary search, looking for the spec

    or specs that need to be run first in order for the failure to happen. Note that this can take a very long time if the set of candidate leaky specs is large.
  58. Then, finally, it reports the minimal command required to reproduce

    the problem. Run that command and you’ll see exactly which order the specs ran in to cause the issue.
  59. Here we are back at the beginning. The font is

    smaller because the solution here is bigger. One way we could approach this would be to call Cache dot clear in between specs. But, because our specs are currently sharing state, that would likely lead to a race condition on the build server where we’re probably running the specs in parallel. So, the solution I prefer is actually dependency injection. That’s a simple technique where I just pass the Cache object into the RubyConf object when it is created. So each spec can have it’s own cache. Here’s what that looks like in the code…
  60. First, here’s the new initializer. Notice that the cache parameter

    defaults to Cache.instance. So, if we don’t pass anything, the code will just use the singleton, which is what we want. By doing this we’ve now created a seam in the software that allows the specs to use their own cache objects. This prevents state from leaking between the specs, without modifying the behavior of the production code. To finish up, we need to modify the specs…
  61. Here, we’re creating a new instance of the Cache class

    and passing it to the RubyConf object when we create it. That’s it. That’s all there is to dependency injection. And, because each spec has it’s own cache, they no longer share state. Let’s run the specs again…
  62. I’m running the specs again — with the same randomized

    seed that caused them to fail in the first place. Now, they pass, even though
  63. Ok! We’re making good progress, and it’s only 4:30! You

    might actually make that thing, after all! Just a couple more broken specs.
  64. What is a race condition? A race condition is what

    can happen when parallel processes compete over a scarce, shared resource. Let’s look at how that happens with a file…
  65. So, both specs in this case, are susceptible to parallel

    flakiness due to a race condition. But, since this is asynchronous code, it’s entirely possible that the specs could pass.
  66. This is why race conditions are notoriously hard to reproduce.

    So, how can you go about debugging them if you can’t reproduce them? Well, you want to take a methodical approach.
  67. The first thing to do is to eliminate non-determinism. Run

    the failing spec repeatedly in isolation. If it fails, that’s non-determinism.
  68. If not, try to eliminate order dependence. Run the failing

    spec and all the specs that ran with it repeatedly in different orders. If you can repro the failure, that’s order dependence.
  69. If that doesn’t work, then run the specs repeatedly in

    parallel with the Parallel RSpec gem. I specifically mention that gem because it seems better suited for running the specs locally than Parallel Test or other options like Knapsack, which seem targeted at Rails apps running on CI. It’s best to debug this locally if at all possible.
  70. If you still can’t reproduce it, you can try randomizing

    the order in which the specs run in parallel.
  71. Once you’ve reproduced it, or even if you can’t, what

    should you look for to fix? The main cause of race conditions on build servers is asynchronous code competing over scarce, shared resources. Those resources might include: * File, or * Socket IO * Thread pools * Connection pools, or even * Low memory Once you have a suspect, how do you fix it?
  72. Well… * For IO-based issues, you can substitute StringIO for

    other kinds of IO in your spec. I’ll share an example of this in a moment. * You can test that the correct messages are being sent between collaborating objects, rather than testing the return value of a method. * You can write thread-safe code. * You can test threaded code synchronously - by extracting the guts of the thread into a plain old Ruby object (or PORO) and testing that, or * You can switch to fibers instead of threads Fibers are cool, because you can test them synchronously which is awesome. They significantly reduce the chances of a race condition, because they’re in control of when they relinquish the CPU back to the OS. So, atomic operations can complete without interruption. * Finally, you can always add more resources to your test environment. Though, that’s a bit of an arms race. You’ll most likely end up come back and increasing it again, and again, and again.
  73. Ok. Let’s take a look at the last two flaky

    specs that are keeping you from that thing…
  74. So, this feature of the app manages a list of

    reservations. There are two methods, one to reserve a seat. The other to get a list of the attendees. As you can see, we’re just…
  75. The first spec ensures that writing “Mickey Mouse” to the

    file grows the number of attendees by 1.
  76. The second spec ensures that when writing multiple lines, Donald

    and Goofy, the attendee count goes up accordingly. These are fine specs. Let’s run them…
  77. Hey! They pass! When run in sequence. In fact, they

    will even pass in the opposite order. But…
  78. In this case, the second spec actually finished first. It

    failed because the attendee count to grew by 3 not 2.
  79. And, the first spec, which finished second thanks to parallelism,

    failed because the count grew by 2 not 1. We’ve already seen how that can happen, but let’s walk through it again…
  80. Before we get into how the specs failed, let’s look

    at how this RSpec code works. It’s a little bit complicated.
  81. The second is passed to the change method. The way

    RSpec handles this is to execute the change block (to get the initial value),
  82. and then the change block again (to get the final

    value). Finally, RSpec subtracts the initial value from the final value to get the delta, which needs to match the “by” clause, which in this case is 2.
  83. Ok. Here we are back at the beginning… This time,

    let’s track the order of operations on this timeline.
  84. First the second spec reads the file to grab the

    attendee count, which should be 0.
  85. Next, the other writes occur. No telling what order. You’d

    need to look at the file. But, they both happen…
  86. And, finally, both reads happen again. Here, we know that

    the second spec finished first, because it’s output appeared first. But, it doesn’t really matter.
  87. Ok. So, now that we know how it failed, let’s

    go back to the beginning and show you the solution. Turns out, the Ruby core team thought of this. They knew that testing asynchronous File IO would be a challenge. So, they included a class called StringIO to simulate other kinds of IO in specs. StringIO is a string, but with the interface of a file.
  88. So, what we want to do is to… Allow File

    to receive open and yield the a StringIO object.
  89. So, now, when the code calls File.open, the actual object

    that it will receive is a StringIO object. One caveat: Because this string behaves like a file, it has a cursor.
  90. So, after writing to the string, we need to rewind

    it before we can read it. That’s not necessary prior to introducing the StringIO because the file was being closed when the object fell out of scope. Here, the StringIO object isn’t closed because it hasn’t fallen out of scope, because it was declared in the spec. So, we need to rewind, before we can read… Alright! The proof is in the pudding. Did we fix the race condition?
  91. So, here we are, 40 minutes in, and we’ve found

    and resolved ALL of the flaky specs that were keeping you from that thing! Time to wrap things up so we can get to that thing in the lunchroom!
  92. Non-deterministic flakiness reproduces in isolation. Look for interactions with non-deterministic

    elements of the environment. To fix this kind of flakiness, mock the non-determinism to make it deterministic. Don’t forget about Timecop when working with date or time related specs. There are also tools like webMock and VCR for handling specs that require network connections. I didn’t show them here because I prefer to just use RSpec like I did in this presentation. But, lots of folks find those tools very useful.
  93. Order dependent flakiness only reproduces with other specs, run in

    a certain order. It will not reproduce in isolation. Look for state that is shared across tests. To fix order dependency, remove shared state, make it immutable, or isolate it. RSpec order random can help you reproduce the failures. And, RSpec bisect can help you locate the leaky specs.
  94. And, race conditions only reproduce with other specs when run

    in parallel, not in isolation. Look for asynchronous code, or exhaustible shared resources. To fix race conditions, isolate things from one another (like we did with StringIO), or use fibers instead of threads. Seriously, they’re amazing! Finally, you can use Parallel RSpec to repro the failures locally instead of on your build server.
  95. And keep in mind that the secret ingredient in every

    flaky spec is an invalid assumption about the environment in which it is running. Sometimes, just remembering that fact will help you identify and resolve the flakiness. Ask yourself, how can I ensure that the environment for this test is what it expects?
  96. Oh, and one more thing… I have a bit of

    a hot take… Debugging this stuff is hard enough. But, it gets one hundred times harder if your specs are too DRY. So, avoid the use of these features of RSpec. They seem harmless — useful even — when you’re writing the specs. But, ultimately they make debugging way too hard. So, try to avoid… * Shared specs * Shared contexts * Nested contexts, and * Let statements Your specs should be super communicative. They are, after all, the executable documentation for your code. If you have to scroll all over the place or open a ton of files to write the specs, you can be guaranteed that you’ll be doing the same when you’re trying to understand and debug them when they fail.
  97. Don’t get me wrong. I love RSpec. But, it’s best

    to leave your tests WET. And, I’m not alone in this belief. The fine folks at thoughtbot have written about it. And, in fact, I honestly think that DRY might be the worst programming advice ever. I told you it was a hot take. If you disagree, come find me so I can change your mind.
  98. Again, my name is Alan Ridlehoover. I do know a

    thing or two about flakiness. But, it took me 20+ years to get here. Hopefully, this talk has short circuited that for you…
  99. As I mentioned at the beginning of the talk, I

    work for Cisco Meraki. So, I also know a thing or two about connectivity! Here’s how to connect with me. And, that last item is where you can find the source code for this talk. There’s even a bonus flaky spec regarding a raffle winner that I didn’t have time to cover today. Cisco Meraki is probably the largest Rails shop you’ve never heard of. And, we’re growing. We are currently hiring for a limited number of roles. Come chat with us to find out what it’s like to work at Meraki.
  100. Finally, a little shameless self promotion… My friend, Fito von

    Zastrow, and I love Ruby so much, we occasionally release something into the wild in the hopes that folks will find it useful. You can find links to our stuff at first try dot software, including Rubyist, the opinionated VS Code color theme I used in this talk. We’d love for you to check it out.
  101. Thank you so much for coming! If you have questions,

    come chat with me offstage. Or find me at lunch.