Lessons learnt from running all the tests all the time
About a year ago, here on the Cloud Foundry container team, we decided to start running every one of our test suites every 20 minutes. The implications of such madness have been profound. This article details everything the team and I have learnt from the whole ordeal so far.
Let’s begin by discussing why we did this and what we hoped to achieve. The goal was to reduce the number of flaky tests in our codebase to near zero. Why? Because flakes are bad, they:
👎 Screw up our CI pipelines
👎 Make refactoring difficult and error-prone
👎 Lead to reduced confidence in the codebase
👎 Make me feel like an amateur who doesn’t know how to write good tests
👎 Waste valuable engineering time
👎 Are just generally are the worst
And we’d had quite enough of all that. But before we could deal with the flakes we first needed to flush them out. And what better way to flush them out than by running all our test suites periodically every 20 minutes forever!
This was the genius idea of the team’s Anchor at the time, Will Martin, who then immediately stepped down as Anchor and left me to deal with the fallout.
Fast forward to the present day and we’ve been running our tests like this for over a year now. So, what lessons have we learnt?
The first and most obvious lesson learnt was that more test runs = more flakes. This was of course expected, but what was less expected was the sheer quantity of flakes that started to crop up. In fact it was absolutely relentless, and it pushed the team’s strategy for dealing with flakes to the limit.
In the past, our flake strategy had always been as follows:
➡️ Engineer notices a flaky test
➡️ Engineer creates a chore in the backlog for the flake
➡️ The chore is immediately prioritised at the top of the backlog
That’s a solid strategy to have in place as it forces flakes to be dealt with immediately, i.e. before context is lost on recent code changes that likely introduced the flake in the first place. However, this strategy is less effective when suddenly faced with a thousand new flakes that have been hiding in the test suites for a long time.
If we were to have prioritised every new flake at the top of the backlog, we wouldn’t have been able to deliver any new features for weeks. Now, there is a super interesting question regarding whether or not we should have stopped work on new features and addressed the flakes before continuing, but that’s a discussion for another day.
For better or worse we continued with feature development and dedicated a track of work to dealing with the flakes – the much-feared flake track. The pair working on the flake track were responsible for identifying the flakiest of flakes and triaging them into actionable chores in the backlog.
This in itself was a challenge and highlighted the need for better tooling around detection and triage of flaky tests. We did briefly experiment with a tool called concourse-flake-hunter, but it turns out that accurately automating the detection of flakes is a lot harder than you might imagine…
The need for such tooling was also called into question, the assumption being that surely it’s better to focus on preventing flakes rather than building tools to help treat them. While I do agree with that sentiment, it was already far too late for that by the time we kicked off this little experiment. But flake prevention is an important topic, and it’s one we’ve learnt a lot about over the past year or so, which brings us on nicely to…
The second lesson learnt was that when dealing with chaos (i.e. a sudden downpour of flakes) it can be hugely beneficial to take a step back and to look at the bigger picture. At the time we were so preoccupied with fixing flakes on a case-by-case basis that we were failing to see the patterns that were emerging.
On reflection it’s now much more clear.
The majority of flakes we saw can be categorised into one or more of the following:
❄️ Timeout flake – Occurs when some timeout in the test is exceeded
❄️ Network flake – Occurs when there is a connectivity issue on the network
❄️ Heisenflake – Occurs only when you are not looking at it
❄️ State flake – Occurs when a test relies on the execution environment to be in a particular state when it is not
❄️ Dependency flake – Occurs when some external dependency is not available at test time (e.g. DockerHub, S3, etc.)
And each type of flake requires a certain type of response. Let’s start with timeout flakes, which were by far the most prevalent.
Now the simplest thing to do when confronted with a timeout flake is to simply increase the timeout. That’s a perfectly acceptable solution to a one-off timeout flake. Much less so when you’re dealing with your thousandth one.
One idea we experimented with was removing timeouts from our tests entirely. Can’t get a timeout flake if you don’t have a timeout! I actually really like this approach. Sure, it may result in occasionally-slower-to-fail tests, but I’d argue that that’s much more desirable than a test that’s going to flake all the time.
Of course, when removing test timeouts you do then run the risk of some test runs hanging forever. In order to combat this risk we developed a little helper script called slowmobius. slowmobius watches over all our test runs and notifies us of any that are taking an unusually long time to complete. The end result is no more test timeouts + confidence that we are still made aware of hanging tests 👌.
Let’s look at dependency flakes next. Garden supports running containers from Docker images. Because of this a large number of our tests have a dependency on DockerHub. For example, it’s quite common for our tests to create a container from the
Given the frequency at which our tests are now running, it’s not super uncommon for a test to flake with a “could not connect to DockerHub” error.
This can be very frustrating, especially given we have no control over the availability of DockerHub. One idea we discussed was to internalise the dependency and to setup our own Docker registry for use in our tests. I’m not particularly keen on that idea though. I think it makes our tests less “realistic” and moves them further away from how the majority of end users will use our product.
If an external dependency is so flaky that you are considering workarounds like this, maybe that’s a signal that some bigger change is required. For example, maybe you need to introduce some retry logic to the production code. Or maybe you should consider alternative dependencies entirely.
These days DockerHub is pretty stable so we’ve been happy to keep things as they are. Besides, whenever DockerHub does go down, our tests don’t really “flake” as such. Rather they continually fail for some period of time before then all coming back online. That’s not so bad because it doesn’t require as much engineering time to investigate.
Heisenflakes are the worst of all flakes. They are impossible to reproduce and occur only once in a lifetime. We literally had to investigate a flake that would occur once in every ~3000 runs. COME ON!
But there is hope. The pattern we’ve adopted for dealing with Heisenflakes is to not even try to reproduce them, as, by definition, they will not occur when you are trying to debug them. Instead we focus on improving our logging/debug output so that we are in a better position to understand the flake if/when it ever occurs again. This has saved us many, many hours of painfully boring debug time over the past year or so.
As for Network flakes, well to be honest there’s not a whole lot you can do about those. The best advice I can give is to make sure you’re running your tests on a solid network connection and to only use the network when absolutely necessary.
And finally state flakes – these are usually the result of improper test setup or teardown. Or perhaps an accidental and implicit dependency on the order in which tests must be run.
The best way to combat state flakes is to ensure your tests are run in a completely random order. While this may lead to the occasional Heisenflake, it will result in a healthier testing suite in the long run.
Ok. What’s next?
Or more accurately your CI infrastructure will die! This was something we overlooked for far too long. By running our tests periodically we had inadvertently increased the load on our CI machines. This in turn led to, you guessed it, more flakes!
In other words, our big attempt to minimise flaky test runs directly contributed to the generation of even more flaky test runs… GOOD ONE!
The key thing we were missing here was proper monitoring of our CI. Once we got some graphs in place we saw that every 20 minutes the load average would jump to over 100 and it became very clear that we needed to scale up.
Once we did that the number of flakes started to drop quite significantly. Something that now seems so obvious …
There’s time for one more lesson (and one more incredibly contrived Game of Thrones quote).
This one’s a bit embarrassing…
We use slack messages to notify us of any flaky test runs. Here’s an example:
Pretty neat right? The notification contains all the information we could possibly need to identify the build. It includes the job name, a link to view the build output in a browser and a copy-paste string we can use to jump into the build container.
Of course, during what has come to be known as “flakemageddon”, we were getting a lot of these notifications. Which in itself isn’t necessarily a problem, but we’d overlooked the fact that we weren’t the only ones who could see these notifications…
Our slack channel is completely public, which means anyone can jump in to ask us questions, or to just hang out in what is clearly the coolest slack channel in all of #cloudfoundry.
What’s less cool is jumping in to be greeted by pages and pages of messages about failing test runs… Not sure that’d fill me with a whole lot of confidence in the product 😬.
For some ridiculous reason we were posting flake notifications to our main channel rather than a separate, notification-only channel. I think maybe there was some concern that we would be less likely to keep an eye on a separate channel? Either way this was clearly a bad impression to be sending out and so we have since moved all notifications to a separate channel.
This actually had an unintended side effect of helping to deal with the flake triaging as well. Now we were able to see all notifications at a glance in a single place, which helped us to spot the ones that were most troublesome. While this helped us out a bit, I still reckon there’s room for improvement in the general flake triaging area.
Let’s wrap this up.
So, was this all worth the effort? The short answer is yes, yes it was. Our test suites are now much, much healthier and I also learnt a huge amount about how to write better tests. I will 100% take this practice with me onto new teams and new codebases in the future.
One really important point I’ve yet to mention is that almost all of our time spent working on this was spent fixing up the test code. There were very few changes (if any) to our production code. On the one had that’s obviously a good thing as it means our production code is solid. But on the other hand we did invest a lot of engineering time into this.
In my opinion it was absolutely worth it. While the only “actual” output may have been improvements to our test suites, the amount we all learnt about our codebase through flake debugging has been absolutely invaluable.
In summary, we decided to run all our tests all the time and as a result are now able to better focus our engineering efforts on the things that really matter.