-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/go: add support for dealing with flaky tests #62244
Comments
I've investigated an awful lot of test flakes in the Go project over the past few years. Some observations:
The common elements of these points are that flaky tests often need to be retried at a fine granularity and with a progressively increasing timeout, rather than retried a fixed number of times in the same configuration. To me, that suggests an API addition to the |
I'm sure we all agree on that, Brad, but it doesn't follow that we should add special functionality to Go to make it easier for developers to live with flaky tests.
I'm not sure why you expect a different answer this time. Frankly, if having flaky tests annoys you, or is inconvenient, I'd say that's a good thing.
Agreed. The solution is to delete them instead, not to mask the flakiness with automatic retries. |
I wholeheartedly support this proposal. Addressing flaky tests is a real-world necessity and providing a structured way to manage them in the standard library is a step in the right direction. |
I agree, I'd rather have a conservative opt-in way to retry test failures, so that flakyness "policy" is defined per test and is in full control of a developer. t.RunRetry("name", func(t *testing.T, attempt int) {
if attempt > 3 {
// Give up and forgive.
t.Skip()
// Or give up and fail hard.
// t.FailNow()
}
// Iterate with increasing tolerance depending on attempt.
t.Fail()
}) |
This proposal has been added to the active column of the proposals project |
This is a really important point that I think is worth elaborating. The impact of flaky tests on the contributors to a software project is often directly correlated to the constraints imposed by the organization responsible for the project. Many software projects won't suffer from flaky tests. Maybe because the code has minimal concurrency, the scope of the problem is small, or because the code isn't critical enough to require comprehensive end-to-end testing of all the components. Other projects with significant concurrency, large scope, and large end-to-end tests may encounter the occasional flaky test, but also don't suffer from flaky tests because the team has sufficient time and resources to deal with them. The Go project has some flaky tests, but appears to have significant infrastructure in place to identify, and triage the flaky tests, and also to mitigate the impact of those flakes on other contributors. Not every project is so fortunate. If your experience is working only on projects that are in these two categories it can be easy to dismiss the problem of flaky tests as not significant. Maybe the problem of flaky tests is not significant to the majority of projects, but it can be very real for some. A few years ago I joined a team that was working on a well known open source project. The project has thousands of tests, many of them could flake. The team had attempted to deal with the flaky tests for years, but could never get them into a reasonable state. The project had been around for 6 years already, had hundreds of contributors who no longer worked on it (many from the open source community), and most of the flaky tests were written by someone who was no longer involved with the project. The cost of fixing a flaky test was high enough that by the time one or two were fixed new ones had appeared. Contributors were forced to either click "rerun job" in CI multiple times (and wait hours to get a green run), or assume the test was flaky and merge with a failing CI check. Multiple attempts were made to manually identify flaky tests by creating github issues to track them. This helped a bit, but by itself didn't really address the problem. Creating github issues for every CI run was still taking time from every contributor, and it was often easier to click re-run and not create a github issue. Github issues on their own did not help prioritize the most flaky of tests. This manual approach was not enough. This experience is what motivated me to write the
This feature allowed the team to make real progress against fixing flaky tests. Instead of manually ticketing and re-running failures we could use the I've seen some interest in this feature from the community. This sourcegraph search shows at least a few recognizable projects using it. I imagine there are others that are not indexed by source graph, not open source, or are using the flag in a script that doesn't match the search query. |
Given my experience with flaky tests above, I have a few questions about the proposal. Why only re-run marked tests? My rational for re-running flaky tests has generally been that an existing flaky test should not introduce friction for other contributors. If a contributor has to create a github issue and edit a test they are not familiar with, that is non-zero friction and they may be tempted to just re-run the CI job instead. I would generally prefer to have an automated system notice the flakes by reading the Some indication of re-run tests is important, but given the nature of a flaky test, it seems unlikely that the person adding the mark is going to be the same person that introduce the flake. Is the marking of a test more for the developer reading the test code, so that they understand it's a flaky test? Do we need new output to show "that a test was flaky and failed but eventually passed" ? If a test appears multiple times in the |
I think @vearutop is on the right track with I suggest the following additions to // Restart is equivalent to Retry followed by Skip.
func (*T) Restart(args ...any)
// Restartf is equivalent to Retry followed by Skipf.
func (*T) Restartf(format string, args ...any)
// Retry marks the function as unable to produce a useful signal,
// but continues execution.
// If the test runs to completion or is skipped, and is not marked as
// having failed, it will be re-run.
func (*T) Retry()
// Retries reports the number of times the current test or subtest has been
// re-run due to a call to Retry. The first run of each test reports zero retries.
func (*T) Retries() int64 The same methods could also be added to The idea is that when a test flakes, it calls |
The marking of a test is so that the flake-retry mechanism doesn't mask actual regressions — such as data races or logical races — that may manifest as low-but-nonzero-probability failures in tests that were previously not flaky at all. |
Why return int64 instead of int? |
Because a very short test can plausibly fail more than 231 times before it times out on a 32-bit machine? (But |
I was thinking the number would be on the order of tens at most. 2^31 retries would be quite the flaky (and short) test. It sounded like the usual source is timeouts, which suggested longer tests to me. |
Agreed, although I think it's important that whatever API we add is also able to gracefully handle other kinds of flakes. (And note that for the “flake manifests as a deadlock, race, or runtime throw” sort of test, generally the test would need to reinvoke itself as a subprocess, run that subprocess using |
Data races and logical races are also common causes for flakes. Is the distinction here "a flake caused by production code" indicating a bug that needs to be fixed vs "a flake caused by test code" indicating a lower priority to fix? Is the marking of tests actually effective in preventing masking of those failures? Consider these scenarios. Scenario 1: I contribute a change that introduces a new logical race. I run the tests a few times but the race only causes the test to fail 1 in 20 runs so it never fails on my change request. The change is merged into main. A few days later it fails on a change request from a different contributor. That contributor doesn't recognize the error. They have a few options.
In this scenario the test wasn't automatically re-run, but due to the nature of flaky tests not happening frequently, the flake is masked either way. In the best case scenario the bug isn't fixed until the appropriate party becomes aware and has time to look into it. Scenario 2: As above, I contribute a change that introduces a new logical race, it doesn't fail on my change request, and the change is merged into main. This time tests are being retried without having to mark them. When tests are retried they are reported on the change request, and when new ones appear that same automation opens an issue for them. In this scenario the appropriate party still becomes aware of the problem at about the same time (when new issues are reported), and that may even happen earlier because in this scenario you aren't relying on every single contributor to do the right thing. The automation handles it. Scenario 3: I contribute a change that introduces a regression, causing requests to timeout more often. The tests that cover this change are already marked as flaky, so the tests are retried and they pass after a couple attempts. In this scenario flaky test markers are used, but it still masks the regression. Until someone looks at the frequency of flakes they don't notice the regression. In these scenarios the marking of tests doesn't seem to prevent them from being masked. In the unlikely scenario where a test flakes on the change request that introduces the bug it could mask the problem, but only if the retried tests are not made visible. Scenario 3 already requires that any retried tests are made visible, which seems like it prevents the failure from being masked. The problem of flaky tests is arguably often more a cultural one than a technical one. That's why this point from the original post is so important:
The API proposed here seems like it would be great when the flaky behaviour in a test is already well understood. From the
If the tests aren't retried because of any In my experience, and based on my read of the original proposal, this doesn't really address the most painful parts of dealing with flaky tests. If |
For some kinds of tests, particularly end-to-end tests of large systems, any failure would benefit from being retried. It's easy to introducing flaky behaviour into these kinds of tests, and a human is going to have to rerun them to determine if the failure is reliable or flaky. Rerunning these automatically would be helpful. Maybe a new Something like That approach should provide good signals on new failures (they aren't masked as passing tests) while still making it easier for developers to work on projects that are prone to flaky tests. |
No. The distinction is “a flake caused by something known to be a preexisting issue” vs. “a flake whose cause has not been triaged”. If no one has even looked at the failure mode before, how can we assume anything at all about its priority?
Marking of tests? No, because if you mark the whole test as flaky then any failure at all will be retried. But marking of failure modes is effective: if the test is only marked as a flake if its failure matches a known issue, then a new failure mode will still cause the failure to be reported. |
@dnephin, regarding your more specific scenarios.
If someone is doing code reviews, ideally they will notice the CI failure when they review the change. But yes: if contributors and maintainers are in the habit of hitting a “retry” button without also reporting the test failure, then the test failure will presumably go unreported. That's why CI systems are usually run as “continuous” (post-commit) tests rather than just pre-commit. If the failure happens during post-commit testing, then it will presumably be reported in the usual way.
That assumes that someone is triaging the failures reported by the automation. If a project isn't keeping up with triage for its test failures, why would we expect it to keep up with triage for automated failure reports? (Aren't those more-or-less equivalent?)
Yes, that's the downside of marking a given failure mode as a flake: it masks all other regressions with the same failure mode. To minimize the masking effect, it is important to mark the most specific failure mode that is feasible. But this is closely related to performance monitoring in general. If you want to test timing- or performance-related characteristics, you need a testing setup that is fundamentally different from the usual heavily-loaded CI environment. |
In my experience, the most painful part of dealing with flaky tests is the time and effort it takes to understand the failure, not to mark it. It is usually much easier to mark a failure mode as flaky than to diagnose and fix its underlying cause. |
Thank you for the detailed response. I now better understand the motivation for the proposed solution.
The number of flaky test failures can often be due to historical compromises. It's common for organizations to make technical compromises to deliver a product. Even if a team is able to spend enough time to deal with new flakes being caused today, that doesn't mean they're able to immediately address the large number of flaky tests built up over previous cycles due to those compromises. The key difference is the "always retry and track using automation" approach gives the team flexibility about when they spend the time to triage. The investigation can be scheduled for a convenient time, and all the flakes don't need to be fixed each time.
No doubt it is difficult to assume too much without doing some investigation, but there are some things that can help prioritize the investigation. A flake in a test for a critical system is going to be higher priority than a test failure in a less critical system. A flake that happens 1/10 runs is also potentially higher priority than one that occurs 1/1000. When a flake first appeared can also help prioritize. A flake that has existed for years is likely lower priority than one that just started in the last week.
Definitely the most time consuming part of dealing with flaky tests is understanding the failure, but I would argue that doesn't necessarily make it the most painful. I often enjoy the challenge, especially if I have the support of my team and organization to do that work. I would argue the painful part of dealing with flaky tests is the distruption to other work tasks. Someone trying to contribute a new feature, or a bug fix shouldn't be interrupted with having to understand a test failure in a completely different part of the system. If marking a test is simply adding a line at the begining of the test (ex:
|
It seems like there are two main proposals on the table: Brad's and Bryan's. They differ in a few details but it sounds like there is general agreement we should try to do something here. If I can simplify them to their cores:
For both assume an f form too: t.Flakyf and t.Retryf. Let me know if I've grossly misunderstood or oversimplified. A benefit of t.Flaky is that monitoring can notice purely from the test logs that a particular test is marked as flaky but not flaking. To do this with t.Retry, you'd need both test logs and a scan of the source code to find the Retry calls and check for ones that never appear in recent logs. A benefit of t.Retry is that it is more precise about the specific reason for the flake. If the test starts failing for some other reason, that's a fail, not a retry. For example in this tailscale code there are many t.Fatal lines, and probably only a couple are implicated in flakiness. For example probably key.ParseNodePublicUntyped is not flaky. But this precision comes at a cost. One might say that a benefit of t.Flaky is that it's not precise, so that you don't have to chase down and rewrite a specific t.Fatal line just to mark the test flaky. This is not a big deal if the t.Fatal line is in the test itself, but the t.Fatal line may be in a helper package that is less easily modified. (On the other hand, maybe that's the fault of the helper package, but people do like their assert packages.) Bryan's full proposal also gives the test more control over what happens next, with the Retries method. That said, I am not sure exactly how a test would use that extra control, so I left it out above. It seems better for the testing package and the command line to control retries. Looking at Go's main repo, we use SkipFlaky and SkipFlakyNet almost exactly like t.Retry, which is not terribly surprising. Looking at Tailscale uses of flakytest.Mark, it does give me pause not to know where exactly the flaky failures happen. So I would lean toward trying t.Retry first. What do people think about adding just the one method t.Retry, as described above, along with the -maxretry flag? |
I think a Also note that for many classes of test, arbitrarily many calls to The main purpose of the |
Curiously, the
Adding the timeout gives the test enough time for a meaningful retry after a deadlock, but introduces the possibility of timeouts in normal operation of the test. An exponentially increasing timeout allows the test to bound the amount of time wasted on these normal-operation retries to be linearly proportional to the running time of the success case. (And, again, the |
I looked at another random flaky Tailscale failure and found a test that uses c.Assert everywhere. There's nowhere to put the t.Retry method, because the Fatal is buried in the Assert. The same problem happens with any test helpers, not just ones as general as "Assert". Brad also pointed out that adding t.Flaky to the top of a test is something an automate process can do, whereas finding the right place to put t.Retry is not (in the Assert case, even a human can't do it). His vision is automation to file a bug and add the t.Flaky(bugURL) and then also automation to remove the t.Flaky when the flake is fixed. This does sound like a good set of tools to enable building. So I am leaning more toward t.Flaky again rather than t.Retry. |
It seems to me that there is a very minor change to Then the “tool-automated tagging helper” use-case looks like: func Mark(t testing.TB, issue string) {
t.Cleanup(func() {
if t.Failed() {
t.Logf("flakytest: retrying flaky test (issue %v)", issue)
t.Retry()
}
})
} That comes at the expense of needing somewhat more careful usage to avoid retrying unexpected failures: if errors.Is(err, context.DeadlineExceeded) && !t.Failed() {
t.Logf("retrying with longer timeout")
t.Retry()
} Crucially, though, it still does support the use-case of retrying in normal operation. As far as I can tell, |
Nice observation. So if we go with the t.Retry, are there any remaining concerns? |
Would someone mind summarizing what would be included with (For example, would it just be |
I think it would rely on per-test timeouts as @bcmills pointed out, but that means we should probably still include Retries() int too. |
Note that individual tests (or test helpers) could add their own if t.Failed() && t.Retries() < *maxRetry {
t.Logf("flakytest: retrying flaky test (issue %v)", issue)
t.Retry()
} |
Packages that add flags to test binaries are hostile to using |
@ChrisHines, so use an environment variable instead? (My point is just that if a test helper wants to set an arbitrary cap on the number of retries, it can do that fairly easily using the |
Fair enough, thanks for clarifying your point. 👍 I guess the question becomes: How important is it to have a standard "global" way to set the max retries vs. potentially multiple ways which could allow different helpers to have different configurations but also make it hard to know all the different knobs to set when setting up CI pipelines? |
Based on the discussion above, this proposal seems like a likely accept. Add to testing.T only (not B or F):
In the testing output, calling Retry results in an output line like:
In the test2json output, this line produces
|
No change in consensus, so accepted. 🎉 Add to testing.T only (not B or F):
In the testing output, calling Retry results in an output line like:
In the test2json output, this line produces
|
Hmm. I don't think (An exception for “already cleaning up and exiting” seems too fragile and too ambiguous to me — what would that mean if I think that would be something like:
|
We should also specify the behavior for subtests. Probably we retry the individual subtest function, rather than retrying the entire chain of parents. But then what happens if a subtest fails and its parent test calls |
I think both @bcmills's comments are correct and should be reflected in the final implementation, whenever that happens. In particular if someone is foolish enough to call t.Retry in a parent after a subtest has failed, seeing an overall PASS seems OK. |
Background
First off, flaky tests (a test that usually passes but sometimes fails) are the worst and you should not write them and fix them immediately.
That said, the larger teams & projects & tests get, the more likely they get introduced. And sometimes it's technically and/or politically hard to get them fixed. (e.g. the person who wrote it originally left the company or works on a new team) You're then left deciding whether to skip/delete the entire test (which might be otherwise super useful), or teach your team to become immune to and start ignoring test failures, which is the worst, when teams start submitting when CI's red, becoming blind to real test failures.
Google doesn't have this problem internally because Google's build system supports detecting & annotating flaky tests: https://bazel.build/reference/be/common-definitions
Tailscale has its own test wrapper that retries tests annotated with
flakytest.Mark
as flaky. We can't usego test -exec=...
unfortunately, as that prevents caching (#27207). So instead we need a separate tool wrapscmd/go
(kinda awkwardly, as it turns out).Go internally also uses these flaky test markers 79 times:
So, clearly flaky tests are a problem. And
cmd/go
isn't helping.Proposal
I propose that Go:
tb.MarkFlaky(urlOrMessage string)
)cmd/go
support re-running marked flaky tests up to N times (default 3 like Bazel?). Maybe add-retrycount
or-maxcount
aside the existing-count
flag?cmd/go
(thattest2json
recognizes) to say that a test was flaky and failed but eventually passed, that users can plug into their dashboards/analytics, to find tests that are no longer flakyFAQ
Won't this encourage writing flaky tests? No. Everybody hates flaky tests.
Won't this discourage fixing flaky tests? No. We're already retrying flaking tests. We're just fighting cmd/go to do so.
Why do you have flaky tests? Unit tests for pure functions aren't flaking. The tests usually flaking are big and involve timeouts and the network and subprocesses, even hitting localhost servers started by the tests. Many CI systems are super slow & oversubscribed. Even super generous network timeouts can fail.
Wasn't there already a proposal about this? Yes, in 2018: #27181. The summary was to not write flaky tests. I'm trying again.
The text was updated successfully, but these errors were encountered: