-
-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: collect/aggregate results to highlight flakiest tests #19182
Comments
One possibility here would be:
As next steps beyond 2, it could surface the results more obviously. For instance, as a comment in a GH discussion, or splat an HTML file to an GH pages site (or S3 bucket) for us to browse to. For the initial version, I'd imagine just summarising all failures would be sufficient, and assume that 'real' failures (e.g. being broken in PR CI) will be far less common than flaky ones, for any given test. That is, if a test is flaky, it'll fail regularly across all the builds done by Pants' CI and so be higher up the list of "most failures", while a real failure might only pop-up once or twice in a PR. If it is a problem, one way to reduce that 'real' failure rate would be only looking at the |
We should definitely do this, but I'm not sure we'll discover that there is a small fixed set of flaky tests. I have a hunch that it's just arbitrary long-running tests getting resource-starved on the weak-ass GHA machines. Which is why extending timeouts on specific tests tends not to fix this. But even the info on which the longest-running tests are will be really useful for deciding if they're worth it, and rewriting them if not. |
@benjyw did this in #19262, uploading to
i.e. some flaky failures on main, but apparently no timeouts (or maybe I have a bug with the logic there). |
Hm, actually, reopen, this task covers summarising too. |
Is your feature request related to a problem? Please describe.
Pants' CI is quite flaky at the moment, with a lot of failures due to exceeding timeouts, and some other types of flakes too. I don't think there's a currently a good way to get insight into what's flaky other than experiencing them for yourself (and/or clicking through a lot of builds), and so people might do spot fixes, but it's not easy to get a systematic view.
Describe the solution you'd like
Some sort of 'dashboard' (even an adhoc/on-demand one) summarising the most common test timeouts and failures over the past month (or some other threshold).
Describe alternatives you've considered
None
Additional context
N/A
The text was updated successfully, but these errors were encountered: