Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Out of the tarpit #2923

Merged
merged 15 commits into from
Aug 20, 2024
Merged

ci: Out of the tarpit #2923

merged 15 commits into from
Aug 20, 2024

Conversation

MarcoPolo
Copy link
Collaborator

@MarcoPolo MarcoPolo commented Aug 15, 2024

This PR gets us out of the tarpit of flaky tests. It reruns failed tests to see if they are flaky. If they consistently fail, the job fails. If they are flaky, they are recorded as such, but the job succeeds. The job outputs a sqlite database of all test runs. The table includes information on whether the test passed or failed and how many times for each.

The summary now includes what specific test failed, so you don't have to hunt through the logs to see what failed. It will also link you to an observable dashboard that lets you dive deeper. example

I think it's worth experimenting with this, and we can always revert it later.

Some valid concerns:

Q: Why do this?
A: Making changes has been feeling like walking through a tarpit with regard to CI. And a bad CI desensitizes you to actual failures. CI failures should be very high signal and very low noise.

Q: Won't this hide flaky tests?
A: No more than we are already doing by rerunning the tests. I'd even argue it hides it less because all the flakiness is presented on a single page rather than spread out across runs.

Q: What happens when a test is so flaky it fails everytime?
A: Sounds like an easy test to fix. Most, if not all, of our flakes happen less than 20%.

Q: Does this replace the Flaky Tests dashboard?
A: Not yet. Although I have a FlakyTestsV2 dashboard that makes better use of the sqlite format. This method is much easier to iterate on and much faster than the original Flaky Tests Dashboard.

Q: Why are you "forking" go-test-template.yml?
A: There's no other way of doing the changes I want to do. I think go-libp2p can graduate to owning its own go-test workflow. I prefer the simple workflow calling composite actions strategy to the shared reusable workflow across many projects. For now, the changes are minimal between the in-tree go-test-template.yml and the upstream one. In the future I may remove even more things from it.

This PR isn't intended to be the final state. I'd like to clean up the summaries and dashboard a lot more. But I do think this current branch is a lot better than what we have.


How it works:

It introduces a go test helper script that:

  1. Runs all the tests i.e. go test ./...
  2. Saves the test results in a sqlite file.
  3. If there are any failures it will rerun the failed tests n more times.
  4. Uploads the sqlite db as an artifact of the job.

There's also an Observable dashboard that:

  1. Parses all the sqlite db artifacts.
  2. Shows which tests failed along with the relevant logs.
  3. Derives a "flakiness score" which represents "for the given state of flakiness, how likely is a job to fail because of these flakes?". (n.b. It's a lot like rolling a dice with disadvantage.)

@MarcoPolo MarcoPolo requested a review from sukunrt August 15, 2024 21:22
Copy link
Member

@sukunrt sukunrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't wait to start using this ❤️

scripts/test_analysis/main.go Show resolved Hide resolved
scripts/test_analysis/cmd/gotest2sql/main.go Show resolved Hide resolved
scripts/test_analysis/main.go Show resolved Hide resolved
@MarcoPolo MarcoPolo merged commit 6c12e22 into master Aug 20, 2024
13 checks passed
@MarcoPolo MarcoPolo deleted the marco/out-of-the-tarpit branch August 20, 2024 18:13
sukunrt pushed a commit that referenced this pull request Aug 30, 2024
* Lint fixes

* Use latest go version for go-check

Fixes nil pointer issue in staticcheck

* Add test_analysis helper script

* Use custom go-test-template

* Add some tests to the test_analysis script

* Always upload test_results db

* Attempt to fix test on windows

* Better if statement

* Try to fix flaky test

* Disable caching setup-go on Windows

* Better if statement

* Tweak

* Always upload summary and artifact

* Close db

* No extra newline
MarcoPolo added a commit that referenced this pull request Sep 4, 2024
* Lint fixes

* Use latest go version for go-check

Fixes nil pointer issue in staticcheck

* Add test_analysis helper script

* Use custom go-test-template

* Add some tests to the test_analysis script

* Always upload test_results db

* Attempt to fix test on windows

* Better if statement

* Try to fix flaky test

* Disable caching setup-go on Windows

* Better if statement

* Tweak

* Always upload summary and artifact

* Close db

* No extra newline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants