Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track which tests are run in CI #95995

Open
jyn514 opened this issue Apr 13, 2022 · 9 comments
Open

Track which tests are run in CI #95995

jyn514 opened this issue Apr 13, 2022 · 9 comments
Labels
A-testsuite Area: The testsuite used to check the correctness of rustc T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap)

Comments

@jyn514
Copy link
Member

jyn514 commented Apr 13, 2022

Currently, it's pretty easy to accidentally miss tests in CI - e.g. I recently found that compiler doctests haven't been enforced in a while: #95994. It would be great to track which tests are being run.

@Mark-Simulacrum did you have some ideas for this already in mind? How were you expecting the generated output to look / what sort of granularity did you want? I think just tracking Step runs should be pretty easy, but maybe not as useful as tracking cargo invocations? Having our own build system gives us some flexibility here.

Originally posted by @Mark-Simulacrum in #95994 (comment)

@jyn514
Copy link
Member Author

jyn514 commented Apr 13, 2022

@rustbot label +A-testsuite +A-rustbuild

@rustbot rustbot added T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) A-testsuite Area: The testsuite used to check the correctness of rustc labels Apr 13, 2022
@Mark-Simulacrum
Copy link
Member

Mark-Simulacrum commented Apr 13, 2022

I think there's been several off hand comments but nothing super well documented. Likely #93717 or something like it would be done to prepare files, though for this we'd want to integrate into compiletest and/or libtest to get per-test results.

Generally my thinking is that each test would get an identifier of some kind and then we'd just upload files to the CI bucket for "ran these tests". After that something like toolstate - though maybe running externally rather than on master CI in this repo - would do a diff and file issues/alert/something if there's a negative delta.

Shouldn't be too hard but a good bit of plumbing to do for sure.

A second aspect would be for the negative diff tool to have some asserts - e.g., I would like to know that tier 2 targets are actually getting built and distributed, tier 1 tested, etc.

A third aspect would be a dry run of some kind to give us a complete list of possible tests - ideally pretty "simplistic" (like find) to validate that we're running all tests at least once or so; we've had e.g. gdb version requirements that are never satisfied or the rustdoc missing invocations you note before. This is probably hard and hopefully negative diffs help catch most of these - at least in the regression sense - but they don't help as much when adding tests.

@jyn514
Copy link
Member Author

jyn514 commented Apr 13, 2022

Ooh, #93717 looks really cool :) I'll keep an eye on that.

Oh wow, per-test granularity! That would be awesome if it worked :D what would we compare it to? Is the idea to make sure that all CI runners are running the same tests? More tests are being added constantly so we couldn't use a golden reference or anything like that.

@Mark-Simulacrum
Copy link
Member

(updated comment).

@jyn514
Copy link
Member Author

jyn514 commented Apr 13, 2022

Generally my thinking is that each test would get an identifier of some kind and then we'd just upload files to the CI bucket for "ran these tests". After that something like toolstate - though maybe running externally rather than on master CI in this repo - would do a diff and file issues/alert/something if there's a negative delta.

Got it, makes sense :) I am worried though about what happens if tests are renamed - git rename detection is pretty imprecise, and it's not terribly rare to rename tests (e.g. #91037).

Doctests are also likely to cause trouble, although we probably don't need to know each individual code block being run, just whether the test runner itself is built / run.

A third aspect would be a dry run of some kind to give us a complete list of possible tests - ideally pretty "simplistic" (like find) to validate that we're running all tests at least once or so; we've had e.g. gdb version requirements that are never satisfied or the rustdoc missing invocations you note before.

Just to clarify - is this meant to be "all tests that we could possibly support"? Or "all tests that x.py currently supports"? We can get a pretty coarse idea of the latter currently with the help paths (x test --help -v). Neither "all current tests" nor "all paths and crates" are perfect subsets of each other, unfortunately - e.g. things like html-checker and expand-yaml-anchors that use cargo run instead of cargo test will be pretty hard to detect automatically I think. Maybe those are less important than the stuff in src/test though?

An example of "possible tests not currently supported" are the codegen_xxx crates: #95518

@Mark-Simulacrum
Copy link
Member

My hope is that libtest integration (probably environment variable or so, and dumping JSON into a file) is enough to work for all of compiletest, rustdoc, and direct libtest usage - which would work out great for us in terms of ease of deploying it. (Those files would be consumed and repackaged by rustbuild to associate with target and other metadata, probably).

I'm not particularly worried about renames. If we get a false positive, it just means closing the issue - no big deal for an obvious PR. If the rate is high enough we can explore how to reduce it, but I suspect this is less common than you might think.

For the third aspect, I would say best effort. The more tests we pick up, the better, of course. Note that this is intended to be very granular - e.g., each ui test, not ui tests as a concept. (And maybe with added manual matrices of "compare-mode"...?)

My hope is that once the underlying infra is in place - even with the simplest of tests - it will be easy for contributors to help plug gaps, and going forward we can try to mandate PRs to plug gaps by linting against unknown tests that were run or so.

@bjorn3
Copy link
Member

bjorn3 commented Apr 13, 2022

Maybe we could have a fuse filesystem which checks that all files are accessed at least once and if not prints out a list of those that haven't? That wouldn't help with doc tests, but it may help with other kinds of tests and dead files in general.

@jyn514
Copy link
Member Author

jyn514 commented Apr 13, 2022

@bjorn3 fuse sounds fun but it seems like strictly more work since we already have pretty intimate access to compiletest and libtest. Also I think FUSE only works on linux, and one of the goals here is to compare tests between different OSs.

@jyn514
Copy link
Member Author

jyn514 commented Jun 5, 2023

@pietroalbini mentions that since #111936 we track individual test names in build-metrics, which should make this a lot easier :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testsuite Area: The testsuite used to check the correctness of rustc T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap)
Projects
None yet
Development

No branches or pull requests

4 participants