Proposed Testing Strategy #471
Replies: 3 comments 7 replies
-
Personally, I think this discussion would benefit from us defining different kinds of tests. From the top of my head, there are three main important ones to cite here:
To me, our test suite currently consists of mainly unittests, so tests that test one singular unit and no interaction among units. I think we should rename the tests folder to unittests and move all non-unittests to an integrationtests folder. IMO, there are a couple of points I disagree with in your proposition:
I disagree with this list. I think there is value in simple tests, that fail in a very simple setting (i.e. unittests). They give you very straight-forward information on what failed, whereas finding a simple issue needs more debuging steps in all of your points in the list. Also, if you notice that a simple test fails, you don't need to look at the complex tests failing, whereas in your example you might get failing tests for all aspects, while the actual issue is simple.
I strongly disagree with this point. IMO, we want our tests to execute a lot of (actually all of the) code to make sure we catch bugs introduced by changes/refactors. For example, we are currently rethinking/rewriting operation instantiation, moving away from build and towards the On the other hand, if you just define "enough" integration tests to cover all lines, I think you are just making it harder to find the actual bug if a test fails. The workflow then becomes taking the integration test, reducing it to a small failing test that shows the issue and then solving it, whereas with unittests, the small failing test is already given to you as the actual failing test.
IMO, it is generally ill-advised to optimise a test-suite for runtime, as long as it does not become an extreme issue (which it certainly isn't in our case). I do agree though that we should considering removing a lot of the simple filecheck tests as many of them are covered by pytests, apart from the mlir interoperation tests. On the topic of dependencies, I still feel that we should introduce a testing dialect, to remove unnecessary dependencies. Why should a memref test fail, if there is an issue in the arith dialect? This once again makes the test-suite and its error reports confusing and hard to understand, making hunting down bugs even more of a pain. |
Beta Was this translation helpful? Give feedback.
-
[deleted] Wanted to post as a reply to the initial post, it's posted there |
Beta Was this translation helpful? Give feedback.
-
My approach to tests in general is that they are code that help us find bugs. Like all code, it costs to run and maintain, so we should have a good reason to write it in the first place. A very good reason is that it will help prevent breaking guarantees we've previously given, whether that's an API or behaviour change. This is especially useful when working on the same codebase as a team, as the test itself could be the closest thing to the documentation of behaviour. When I write tests, I try to formulate them in a way that either a collaborator or myself a month or two from now would trigger if any if the important behaviours were broken. Sometimes this is hard to predict. When in doubt, I usually leave out the test, and try to add one when I find a bug, that would've prevented the regression. The best tests are the ones we don't have to write due to compiler support, which is where type checking is so useful, among other analyses. |
Beta Was this translation helpful? Give feedback.
-
Testing xDSL
I think that each test should either:
What I like to think about:
But we should keep in mind, that it's okay that a failing test doesn't immediately tell us what exact line of code failed. We have error messages and debuggers to pinpoint problems. If every test only covered a handful lines of code we'd have thousands of tests, and I would argue that we wouldn't gain anything significant from that.
Tests I don't want to have:
Insubstantial tests:
A test that only uniquely covers ~3 LOC (meaning there are tests that taking that test away "uncovers" 3 LOC) is "insubstantial" in my eyes. For example a test that covers the
get
method of a specific Operation might be insubstantial, if theget
method is just a trivial passing on of some arguments tobuild
.Why I don't like these kinds of tests:
If only that test fails, it tells me that I changed the way the
get
of that specific operation. So I learn nothing new by this test failing. This means that I can't break the test "by accident". I feel that this is a bad property, although I can't really state why.Proposed solution:
My proposed solution for these tests is to collect them into bigger "usecase" tests that:
Tests with too much "accidental" coverage:
A test usually intends to cover some specific behavior of code.
For example a recent version of my
matmul
test wanted to check that the operations are instantiated correctly.Why I don't like these kinds of tests:
This test goes through the printing code, while it never intended to test the printing code. This increases test runtime, and makes this test dependent on the
Print
module, which we never wanted to test in the first place! The printer has it's one test suite, and there are enough tests that cover it! This goes against ourCover a specific complex behavior
rule!Proposed solution:
I think tests like that are really valuable, because they show a specific usecase of the dialect. If one is smart about the example, one can cover all of the different permutations of operations that are relevant and thereby cover lots of small
get
andfrom_...
functions that would be insubstantial on their own.Test I want to have:
Complex Behavior Covering Tests:
These tests cover a specific complex thing.
These tests should not cover too much, we should try to find a subset of the complexity that is still sufficiently interesting, but small enough so that the test failing tells you something about where the error occurred.
Examples:
Usecase Covering Tests:
These tests usually cover significant parts of a single module, but try to minimize dependencies on other code.
These tests should not uniquely cover a lot of interesting or complex code. Meaning for any parts that are sufficiently complex, we should have a separate tests.
Examples:
End-To-End Tests:
Our filechecks are great examples of End-To-End tests.
We want to test a specific usecase of our program from the perspective of the end user. We input information into our software in the same way an end-user does, and we retrieve information in the same way an end user does.
Beta Was this translation helpful? Give feedback.
All reactions