Proposed Testing Strategy #471

AntonLydike · 2023-02-24T17:34:32Z

AntonLydike
Feb 24, 2023
Maintainer

Testing xDSL

I think that each test should either:

Cover a specific complex behavior
Cover a specific usecase
Cover an aspect "End-To-End"

What I like to think about:

"What does it tell me if only this test fails?"
"How can this test prevent me from making a mistake?"
"Am I testing too much of our program in this test"

But we should keep in mind, that it's okay that a failing test doesn't immediately tell us what exact line of code failed. We have error messages and debuggers to pinpoint problems. If every test only covered a handful lines of code we'd have thousands of tests, and I would argue that we wouldn't gain anything significant from that.

Tests I don't want to have:

Insubstantial tests:

A test that only uniquely covers ~3 LOC (meaning there are tests that taking that test away "uncovers" 3 LOC) is "insubstantial" in my eyes. For example a test that covers the get method of a specific Operation might be insubstantial, if the get method is just a trivial passing on of some arguments to build.

Why I don't like these kinds of tests:

If only that test fails, it tells me that I changed the way the get of that specific operation. So I learn nothing new by this test failing. This means that I can't break the test "by accident". I feel that this is a bad property, although I can't really state why.

Proposed solution:

My proposed solution for these tests is to collect them into bigger "usecase" tests that:

Showcase a use of the dialect
Force me to think of the implications of my changes for the users

Tests with too much "accidental" coverage:

A test usually intends to cover some specific behavior of code.

For example a recent version of my matmul test wanted to check that the operations are instantiated correctly.

Why I don't like these kinds of tests:

This test goes through the printing code, while it never intended to test the printing code. This increases test runtime, and makes this test dependent on the Print module, which we never wanted to test in the first place! The printer has it's one test suite, and there are enough tests that cover it! This goes against our Cover a specific complex behavior rule!

Proposed solution:

I think tests like that are really valuable, because they show a specific usecase of the dialect. If one is smart about the example, one can cover all of the different permutations of operations that are relevant and thereby cover lots of small get and from_... functions that would be insubstantial on their own.

Test I want to have:

Complex Behavior Covering Tests:

These tests cover a specific complex thing.

These tests should not cover too much, we should try to find a subset of the complexity that is still sufficiently interesting, but small enough so that the test failing tells you something about where the error occurred.

Examples:

Usecase Covering Tests:

These tests usually cover significant parts of a single module, but try to minimize dependencies on other code.

These tests should not uniquely cover a lot of interesting or complex code. Meaning for any parts that are sufficiently complex, we should have a separate tests.

Examples:

End-To-End Tests:

Our filechecks are great examples of End-To-End tests.

We want to test a specific usecase of our program from the perspective of the end user. We input information into our software in the same way an end-user does, and we retrieve information in the same way an end user does.

webmiche · 2023-02-24T18:59:38Z

webmiche
Feb 24, 2023
Collaborator

Personally, I think this discussion would benefit from us defining different kinds of tests. From the top of my head, there are three main important ones to cite here:

unittests: test on singular "unit"
integration tests: test how different units interact
end-to-end tests: test a program flow end to end, i.e. how a user might use the framework

To me, our test suite currently consists of mainly unittests, so tests that test one singular unit and no interaction among units. I think we should rename the tests folder to unittests and move all non-unittests to an integrationtests folder.

IMO, there are a couple of points I disagree with in your proposition:

Testing xDSL

I think that each test should either:
* Cover a specific complex behavior

* Cover a specific usecase

* Cover an aspect "End-To-End"

I disagree with this list. I think there is value in simple tests, that fail in a very simple setting (i.e. unittests). They give you very straight-forward information on what failed, whereas finding a simple issue needs more debuging steps in all of your points in the list. Also, if you notice that a simple test fails, you don't need to look at the complex tests failing, whereas in your example you might get failing tests for all aspects, while the actual issue is simple.

Insubstantial tests:

A test that only uniquely covers ~3 LOC (meaning there are tests that taking that test away "uncovers" 3 LOC) is "insubstantial" in my eyes. For example a test that covers the get method of a specific Operation might be insubstantial, if the get method is just a trivial passing on of some arguments to build.

I strongly disagree with this point. IMO, we want our tests to execute a lot of (actually all of the) code to make sure we catch bugs introduced by changes/refactors. For example, we are currently rethinking/rewriting operation instantiation, moving away from build and towards the __init__ constructor. If you don't test the get methods, how would you know that you really removed all the calls to build?

On the other hand, if you just define "enough" integration tests to cover all lines, I think you are just making it harder to find the actual bug if a test fails. The workflow then becomes taking the integration test, reducing it to a small failing test that shows the issue and then solving it, whereas with unittests, the small failing test is already given to you as the actual failing test.

Tests with too much "accidental" coverage:

...

Why I don't like these kinds of tests:

This test goes through the printing code, while it never intended to test the printing code. This increases test runtime, and makes this test dependent on the Print module, which we never wanted to test in the first place! The printer has it's one test suite, and there are enough tests that cover it! This goes against our Cover a specific complex behavior rule!

IMO, it is generally ill-advised to optimise a test-suite for runtime, as long as it does not become an extreme issue (which it certainly isn't in our case). I do agree though that we should considering removing a lot of the simple filecheck tests as many of them are covered by pytests, apart from the mlir interoperation tests.

On the topic of dependencies, I still feel that we should introduce a testing dialect, to remove unnecessary dependencies. Why should a memref test fail, if there is an issue in the arith dialect? This once again makes the test-suite and its error reports confusing and hard to understand, making hunting down bugs even more of a pain.

7 replies

webmiche Feb 25, 2023
Collaborator

I am not sure if dividing unit and integration tests is a good idea, I think having all tests for a certain thing in one place is very useful.

What would be a concrete downside? AFAIK, this is quite widely known and applied software engineering standard, so I am curious.

I did not want to say that we can't have simple tests! I think unit tests are great! I even give some example unit tests that I think are great in the current repository!

I thought I covered unit tests in my list when I talked about tests that "Cover a specific complex behavior". In my opinion those tests contain (a subset of) unit tests! I even state that these should not be super complex:

These tests should not cover too much, we should try to find a subset of the complexity that is still sufficiently interesting, but small enough so that the test failing tells you something about where the error occurred.

So you are saying that the get tests are "insubstantial" unittests and that's why you don't like them? I am mainly disagreeing with your arguments about insubstantial tests. I think any simple test that covers a new line is a good test.

I think I wasn't clear enough in my initial message. I completely agree with you that tests should give you confidence that your refactoring is sound. They should also give you a quick overview of what's impacted by your change.

However, I did not want to convey that we shouldn't cover the get and from_... methods. I think we should instead merge those small tests into larger tests that provide something more useful than just an instantiation of three operations and a few asserts. As I stated later in the section on Unsubstantial tests:

My proposed solution for these tests is to collect them into bigger "usecase" tests (...)

IMO, if you change the way the mpi.irecv function is built, and you see tests failing, it's just as easy to tell what's going on if test_mpi_async_receive_example fails than if test_mpi_irecv fails.

Yes, and this is a point where we are disagreeing. I think there is value in very simple tests that execute everything compared to more complex tests that hit everything. Sometimes, even often times, bugs are not as simple as "I changed this function, so it fails now". I would not even really consider this a bug, as it does not have any of the surprising nature of other bugs. Especially with all the hacking our decorators do, there are non-trivial dependencies that are hard enough to debug on their own. No need to have 10 big test cases failing and having to debug one of them if you can have 10 big tests and a small one failing and being able to first debug the small one.

I do understand that overbloated test suites can be annoying, especially if you change something and then have to change all the tests for it. However, I do not think we are anywhere near an overbloated test suite, on the contrary, we have too few tests.

PapyChacal Feb 25, 2023
Maintainer

Thanks for opening this discussion! I think we need it indeed.
Just starting my answer with my general view on testing and what I think could be dangerous here. I write more on-topic in the second paragraph.

My cents on testing and coverage in general

While coverage is a nice metric to watch, I think it is error-prone in some way. Having small tests covering every corner of our code is for sure good, but how we get there is the question.

I like my tests to cover the functionality of my program/framework. They are there to ensure that whatever I do, the behviour constraints are respected. Thus, I like to use coverage in a kind of Test Driven Development mindset: having some code not covered by tests doesn't necessarily mean that I should write tests for this code. Sometimes, it just means that these lines are superfluous to my goals.

Basically, I want my codebase to be as free as allowed by my objective (again, the functionality I want it to offer to the user).
Taking coverage as a goal can lead to test the current implementation, rather than its behaviour. The mid/long-term danger of that is having a bunch of test fail whenever I change the implementation, even if it still offers the same functionality.
For some developers - let's stay with "I" - This is just a painful cognitive overload, because I have a global view of the codebase and its goals, and I can just change the meaningful tests. (This is error-prone too: everytime I do that, I have a chance to miss something or overkill the new tests!)
More importantly IMO, this can discourage new contributors, who just want to try something. They do it, see a hellish rain of test fails, and just think they miss a bunch of key insight to understand how to contribute to this codebase, and probably give up.

I don't think we are in this situation, it is just something I am worried about seeing the strong focus on coverage there is in the project. Again, I'm all for it, as long as we keep the dangers in mind!

My cents on xDSL's testing strategy

Testing xDSL

I think that each test should either:
* Cover a specific complex behavior

* Cover a specific usecase

* Cover an aspect "End-To-End"

I agree with this list (non exhaustively), but I think it is symptomatic of xDSL's current situation; I think it is too abstract for a proper testing strategy for xDSL:

Our filechecks are great examples of End-To-End tests.

(NB: I value our filechecks, and write a bunch of them whenever it makes sense to me)
I do not agree with this statement. Most of our filechecks are at most round-trip tests of giving a specific IR to MLIR and parsing it back exactly as is. It is extremely valuable, but I would call it an integration test at most: who wants to do IR round-trips as an end-goal?

I think we have questions to answer first, and that we could derive a strategy from their answers:

Who are our users?
What are their usecases?
What is an End-To-End flow for xDSL?

Who are our users?

From my - limited - experience, I can see 3 different types of users already: dialect writers, compiler builders, and compiler users.
For example, an xDSL-based compiler user won't care about what's happening in xDSL's core. is it MLIR bindings? Python code? Shell script? I don't care, I just want my compiler to compile as I want.
A dialect writer will care about the core's API, because he wants his dialect readable and maintainable.

What are their usecases? What is an End-To-End flow for xDSL?

We could define some general use-cases and flows from those users "categories", and think about concrete testing strategies for them.
E.g, write actual code/IR to executable's output tests, for compiler users.

AntonLydike Feb 25, 2023
Maintainer Author

Separating Tests:

@webmiche

I am not sure if dividing unit and integration tests is a good idea, I think having all tests for a certain thing in one place is very useful.

What would be a concrete downside?

I don't really know, my gut feeling is that keeping stuff grouped together by topic rather then type is usually a good thing. But that's just that^^ A gut feeling.

Testing individual `.get`s

Yes, and this is a point where we are disagreeing. I think there is value in very simple tests that execute everything compared to more complex tests that hit everything. Sometimes, even often times, bugs are not as simple as "I changed this function, so it fails now". I would not even really consider this a bug, as it does not have any of the surprising nature of other bugs.

I am unsure if we are actually disagreeing, I totally agree with your statement here. I think the root of our misunderstanding is somewhere in here:

Especially with all the hacking our decorators do, there are non-trivial dependencies that are hard enough to debug on their own. No need to have 10 big test cases failing and having to debug one of them if you can have 10 big tests and a small one failing and being able to first debug the small one.

The cited statements were specifically concerned with .get and .from methods, not the larger code base in general. I should have been more clear on that. I do think that everywhere there's something complex (like decorators) we should absolutely have individual, simple unit-tests, exactly tailored to what they do! Deocrators are interesting behavior! Touching them can definitely introduce unexpected bugs and API changes. We should have tests that cover that!

My argument is:

Changing the way a simple .get function works is not necessarily introducing a bug (and if it is it's trivial to see)
Changing the way a .get works is mainly changing public API, and we care about that! My argument then continues:
- When I change public API, I want to see how that affects the use cases, that's where our use-case tests come in
- Since we have individual use-case tests that cover a (small) combination of the .gets, we don't need tests for individual .gets to reach
  coverage
There are .get methods that are more complex. I definitely agree that they should have unit tests (as I would say they are "sufficiently interesting")

Maybe our misunderstanding arose from my use of such loose terms. So let's get more concrete.

IMO insubstantial tests: testing affine.yield, this test for memref.dim (and more tests in that memref file), test func.return, llvm pointer type constructor tests

IMO substantial tests: func test_wrong_blockarg_types, llvm pointer operations test

Filechecks

@PapyChacal

I do not agree with this statement. Most of our filechecks are at most round-trip tests of giving a specific IR to MLIR and parsing it back exactly as is. It is extremely valuable, but I would call it an integration test at most: who wants to do IR round-trips as an end-goal?

That's an interesting statement, apparently I haven't thought about our filecheck tests enough yet! I pretty much agree wit that.

I think the problem is that we can't really do any better right now, as we don't have any optimizations in xDSL that we could test that way, right?

A better way to test might be instead to test some compilation flow like ingesting an MPI program and compiling it to assembly (first with xDSL and then passing the half-lowered version on to MLIR)

AntonLydike Feb 25, 2023
Maintainer Author

I think part of the problem we're having is that I said the matmul test was good:

More recent versions of the matmul test (debatable)

I definitely agree with you that that isn't a good example.

I think even a simple vec_add would have been more than enough here, showing dim, and load and store in a loop in action.

webmiche Feb 27, 2023
Collaborator

The cited statements were specifically concerned with .get and .from methods, not the larger code base in general. I should have been more clear on that. I do think that everywhere there's something complex (like decorators) we should absolutely have individual, simple unit-tests, exactly tailored to what they do! Deocrators are interesting behavior! Touching them can definitely introduce unexpected bugs and API changes. We should have tests that cover that!

And my argument is that in a codebase that uses as much reflection as we do in our decorators, we need to test everything very thoroughly, even the simple looking things. Because the build method for example is not just "some method defined somewhere", but actually something that the decorator smashes onto the operation through reflection. And therefore, I think (and know from experience) that this method can sometimes fail in very unexpected ways, so I prefer having unittests for it.

For an example of such a thing, look at #98 for example. Adding the type hint to a field made stuff crash, because the decorator did not understand it anymore. This is the kind of behavior that is hard enough to debug as is already without having to first figure out why a full-blown use case fails, only to find that it is only one operation definition, to then find out that actually, there was a type-hint that can't be there for some reason.

My argument is:

Changing the way a simple .get function works is not necessarily introducing a bug (and if it is it's trivial to see)

Changing the way a .get works is mainly changing public API, and we care about that! My argument then continues:

When I change public API, I want to see how that affects the use cases, that's where our use-case tests come in

Since we have individual use-case tests that cover a (small) combination of the .gets, we don't need tests for individual .gets to reach
coverage

There are .get methods that are more complex. I definitely agree that they should have unit tests (as I would say they are "sufficiently interesting")

I understand your argument, I still don't agree however as "simple" things just aren't simple for me in the context of reflection. Maybe that's just me not knowing reflection enough and not having worked with it enough, but my experience tells me that simple tests have been very useful for me in this very codebase, and that's why I argue that they have value.

Maybe our misunderstanding arose from my use of such loose terms. So let's get more concrete.

IMO insubstantial tests: testing affine.yield, this test for memref.dim (and more tests in that memref file), test func.return, llvm pointer type constructor tests

IMO substantial tests: func test_wrong_blockarg_types, llvm pointer operations test

Thanks for clarifying and giving some examples. These are actually about the kind of tests I was thinking about as well 👍

For the filecheck discussion, I very much agree with @PapyChacal. We need to clarify who our users are and define what our tests are from there. To me, the filecheck tests are unittests (plus some integration tests) and not really end to end tests.

I guess some of the confusion here comes from the fact that we have to things in this repo: a) the xDSL core (with ir.py, irdl.py, and parser/printer) and b) the dialects and frontend, that are actually use cases of a). We have had discussions about splitting the two things into separate repos (which we decided not to do because it makes it hard to keep versions in-sync). I think we might want to split them even more in our folder hierarchy and maybe even split some of the testing for the two. This would clarify this discussion, because yes, for a) the filecheck tests are somewhat end-to-end tests, but actually for b) they are unittests.

Also, I am still planning to add some actual transformations into the core repo (probably DCE mainly) to show how to actually use xdsl-opt-main to get a pipeline set up.

PapyChacal · 2023-02-25T10:42:25Z

PapyChacal
Feb 25, 2023
Maintainer

[deleted] Wanted to post as a reply to the initial post, it's posted there

0 replies

superlopuh · 2023-02-27T15:11:31Z

superlopuh
Feb 27, 2023
Maintainer

My approach to tests in general is that they are code that help us find bugs. Like all code, it costs to run and maintain, so we should have a good reason to write it in the first place. A very good reason is that it will help prevent breaking guarantees we've previously given, whether that's an API or behaviour change. This is especially useful when working on the same codebase as a team, as the test itself could be the closest thing to the documentation of behaviour. When I write tests, I try to formulate them in a way that either a collaborator or myself a month or two from now would trigger if any if the important behaviours were broken.

Sometimes this is hard to predict. When in doubt, I usually leave out the test, and try to add one when I find a bug, that would've prevented the regression. The best tests are the ones we don't have to write due to compiler support, which is where type checking is so useful, among other analyses.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Testing Strategy #471

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

Testing xDSL

Insubstantial tests:

Tests with too much "accidental" coverage:

{{title}}

{{title}}

Testing xDSL

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Proposed Testing Strategy #471

AntonLydike Feb 24, 2023 Maintainer

Testing xDSL

Tests I don't want to have:

Insubstantial tests:

Tests with too much "accidental" coverage:

Test I want to have:

Complex Behavior Covering Tests:

Usecase Covering Tests:

End-To-End Tests:

Replies: 3 comments · 7 replies

webmiche Feb 24, 2023 Collaborator

Testing xDSL

Insubstantial tests:

Tests with too much "accidental" coverage:

webmiche Feb 25, 2023 Collaborator

PapyChacal Feb 25, 2023 Maintainer

My cents on testing and coverage in general

My cents on xDSL's testing strategy

Testing xDSL

Who are our users?

What are their usecases? What is an End-To-End flow for xDSL?

AntonLydike Feb 25, 2023 Maintainer Author

Separating Tests:

Testing individual .gets

Filechecks

AntonLydike Feb 25, 2023 Maintainer Author

webmiche Feb 27, 2023 Collaborator

PapyChacal Feb 25, 2023 Maintainer

superlopuh Feb 27, 2023 Maintainer

AntonLydike
Feb 24, 2023
Maintainer

Replies: 3 comments 7 replies

webmiche
Feb 24, 2023
Collaborator

webmiche Feb 25, 2023
Collaborator

PapyChacal Feb 25, 2023
Maintainer

AntonLydike Feb 25, 2023
Maintainer Author

Testing individual `.get`s

AntonLydike Feb 25, 2023
Maintainer Author

webmiche Feb 27, 2023
Collaborator

PapyChacal
Feb 25, 2023
Maintainer

superlopuh
Feb 27, 2023
Maintainer