Failed tests leak CSI resources #260

timoreimann · 2020-04-17T16:10:00Z

The vast majority (all?) of the tests in csi-test do not seem to clean up resources properly when an assertion failure occurs. I'm not too familiar with Gomega, but my understanding is that once an EXPECT fails, the test execution comes to a halt immediately, so none of the cleanup logic usually done by the end of a test (e.g., deleting volumes or snapshots) gets a chance to execute.

This is a problem because the created CSI resources leaking this way may affect other tests to the extent that they may now fail as well even though they otherwise wouldn't. Specifically, a test that does not create any fixtures would now have left-over objects from the failed tests and thus fail too, aggravating identification of why a test run fails.

It'd be good to come up with a means to reliably clean up resources regardless of what the outcome of a test (and any of its assertions) is.

The text was updated successfully, but these errors were encountered:

msau42 · 2020-04-17T16:11:07Z

/kind bug
/help

k8s-ci-robot · 2020-04-17T16:11:08Z

@msau42:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/kind bug
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

timoreimann · 2020-04-20T05:46:58Z

@pohly you mentioned on Slack that there is some kind of cleanup taking place already. Looking forward to hearing about it.

pohly · 2020-04-20T08:57:25Z

That code is in https://github.com/kubernetes-csi/csi-test/blob/master/pkg/sanity/cleanup.go

Whenever you see a call to MaybeRegisterVolume you can be sure that the volume is going to be deleted even if the test code itself doesn't do it.

pohly · 2020-04-20T08:58:19Z

Have you encountered a specific test where the cleanup didn't work?

timoreimann · 2020-04-22T09:04:08Z

@pohly I had experienced the problem when I was running ListSnapshots-related tests a few days ago. Oddly, I can't reproduce the issue anymore locally right now, although it seems as if it should be fairly easy to do so: I don't see a register functionality for snapshots, so presumably if snapshots aren't cleaned up, they should possibly affect other tests?
Example: test A creates a snapshot but fails its assertion so the cleanup step is skipped. If test B is executed afterwards, doesn't create any snapshots, and expects ListSnapshots to return an empty list, it would fail now because it still has the snapshot from test A stored.

What am I missing? Is this possibly as mock driver vs. hostpath driver issue?

pohly · 2020-04-22T09:30:02Z

I don't see a register functionality for snapshots, so presumably if snapshots aren't cleaned up, they should possibly affect other tests?

Yes. Looks like those tests were written without taking into account how cleanup is supposed to work. So this issue is valid, we just need to be more specific about which tests have to be updated.

timoreimann · 2020-04-22T10:07:56Z

@pohly at a quick glance, it looks like all *Snapshot* tests (e.g., ListSnapshots, CreateSnapshot) are affected to the extent that create volumes are not registered and snapshots cannot be registered.

Other tests may be affected as well. I spotted the CreateAndControllerPublishVolume helper function that creates a volume but does not register it.

AFAICS, there are two TODOs:

Ensure all existing code registers volumes properly.
Create a cleanup manager for snapshots and use it.

Going further, I wonder if we should extend the (test) API so that test authors are less susceptible to missing resource registrations. For instance, we could provide a convenience function that creates a resource and registers it for cleanup in one go. This may also allow us to get rid of MaybeRegisterVolume because the convenience function would encapsulate the check for error being non-nil and act accordingly. What do you think?

Feel free to assign me to this one. I should have some bandwidth by the end of this week to work on whatever we determine is the most desirable approach, and update all volume/snapshot usages in the test code systematically.

pohly · 2020-04-22T10:23:58Z

Going further, I wonder if we should extend the (test) API so that test authors are less susceptible to missing resource registrations. For instance, we could provide a convenience function that creates a resource and registers it for cleanup in one go.

That sounds useful.

Feel free to assign me to this one. I should have some bandwidth by the end of this week to work on whatever we determine is the most desirable approach, and update all volume/snapshot usages in the test code systematically.

Thanks! I'm not sure whether I can assign it to you, but we'll see:

/assign @timoreimann

NicolasT · 2020-04-28T15:19:19Z

Running into a similar issue: I have a CSI plugin that's far from complete/spec-compliant, so many tests like Ensure VolumeId is not provided are still failing.

Now, when such test fails, it appears /tmp/csi-mount gets created but not cleaned up. Any subsequent test which attempts to create the very same directory then fails (file exists).

pohly · 2020-04-29T06:25:20Z

@NicolasT: if the default implementation for directory creation/deletion doesn't work for you, then you can also use scripts. For example, the sanity command runs unprivileged and thus cannot clean up after a privileged CSI driver mounted something inside the directory. But a custom script could use "sudo" to clean up.

Also, in PMEM-CSI, we are using a create script for the directories which uses mktemp to create unique directories for each test: https://github.com/intel/pmem-csi/blob/dfdd0a0fb3a3e957ad91ebdd5d93f6d64caea77f/test/e2e/storage/sanity.go#L171-L177

That still leaves garbage behind after one test failure, but at least the others can still run.

fejta-bot · 2020-07-28T06:54:47Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

timoreimann · 2020-07-29T20:18:05Z

/remove-lifecycle stale

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Apr 17, 2020

pohly mentioned this issue Apr 22, 2020

Improve ListSnapshot filtering tests #259

Merged

k8s-ci-robot assigned timoreimann Apr 22, 2020

timoreimann mentioned this issue Apr 27, 2020

Centralize cleanup of created resources #261

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 28, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2020

k8s-ci-robot closed this as completed in #261 Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed tests leak CSI resources #260

Failed tests leak CSI resources #260

timoreimann commented Apr 17, 2020

msau42 commented Apr 17, 2020

k8s-ci-robot commented Apr 17, 2020

timoreimann commented Apr 20, 2020

pohly commented Apr 20, 2020

pohly commented Apr 20, 2020

timoreimann commented Apr 22, 2020

pohly commented Apr 22, 2020

timoreimann commented Apr 22, 2020

pohly commented Apr 22, 2020

NicolasT commented Apr 28, 2020

pohly commented Apr 29, 2020

fejta-bot commented Jul 28, 2020

timoreimann commented Jul 29, 2020

Failed tests leak CSI resources #260

Failed tests leak CSI resources #260

Comments

timoreimann commented Apr 17, 2020

msau42 commented Apr 17, 2020

k8s-ci-robot commented Apr 17, 2020

timoreimann commented Apr 20, 2020

pohly commented Apr 20, 2020

pohly commented Apr 20, 2020

timoreimann commented Apr 22, 2020

pohly commented Apr 22, 2020

timoreimann commented Apr 22, 2020

pohly commented Apr 22, 2020

NicolasT commented Apr 28, 2020

pohly commented Apr 29, 2020

fejta-bot commented Jul 28, 2020

timoreimann commented Jul 29, 2020