Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed tests leak CSI resources #260

Closed
timoreimann opened this issue Apr 17, 2020 · 13 comments · Fixed by #261
Closed

Failed tests leak CSI resources #260

timoreimann opened this issue Apr 17, 2020 · 13 comments · Fixed by #261
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.

Comments

@timoreimann
Copy link
Contributor

The vast majority (all?) of the tests in csi-test do not seem to clean up resources properly when an assertion failure occurs. I'm not too familiar with Gomega, but my understanding is that once an EXPECT fails, the test execution comes to a halt immediately, so none of the cleanup logic usually done by the end of a test (e.g., deleting volumes or snapshots) gets a chance to execute.

This is a problem because the created CSI resources leaking this way may affect other tests to the extent that they may now fail as well even though they otherwise wouldn't. Specifically, a test that does not create any fixtures would now have left-over objects from the failed tests and thus fail too, aggravating identification of why a test run fails.

It'd be good to come up with a means to reliably clean up resources regardless of what the outcome of a test (and any of its assertions) is.

@msau42
Copy link
Collaborator

msau42 commented Apr 17, 2020

/kind bug
/help

@k8s-ci-robot
Copy link
Contributor

@msau42:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/kind bug
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Apr 17, 2020
@timoreimann
Copy link
Contributor Author

@pohly you mentioned on Slack that there is some kind of cleanup taking place already. Looking forward to hearing about it.

@pohly
Copy link
Contributor

pohly commented Apr 20, 2020

That code is in https://github.com/kubernetes-csi/csi-test/blob/master/pkg/sanity/cleanup.go

Whenever you see a call to MaybeRegisterVolume you can be sure that the volume is going to be deleted even if the test code itself doesn't do it.

@pohly
Copy link
Contributor

pohly commented Apr 20, 2020

Have you encountered a specific test where the cleanup didn't work?

@timoreimann
Copy link
Contributor Author

@pohly I had experienced the problem when I was running ListSnapshots-related tests a few days ago. Oddly, I can't reproduce the issue anymore locally right now, although it seems as if it should be fairly easy to do so: I don't see a register functionality for snapshots, so presumably if snapshots aren't cleaned up, they should possibly affect other tests?
Example: test A creates a snapshot but fails its assertion so the cleanup step is skipped. If test B is executed afterwards, doesn't create any snapshots, and expects ListSnapshots to return an empty list, it would fail now because it still has the snapshot from test A stored.

What am I missing? Is this possibly as mock driver vs. hostpath driver issue?

@pohly
Copy link
Contributor

pohly commented Apr 22, 2020

I don't see a register functionality for snapshots, so presumably if snapshots aren't cleaned up, they should possibly affect other tests?

Yes. Looks like those tests were written without taking into account how cleanup is supposed to work. So this issue is valid, we just need to be more specific about which tests have to be updated.

@timoreimann
Copy link
Contributor Author

@pohly at a quick glance, it looks like all *Snapshot* tests (e.g., ListSnapshots, CreateSnapshot) are affected to the extent that create volumes are not registered and snapshots cannot be registered.

Other tests may be affected as well. I spotted the CreateAndControllerPublishVolume helper function that creates a volume but does not register it.

AFAICS, there are two TODOs:

  1. Ensure all existing code registers volumes properly.
  2. Create a cleanup manager for snapshots and use it.

Going further, I wonder if we should extend the (test) API so that test authors are less susceptible to missing resource registrations. For instance, we could provide a convenience function that creates a resource and registers it for cleanup in one go. This may also allow us to get rid of MaybeRegisterVolume because the convenience function would encapsulate the check for error being non-nil and act accordingly. What do you think?

Feel free to assign me to this one. I should have some bandwidth by the end of this week to work on whatever we determine is the most desirable approach, and update all volume/snapshot usages in the test code systematically.

@pohly
Copy link
Contributor

pohly commented Apr 22, 2020

Going further, I wonder if we should extend the (test) API so that test authors are less susceptible to missing resource registrations. For instance, we could provide a convenience function that creates a resource and registers it for cleanup in one go.

That sounds useful.

Feel free to assign me to this one. I should have some bandwidth by the end of this week to work on whatever we determine is the most desirable approach, and update all volume/snapshot usages in the test code systematically.

Thanks! I'm not sure whether I can assign it to you, but we'll see:

/assign @timoreimann

@NicolasT
Copy link
Contributor

Running into a similar issue: I have a CSI plugin that's far from complete/spec-compliant, so many tests like Ensure VolumeId is not provided are still failing.

Now, when such test fails, it appears /tmp/csi-mount gets created but not cleaned up. Any subsequent test which attempts to create the very same directory then fails (file exists).

@pohly
Copy link
Contributor

pohly commented Apr 29, 2020

@NicolasT: if the default implementation for directory creation/deletion doesn't work for you, then you can also use scripts. For example, the sanity command runs unprivileged and thus cannot clean up after a privileged CSI driver mounted something inside the directory. But a custom script could use "sudo" to clean up.

Also, in PMEM-CSI, we are using a create script for the directories which uses mktemp to create unique directories for each test: https://github.com/intel/pmem-csi/blob/dfdd0a0fb3a3e957ad91ebdd5d93f6d64caea77f/test/e2e/storage/sanity.go#L171-L177

That still leaves garbage behind after one test failure, but at least the others can still run.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 28, 2020
@timoreimann
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants