-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Container storage leak when the podman process is SIGKILL #3906
Comments
Hi @TristanCacqueray, thanks for opening the issue. There's something odd with the version since podman-version claims Can you check if you used the correct binary? |
Oops, it was a devel version in ~/.local/bin... Using the system podman resulted in the same behavior. Here are the new debug info:
|
If we finally have a consistent reproducer for this, that's a very good thing - removing storage happens before containers are evicted from the database, so figuring out why it's failing but the container is still not in the Libpod DB might let us fix this for good. On |
@mheon, I can share some of my cycles. Want me to have a look? |
Here is a new sample log using system podman: $ python3 repro.py
Press ctrl-C if http server start...
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
^C
Keyboard interrupt received, exiting.
Error: no container with name or ID test-server found: no such container
Press ctrl-C if http server start...
Error: error creating container storage: the container name "test-server" is already in use by "1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841". You have to remove that container to be able to reuse that name.: that name is already in use
Press ctrl-C if http server start...
Error: error creating container storage: the container name "test-server" is already in use by "1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841". You have to remove that container to be able to reuse that name.: that name is already in use
Press ctrl-C if http server start...
^CError: no container with name or ID test-server found: no such container
^CTraceback (most recent call last):
File "repro.py", line 22, in <module>
time.sleep(1)
KeyboardInterrupt
$ podman container inspect 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841
Error: error looking up container "1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841": no container with name or ID 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841 found: no such container
$ podman container inspect test-server
Error: error looking up container "test-server": no container with name or ID test-server found: no such container
$ podman rm test-server
Error: no container with name or ID test-server found: no such container
$ podman rm 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841
Error: no container with name or ID 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841 found: no such container
$ podman rm 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841^C
$ podman run -it --rm --name test-server fedora
Error: error creating container storage: the container name "test-server" is already in use by "1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841". You have to remove that container to be able to reuse that name.: that name is already in use
$ podman rm -f 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841
Error: no container with name or ID 1c5629586bd866f80c05cc90d38b6e395c0114aac01c000446c52dfe3708d841 found: no such container
$ podman rm -f test-server
Error: no container with name or ID test-server found: no such container
$ podman rm -f --storage test-server
test-server
$ podman run -it --rm --name test-server fedora echo it is working
it is working |
@mheon perhaps the |
podman should act with /var/lib/containers/storage/overlay-containers/containers.json based on ACID. SIGTERM or SIGKILL shouldn't corrupt containers.json. Having containers.lock is not enough to implement ACID. |
I haven't seen any corrupted @holser If you have a corrupted |
I've identified something that may be part of the problem - Podman may not be unmounting the SHM mount before the removal. I'm adding some patches that might address it to #3931 |
* Update tripleo-heat-templates from branch 'master' - Merge "container-puppet: run podman rm with --storage" - container-puppet: run podman rm with --storage There is an ongoing bug in libpod where the removal of a container can randomly (probably under heavy load) leak and fail to finish the tasks, leaving leftovers in the podman database, which causes a container creation with the same name to fail. It is related to this issue: containers/podman#3906 And originally reported here as well: https://bugzilla.redhat.com/show_bug.cgi?id=1747885 This patch is a mitigation that has been validated by the Container Team where we use --storage option to force the low level removal of the container by using containers/storage project to really remove all the data. Closes-Bug: #1840691 Change-Id: I711f460bd51747c3985b95784d19560d1be2028a
There is an ongoing bug in libpod where the removal of a container can randomly (probably under heavy load) leak and fail to finish the tasks, leaving leftovers in the podman database, which causes a container creation with the same name to fail. It is related to this issue: containers/podman#3906 And originally reported here as well: https://bugzilla.redhat.com/show_bug.cgi?id=1747885 This patch is a mitigation that has been validated by the Container Team where we use --storage option to force the low level removal of the container by using containers/storage project to really remove all the data. Closes-Bug: #1840691 Change-Id: I711f460bd51747c3985b95784d19560d1be2028a
There is an ongoing bug in libpod where the removal of a container can randomly (probably under heavy load) leak and fail to finish the tasks, leaving leftovers in the podman database, which causes a container creation with the same name to fail. It is related to this issue: containers/podman#3906 And originally reported here as well: https://bugzilla.redhat.com/show_bug.cgi?id=1747885 This patch is a mitigation that has been validated by the Container Team where we use --storage option to force the low level removal of the container by using containers/storage project to really remove all the data. Closes-Bug: #1840691 Change-Id: I711f460bd51747c3985b95784d19560d1be2028a (cherry picked from commit 9e2a971)
I'm going to test this now that #3931 is merged and see if we can call it fixed |
Does not appear to be fixed. Looking deeper... |
SHM is still mounted, but I'm not seeing any error messages out of Podman. |
Alright, I think the reproducer here has introduced a wierd race.
In this case, The cleanup process then arrives and attempts to remove the container, but it's already gone from the database; that's one of the first steps we take, to ensure that we don't leave unusable partially-deleted containers around. As such, I don't know how well we can handle this. If Perhaps the simplest takeaway here is to not whack I'll think about this more, but I don't know if there's an easy solution on our end. |
(For reference - adding a |
Please note we cannot in general protect podman from sudden "SIGKILL", AFAIK, a fenced controller would end up with the same situation, like if using SIGKILL but doing a force power off instead. Or systemd might send KILL when it times-out shutting down some container via the service unit. So there needs to be some transactions replaying or something. Not sure, but Galera DB recovers from such crashes well, for example... |
@mheon SIGKILL is a valid signal which can be used by user. There is no way to intercept it, that's true. Though, there are many ways how to handle it. A lot of software guarantees constancy data whatever happens (databases, file systems, clustering software). Thus I mentioned ACID. For instance there can be a transaction log (journal) of podman operations. If podman was killed by SIGKILL the transaction log won't be clean. It means that extra steps should be taken to assure that consistency is present and nothing will blow up. |
Not sure if my issue is related to SIGKILL issue here, but I'm encountering this case of container that seems to didn't make to the execution of command inside the container. The
I'm assuming that if the podman didn't started the container successfully, it won't be able to cleanup the shm mounts afterwards. On my case, the container was still in the containers.json, so I have to remove it using Here's my podman info details.
Edit 1: Edit 2:
|
This issue had no activity for 30 days. In the absence of activity or the "do-not-close" label, the issue will be automatically closed within 7 days. |
Oops. I thought the bot closed this. Adding do-not-close label. |
When Libpod removes a container, there is the possibility that removal will not fully succeed. The most notable problems are storage issues, where the container cannot be removed from c/storage. When this occurs, we were faced with a choice. We can keep the container in the state, appearing in `podman ps` and available for other API operations, but likely unable to do any of them as it's been partially removed. Or we can remove it very early and clean up after it's already gone. We have, until now, used the second approach. The problem that arises is intermittent problems removing storage. We end up removing a container, failing to remove its storage, and ending up with a container permanently stuck in c/storage that we can't remove with the normal Podman CLI, can't use the name of, and generally can't interact with. A notable cause is when Podman is hit by a SIGKILL midway through removal, which can consistently cause `podman rm` to fail to remove storage. We now add a new state for containers that are in the process of being removed, ContainerStateRemoving. We set this at the beginning of the removal process. It notifies Podman that the container cannot be used anymore, but preserves it in the DB until it is fully removed. This will allow Remove to be run on these containers again, which should successfully remove storage if it fails. Fixes containers#3906 Signed-off-by: Matthew Heon <[email protected]>
Thanks @mheon ! |
Took us half a year, but it's finally done 😄 |
When Libpod removes a container, there is the possibility that removal will not fully succeed. The most notable problems are storage issues, where the container cannot be removed from c/storage. When this occurs, we were faced with a choice. We can keep the container in the state, appearing in `podman ps` and available for other API operations, but likely unable to do any of them as it's been partially removed. Or we can remove it very early and clean up after it's already gone. We have, until now, used the second approach. The problem that arises is intermittent problems removing storage. We end up removing a container, failing to remove its storage, and ending up with a container permanently stuck in c/storage that we can't remove with the normal Podman CLI, can't use the name of, and generally can't interact with. A notable cause is when Podman is hit by a SIGKILL midway through removal, which can consistently cause `podman rm` to fail to remove storage. We now add a new state for containers that are in the process of being removed, ContainerStateRemoving. We set this at the beginning of the removal process. It notifies Podman that the container cannot be used anymore, but preserves it in the DB until it is fully removed. This will allow Remove to be run on these containers again, which should successfully remove storage if it fails. Fixes containers#3906 Downstream-patch: podman-1775647.patch Signed-off-by: Matthew Heon <[email protected]> Signed-off-by: Valentin Rothberg <[email protected]>
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
When the podman process is SIGKILL midway, the container storage leaks and it is not obvious how to clean it up.
Steps to reproduce the issue:
Run this script:
Sample output:
Describe the results you received:
Once this happens, to be able to re-use the
test-server
name, the container storage needs to be removed manually:Describe the results you expected:
podman inspect
should tell that the container storage still exists instead of "no such container".Additional information you deem important (e.g. issue happens only occasionally):
The reproducer is probably not correct as it SIGKILL podman while it is cleaning up the container, but perhaps podman could remove the storage first?
Output of
podman version
:Output of
podman info --debug
:Package info (e.g. output of
rpm -q podman
orapt list podman
):Additional environment details (AWS, VirtualBox, physical, etc.):
The host is running in OpenStack
The text was updated successfully, but these errors were encountered: