Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ContainerStateRemoving #4493

Merged
merged 2 commits into from
Dec 2, 2019

Conversation

mheon
Copy link
Member

@mheon mheon commented Nov 11, 2019

When Libpod removes a container, there is the possibility that removal will not fully succeed. The most notable problems are storage issues, where the container cannot be removed from c/storage.

When this occurs, we were faced with a choice. We can keep the container in the state, appearing in podman ps and available for other API operations, but likely unable to do any of them as it's been partially removed. Or we can remove it very early and clean up after it's already gone. We have, until now, used the second approach.

The problem that arises is intermittent problems removing storage. We end up removing a container, failing to remove its storage, and ending up with a container permanently stuck in c/storage that we can't remove with the normal Podman CLI, can't use the name of, and generally can't interact with. A notable cause is when Podman is hit by a SIGKILL midway through removal, which can consistently cause podman rm to fail to remove storage.

We now add a new state for containers that are in the process of being removed, ContainerStateRemoving. We set this at the beginning of the removal process. It notifies Podman that the container cannot be used anymore, but preserves it in the DB until it is fully removed. This will allow Remove to be run on these containers again, which should successfully remove storage if it fails.

Fixes #3906

@mheon
Copy link
Member Author

mheon commented Nov 11, 2019

WIP: Needs tests, needs integration with ps, and there's probably locking issues I need to solve around a dependency race (container A is being removed, container B is being added with A as a dependency, A deallocates lock, B grabs lock, B is added to state, A tries to remove but fails as a dependency has been added; subsequent attempts to remove A or B will cause deadlocks as both have the same lock ID)

@rhatdan
Copy link
Member

rhatdan commented Nov 11, 2019

LGTM

@mheon
Copy link
Member Author

mheon commented Nov 11, 2019

Fixed the race by moving lock removal to after the container is removed from the database. Could potentially result in leaking locks is we get a SIGKILL precisely between the database removal and attempting to free the lock, but does solve issues where we could try to double-free a lock if RemoveContainer is run twice on the same container, once partially successful, the second time running to the end.

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L labels Nov 11, 2019
@mheon
Copy link
Member Author

mheon commented Nov 11, 2019

Good news: this seems to work.

Bad news: it potentially introduces 4 new database writes during removal to ensure atomicity of removal operations. It doesn't seem to have a significant effect on removal time on my laptop, but four disk writes can't be free.

@mheon mheon changed the title [WIP] Add ContainerStateRemoving Add ContainerStateRemoving Nov 11, 2019
When Libpod removes a container, there is the possibility that
removal will not fully succeed. The most notable problems are
storage issues, where the container cannot be removed from
c/storage.

When this occurs, we were faced with a choice. We can keep the
container in the state, appearing in `podman ps` and available for
other API operations, but likely unable to do any of them as it's
been partially removed. Or we can remove it very early and clean
up after it's already gone. We have, until now, used the second
approach.

The problem that arises is intermittent problems removing
storage. We end up removing a container, failing to remove its
storage, and ending up with a container permanently stuck in
c/storage that we can't remove with the normal Podman CLI, can't
use the name of, and generally can't interact with. A notable
cause is when Podman is hit by a SIGKILL midway through removal,
which can consistently cause `podman rm` to fail to remove
storage.

We now add a new state for containers that are in the process of
being removed, ContainerStateRemoving. We set this at the
beginning of the removal process. It notifies Podman that the
container cannot be used anymore, but preserves it in the DB
until it is fully removed. This will allow Remove to be run on
these containers again, which should successfully remove storage
if it fails.

Fixes containers#3906

Signed-off-by: Matthew Heon <[email protected]>
If the container is running and we need to get its netns and
can't, that is a serious bug deserving of errors.

If it's not running, that's not really a big deal. Log an error
and continue.

Signed-off-by: Matthew Heon <[email protected]>
@mheon
Copy link
Member Author

mheon commented Nov 22, 2019

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM but tests are missing.

@TomSweeneyRedHat
Copy link
Member

LGTM too, added tests are always welcomed.

@baude
Copy link
Member

baude commented Dec 2, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2019
@openshift-merge-robot openshift-merge-robot merged commit e4275b3 into containers:master Dec 2, 2019
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Container storage leak when the podman process is SIGKILL
7 participants