Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kill cpcontainer: could not be stopped... sending SIGKILL... container state improper #16142

Closed
edsantiago opened this issue Oct 12, 2022 · 17 comments · Fixed by #16177 or #17126
Closed
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

[+0401s] not ok 127 podman cp file from host to container
....
# podman cp /tmp/podman_bats.rNSXlF/cp-test-file-host-to-ctr/hostfile0 cpcontainer:/IdoNotExist/
Error: "/IdoNotExist/" could not be found on container cpcontainer: no such file or directory
[ rc=125 (expected) ]
# podman kill cpcontainer
cpcontainer
# podman rm -t 0 -f cpcontainer
open pidfd: No such process
Error: cannot remove container <ID> as it could not be stopped: sending SIGKILL to container <ID>: container state improper
[ rc=2 (** EXPECTED 0 **) ]

[sys] 127 podman cp file from host to container

Only two instances so far, both on f36 aarch64 root.
@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Oct 12, 2022
@vrothberg
Copy link
Member

       # # podman kill cpcontainer
        # cpcontainer
        # # podman rm -t 0 -f cpcontainer
        # open pidfd: No such process
        # Error: cannot remove container 1ade81d6e299e45fdb601a576ea555afcd674f4670b428dd62385cb54da77041 as it could not be stopped: sending SIGKILL to container 1ade81d6e299e45fdb601a576ea555afcd674f4670b428dd62385cb54da77041: container state improper
        # [ rc=2 (** EXPECTED 0 **) ]

That's interesting. The podman kill succeeded and the following podman rm fails.

vrothberg added a commit to vrothberg/libpod that referenced this issue Oct 13, 2022
To improve the error message reported in containers#16142 where the container is
reported to be in the wrong state but we do not know which.  This is not
a fix for containers#16142 but will hopefully aid in better understanding what's
going on if it flakes again.

[NO NEW TESTS NEEDED] as hitting the condition is inherently racy.

Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg
Copy link
Member

I opened #16151 to improve the error message. I really want to know which state the container is in.

@edsantiago
Copy link
Member Author

You got lucky!:

Error: cannot remove container [...]: container state improper: stopped

@vrothberg
Copy link
Member

Wuaaaaaaa ... I expected everything but that O.O

@edsantiago
Copy link
Member Author

Failed in nightly cron, f36 rootless. Again, state is stopped. So far we've seen this root and rootless, f36 amd64 and aarch64.

@vrothberg
Copy link
Member

#16177 should fix the issue. I am surprised this didn't flake more.

vrothberg added a commit to vrothberg/libpod that referenced this issue Oct 14, 2022
Make sure to wait for the container to exit after kill. While the
cleanup process will take care eventually of transitioning the state, we
need to give a guarantee to the user to leave the container in the
expected state once the (kill) command has finished.

The issue could be observed in a flaking test (containers#16142) where
`podman rm -f -t0` failed because the preceding `podman kill`
left the container in "running" state which ultimately confused
the "stop" backend.

Note that we should only wait for the container to exit when SIGKILL is
being used.  Other signals have different semantics.

[NO NEW TESTS NEEDED] as I do not know how to reliably reproduce the
issue.  If containers#16142 stops flaking, we are good.

Fixes: containers#16142
Signed-off-by: Valentin Rothberg <[email protected]>
mheon pushed a commit to mheon/libpod that referenced this issue Oct 18, 2022
To improve the error message reported in containers#16142 where the container is
reported to be in the wrong state but we do not know which.  This is not
a fix for containers#16142 but will hopefully aid in better understanding what's
going on if it flakes again.

[NO NEW TESTS NEEDED] as hitting the condition is inherently racy.

Signed-off-by: Valentin Rothberg <[email protected]>
mheon pushed a commit to mheon/libpod that referenced this issue Oct 18, 2022
Make sure to wait for the container to exit after kill. While the
cleanup process will take care eventually of transitioning the state, we
need to give a guarantee to the user to leave the container in the
expected state once the (kill) command has finished.

The issue could be observed in a flaking test (containers#16142) where
`podman rm -f -t0` failed because the preceding `podman kill`
left the container in "running" state which ultimately confused
the "stop" backend.

Note that we should only wait for the container to exit when SIGKILL is
being used.  Other signals have different semantics.

[NO NEW TESTS NEEDED] as I do not know how to reliably reproduce the
issue.  If containers#16142 stops flaking, we are good.

Fixes: containers#16142
Signed-off-by: Valentin Rothberg <[email protected]>
@edsantiago
Copy link
Member Author

Reopening, sorry. Just saw this on #16275, which was based on 589ff20, which is very recent and includes #16177. f36-aarch64 remote root:

not ok 346 podman selinux: inspect multiple labels
...
# podman-remote  rm -t 0 -f myc
Error: cannot remove container <CID> as it could not be stopped: 
     sending SIGKILL to container <CID>: container state improper: stopped

@edsantiago edsantiago reopened this Oct 26, 2022
@vrothberg
Copy link
Member

Reopening, sorry.

Monsieur, I love your reports and bookkeeping. Some flakes are more stubborn than others. The error message strongly suggests another fart/race in the kill code. I will take a look today.

@vrothberg
Copy link
Member

#16320 should fix the issue. It's slightly different but effectively a similar scenario as before. When sending signals to the container, Podman releases the lock to prevent podman stop from blocking any other command trying to do something with the container. In the last failure above, the container exited so the container wasn't running anymore.

@edsantiago
Copy link
Member Author

That was quick! Thank you!

@edsantiago
Copy link
Member Author

Another "container state improper" flake, but I'm not sure if it's the same cause. This time in int remote f37 root but it's in podman restart:

# podman-remote [options] start 755ed1a8ac74984410bd7817cec045e56038e06a2133ab9592e70bc9da623c62
755ed1a8ac74984410bd7817cec045e56038e06a2133ab9592e70bc9da623c62
# podman-remote [options] restart 755ed1a8ac74984410bd7817cec045e56038e06a2133ab9592e70bc9da623c62
open pidfd: No such process
Error: sending SIGKILL to container 755ed1a8ac74984410bd7817cec045e56038e06a2133ab9592e70bc9da623c62: container state improper: stopped

@vrothberg
Copy link
Member

This one looks different to me. Would you create a new issue for it?

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

"container state improper" is one of the symptoms of the everything-hosed issue (#15367). Yesterday I assigned these three to that PR:

Is it possible that those are manifestations of this bug instead? If so, should I remove "cpcontainer" from the title and reassign those three?

@vrothberg
Copy link
Member

Yes, it looks different to me.

@edsantiago
Copy link
Member Author

Sorry, I'm having trouble interpreting that. Can you clarify whether I should

  1. assign those three failures to this issue (16142), or
  2. leave those three assigned to the everything-hosed issue (15367)?

@vrothberg
Copy link
Member

Apologies, I think they look more like #15367 but things are getting blurry a bit. Some symptoms tend to occur simultaneously at times.

vrothberg added a commit to vrothberg/libpod that referenced this issue Jan 16, 2023
The container lock is released before stopping/killing which implies
certain race conditions with, for instance, the cleanup process changing
the container state to stopped, exited or other states.

The (remaining) flakes seen in containers#16142 and containers#15367 strongly indicate a
race in between the stopping/killing a container and the cleanup
process.  To fix the flake make sure to ignore invalid-state errors.
An alternative fix would be to change `KillContainer` to not return such
errors at all but commit c77691f indicates an explicit desire to
have these errors being reported in the sig proxy.

[NO NEW TESTS NEEDED] as it's a race already covered by the system
tests.

Fixes: containers#16142
Fixes: containers#15367
Signed-off-by: Valentin Rothberg <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 4, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
3 participants