kill cpcontainer: could not be stopped... sending SIGKILL... container state improper #16142

edsantiago · 2022-10-12T19:43:15Z

[+0401s] not ok 127 podman cp file from host to container
....
# podman cp /tmp/podman_bats.rNSXlF/cp-test-file-host-to-ctr/hostfile0 cpcontainer:/IdoNotExist/
Error: "/IdoNotExist/" could not be found on container cpcontainer: no such file or directory
[ rc=125 (expected) ]
# podman kill cpcontainer
cpcontainer
# podman rm -t 0 -f cpcontainer
open pidfd: No such process
Error: cannot remove container <ID> as it could not be stopped: sending SIGKILL to container <ID>: container state improper
[ rc=2 (** EXPECTED 0 **) ]

[sys] 127 podman cp file from host to container

fedora-36-aarch64 : sys podman fedora-36-aarch64 root host
- PR libpod: Remove 100msec delay during shutdown #16072
  - 10-07 05:31
fedora-36-aarch64 : sys remote fedora-36-aarch64 root host [remote]
- PR System tests: reenable some skipped aarch64 tests #15894
  - 09-21 16:44

Only two instances so far, both on f36 aarch64 root.

The text was updated successfully, but these errors were encountered:

vrothberg · 2022-10-13T10:08:49Z

       # # podman kill cpcontainer
        # cpcontainer
        # # podman rm -t 0 -f cpcontainer
        # open pidfd: No such process
        # Error: cannot remove container 1ade81d6e299e45fdb601a576ea555afcd674f4670b428dd62385cb54da77041 as it could not be stopped: sending SIGKILL to container 1ade81d6e299e45fdb601a576ea555afcd674f4670b428dd62385cb54da77041: container state improper
        # [ rc=2 (** EXPECTED 0 **) ]

That's interesting. The podman kill succeeded and the following podman rm fails.

To improve the error message reported in containers#16142 where the container is reported to be in the wrong state but we do not know which. This is not a fix for containers#16142 but will hopefully aid in better understanding what's going on if it flakes again. [NO NEW TESTS NEEDED] as hitting the condition is inherently racy. Signed-off-by: Valentin Rothberg <[email protected]>

vrothberg · 2022-10-13T10:26:23Z

I opened #16151 to improve the error message. I really want to know which state the container is in.

edsantiago · 2022-10-13T15:37:26Z

You got lucky!:

Error: cannot remove container [...]: container state improper: stopped

vrothberg · 2022-10-13T15:47:14Z

Wuaaaaaaa ... I expected everything but that O.O

edsantiago · 2022-10-14T00:11:16Z

Failed in nightly cron, f36 rootless. Again, state is stopped. So far we've seen this root and rootless, f36 amd64 and aarch64.

vrothberg · 2022-10-14T09:37:45Z

#16177 should fix the issue. I am surprised this didn't flake more.

Make sure to wait for the container to exit after kill. While the cleanup process will take care eventually of transitioning the state, we need to give a guarantee to the user to leave the container in the expected state once the (kill) command has finished. The issue could be observed in a flaking test (containers#16142) where `podman rm -f -t0` failed because the preceding `podman kill` left the container in "running" state which ultimately confused the "stop" backend. Note that we should only wait for the container to exit when SIGKILL is being used. Other signals have different semantics. [NO NEW TESTS NEEDED] as I do not know how to reliably reproduce the issue. If containers#16142 stops flaking, we are good. Fixes: containers#16142 Signed-off-by: Valentin Rothberg <[email protected]>

To improve the error message reported in containers#16142 where the container is reported to be in the wrong state but we do not know which. This is not a fix for containers#16142 but will hopefully aid in better understanding what's going on if it flakes again. [NO NEW TESTS NEEDED] as hitting the condition is inherently racy. Signed-off-by: Valentin Rothberg <[email protected]>

Make sure to wait for the container to exit after kill. While the cleanup process will take care eventually of transitioning the state, we need to give a guarantee to the user to leave the container in the expected state once the (kill) command has finished. The issue could be observed in a flaking test (containers#16142) where `podman rm -f -t0` failed because the preceding `podman kill` left the container in "running" state which ultimately confused the "stop" backend. Note that we should only wait for the container to exit when SIGKILL is being used. Other signals have different semantics. [NO NEW TESTS NEEDED] as I do not know how to reliably reproduce the issue. If containers#16142 stops flaking, we are good. Fixes: containers#16142 Signed-off-by: Valentin Rothberg <[email protected]>

edsantiago · 2022-10-26T12:58:18Z

Reopening, sorry. Just saw this on #16275, which was based on 589ff20, which is very recent and includes #16177. f36-aarch64 remote root:

not ok 346 podman selinux: inspect multiple labels
...
# podman-remote  rm -t 0 -f myc
Error: cannot remove container <CID> as it could not be stopped: 
     sending SIGKILL to container <CID>: container state improper: stopped

vrothberg · 2022-10-27T08:30:46Z

Reopening, sorry.

Monsieur, I love your reports and bookkeeping. Some flakes are more stubborn than others. The error message strongly suggests another fart/race in the kill code. I will take a look today.

vrothberg · 2022-10-27T11:56:59Z

#16320 should fix the issue. It's slightly different but effectively a similar scenario as before. When sending signals to the container, Podman releases the lock to prevent podman stop from blocking any other command trying to do something with the container. In the last failure above, the container exited so the container wasn't running anymore.

edsantiago · 2022-10-27T11:58:09Z

That was quick! Thank you!

edsantiago · 2022-11-14T15:53:52Z

Another "container state improper" flake, but I'm not sure if it's the same cause. This time in int remote f37 root but it's in podman restart:

# podman-remote [options] start 755ed1a8ac74984410bd7817cec045e56038e06a2133ab9592e70bc9da623c62
755ed1a8ac74984410bd7817cec045e56038e06a2133ab9592e70bc9da623c62
# podman-remote [options] restart 755ed1a8ac74984410bd7817cec045e56038e06a2133ab9592e70bc9da623c62
open pidfd: No such process
Error: sending SIGKILL to container 755ed1a8ac74984410bd7817cec045e56038e06a2133ab9592e70bc9da623c62: container state improper: stopped

vrothberg · 2022-11-14T15:55:46Z

This one looks different to me. Would you create a new issue for it?

github-actions · 2022-12-15T00:06:46Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2023-01-04T15:44:53Z

"container state improper" is one of the symptoms of the everything-hosed issue (#15367). Yesterday I assigned these three to that PR:

fedora-37-aarch64 : sys remote fedora-37-aarch64 root host [remote]

Is it possible that those are manifestations of this bug instead? If so, should I remove "cpcontainer" from the title and reassign those three?

vrothberg · 2023-01-04T15:46:24Z

Yes, it looks different to me.

edsantiago · 2023-01-04T15:56:44Z

Sorry, I'm having trouble interpreting that. Can you clarify whether I should

assign those three failures to this issue (16142), or
leave those three assigned to the everything-hosed issue (15367)?

vrothberg · 2023-01-05T10:03:50Z

Apologies, I think they look more like #15367 but things are getting blurry a bit. Some symptoms tend to occur simultaneously at times.

The container lock is released before stopping/killing which implies certain race conditions with, for instance, the cleanup process changing the container state to stopped, exited or other states. The (remaining) flakes seen in containers#16142 and containers#15367 strongly indicate a race in between the stopping/killing a container and the cleanup process. To fix the flake make sure to ignore invalid-state errors. An alternative fix would be to change `KillContainer` to not return such errors at all but commit c77691f indicates an explicit desire to have these errors being reported in the sig proxy. [NO NEW TESTS NEEDED] as it's a race already covered by the system tests. Fixes: containers#16142 Fixes: containers#15367 Signed-off-by: Valentin Rothberg <[email protected]>

edsantiago added the flakes Flakes from Continuous Integration label Oct 12, 2022

vrothberg mentioned this issue Oct 13, 2022

KillContainer: improve error message #16151

Merged

vrothberg mentioned this issue Oct 14, 2022

kill: wait for the container #16177

Merged

openshift-merge-robot closed this as completed in #16177 Oct 14, 2022

edsantiago reopened this Oct 26, 2022

vrothberg mentioned this issue Oct 27, 2022

kill: resync the container if runtime fails #16320

Closed

edsantiago mentioned this issue Nov 14, 2022

podman-remote restart: Error sending SIGKILL: container state improper: stopped #16505

Closed

github-actions bot added the stale-issue label Dec 15, 2022

rhatdan removed the stale-issue label Dec 17, 2022

vrothberg mentioned this issue Jan 16, 2023

container kill: handle stopped/exited container #17126

Merged

openshift-merge-robot closed this as completed in #17126 Jan 16, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 4, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kill cpcontainer: could not be stopped... sending SIGKILL... container state improper #16142

kill cpcontainer: could not be stopped... sending SIGKILL... container state improper #16142

edsantiago commented Oct 12, 2022

vrothberg commented Oct 13, 2022

vrothberg commented Oct 13, 2022

edsantiago commented Oct 13, 2022

vrothberg commented Oct 13, 2022

edsantiago commented Oct 14, 2022

vrothberg commented Oct 14, 2022

edsantiago commented Oct 26, 2022

vrothberg commented Oct 27, 2022

vrothberg commented Oct 27, 2022

edsantiago commented Oct 27, 2022

edsantiago commented Nov 14, 2022

vrothberg commented Nov 14, 2022

github-actions bot commented Dec 15, 2022

edsantiago commented Jan 4, 2023

vrothberg commented Jan 4, 2023

edsantiago commented Jan 4, 2023

vrothberg commented Jan 5, 2023

kill cpcontainer: could not be stopped... sending SIGKILL... container state improper #16142

kill cpcontainer: could not be stopped... sending SIGKILL... container state improper #16142

Comments

edsantiago commented Oct 12, 2022

[sys] 127 podman cp file from host to container

vrothberg commented Oct 13, 2022

vrothberg commented Oct 13, 2022

edsantiago commented Oct 13, 2022

vrothberg commented Oct 13, 2022

edsantiago commented Oct 14, 2022

vrothberg commented Oct 14, 2022

edsantiago commented Oct 26, 2022

vrothberg commented Oct 27, 2022

vrothberg commented Oct 27, 2022

edsantiago commented Oct 27, 2022

edsantiago commented Nov 14, 2022

vrothberg commented Nov 14, 2022

github-actions bot commented Dec 15, 2022

edsantiago commented Jan 4, 2023

vrothberg commented Jan 4, 2023

edsantiago commented Jan 4, 2023

vrothberg commented Jan 5, 2023