podman-remote rmi: device or resource busy #3870

edsantiago · 2019-08-21T17:33:38Z

@cevich is seeing frequent system-test errors [1] with podman-remote

$  ./bin/podman-remote rmi -f build_test
# Error: error removing containers [10b343023c7d6187aa521fad47934b5d558adc53964ab491017e34d909782751] for image "8659137a98bd8c7f3528c1c2c09a0fc8115bc32ce7659bc7a1d31b7d73dd3029": could not remove container "10b343023c7d6187aa521fad47934b5d558adc53964ab491017e34d909782751": remove /var/lib/containers/storage/overlay-containers/10b343023c7d6187aa521fad47934b5d558adc53964ab491017e34d909782751/userdata/shm: device or resource busy

This is (I think) Ubuntu. I cannot reproduce on my Fedora laptop, even running the varlink server with nice 19 and even with cores disabled to emulate a slow system. My attempted reproducer is a simplified variant of the build system test:

#!/bin/bash

set -xe

tmpdir=$(mktemp -d --tmpdir podman-remote-test.XXXXXX)
cat >$tmpdir/Dockerfile <<EOF
FROM quay.io/libpod/alpine_labels:latest
RUN apk add nginx
RUN echo hi >/myfile
EOF

pr() { /path/to/podman-remote "$@"; }

while :;do
    pr build -t build_test --format=docker $tmpdir
    pr run --rm build_test cat /myfile
    pr rmi -f build_test
done

My only guess is that this might be a race condition with podman-remote run --rm: does anyone know if it is guaranteed to block?

master @ 1ad8fe5. I am sorry for not being able to reproduce it but Chris seems to run into it constantly.

The text was updated successfully, but these errors were encountered:

cevich · 2019-08-21T19:12:52Z

Thanks Ed.

Thinking about this more, I'm not seeing those particular failures anywhere else. So there must be something special about PR #3754. That could simply be new/different VM imges - including images with embedded bugs.

Note: The two failures are on F29 and F30 (not Ubuntu) images:

fedora-30-libpod-5420341990522880
fedora-29-libpod-5420341990522880

I'll try your reproducer (above) using those images vs what we have on master...

cevich · 2019-08-21T19:34:09Z

@edsantiago good news is, I was able to reproduce the error using your script above and the F30 VM image from my PR. Trying the VM image from master (expecting no error) now...

cevich · 2019-08-21T19:39:49Z

...as expected, no error using the VM image from master (fedora-30-libpod-5751722641719296).

Excellent, so now (hopefully) it's a matter of comparing differences. Since the libpod code is common in both VMs, they're running in the same cloud-environment, so there must be a configuration or packaging difference or something like that.

cevich · 2019-08-21T19:48:24Z

Same on both VMs:

libvarlink-util-18-1.fc30.x86_64
container-selinux-2.113.0-1.dev.git4f7d6bb.fc30.noarch
runc-1.0.0-93.dev.gitb9b6cc6.fc30.x86_64
systemd-241-10.git511646b.fc30.x86_64
slirp4netns-0.4.0-4.git19d199a.fc30.x86_64
containernetworking-plugins-0.8.1-1.fc30.x86_64
containers-common-0.1.37-2.fc30.x86_64

hrmmmmm.

cevich · 2019-08-21T20:38:39Z

Ran reproducer again, and it worked fine for a while, failing after maybe 10 passes. So there's def. some race condition here that's somehow more likely with the images in my PR vs from master, even with the exact same libpod code 😖

Maybe I should try taking the podman and podman-remote binaries from the breaking-VM over to the working-VM to see if the behavior moves/changes?

Not sure what else could be involved here, @mheon have any ideas?

cevich · 2019-08-21T20:46:46Z

confirmed: it's not the binaries, behavior remains the same after copying from breaking -> working VM.

Noticed this difference though, the breaking VM is running kernel-5.2.8-200 vs working VM (from master) is using kernel-5.2.7-200. Going to try upgrading that kernel, rebooting, and see if the problem starts happening...

cevich · 2019-08-21T20:55:24Z

JJF**($!!343$hg80*&#//**** 😠

updated both VMs to 5.2.9-200.fc30.x86_64 rebooted, but the pattern remains: Problem reproduces on fedora-30-libpod-5420341990522880 but not fedora-30-libpod-5751722641719296.

rhatdan · 2019-08-29T20:22:25Z

This could definitely cause a race.
pr build -t build_test --format=docker $tmpdir
pr run --rm build_test cat /myfile
When this command finishes the image is still mounted. conmon will execute podman container cleanup, but after the run command finishes.
pr rmi -f build_test
If this command executes before podman container cleanup finishes you can get this error.

rhatdan · 2019-08-29T20:23:35Z

@mheon Is my assumption correct. Does podman container cleanup take care of the --rm as well?

edsantiago · 2019-08-29T20:40:54Z

Thanks @rhatdan. If this is the case, would it be possible to have run --rm block until conmon finishes cleanup?

And if that's not possible, I'd like to request a clear documentation update stating something like:

 podman-remote run --rm actually means "remove the container some time in
 the future, not necessarily now", and is not actually guaranteed to remove
 the container. If you actually want to deterministically remove the container,
 perhaps because you want to rmi its parent container, please run [some sort
 of TBI command that blocks until container is removed]

rhatdan · 2019-08-29T20:46:32Z

Yes I agree it should be cleaned up, but I am not sure if my theory is correct.

mheon · 2019-08-29T21:18:53Z

podman run --rm should, if attached (not run with -d), block until container removal is complete. I don't think there's a potential for races, either.

rhatdan · 2019-08-30T10:09:09Z

Well @edsantiago is seeing what looks like a race when using the varlink connection.

edsantiago · 2019-08-30T12:03:40Z

Small correction: I've never seen it; only @cevich has seen it.

cevich · 2019-08-30T13:32:44Z

@mheon wrong answer. The correct answer is: Oh, I saw this problem yesterday and just merged a fix 😀

@mheon remember this is podman-remote -> podman. Perhaps podman-remote is exiting before podman/conmon finish their job?

With Ed's reproducer script, this very reliably reproduces for me (using hack/get_ci_vm.sh on the images mentioned). Would running in debug mode help at all? How can we prove/disprove Dan's theory? Or maybe is there a way we can tell the audit subsystem to log process-exits (with timestamps)?

cevich · 2019-09-03T16:01:03Z

@mheon I managed to catch this in the act by instrumenting the system-test that reproduces it:

--- a/test/system/070-build.bats
+++ b/test/system/070-build.bats
@@ -26,9 +26,11 @@ EOF
     PODMAN_TIMEOUT=240 run_podman build -t build_test --format=docker $tmpdir
     is "$output" ".*STEP 4: COMMIT" "COMMIT seen in log"
 
+    run_podman ps -a
     run_podman run --rm build_test cat /$rand_filename
     is "$output"   "$rand_content"   "reading generated file in image"
 
+    run_podman ps -a
     run_podman rmi -f build_test
 }

You can see what the test is doing in the diff. When it reproduces, it happens on that last rm -f line:

[+0100s] not ok 28 podman build - basic test
[+0100s] # (from function `die' in file test/system/helpers.bash, line 264,
[+0100s] #  from function `run_podman' in file test/system/helpers.bash, line 164,
[+0100s] #  in test file test/system/070-build.bats, line 34)
[+0100s] #   `run_podman rmi -f build_test' failed with status 2
[+0100s] # $ ./bin/podman-remote rm --all --force
[+0100s] # $ ./bin/podman-remote images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
[+0100s] # quay.io/libpod/alpine_labels:latest   4fab981df737
[+0100s] # $ ./bin/podman-remote build -t build_test --format=docker /tmp/podman_bats.hPzLvX/build-test
[+0100s] # STEP 1: FROM quay.io/libpod/alpine_labels:latest
[+0100s] # STEP 2: RUN apk add nginx
[+0100s] # fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/main/x86_64/APKINDEX.tar.gz
[+0100s] # fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/community/x86_64/APKINDEX.tar.gz
[+0100s] # (1/2) Installing pcre (8.42-r0)
[+0100s] # (2/2) Installing nginx (1.14.2-r1)
[+0100s] # Executing nginx-1.14.2-r1.pre-install
[+0100s] # Executing busybox-1.28.4-r1.trigger
[+0100s] # OK: 6 MiB in 15 packages
[+0100s] # 27af317be9360cdb9cf9ec92e3852518e34831223001012fb28aaf496d2462ff
[+0100s] # STEP 3: RUN echo UkyW3wKhN0Uk8z1zdTzgjXxyUypI0WBjXmaKij34vb2vC9cHA2 > /UdFKEYczgNKHkBKXaNwL
[+0100s] # STEP 4: COMMIT build_test
[+0100s] # 102f5447866173d6e131165b85c0955f970cba4d5f2197cd40e1bd506824ce78
[+0100s] # $ ./bin/podman-remote ps -a
[+0100s] # CONTAINER ID  IMAGE  COMMAND  CREATED  STATUS  PORTS  NAMES
[+0100s] # $ ./bin/podman-remote run --rm build_test cat /UdFKEYczgNKHkBKXaNwL
[+0100s] # UkyW3wKhN0Uk8z1zdTzgjXxyUypI0WBjXmaKij34vb2vC9cHA2
[+0100s] # $ ./bin/podman-remote ps -a
[+0100s] # CONTAINER ID  IMAGE                        COMMAND               CREATED                 STATUS                             PORTS  NAMES
[+0100s] # e32470ce70ef  localhost/build_test:latest  cat /UdFKEYczgNKH...  Less than a second ago  Exited (0) Less than a second ago         keen_cohen
[+0100s] # $ ./bin/podman-remote rmi -f build_test
[+0100s] # Error: error removing containers [e32470ce70ef1fabf805c6db9d2bdb763bb0b802e79ac6d2a8b38e21171ac8e4] for image "102f5447866173d6e131165b85c0955f970cba4d5f2197cd40e1bd506824ce78": could not remove container "e32470ce70ef1fabf805c6db9d2bdb763bb0b802e79ac6d2a8b38e21171ac8e4": unlinkat /var/lib/containers/storage/overlay-containers/e32470ce70ef1fabf805c6db9d2bdb763bb0b802e79ac6d2a8b38e21171ac8e4/userdata/shm: device or resource busy
[+0100s] # [ rc=2 (** EXPECTED 0 **) ]
[+0100s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0100s] # #| FAIL: exit code is 2; expected 0
[+0100s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

So clearly a race here is possible, but perhaps (somehow) more likely when podman-remote is being used?

is it impossible for the run ... --rm to verify removal before exit?

(ref. cirrus task)

mheon · 2019-09-03T16:06:10Z

Have we ever seen this without remote being involved? This should not be possible for local Podman.

cevich · 2019-09-03T17:13:25Z

@mheon As best as I can recall, I have not seen it. Even with podman-remote it doesn't always happen. Is it possible for podman-remote to special-case this, and verify removal happens before exiting (assuming there was no networking-error)?

rhatdan · 2019-09-04T13:22:41Z

Looking at the difference between pkg/adapter/containers.go and pkg/adapter/containers_remote.go.
Containers.go has a call to

if ecode, err := ctr.Wait(); err != nil {

And containers_remote.go does not. I would guess this is the issue.

mheon · 2019-09-04T13:41:26Z

Oh. That sounds like it.

cevich · 2019-09-04T14:25:45Z

Oh that's a HUGE relief, thanks @rhatdan !

mheon · 2019-09-13T18:09:48Z

Closing as #3934 is merged

cevich mentioned this issue Aug 22, 2019

Add support for testing with the latest Ubuntu release #3754

Merged

cevich mentioned this issue Aug 29, 2019

CI Test Failure under updated kernel 5.x containers/storage#419

Closed

cevich changed the title ~~[not reproducible] podman-remote rmi: device or resource busy~~ podman-remote rmi: device or resource busy Sep 3, 2019

rhatdan mentioned this issue Sep 4, 2019

Podman-remote run should wait for exit code #3934

Merged

cevich mentioned this issue Sep 5, 2019

[WIP] - Generate seccomp profile by tracing the syscalls made by the container #3576

Closed

mheon closed this as completed Sep 13, 2019

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

podman-remote rmi: device or resource busy #3870

podman-remote rmi: device or resource busy #3870

edsantiago commented Aug 21, 2019

cevich commented Aug 21, 2019

cevich commented Aug 21, 2019

cevich commented Aug 21, 2019

cevich commented Aug 21, 2019

cevich commented Aug 21, 2019

cevich commented Aug 21, 2019

cevich commented Aug 21, 2019 •

edited

Loading

rhatdan commented Aug 29, 2019

rhatdan commented Aug 29, 2019

edsantiago commented Aug 29, 2019

rhatdan commented Aug 29, 2019

mheon commented Aug 29, 2019

rhatdan commented Aug 30, 2019

edsantiago commented Aug 30, 2019

cevich commented Aug 30, 2019

cevich commented Sep 3, 2019 •

edited

Loading

mheon commented Sep 3, 2019

cevich commented Sep 3, 2019

rhatdan commented Sep 4, 2019

mheon commented Sep 4, 2019

cevich commented Sep 4, 2019

mheon commented Sep 13, 2019

podman-remote rmi: device or resource busy #3870

podman-remote rmi: device or resource busy #3870

Comments

edsantiago commented Aug 21, 2019

cevich commented Aug 21, 2019

cevich commented Aug 21, 2019

cevich commented Aug 21, 2019

cevich commented Aug 21, 2019

cevich commented Aug 21, 2019

cevich commented Aug 21, 2019

cevich commented Aug 21, 2019 • edited Loading

rhatdan commented Aug 29, 2019

rhatdan commented Aug 29, 2019

edsantiago commented Aug 29, 2019

rhatdan commented Aug 29, 2019

mheon commented Aug 29, 2019

rhatdan commented Aug 30, 2019

edsantiago commented Aug 30, 2019

cevich commented Aug 30, 2019

cevich commented Sep 3, 2019 • edited Loading

mheon commented Sep 3, 2019

cevich commented Sep 3, 2019

rhatdan commented Sep 4, 2019

mheon commented Sep 4, 2019

cevich commented Sep 4, 2019

mheon commented Sep 13, 2019

cevich commented Aug 21, 2019 •

edited

Loading

cevich commented Sep 3, 2019 •

edited

Loading