-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--userns=keep-id storage-chown-by-maps kills machine with large images #16541
Comments
More details on the commands being run (particularly the |
Looking at scrollback when I had the hang, the 4 podman commands that were running are below (I edited out some paths and IDs). I am not sure why are there identical pairs… Looking at paths, both
|
Oh, apparently that's just how it works. Running a mere |
FTR: running the same |
I'm experiencing the same hanging behaviour when I run a tomcat container. I know the container is doing a LOT of processing on startup (indexing 14TB of JSON files to a postgres DB) and I expect it to un-hang in the future (possibly days from now) as I have run the same container with smaller datasets and it releases in time (maybe minutes or hours). While this is happening I can get nothing from podman ps, podman top, podman logs and probably others. They all hang.
iotop on the host OS does show disk IO for the container in question so It's doing something
|
Podman should not hang because a container is processing IO. A Podman hang is almost certainly because something is holding a lock; we don't hold locks during steady-state, only during certain setup operations. |
Are the containers running (as in, do you see Tomcat in |
On closer inspection @mheon I can not actually see a tomcat process running yet. Is there anything else I can do to help debug the situation? From the host:
Obviously I can't run podman top as its still hanging from hours ago:
|
How big is the volume ( |
The hang I got right now is ridiculous. There is only one command running:
(as we found out earlier, podman commands duplicate itself in process list, so this is just one command). And somehow managed to hang 😄 🤷♂️ Running |
Oh, I don't know if that points to something, but after I Ctrl+ced the command above (I had ran it manually), some seconds later I got this message:
|
It hanged again as part of CI, and sending a |
Got stacktrace from the process that hanged first and now doesn't react to
|
It's a 13TB volume with at least 50 million small json and text files and maybe 10million larger bninary files. I let the process run all weekend but it was still hung this morning. I CTRL+Cd the I followed your advice and replaced the |
I'm going to go ahead and close this then. @Hi-Angel can open a fresh issue for their hang (which seems like a deadlock and not I'll also look into |
@mheon err. What? 😄 Why? I don't get it, why do you want me to create a duplicate ? Do you want me to copy comments over there as well? Err, I am so confused. |
Because it won't be a duplicate. A new bug requires that you provide the full bug template (including full |
I did that, didn't I? I provided all that info when I created this report.
Right. So… why do you close an unrelated bugreport?? 😂😂😂 I mean, don't get me wrong, I'm glad that @Surfrdan 's issues are getting resolved, but they never even created a bugreport. This one is about something similar to their problem, but as you noted not really the same one.
So, based on my replies above, why wouldn't it? |
Perhaps, did you confuse the author? @Surfrdan wasn't the author here, I was |
Yep, sorry for the runaround. Need more coffee. Reopening. @Surfrdan Mind opening an RFE issue for additional logging around |
No problem, happens with the best of us ;) |
Sorry for the confusion guys. I genuinely thought we had the same issue at the outset. I'll take mine to a new ticket as requested. |
Have a similar problem #16062 ( podman 4.3.1, kernel 6.0.8, fcos 37.20221121.20.0 ) |
I'm having the same issue. These work:
I've recreated the machine a few times. it works for a few minutes though the behaviour is strange as described above. I'm on macOS Monterey 12.6 |
Can you SSH into the machine with |
I did attempt to ssh into the machine at one point, when podman was in the "hanging" state. I was able to ssh in however no shell prompt appeared. |
I can comment on that. I tried making that default and described my experience in that comment some time ago. The quote below summarizes experience:
|
Have similar problems with |
Just came across this, |
Here is a a similar report, I first reported it at #16062 (comment) but maybe it belongs here (too). In #16062 (comment) this issue is considered a locking bug inside podman. We seem to have problems regarding a deadlock in the xfs file system, triggered by podman. I am wondering if these can be explained by the same explanation or not. The issue occurs when starting a container for RabbitMQ with --user rabbitmq --userns=keep-id, on newer kernels when the native overlayfs is used and not fuse-overlayfs. One thing that is sub-optimal is that RabbitMQ needs its home directory mounted in the container (/var/lib/rabbitmq) but this is also where Podman stores all the container files; so effectively the container files are mounted into the container. The kernel warns about processes being stuck; here is one of them:
After this, more and more processes get blocked as they try to access the file system. We are currently working around it by forcing the use of the fuse-overlayfs instead of the native one. The version of Podman we have is fairly old (3.4.2) but Ubuntu doesn't seem to have packaged a newer version for Ubuntu 20.04. Therefore we could not try if this fix [updating to the latest 4.x version] also works for us. |
Ubuntu always have outdated packages, even when using the latest release (excluding perhaps browsers). The only way to get latest version of whatever software you're interested in is either using a PPA or packaging it yourself. For Ubuntu there's a PPA mentioned in Podman installation instructions |
I encountered issues with this on low-end hardware in the past, but now it has manifested in a different and more severe manner, that showcases the potential of the problem. TLDR; I managed to bork my SSD with this issue. I had a rootless Podman container running through a systemd service within a pod using the "--userns nomap" flag. About seven days ago, the "podman-auto-update.service" updated the container. However, something went awry during this process, causing the Podman/container service to enter an endless loop of storage-chown-by-maps operations (see log). Consequently, my SSD suffered a massive data write of ~23TB over the past week. Repeatedly restarting the service only resulted in the same problematic state. However, when I manually stopped the service and ran the container directly, it successfully completed the storage-chown-by-maps operation, ultimately resolving the issue. container-in_app.service.log (unfortunately only the last few hours) systemd units# pod-invoiceninja.service
# autogenerated by Podman 4.3.1
# Sat Dec 10 16:10:57 CET 2022
[Unit]
Description=Podman pod-invoiceninja.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=/run/user/1000/containers
Wants=container-in_app.service container-in_db.service container-in_web.service
Before=container-in_app.service container-in_db.service container-in_web.service
[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStartPre=/bin/rm \
-f %t/pod-invoiceninja.pid %t/pod-invoiceninja.pod-id
ExecStartPre=/usr/bin/podman pod create \
--infra-conmon-pidfile %t/pod-invoiceninja.pid \
--pod-id-file %t/pod-invoiceninja.pod-id \
--exit-policy=stop \
--replace \
--name invoiceninja \
--infra-name in-infra \
--network proxy,invoiceninja \
--userns nomap
ExecStart=/usr/bin/podman pod start \
--pod-id-file %t/pod-invoiceninja.pod-id
ExecStop=/usr/bin/podman pod stop \
--ignore \
--pod-id-file %t/pod-invoiceninja.pod-id \
-t 10
ExecStopPost=/usr/bin/podman pod rm \
--ignore \
-f \
--pod-id-file %t/pod-invoiceninja.pod-id
PIDFile=%t/pod-invoiceninja.pid
Type=forking
[Install]
WantedBy=default.target
# container-in_app.service
# autogenerated by Podman 4.3.1
# Sat Dec 10 16:10:57 CET 2022
[Unit]
Description=Podman container-in_app.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers
BindsTo=pod-invoiceninja.service
After=pod-invoiceninja.service
[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=always
TimeoutStopSec=70
ExecStartPre=/bin/rm \
-f %t/%n.ctr-id
ExecStart=/usr/bin/podman run \
--cidfile=%t/%n.ctr-id \
--cgroups=no-conmon \
--rm \
--pod-id-file %t/pod-invoiceninja.pod-id \
--sdnotify=conmon \
-d \
--replace \
-t \
--name=in_app \
--env-file /var/home/puser/podman/invoiceninja/env \
-v /var/home/puser/podman/invoiceninja/docker/app/public:/var/www/app/public:z,rw \
-v /var/home/puser/podman/invoiceninja/docker/app/storage:/var/www/app/storage:Z,rw \
--label io.containers.autoupdate=image docker.io/invoiceninja/invoiceninja:5
ExecStop=/usr/bin/podman stop \
--ignore -t 10 \
--cidfile=%t/%n.ctr-id
ExecStopPost=/usr/bin/podman rm \
-f \
--ignore -t 10 \
--cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all
[Install]
WantedBy=default.target
# container-in_db.service
# autogenerated by Podman 4.3.1
# Sat Dec 10 16:10:57 CET 2022
[Unit]
Description=Podman container-in_db.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers
BindsTo=pod-invoiceninja.service
After=pod-invoiceninja.service
[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStartPre=/bin/rm \
-f %t/%n.ctr-id
ExecStart=/usr/bin/podman run \
--cidfile=%t/%n.ctr-id \
--cgroups=no-conmon \
--rm \
--pod-id-file %t/pod-invoiceninja.pod-id \
--sdnotify=conmon \
-d \
--replace \
-t \
--name=in_db \
-v /var/home/puser/podman/invoiceninja/docker/mysql/data:/var/lib/mysql:Z,rw \
--label io.containers.autoupdate=image docker.io/mariadb:10.4
ExecStop=/usr/bin/podman stop \
--ignore -t 10 \
--cidfile=%t/%n.ctr-id
ExecStopPost=/usr/bin/podman rm \
-f \
--ignore -t 10 \
--cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all
[Install]
WantedBy=default.target
# container-in_web.service
# autogenerated by Podman 4.3.1
# Sat Dec 10 16:10:57 CET 2022
[Unit]
Description=Podman container-in_web.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers
BindsTo=pod-invoiceninja.service
After=pod-invoiceninja.service
[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStartPre=/bin/rm \
-f %t/%n.ctr-id
ExecStart=/usr/bin/podman run \
--cidfile=%t/%n.ctr-id \
--cgroups=no-conmon \
--rm \
--pod-id-file %t/pod-invoiceninja.pod-id \
--sdnotify=conmon \
-d \
--replace \
-t \
--name=in_web \
-v /var/home/puser/podman/invoiceninja/docker/app/public:/var/www/app/public:z,ro \
--label io.containers.autoupdate=image docker.io/nginx:alpine
ExecStop=/usr/bin/podman stop \
--ignore -t 10 \
--cidfile=%t/%n.ctr-id
ExecStopPost=/usr/bin/podman rm \
-f \
--ignore -t 10 \
--cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all
[Install]
WantedBy=default.target podman infohost:
arch: amd64
buildahVersion: 1.30.0
cgroupControllers:
- cpuset
- cpu
- io
- memory
- pids
cgroupManager: systemd
cgroupVersion: v2
conmon:
package: conmon-2.1.7-2.fc38.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.1.7, commit: '
cpuUtilization:
idlePercent: 97.47
systemPercent: 1.89
userPercent: 0.64
cpus: 4
databaseBackend: boltdb
distribution:
distribution: fedora
variant: iot
version: "38"
eventLogger: journald
idMappings:
gidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
uidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
kernel: 6.2.15-300.fc38.x86_64
linkmode: dynamic
logDriver: journald
memFree: 679092224
memTotal: 8059166720
networkBackend: netavark
ociRuntime:
name: crun
package: crun-1.8.4-1.fc38.x86_64
path: /usr/bin/crun
version: |-
crun version 1.8.4
commit: 5a8fa99a5e41facba2eda4af12fa26313918805b
rundir: /run/user/1000/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
os: linux
remoteSocket:
exists: true
path: /run/user/1000/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: true
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: true
serviceIsRemote: false
slirp4netns:
executable: /usr/bin/slirp4netns
package: slirp4netns-1.2.0-12.fc38.x86_64
version: |-
slirp4netns version 1.2.0
commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
libslirp: 4.7.0
SLIRP_CONFIG_VERSION_MAX: 4
libseccomp: 2.5.3
swapFree: 8049127424
swapTotal: 8058302464
uptime: 3h 16m 14.00s (Approximately 0.12 days)
plugins:
authorization: null
log:
- k8s-file
- none
- passthrough
- journald
network:
- bridge
- macvlan
- ipvlan
volume:
- local
registries:
search:
- registry.fedoraproject.org
- registry.access.redhat.com
- docker.io
- quay.io
store:
configFile: /var/home/puser/.config/containers/storage.conf
containerStore:
number: 17
paused: 0
running: 8
stopped: 9
graphDriverName: overlay
graphOptions: {}
graphRoot: /home/puser/.local/share/containers/storage
graphRootAllocated: 170259980288
graphRootUsed: 67244826624
graphStatus:
Backing Filesystem: extfs
Native Overlay Diff: "true"
Supports d_type: "true"
Using metacopy: "false"
imageCopyTmpDir: /var/tmp
imageStore:
number: 73
runRoot: /run/user/1000/containers
transientStore: false
volumePath: /home/puser/.local/share/containers/storage/volumes
version:
APIVersion: 4.5.0
Built: 1681486942
BuiltTime: Fri Apr 14 17:42:22 2023
GitCommit: ""
GoVersion: go1.20.2
Os: linux
OsArch: linux/amd64
Version: 4.5.0 It might be beneficial to warn users in the documentation, that these (userns/idmap) options can be problematic when using the (default) native overlay storage driver. |
I came across this issue when I started testing out podman and wanted to use the Here was my initial result of running a fresh ubi9 container with
Running with
For comparison, here is the result of using a combination of
Now, the first run only took 14 seconds, and the container appears to be the same from the user namespace perspective. One difference is that this approach does not copy in the host user's group name, but that seems to be doable with the
After having run the image once with the "long" user namespace setup, I could now use the
Now the startup time was under 1 second and only one additional overlay was created, and still no overlay image owned by the host user:
However, the slowdown does reappear if creating a new container with the default user namespace, which leads to the long chown copying, so there is still an issue if you need the same image with two different user namespaces:
I would be interested to know whether this is a viable workaround for |
Linux 5.12 introduced id mapped mounts. If that is available, that could also avoid the costly chown? The latter would have to be kept as a fallback for a few years for older kernels. |
id mapped file systems is not allowed in rootless mode yet. Podman supports it in rootful mode. |
I ran into this issue when trying to set up a ZFS ZVOL specifically for Podman to get better performance than with fuse-overlayfs. Since I can easily swap filesystems and test mount options, I was able to try out a few different configurations. Unfortunately the version of ZFS I am using at the moment cannot be used with overlayfs, but as of ZFS 2.2 which hopefully will be backported to my Linux distribution soon, overlayfs is supported -- so I'll re-run my tests then. With an ext4 volume for Podman storage, I saw the most abysmal performance with default options. My tests were with the With default options, a pre-pulled I tested disabling the ext4 journal as well as other options. The only option I found that made any meaningful difference was adding Then I moved on to testing btrfs. btrfs performed the best. Starting the Home Assistant container with a pre-pulled image on btrfs took 11 seconds (and 13 seconds on a subsequent test.) btrfs also had the best deduplication performance. Finally, I tested XFS. XFS performance was in between ext4 and btrfs. I did not see any deadlock like reported by other users. But I'm not entirely sure XFS was using reflinks with Podman. My first test with XFS started a pre-pulled Home Assistant container in 18 seconds, but a subsequent test took nearly twice as long. Also, the I'm hopeful that ZFS 2.2 will perform similarly to btrfs. But it's possible that I'll see similar issues to that of XFS, given the interaction with overlayfs. I understand that this is a complicated problem, and that fuse-overlayfs can be used as a workaround. But the lockup of podman is a bit of an issue, and it'd be nice if there were some way to speed up the process if it is known ahead of time that an It sounds like rootless idmapped file systems is also a potential solution -- is there somewhere to follow along on that progress? Would that be a kernel change, or a Podman change? |
@mhoran I guess you are running as rootless? In this case, idmapped mounts won't help, since the kernel doesn't support it. At the moment, with rootless, the only way to have a dynamic IDs mapping at runtime is to use fuse-overlayfs for the storage. It is a FUSE file system, so it is inherently slower than native overlay, but it works quite well if you use the image only for reading files and use volumes for I/O intensive operations. Idmapped mounts should work fine for root, please pay attention that it is file system dependent, but BTFS, XFS and EXT4 should work fine with it, I am not sure about ZFS. |
How do I even pass something like |
podman compose is just going to execute docker-compose which will communicate with the podman service, which would require the change. If you want keep-id as the default then you can change your containers.conf to set that as default. |
I just ran into this problem as well. I can confirm that that Native Overlay causes HUGE impacts to container build time when paired with Here's what we tried and found: Click Me for more informationProblemWe had two Gentoo systems on ZFS, same OS (from the same tarball) but one ran into this issue (A) and the other didn't (B). We determined that (A) was using native overlay but (B) was using fuse-overlay, B had pre-existing containers from 2024-09-23. We compared with a third machine on Suse Aeon (btrfs). If I'm not mistaken, fuse overlay and native overlay correspond to the Native Overlay Diff which can be checked with: podman info --debug | grep Na
What we FoundWe experimented with the following command: podman pull debian
podman image save -o /tmp/debian.tar debian
podman system reset
podman load -i /tmp/debian.tar
time podman run --rm --userns keep-id -v /home/:/home/ debian date We found the time to create that debian container varied a lot:
When I ran conf=$HOME/.config/containers/storage.conf
mv $conf $conf.bak
cat <<EOF >> $conf
[storage]
driver = "overlay"
[storage.options]
mount_program = "/usr/bin/fuse-overlayfs"
EOF We repeated with some larger images and roughly observed:
The time taken to load the images wasn't impacted by the FS or by native Alternative User MappingI should note, the advice given above by @arroyoj 1 didn't work time podman run --uidmap 0:1:1000 --uidmap 1000:1000:64536 --gidmap 0:1:1000 --gidmap 1000:1000:64536 --user 1000:1000 debian /bin/bash -c "cat /proc/self/uid_map /proc/self/uid_map /proc/self/gid_map; id; tail -n 2 /etc/passwd"
Had it worked we could have created containers like that and cloned them We finally tried without any id maps at all, this, as expected, worked: time podman run debian /bin/bash -c "cat /proc/self/uid_map /proc/self/uid_map /proc/self/gid_map; id; tail -n 2 /etc/passwd"
ConclusionUsing See AlsoFootnotes |
@RyanGreenup is machine C (btrfs) different in hardware from machine B? I'm asking because if it is, it's hard to judge how much difference is there on native overlay compared to FUSE when BTRFS is involved.
Barring my previous comment, I'd presume the reason for seeing difference in performance may be that BTRFS supports reflinks, whereas ZFS as of today to my knowledge still doesn't. |
Yes, B and C are different hardware, so these numbers aren’t going to be sharp enough to split hairs. The lack of reflink in zfs is unfortunate indeed, as a result, I suspect it will be a while before we can use native overlay with userns on zfs. Edit: I should clarify though,
I would have expected Machine C to have about equal performance in terms of hardware, the difference between 1 and 10 s is arguably significant. |
Same issue. A chown of a single file takes seconds (verified using EDIT to add command line that causes the issue:
|
I enabled block cloning on ZFS. It solved the issue it seems. |
Can we close this issue? |
Long time watcher (and victim) of this issue here chiming in.. this issue always bites me when running podman on low-end hardware like the raspberry pi 3 and 4 (with ext4 fs fwiw).
I subscribed to this issue when @rhatdan made that comment last year. If I understand the various threads, supporting id mapped file systems in rootless mode would solve/mitigate this issue, right? Is there an issue I can subscribe to that will indicate when that feature is available? If so, I'm all for closing this issue because it's grown quite gnarly with lots of input and no clear direction. Edit: also this comment succinctly summarized the issue from my POV #16541 (comment) |
Have there been changes that should fix it? |
No idea, I saw someone mentioned a workaround potential fix. |
I presume you meant the comment about block cloning on ZFS? "Block cloning" I presume are reflinks, and in my older steps-to-reproduce I used XFS which supports that. So it's likely the comment author just have fast SSD/small image, but the problem is still there. |
@rhatdan Last week in our meeting when speaking specifically about idmap in rootless mode, you told me it will never be implemented as it would allow a user to write file on disk as root and obviously this would be a huge security issue. Thinking a bit more since then, we have no way in the kernel to make rootless idmap mounts restricted to only mapping which "make sense" from a security point of view ? I mean each time I start a container, it creates a uid_map already, as a rootless user. Playing with the --userns flag I can create a different user id mapping. Typically if I can "show" to the kernel that my user has a running process with a given uid_map, why wouldn't be ok to create another idmapped mount using that specific uid_map ? Maybe I don't use the right words, sorry for that, my knowledge in that area is not huge. |
@giuseppe what do you think of the chances of the kernel allowing idmapped file mappings only for UIDs defined within the user namespace? |
idmapped inside a user namespace are restricted to the file systems whose superblock is owned by the user namespace itself (like a tmpfs, or FUSE that were created inside the user namespace). My impression is that it is unlikely that the restriction will be lifted to allow idmapped mounts for other file systems. |
/kind bug
Description
While migrating a CI from Docker to Podman, I'm occasionally stumbling upon freezes of Podman commands. They may take dozens (!!!) of minutes, with Podman not doing anything at all.
The hangs aren't specific to any commands. E.g. right as I'm writing this text, I see two jobs, one with
podman run …
and another withpodman inspect
both frozen. So I connected to the server with ssh and trying running atime podman inspect foobar
(literally a request for non-existingfoobar
image), and it hanged as well.podman ps
hangs, andpodman version
even hangs!!Basically, to be able to create this report I had to kill a podman process. I had 2
podman run
processes and 2podman inspect
s. I killed one ofpodman inspect
processes, and a little later CI finally proceeded and podman commands started working.Steps to reproduce the issue:
I'm afraid I couldn't find any. It seems to be happening when multiple podman processes are run, but my attempts simulating that in different ways didn't succeed. It just happens from time to time as part of CI, in which case CI basically breaks completely.Steps to reproduce were found as part of this duplicate issue and are copied below:
Output of
podman version
:Output of
podman info
:Package info:
Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)
Yes
The text was updated successfully, but these errors were encountered: