-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: Failed to retrieve PID of executing child process #12084
Comments
This comes up from time to time and is usually a process on your system periodically cleaning the tmp directory. |
@tomponline Thanks! That seems about right, could also be the reason why I experience this now when switching to the snap lxd version. But I'm a bit confused, what should the contents of
Should I change the D! to an x or append the line |
I've added it above and tested with
But does the third line mean that the snap.lxd directory were still cleaned, even though D! was specified for its parent dir? Just trying to understand and make sure it wont happen again. |
Looking at
So you would want to be adding a new line to |
Chatting with the snapd team theres been a suggestion that perhaps LXD should use the XDG_RUNTIME_DIR path for its runtime state files (i.e. /run/user/0/snap. in case of daemons) to avoid systemd-tmpfiles from cleaning up these files periodically. |
Unfortunately it didn't solve the error first described, after a few days the "Error: Failed to retrieve PID..." is back. Contents of
Trying console instead
|
You reloaded the systemd-tmpfiles service after making the change? |
Yes, prettry sure I did. Reloaded systemd with Just did it again on all LXD hosts. I'll restart all the faulty containers and see if it happens again within a day or two. |
You could try stopping and disabling the service for a while too and see if that helps. |
Instances restarted two days ago, are now giving this error again. It's not all the containers that were restarted, that gives this error again, for some which were restarted, I'm still able to use "lxc exec". So a bit inconsistent. I'm beginning to have my doubts it has something to do with the systemd-tmpfiles-clean job. Are there happening anything special with or "inside" the snap package? Older version though, but never had the issue with non-snap LXD. |
What's strange is that I have one LXD cluster host, where not a single instance has this issue and containers have been running for many days. Haven't spotted the difference between that cluster host and the others yet, other than is has "database standby" role. |
Found this thread (https://discuss.linuxcontainers.org/t/forkexec-failed-to-load-config-file/16220/9) and looked at my hosts to see if this also was the case, that this On one of the failing LXD hosts, this is the content of
There are only
Could this be the reason I'm getting this error? Restarting the container, all these file are present again
|
We've also been experiencing the same problem in our environments, basically, we created clusters, and even out of the box if we leave the cluster for a while we are seeing the same error while trying to exec into the containers, we can reboot the containers to bash into them again but that is not convenient in a production environment for obvious reasons. It would be great if we can be provided some instructions on how to go about troubleshooting this. |
Manually restoring So the questions is, why does this file suddenly disappear...? |
Thank you @KLIM8D We have the same issue and think of a workaround for now, to keep a copy of lxc.conf when a container starts, and have a way of alerting when lxc.conf disappears, and of sending it back. To have it in a logs directory is certainly odd, a /run subdir would be more appropriate I would say. On top of that we seem to have at times issues with some logrotation in this directory. We had lxc.log files recreated with 0 bytes while the lxc monitor process keeps the old file open, filing up /var, and only a restart of the container releases the space of the already deleted file. I am not sure whether this misbehaviour of logrotation may also delete the lxc.conf in the same directory. I have not figured out where the logrotation configuration for the log files here is kept. |
Apparently it is the entire content, or at least all the containers directories in Tried setting a rule in auditd, to see what deleted the @petrosssit So maybe you should check in your workaround script, if the directory is there or not.
After exec for
|
Same issue with LXC/D 5.0.2 from Debian Bookworm ... lxc.conf (writen in /var/log/lxd/[container]/lxc.conf disapears. More answer on this ? |
To be more precise on my setup I work with CEPH as storage backend. To go further, I've added a script who periodically check lxc.conf (1 test / 10s) presence and keep all API logs : the file disapears when the cleaning of expired backup task started. I've also seen this commit who is not present on the refactored code (security on lxc.conf on delete operations) : https://github.com/canonical/lxd/pull/4010/files?diff=split&w=0 Hope we could find more on this bug... Nice workaround to regenerate lxc.conf : lxc config show [container] > /var/log/lxd/[container]/lxc.conf |
@nfournil A simple empty lxc.conf file works too. touch /var/snap/lxd/common/lxd/logs/[container]/lxc.conf ¯_(ツ)_/¯ Edit: "lxc exec" with an empty lxc.conf works, but later data is added to that lxc.conf file (right after I did an lxc exec). So it's not really empty all the while. |
Hi, please could you clarify what you mean here? |
Il you go to this file (who seems to be the 5.x place for this function) :
you find the protection to lxc.log file, but not to lxc.conf anymore ...
https://github.com/canonical/lxd/blob/f14fc05ed333006ffe344e7268bbc7fda3994596/lxd/instance_logs.go#L295
Le mar. 31 oct. 2023 à 17:06, Tom Parrott ***@***.***> a
écrit :
… I've also seen this commit who is not present on the refactored code
(security on lxc.conf on delete operations) :
https://github.com/canonical/lxd/pull/4010/files?diff=split&w=0
Hi, please could you clarify what you mean here?
—
Reply to this email directly, view it on GitHub
<#12084 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABYB3TEUCASQ2QPN6H3ZM3DYCEOXPAVCNFSM6AAAAAA22KN24WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBXGUZDMOBUGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Had added a more agressive (1 check / s if lxc.conf file exists) logging and I confirm that file disapears when the API calls : location: srv05-r2b-fl1 less than 1/2s next, file doesn't exists anymore... |
@mihalicyn I think if we cherry-pick this lxc/incus#361 it should help. |
Reopening due to #12084 (comment) |
Looks like Lines 344 to 346 in 8f2d07c
|
One solution to this might be to stop using tmp and log dirs for .conf files, similar to lxc/incus#426 |
Ran into the same error on snap LXD 6.1, but seemingly not for the same reason, deleting a container was stuck and broke it. A simple LXD daemon reload (not restart, that kills containers) helped there - |
5.15.0-76-generic #83~20.04.1-Ubuntu
I have a cluster with 4 nodes. There are 44 containers spread across the nodes and atm I'm getting this error message trying to run
lxc exec
on 19 of the containers. Besides 3 containers, the other 41 are using ceph rbd for storage.I can start the container and exec works for a while and then all of the sudden I'm getting the error message. Restarting the container works and I can get a shell or whatever again, but then after X amount of hours, it happens again.
lxc info
lxc config show amqp-1 --expanded
lxc monitor
The text was updated successfully, but these errors were encountered: