Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent podman varlink socket fight #3998

Merged

Conversation

cevich
Copy link
Member

@cevich cevich commented Sep 11, 2019

When enabled, it's desired for the podman-varlink process to startup on
boot or upon socket-activation, whichever happens first. However,
with KillMode=none systemd will never kill any podman-varlink
processes. This makes it easily possible for multiple podman-varlink
processes to be running, and fight each other to service a single socket.


For example:

Prior to this commit, this will result in four podman-varlink processes
being run:

systemctl enable io.podman.socket
systemctl enable io.podman.service
systemctl start io.podman.socket
systemctl start io.podman.service
systemctl start io.podman.service

Fix this by setting KillMode=process and TimeoutStopSec=30 (default
is 90). This results in podman-varlink exiting on its own after a minute
of being idle (--timeout=60000). Alternatively, systemd will manage the
service stop by sending a SIGTERM, then if podman-varlink has not exited
within TimeoutStopSec, a SIGKILL will be sent.

Signed-off-by: Chris Evich [email protected]

@mheon
Copy link
Member

mheon commented Sep 11, 2019

/approve

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cevich, mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2019
@cevich
Copy link
Member Author

cevich commented Sep 11, 2019

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2019
@rhatdan
Copy link
Member

rhatdan commented Sep 12, 2019

Looks like you pulled in part of my patches.

@cevich
Copy link
Member Author

cevich commented Sep 12, 2019

Looks like you pulled in part of my patches.

Aye, did that on purpose, at least temporarily to remove that variable from the equation - Nothing but @edsantiago's system-tests actually exercises the systemd+varlink stuff.

Still, there's too much systemd $MAGIC_VOODOO in this fix for my liking. I think it fixes a problem, but I don't fully understand the how/why part (i.e. correctness of my commit message).

@edsantiago has played with systemd more than me. Is what I wrote in the commit message correct for the changes I made in 'Prevent podman varlink service/socket fight'?

note: There's another problem here as well which we haven't sussed out yet. @baude thinks varlink is actually giving EOF incorrectly/unexpectedly sometimes. So more debugging and understanding work is in flight on that end.

@cevich
Copy link
Member Author

cevich commented Sep 12, 2019

N/m @edsantiago I think I got a handle on this now after reading man systemd.kill. That's the actual cause. Updating commit messages...

@cevich cevich changed the title DO NOT MERGE: WIP: Prevent podman varlink service/socket fight Prevent podman varlink socket fight Sep 12, 2019
@cevich
Copy link
Member Author

cevich commented Sep 12, 2019

@baude okay, this def. fixes the multiple-processes problem. However there is still some varlink issue WRT Error: Virtual Read failed, 0: EOF and Error: unexpected EOF errors with the reproducer: while podman-remote run -it --rm alpine echo "hi"; do :; done. I'll open an issue on that for tracking.

When enabled, it's desired for the podman-varlink process to startup on
boot or upon socket-activation, whichever happens first.  However,
with `KillMode=none` systemd will never kill any podman-varlink
processes.  This makes it easily possible for multiple podman-varlink
processes to be running, and fight each other to service a single socket.

---
For example:

Prior to this commit, this will result in four podman-varlink processes
being run:

```
systemctl enable io.podman.socket
systemctl enable io.podman.service
systemctl start io.podman.socket
systemctl start io.podman.service
systemctl start io.podman.service
```

Fix this by setting `KillMode=process` and `TimeoutStopSec=30` (default
is 90).  This results in podman-varlink exiting on its own after a minute
of being idle (--timeout=60000).  Alternatively, systemd will manage the
service stop by sending a SIGTERM, then if podman-varlink has not exited
within `TimeoutStopSec`, a SIGKILL will be sent.

Signed-off-by: Chris Evich <[email protected]>
@cevich
Copy link
Member Author

cevich commented Sep 12, 2019

note: I believe this CI run may have hit the problem reported in #4005

@cevich
Copy link
Member Author

cevich commented Sep 12, 2019

(rebased and force-pushed without Dan's patch)

@cevich
Copy link
Member Author

cevich commented Sep 12, 2019

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 12, 2019
@rhatdan
Copy link
Member

rhatdan commented Sep 12, 2019

LGTM
@baude @mheon @edsantiago PTAL

@edsantiago
Copy link
Member

LGTM. Thanks for all your work on this.

@rhatdan
Copy link
Member

rhatdan commented Sep 12, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2019
@openshift-merge-robot openshift-merge-robot merged commit 6ad1762 into containers:master Sep 12, 2019
jlebon added a commit to jlebon/libpod that referenced this pull request Oct 8, 2019
Using `Also=` means that the target unit will also be
installed/uninstalled together with our unit. Doing
`Also=multi-user.target` essentially says: disable `multi-user.target`
if `io.podman.socket` is disabled, which sounds... not at all like
what we want.

In practice, systemd thankfully ignores this (likely because it's the
default target). I think having `Also=io.podman.socket` in the
`io.podman.service` already does what we want here: it gets installed
under `sockets.target` whenever the service is. (And the fact that
systemd ignored this means that it wasn't actually playing a role in
resolving containers#3998.)

This was causing `systemctl preset-all` to dump core in Fedora CoreOS:
coreos/fedora-coreos-tracker#290

(Likely there's a systemd bug around here too.)

Signed-off-by: Jonathan Lebon <[email protected]>
@cevich cevich deleted the idiot_proof_systemd_unit branch June 30, 2021 18:11
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants