[4.12 upgrade] Node upgrade fails because of SELinux policies preventing `nm-dispatcher` from working #1475

rassie · 2023-01-23T12:59:29Z

Describe the bug

Upgrading OKD from 4.11 to 4.12, I'm stopped by kubelets not starting on both master and worker nodes. The problem is the same: file /run/resolv-prepender-kni-conf-done does not get created, so that kubelet's pre-condition does not allow it to start. Logs are full of SELinux prohibiting nm-dispatcher to read NetworkManager's configuration:

Jan 22 21:50:02 okd-xwwxf-master-2 audit[1087]: AVC avc:  denied  { read } for  pid=1087 comm="nm-dispatcher" name="dispatcher.d" dev="sda4" ino=90264444 scontext=system_u:system_r:NetworkManager_dispatcher_t:s0 tcontext=system_u:object_r:NetworkManager_initrc_exec_t:s0 tclass=dir permissive=0
Jan 22 21:50:02 okd-xwwxf-master-2 audit[1087]: SYSCALL arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=561faac18790 a2=90800 a3=0 items=0 ppid=1 pid=1087 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="nm-dispatcher" exe="/usr/libexec/nm-dispatcher" subj=system_u:system_r:NetworkManager_dispatcher_t:s0 key=(null)
Jan 22 21:50:02 okd-xwwxf-master-2 audit: PROCTITLE proctitle="/usr/libexec/nm-dispatcher"
Jan 22 21:50:02 okd-xwwxf-master-2 nm-dispatcher[1087]: req:53 'connectivity-change': find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory "/etc/NetworkManager/dispatcher.d": Permission denied

Version

IPI with vSphere, 4.11.0-0.okd-2023-01-14-152430 updating to 4.12.0-0.okd-2023-01-21-055900.

How reproducible

100% so far, adding a node works, but with an earlier version of Fedora CoreOS, which will probably get updated in time and fail too.

Log bundle

https://drive.google.com/file/d/16oVumQ6SAHoiP2FlvItbAsIY87CvcW64/view?usp=sharing

The text was updated successfully, but these errors were encountered:

vrutkovs · 2023-01-23T13:06:06Z

MCO operator says

pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node okd-xwwxf-worker-p8pjt is reporting: "failed to drain node: okd-xwwxf-worker-p8pjt after 1 hour. Please see machine-config-controller logs for more information""

MCO controller says:

2023-01-23T09:31:21.543073234Z E0123 09:31:21.543021       1 drain_controller.go:110] error when evicting pods/"rook-ceph-osd-1-c7b8c8b49-8xm58" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
2023-01-23T09:31:21.543325781Z E0123 09:31:21.543290       1 drain_controller.go:110] error when evicting pods/"rook-ceph-osd-2-56bd8bd885-zbd6k" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
2023-01-23T09:31:26.544001981Z I0123 09:31:26.543939       1 drain_controller.go:110] evicting pod rook-ceph/rook-ceph-osd-1-c7b8c8b49-8xm58
2023-01-23T09:31:26.544041726Z I0123 09:31:26.543996       1 drain_controller.go:110] evicting pod rook-ceph/rook-ceph-osd-2-56bd8bd885-zbd6k
2023-01-23T09:31:26.544041726Z I0123 09:31:26.544023       1 drain_controller.go:139] node okd-xwwxf-worker-p8pjt: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: [error when evicting pods/"rook-ceph-osd-1-c7b8c8b49-8xm58" -n "rook-ceph": global timeout reached: 1m30s, error when evicting pods/"rook-ceph-osd-2-56bd8bd885-zbd6k" -n "rook-ceph": global timeout reached: 1m30s]

vrutkovs · 2023-01-23T13:10:19Z

Checking why okd-xwwxf-master-2 is not coming back from the reboot

vrutkovs · 2023-01-23T13:50:52Z

in 4.11 -> 4.12 we upgrade from F36 to F37. NM dispatcher on F37 expects scripts to be labelled with system_u:object_r:NetworkManager_dispatcher_script_t:s0, but on F36 files are labelled with system_u:object_r:NetworkManager_initrc_exec_t:s0.

Workaround:

boot with enforcing=0
ssh on the node, run restorecon -R -v /etc/NetworkManager/dispatcher.d/ and reboot

Not sure why MCD/rpm-ostree rebase didn't update the labels. Possibly an rpm-ostree/mco regression? cc @cgwalters

cgwalters · 2023-01-23T14:20:30Z

boot with selinux=0

No, that's a one-way transition effectively. We want enforcing=0 here.

As far as the incorrect label...hmm, definitely needs some debugging. Does ostree admin config-diff |grep selinux show that you have a modified policy?

Does the restorecon -R -v /etc/NetworkManager/dispatcher.d/ show anything?

rassie · 2023-01-23T14:31:57Z

@cgwalters

On an upgrading node after OS update restart:

[root@okd-xwwxf-master-1 ~]# rpm-ostree status
State: idle
Deployments:
● ostree-unverified-registry:quay.io/openshift/okd-content@sha256:01d90a996a2e78a0486616a0adba0733db428f5a1074976054cedff62f17b2ac
                   Digest: sha256:01d90a996a2e78a0486616a0adba0733db428f5a1074976054cedff62f17b2ac
                  Version: 37.20221225.3.0 (2023-01-23T14:28:02Z)

  pivot://quay.io/openshift/okd-content@sha256:bc4fe370cd76415d045b6cc2cf08e5f696ece912661cfe4370910020be9fe0b6
             CustomOrigin: Managed by machine-config-operator
                  Version: 411.36.202301141513-0 (2023-01-14T15:17:08Z)

[root@okd-xwwxf-master-1 ~]# ostree admin config-diff |grep selinux
M    selinux/targeted/active/commit_num
A    selinux/targeted/semanage.read.LOCK
A    selinux/targeted/semanage.trans.LOCK

[root@okd-xwwxf-master-1 ~]# restorecon -R -v /etc/NetworkManager/dispatcher.d/
Relabeled /etc/NetworkManager/dispatcher.d from system_u:object_r:NetworkManager_initrc_exec_t:s0 to system_u:object_r:NetworkManager_dispatcher_script_t:s0
Relabeled /etc/NetworkManager/dispatcher.d/pre-up.d from system_u:object_r:NetworkManager_initrc_exec_t:s0 to system_u:object_r:NetworkManager_dispatcher_script_t:s0
Relabeled /etc/NetworkManager/dispatcher.d/pre-up.d/10-ofport-request.sh from system_u:object_r:NetworkManager_initrc_exec_t:s0 to system_u:object_r:NetworkManager_dispatcher_script_t:s0
Relabeled /etc/NetworkManager/dispatcher.d/30-resolv-prepender from system_u:object_r:NetworkManager_initrc_exec_t:s0 to system_u:object_r:NetworkManager_dispatcher_script_t:s0
Relabeled /etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl from system_u:object_r:NetworkManager_initrc_exec_t:s0 to system_u:object_r:NetworkManager_dispatcher_script_t:s0

cgwalters · 2023-01-23T16:31:37Z

but on F36 files are labelled with system_u:object_r:NetworkManager_initrc_exec_t:s0.

I booted 36.20221030.3.0 and that doesn't seem to be true, I see

[root@cosa-devsh ~]# ls -alZ /etc/NetworkManager/dispatcher.d/
total 4
drwxr-xr-x. 5 root root system_u:object_r:NetworkManager_dispatcher_script_t:s0         111 Nov 11 15:55 .
drwxr-xr-x. 7 root root system_u:object_r:NetworkManager_etc_t:s0                       134 Nov 11 15:55 ..
-rwxr--r--. 1 root root system_u:object_r:NetworkManager_dispatcher_console_script_t:s0 506 Nov 11 15:55 90-console-login-helper-messages-gensnippet_if
drwxr-xr-x. 2 root root system_u:object_r:NetworkManager_dispatcher_script_t:s0           6 Nov 11 15:55 no-wait.d
drwxr-xr-x. 2 root root system_u:object_r:NetworkManager_dispatcher_script_t:s0           6 Nov 11 15:55 pre-down.d
drwxr-xr-x. 2 root root system_u:object_r:NetworkManager_dispatcher_script_t:s0           6 Nov 11 15:55 pre-up.d

In a stock node.

dustymabe · 2023-01-23T16:52:41Z

Some dispatcher related issues were documented in coreos/fedora-coreos-tracker#1218. Not sure if that's part of the problem here or not.

tthrone-atomic · 2023-01-26T23:34:54Z

Using the following works for my 4.11 to 4.12 upgrade (vsphere IPI). Did not need to set enforcing=0 on boot
restorecon -R -v /etc/NetworkManager/dispatcher.d/

kalik1 · 2023-02-08T16:18:30Z

Hi, I had same issue. For those who should be in the situation of a blocked update, the workaround at the following url worked in my case: #1317 (comment)

nate-duke · 2023-03-02T18:05:11Z

We've recently hit this while updating one of our clusters and are a bit concerned with the impact this has on MachineSet scaling or other "new provisioning" scenarios in existing clusters. Are there any potential workarounds aside from the MachineConfig workaround in #1317 (comment)?

We've done some limited testing of that workaround and it doesn't appear to work for new systems. What we've seen is that systems will get provisioned but they never make it to a running Node. We're going to do some more testing with this to see what the additional issues are be encountered but given that we're in uncharted territory I'm reluctant to post an issue on an environment that's had a workaround applied to it.

nate-duke · 2023-03-15T11:13:18Z

Is there anything else we can do to determine the cause of this? It seems to still be impacting new Machine builds in 4.12.0-0.okd-2023-03-05-022504.

There's a FCOS issue mentioned upthread and then there's #1438 and #1450 where it seems selinux is at play as in this issue but there's no clear identification (to me at least) of where the root of the issue is and thus where we can focus for a fix.

Happy to help test in any way that we can.

Bengrunt · 2023-03-27T14:46:18Z

Hello, I hit the same issue when upgrading a cluster from 4.11 to 4.12 and tried to apply the workaround mentioned here:

Hi, I had same issue. For those who should be in the situation of a blocked update, the workaround at the following url worked in my case: #1317 (comment)

However, I now hit another issue related to OVN with no idea what so ever how to debug this :/

ovsdb-server[1248]: ovs|00002|stream_ssl|ERR|SSL_use_certificate_file: error:80000002:system library::No such file or directory
ovsdb-server[1248]: ovs|00003|stream_ssl|ERR|SSL_use_PrivateKey_file: error:10080002:BIO routines::system lib
ovs-ctl[1202]: Starting ovsdb-server.
[...]
ovs-vswitchd[1328]: ovs|00007|stream_ssl|ERR|SSL_use_certificate_file: error:80000002:system library::No such file or directory
ovs-vswitchd[1328]: ovs|00008|stream_ssl|ERR|SSL_use_PrivateKey_file: error:10080002:BIO routines::system lib
ovs-vswitchd[1328]: ovs|00009|stream_ssl|ERR|failed to load client certificates from /ovn-ca/ca-bundle.crt: error:0A080002:SSL routines::system lib

Then I get no network connectivity on the node. Or maybe OVN fails to boot up because network manager didn't manage to boot up properly but that does not appear in the logs ?

Thanks a lot in advance for any help regarding this.

nate-duke · 2023-03-27T16:10:12Z

@Bengrunt To be clear, you shelled into the broken node(s) and executed the restorecon spell and then rebooted and were met with the above error in ovs?

(if so you'll probably want to open a new issue and attach a must-gather to get some visibility. also be sure to mention that it's currently an ovn issue!)

Bengrunt · 2023-03-27T20:06:37Z

@nate-duke Not exactly, what I did was:

Pause MCPs (master and worker) since I had two nodes stuck in the middle of the upgrade.
Create the two machine configs mentioned in this other issue and referred to as a possible workaround above
Reboot the failing node on the previous rpm-ostree release (eg. FCOS 36/OKD 4.11)
Wait for new MCs to be rendered and include the fix
Unpause the MCPs
Hope for the best 🤞
End up with this new error 😢

But maybe you're right I should rather open a new bug, sorry about that.

Bengrunt · 2023-04-13T12:10:01Z

Hello, just to let other users that would hit this issue that I eventually managed to make the above mentioned workaround work, by running it manually in single mode on the nodes and overriding MCD's validation process.

Thus, I managed to carry out the cluster upgrade process and then run two successive cluster upgrades without any issue.

So I imagine that others with clusters deployed back in 4.6 or 4.7 could work around this issue using the same technique.

Feels like I learned a lot about FCOS and rpm-ostree and the MCO/MCD in the process 😆

MattPOlson · 2023-04-18T14:37:52Z

I posted this in the bug referenced above, but this service "rhcos-selinux-policy-upgrade.service" is supposed to be rebuilding the SELinux policy but its not running becuase its trying to use a variable that doesn't exist in fcos.

RHEL_VERSION=$(. /usr/lib/os-release && echo ${RHEL_VERSION:-})
echo -n "RHEL_VERSION=${RHEL_VERSION:-}"

It probably needs to be updated to just Version.

NAME="Fedora Linux"
VERSION="37.20230303.3.0 (CoreOS)"

alexzose · 2023-05-10T07:01:37Z

Hello, we experience the same issue with an IPI installation on OpenStack.

The initial cluster version was 4.8, and we have been updating since then.

After the udpate to 4.12, kubelet fails to start because NetworkManager scripts have incorrect SeLinux labels, and the file /run/resolv-prepender-kni-conf-done is never created.

By running restorecon -vR /etc/NetowrkManager/dispatcher.d/ it seems to fix the issue for the kubelet, it starts normally, but then the afterburn-hostname.service fails on boot. Manual restart of afterburn-hostname.service runs OK though.

danielchristianschroeter · 2023-05-24T08:09:57Z

I ran into the same error situation when upgrading from 4.11.0-0.okd-2023-01-14-152430 to 4.12.0-0.okd-2023-04-16-041331.
The incorrect labels result in:
May 24 06:45:02 okd-01-zvldl-worker-fxl6k systemd[1]: kubelet.service: Failed with result 'exit-code'.
May 24 06:45:02 okd-01-zvldl-worker-fxl6k systemd[1]: Failed to start kubelet.service - Kubernetes Kubelet.

After executing restorecon -vR /etc/NetworkManager/dispatcher.d/;semodule -B;systemctl restart NetworkManager;systemctl restart kubelet it is working temporary.

During the upgrade process if the node switch to "Not Ready" state "restorecon -vR /etc/NetworkManager/dispatcher.d/;semodule -B" was enough to continue the upgrade process. At the beginning the labels will be reset so executing this command before on every node is not working.

nate-duke · 2023-10-12T11:13:25Z

So, we're still dealing with this on every new node provision (and nearly if not every update?). Is there a recommended place we can file an issue to get this fixed in FCOS as mentioned in #1475 (comment)?

LorbusChris · 2023-10-12T11:23:29Z

ah yes. Please file an issue on https://github.com/openshift/os/

Something like:
"rhcos-selinux-policy-upgrade.service broken on OKD"

Please also include a link to this issue here.

JaimeMagiera · 2024-08-15T13:34:55Z

Hi,

We are not working on FCOS builds of OKD any more. Please see these documents...

https://okd.io/blog/2024/06/01/okd-future-statement
https://okd.io/blog/2024/07/30/okd-pre-release-testing

Please test with the OKD SCOS nightlies and file a new issue as needed.

Many thanks,

Jaime

vrutkovs closed this as completed Jan 23, 2023

vrutkovs reopened this Jan 23, 2023

vrutkovs pinned this issue Jan 23, 2023

rassie mentioned this issue Jan 23, 2023

[4.12 upgrade] CoreDNS pods consume a lot of CPU #1476

Closed

vrutkovs mentioned this issue Mar 1, 2023

update from 4.11.0-0.okd-2023-01-14-152430 to 4.12.0-0.okd-2023-02-18-033438 failing #1527

Closed

vrutkovs mentioned this issue Apr 16, 2023

Pod to Pod Communcation severely degraded in 4.11 on vSphere #1550

Closed

nate-duke mentioned this issue Oct 12, 2023

rhcos-selinux-policy-upgrade.service broken on OKD/FCOS openshift/os#1381

Closed

vrutkovs unpinned this issue Oct 29, 2023

JaimeMagiera closed this as completed Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4.12 upgrade] Node upgrade fails because of SELinux policies preventing `nm-dispatcher` from working #1475

[4.12 upgrade] Node upgrade fails because of SELinux policies preventing `nm-dispatcher` from working #1475

rassie commented Jan 23, 2023

vrutkovs commented Jan 23, 2023

vrutkovs commented Jan 23, 2023

vrutkovs commented Jan 23, 2023 •

edited

Loading

cgwalters commented Jan 23, 2023

rassie commented Jan 23, 2023

cgwalters commented Jan 23, 2023

dustymabe commented Jan 23, 2023 •

edited

Loading

tthrone-atomic commented Jan 26, 2023 •

edited

Loading

kalik1 commented Feb 8, 2023

nate-duke commented Mar 2, 2023 •

edited

Loading

nate-duke commented Mar 15, 2023

Bengrunt commented Mar 27, 2023

nate-duke commented Mar 27, 2023 •

edited

Loading

Bengrunt commented Mar 27, 2023 •

edited

Loading

Bengrunt commented Apr 13, 2023

MattPOlson commented Apr 18, 2023

alexzose commented May 10, 2023

danielchristianschroeter commented May 24, 2023

nate-duke commented Oct 12, 2023

LorbusChris commented Oct 12, 2023

JaimeMagiera commented Aug 15, 2024

[4.12 upgrade] Node upgrade fails because of SELinux policies preventing nm-dispatcher from working #1475

[4.12 upgrade] Node upgrade fails because of SELinux policies preventing nm-dispatcher from working #1475

Comments

rassie commented Jan 23, 2023

vrutkovs commented Jan 23, 2023

vrutkovs commented Jan 23, 2023

vrutkovs commented Jan 23, 2023 • edited Loading

cgwalters commented Jan 23, 2023

rassie commented Jan 23, 2023

cgwalters commented Jan 23, 2023

dustymabe commented Jan 23, 2023 • edited Loading

tthrone-atomic commented Jan 26, 2023 • edited Loading

kalik1 commented Feb 8, 2023

nate-duke commented Mar 2, 2023 • edited Loading

nate-duke commented Mar 15, 2023

Bengrunt commented Mar 27, 2023

nate-duke commented Mar 27, 2023 • edited Loading

Bengrunt commented Mar 27, 2023 • edited Loading

Bengrunt commented Apr 13, 2023

MattPOlson commented Apr 18, 2023

alexzose commented May 10, 2023

danielchristianschroeter commented May 24, 2023

nate-duke commented Oct 12, 2023

LorbusChris commented Oct 12, 2023

JaimeMagiera commented Aug 15, 2024

[4.12 upgrade] Node upgrade fails because of SELinux policies preventing `nm-dispatcher` from working #1475

[4.12 upgrade] Node upgrade fails because of SELinux policies preventing `nm-dispatcher` from working #1475

vrutkovs commented Jan 23, 2023 •

edited

Loading

dustymabe commented Jan 23, 2023 •

edited

Loading

tthrone-atomic commented Jan 26, 2023 •

edited

Loading

nate-duke commented Mar 2, 2023 •

edited

Loading

nate-duke commented Mar 27, 2023 •

edited

Loading

Bengrunt commented Mar 27, 2023 •

edited

Loading