Nodes become unhealthy after from 4.11 to 4.12 #2003

hamidostad · 2024-07-02T12:25:17Z

hamidostad
Jul 2, 2024

The previous cluster version was 4.11.0-0.okd-2022-12-02-145640. After upgrade the cluster to each version of 4.12, The Nodes become unhealthy. When we check the nodes, we find out the EC2 is not healthy too. When we check EC2 and its services, we faced error in networkmanager that doesn't assign IP to the instance and also, kubelet service is not running. Finally the error shows issue is relatted to ovsdb-server. The user and group "openvswitch:hugetlbfs" is not exist on the instance and it cause failing the ovsdb-server and openvswitch.
When we create mentioned user and group, the problem is solved.
The question is:
Why upgrading to 4.12 version causes this problem? The cluster doesn't have this issue when upgrade patches in 4.11 version.

ovsdb-server log

Jul 01 08:01:48 localhost.localdomain sh[1726]: /usr/bin/chown: invalid user: ‘openvswitch:hugetlbfs’
Jul 01 08:01:48 localhost.localdomain sh[1731]: /usr/bin/chown: invalid user: ‘openvswitch:hugetlbfs’
Jul 01 08:01:48 localhost.localdomain sh[1732]: /usr/bin/chown: invalid user: ‘openvswitch:hugetlbfs’
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1763]: id: 'openvswitch': no such user
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1764]: id: 'openvswitch': no such user
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1766]: id: 'openvswitch': no such user
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1768]: setpriv: failed to parse reuid: ''
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1770]: id: 'openvswitch': no such user
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1771]: id: 'openvswitch': no such user
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1773]: id: 'openvswitch': no such user
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1775]: setpriv: failed to parse reuid: ''
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1776]: install: invalid user 'openvswitch'
Jul 01 08:01:48 localhost.localdomain ovsdb-server[1778]: ovs|00001|daemon_unix|EMER|(null): user openvswitch not found, abort>
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1778]: ovsdb-server: (null): user openvswitch not found, aborting.

Cluster upgrade history

Version

from: 4.11.0-0.okd-2022-12-02-145640
to: 4.12.0-0.okd-2023-03-18-084815

How to reproduce

oc adm upgrade --to="4.12.0-0.okd-2023-03-18-084815"

JaimeMagiera · 2024-08-15T13:15:40Z

JaimeMagiera
Aug 15, 2024
Maintainer

Hi,

We are not working on FCOS builds of OKD any more. Please see these documents...

https://okd.io/blog/2024/06/01/okd-future-statement
https://okd.io/blog/2024/07/30/okd-pre-release-testing

We will be providing documentation on upgrading clusters from 4.15 FCOS to 4.16 SCOS. In terms of clusters that are older, you may be able to get help from community members. I'll convert this to a discussion to facilitate that.

Many thanks,

Jaime

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes become unhealthy after from 4.11 to 4.12 #2003

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Nodes become unhealthy after from 4.11 to 4.12 #2003

hamidostad Jul 2, 2024

Replies: 1 comment

JaimeMagiera Aug 15, 2024 Maintainer

hamidostad
Jul 2, 2024

JaimeMagiera
Aug 15, 2024
Maintainer