Replies: 4 comments 13 replies
-
Are the nodes up? If master is up is has a local kubeconfig at |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Hello, We found this SELinux
This was on the failed worker node. |
Beta Was this translation helpful? Give feedback.
-
Hello there,
Describe the bug
We've been running an OKD 4.7 cluster for a few months, carrying out a few updates (stable releases) successfully.
However, when upgrading from 4.7.0-0.okd-2021-05-22-050008 to 4.7.0-0.okd-2021-06-04-191031 a few days ago, the process failed.
The Operator upgrades went well, with all of them succeeding, but the nodes upgrades failed, and we ended up stuck in the middle of the process with one master and one worker node unavailable:
We tried to ssh on the affected nodes to see what happened but the SSH server was not available. Using the hardware console we saw that the machines went through a FCOS upgrade to FCOS 34.20210518.3.0 and after it boots it immediately looses networking connectivity and the hostname of the host gets set to
localhost
(that can be seen on the prompt).It seems the issue occurs as soon as we're done with the FCOS 34 boot process.
From then on, we don't really know what we could do and any help would be greatly appreciated.
Version
UPI Baremetal clusters with 3 master and 3 worker nodes running 4.7.0-0.okd-2021-05-22-050008.
How reproducible
The issue occurs every time. We tried reinstalling the nodes (PXE installs) from the FCOS 34 image directly but the node ends but downgraded to FCOS 33.20210426.3.0 then back to 34.20210518.3.0 and we loose the connectivity.
Log bundle
We cannot complete the
oc adm must-gather
process since it can't reach some of the nodes.I'll try to gather a console boot log from our remote console.
Thanks a lot ! :)
Beta Was this translation helpful? Give feedback.
All reactions