-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot get cluster machine-config operator out of degraded state after fixing reason for an upgrade failure #1261
Comments
I don't follow - which log needs to changed? |
Here it reports that the machine config is degraded, and that it is retrying. In all reality, due to the timeout apparanetly, the master MachioneConfigPool paused itself. So you have to go in and unpause it once the underlying poroblem (in our case selinux policy) was fixed. |
managedFields says "Mozilla" client has set |
Ok, then I know the problem: there is this 'Pause update' button during an update in the Cluster Settings menu. It must have been this because frankly I did not even know you could pause the MCP from the GUI in the MachineConfigPools screen. |
Thanks i had the same proplem |
Due to Issue coreos/fedora-coreos-tracker#701
we had an upgrade failure in our cluster as kubelet would not start due to selinux denials after machine config rebooted the nodes.
We were able to fix that by restoring the policy to the delivered version by FCOS before re-applying our necessary changes. (we run some EDA tools that need execheap and execmod on nfs_t)
Using this method our first master would finsih the upgrade and the upgrade would progress through the worker nodes flawlessly. The 2 remaining masters though are not upgrading it seems, because mchaine-config operator is permanently in degraded mode and I have no idea how to get it out of that state. Other than that the cluster works perfectly and jobs are running, but the update isn't progressing.
I have seen some messages regarding etcd but I don't have any reason to believe it is degraded as it is running on all nodes, guards are there and no degradation is reported.
All other operators are already at the new version and running fine, so this is the state:
Please help to get out of this mess. I also tried to retrigger machine-config on the first master by using
sudo touch /run/machine-config-daemon-force
and removiong the daemon pod to force re-creation. This led to a reboot and re-application of the config, but it ended up in the same state.Describe the bug
While, after fixing the selinux issue, worker nodes updates progressed and finished, the remaining 2 masters will not update because I cannot get the machine config operator out of degraded mode.
Version
4.10.0-0.okd-2022-05-28-062148 in upgrade towards 4.10.0-0.okd-2022-06-10-131327
Log bundle
https://next.mkcloud.dynu.net/index.php/s/MeDEWsPgFgJnnKg
The text was updated successfully, but these errors were encountered: