Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create new worker node after upgrade from 4.7.0-0.okd-2021-06-04-191031 - VMWare IPI #699

Closed
aalgera opened this issue Jun 17, 2021 · 8 comments

Comments

@aalgera
Copy link

aalgera commented Jun 17, 2021

After upgrading from 4.7.0-0.okd-2021-06-04-191031 to 4.7.0-0.okd-2021-06-13-090745, I am not able to create new worker nodes. I do manage to acess the machine and find that the configuration of this machine halts due to an error in machine-config-daemon-firstboot.service.

In the logs of this service I found the following error:
jun 17 20:54:35 localhost machine-config-daemon[2363]: I0617 20:54:35.639020 2363 rpm-ostree.go:261] Running captured: rpm-ostree update --install NetworkManager-ovs --install glusterfs --install glusterfs-fuse --install qemu-guest-agent jun 17 20:54:36 localhost machine-config-daemon[2363]: I0617 20:54:36.370100 2363 update.go:439] Rolling back applied changes to OS due to error: error running rpm-ostree update --install NetworkManager-ovs --install glusterfs --install glusterfs-fuse --install qemu-guest-agent: error: "NetworkManager-ovs" is already provided by: NetworkManager-ovs-1:1.30.4-1.fc34.x86_64. Use --allow-inactive to explicitly require it. jun 17 20:54:36 localhost machine-config-daemon[2363]: : exit status 1
I did a clean install starting with 4.7.0-0.okd-2021-05-22-050008, updating to 4.7.0-0.okd-2021-06-04-191031 and then to 4.7.0-0.okd-2021-06-13-090745. I didn't see this problem in 4.7.0-0.okd-2021-05-22-050008 and 4.7.0-0.okd-2021-06-04-191031

Version
4.7.0-0.okd-2021-06-13-090745 -IPI VMWare

How reproducible
100% reproducible, I did a clean install of

Log bundle

@vrutkovs
Copy link
Member

vrutkovs commented Jun 17, 2021

Relevant Slack thread: https://kubernetes.slack.com/archives/C6BRQSH2S/p1623781138240100

Interestingly enough, I can't reproduce it on clean install of 4.7.0-0.okd-2021-06-13-090745.
I'll check if its happening on 2021-06-04 -> 2021-06-13 and then try the scale up

@aalgera
Copy link
Author

aalgera commented Jun 17, 2021

I started with 4.7.0-0.okd-2021-05-22-050008 which still uses FCOS-33.

@Udbv
Copy link

Udbv commented Jun 18, 2021

@aalgera you could check MachineConfigs/99-worker-okd-extensions, and if NetworkManager-ovs is present there - remove

@vrutkovs
Copy link
Member

vrutkovs commented Jun 18, 2021

This should be resolved by openshift/okd-machine-os#145 and vrutkovs/machine-config-operator@90c5c91

@uselessidbr
Copy link

@aalgera you could check MachineConfigs/99-worker-okd-extensions, and if NetworkManager-ovs is present there - remove

I've removed all extensions and then MCO turned into DEGRATED STATE because of the existing nodes not being able to uninstall the packages as it wasn't requested. As the node Isn't in production yet I just removed the node and then MCO reconciliated. Although I think there's a RedHat KB that provides a workaround to bypass it without removing the node.

After that I just scaled up new nodes with success.

@vrutkovs
Copy link
Member

Verified the scale up works again, installed 06-04, upgraded to 06-13 and then to 2021-06-19-191547 nightly. Upgrade worked correctly with 99-okd-master/worker-extensions and machineset can be scaled up.

@vrutkovs
Copy link
Member

@yaroslavkasatikov
Copy link

yaroslavkasatikov commented Jul 14, 2021

Hi @vrutkovs

we have this issue again in #441

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants