-
A recent (after march 10th 2024) microos update put my ARM nodes into a state that they can't connect to the internet anymore. The nodes seem to work perfectly fine and are reachable via the VNET, but can't get any internet connectivity. Happened on two nodes so far. And a manual rollback fixes the issue. I didn't have any time to investigate yet, but maybe someone else is facing a similar issue? |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 33 replies
-
I'm also facing this issue with one of my nodes. I noticed a difference between the secondary (internal) interface names. The working node has an interface called
|
Beta Was this translation helpful? Give feedback.
-
I'm having the same issue. Unfortunately, I missed setting up backups and snapshots.. too bad :-) I managed to ssh into the control planes via a jump host, but am not sure how to resolve this situation. Ideas are much appreciated |
Beta Was this translation helpful? Give feedback.
-
@mysticaltech I narrowed the issue down, and it looks like in my case the /etc directory is more or less lost during the upgrade. I noticed that the This seems to be a really weird problem and somehow related to the transactional-update and the etc overlayfs. I can't grasp how this happens. Any ideas? |
Beta Was this translation helpful? Give feedback.
-
Hi to all! I've stuck into similar problem. After upgrade k3s to 1.29.3 agent node became unavailable cause of connection problems. So, there is no ssh connection over external IP and connection timeout errors was thrown. But, I can access node via ssh through other node via local (internal) IP. So, I'm on node there are longer no eth1 interface:
And I have no idea how to deal with it. My thoughts are:
And I don't know, will it or not works? |
Beta Was this translation helpful? Give feedback.
-
@andi0b @v-petukhov FYI here #697 (comment), @maaft confirmed that re-applying the cloudinit commands manually works. For the time being, I will make sure that happens automatically so as to make sure that whatever hopefully temporary issues there is with snapshot syncing, it's not felt. @jhass FYI the above thread, @andi0b pretty much cornered the underlying issue. |
Beta Was this translation helpful? Give feedback.
-
@andi0b @v-petukhov If you could try running those commands and let me know, it would be great (in case your nodes happen to still be up).
|
Beta Was this translation helpful? Give feedback.
-
@mysticaltech I think it's better to continue the discussion here (from #1324 (comment))
Sadly I think this is a conceptional issue. I don't know if you took the time to read the long posts I wrote in this discussion before, but it all comes down to the /etc rsync of transactional-update (at least in my case). It's this catch block in the source code: https://github.com/openSUSE/transactional-update/blob/2d77e2b24a3e958ebede49622c68f46e9adc377f/lib/Overlay.cpp#L106 In the case the base snapshot is not there anymore (and this happens after 10-20 days without a reboot) /etc is not getting merged/rsynced anymore. I don't really get why, because the base snapshot is not really required, the /etc overlays are kept even after the snapshot is removed. I thought about creating an issue with transactional-update, but I don't really feel confident enough to do that, because I'm still missing a lot of details about it's inner workings. |
Beta Was this translation helpful? Give feedback.
-
I finally drained all the broken nodes and had some time to test. I ran the cloud init commands before rebooting, after rebooting (via a jumphost on the private network), and after another reboot. cloud-init single --frequency always --name write_files
cloud-init single --frequency always --name runcmd I didn't notice any changes, and the nodes are still far away from being fine (no internet connectivity, no k3s starting up) Switching to new nodes worked without issues, so no unexpected surprises there. |
Beta Was this translation helpful? Give feedback.
-
A new transactional-update release (version 4.6.8) fixing the error is currently staged in https://build.opensuse.org/request/show/1172470. I'm really sorry for the trouble! |
Beta Was this translation helpful? Give feedback.
A new transactional-update release (version 4.6.8) fixing the error is currently staged in https://build.opensuse.org/request/show/1172470. I'm really sorry for the trouble!