-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly handle overlay syncing failures #116
Conversation
Previously, the code assumed that syncing always succeeds and only preserved the lowest layer of the parent snapshot. This results in the data of the dropped layers to be lost. Detect if syncing did not happen and preserve the layers.
Seems like we're affected by the same issue. |
I'm currently working on a rework of overlay handling so that it doesn't rely on older snapshots. However I'm wondering @andreygolev: The problem Vogtinator was fixing only occurs when the parent snapshot of the current one is deleted. This will only happen when you create multiple (by default > 5) new snapshots before a reboot, delete the previous snapshot manually or when snapper is configured to only preserve one snapshot. Is this the case in your setup? |
According to logs, there was just 1 reboot in 7 days for last affected node, while transaction-update is running daily. |
It looks like the users of the project kube-hetzner are strongly affected by this or a similar issue. Some nodes seem to revert back to the stock /etc after a reboot, which is a catastrophic situation, as no services start up and even the network settings are gone (node unreachable). My working theory how we run into this issue is roughly:
This working theory is supported by finding messages like We know that it is recommended to reboot as soon as possible after running transactional-update, but the reality is that the reboot does not always happen in a timely manner. There is a longer discussion here: kube-hetzner/terraform-hcloud-kube-hetzner#1287 There are a few more issues reported by users, but mostly they are unsolved, because people might just recreate the nodes, switch to another project, or give up, instead of investigating it thoroughly. |
@Vogtinator @sysrich That will be a life saver for our project https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner, we are loosing nodes because of that issue. |
Thanks a lot to all involved (and especially @Vogtinator for the patch and @andi0b for detailed breakdown here and in kube-hetzner/terraform-hcloud-kube-hetzner#1287 (reply in thread)). It seems this problem affects several people, so I won't wait for the reworked overlay handling, but apply the pull request immediately. |
@laenion Thanks! I just want to add that I didn't test this PR at all, or test if this fixes our issue. I stumbled upon it and wanted to highlight the severity of the issue. |
No worries: I tested it and also verified that it actually solves the problem ;-) |
https://build.opensuse.org/request/show/1172470 by user fos + dimstar_suse - Version 4.6.8 - tukit: Properly handle overlay syncing failures: If the system would not be rebooted and several snapshots accumulated in the meantime, it was possible that the previous base snapshot - required for /etc syncing - was deleted already. In that case changes in /etc might have been reset. [gh#openSUSE/transactional-update#116] [gh#kube-hetzner/terraform-hcloud-kube-hetzner#1287] - soft-reboot: Log requested reboot type - soft-reboot: Don't force hard reboot on version change only - Version 4.6.7 - Add support for snapper 0.11.0; also significantly decreases cleanup time [boo#1223504]
Previously, the code assumed that syncing always succeeds and only preserved the lowest layer of the parent snapshot. This results in the data of the dropped layers to be lost. Detect if syncing did not happen and preserve the layers.