-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 Snapshot Retention policy is too aggressive #9866
Comments
This is discussed in some detail at rancher/rke2#5216 (comment), which I see you already linked. As mentioned there, the current behavior is intentional; we will likely eventually add a separate control over the number of s3 snapshots. Multiplying the retention count by the number of etcd nodes is also an interesting idea, but this could lead to overly aggressive pruning following a restore when the number of etcd nodes is temporarily reduced. There is also another related issue: |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
Not stale |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
hello friendly bot - s3 backup retention policies are still broken, no direct workaround available. Let's hope everyone has soft-delete features in their s3-buckets to make up for it. |
that said, does anyone know how to make an 'undeleted' backup known to rancher so it can be restored? |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
not stale |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
Not stale. Q: for the use case with multiple control planes, would it be helpful if each cluster used a different path/prefix in the S3 bucket? That might avoid the issue in the linked RKE issue (about data from deleted nodes never getting pruned), but also prevent clusters from pruning data belonging to each other. If this cannot be done, I would wonder if it's even safe to have multiple clusters sharing the same S3 bucket, considering that "misconfiguring" the pruning on one of them could have a wide impact to all of their snapshots. |
No, all the nodes should use the same bucket and prefix. It is assumed (and we should probably document this as a requirement) that all nodes use the same bucket and prefix. If this does not happen nodes will remove S3 snapshot records created by other nodes, thinking that they are "missing" from s3 - since they are all expected to have the same view of what exists on S3. The snapshots on S3 do not "belong" to the node that uploaded them, they are accessible to and can be restored/listed/deleted/pruned by any node. Having them owned by a node would not make a lot of sense, given that S3 is supposed to be a stable external system available to any cluster member, even one that was just created from scratch that perhaps needs to restore an old snapshot in order to restore the cluster from complete loss of all nodes. |
would it make more sense then to disable the built-in retention altogether and instead use retention policies built into the s3 backend? Is k3s/rke2 able to handle that? |
Describe the bug:
A PR had been introduced that prunes S3 etcd backups by date:
*https://docs.k3s.io/release-notes/v1.24.X#release-v12417k3s1
The default
etcd-snapshot-retention
is5
.Since this PR, only 5 total snapshots are allowed in S3 buckets, rather than 5 per control plane. This means that for my cluster with three control planes, I only have one backup for one of them, and the backups for the others are only from today (total of 5). As I add more control planes, presumably some are not going to be backed up at all as they will be pruned immediately in favour of the latest by date (one more control plane would push out that top fsn1 backup completely):
If my understanding here is correct, this is a significant breaking change in the patch release that has reduced the number of backups available for people's clusters, potentially leading to nasty surprises when they go back to check them.
A potential solution is to change the
etcd-snapshot-retention
tonumber-of-control-planes*5
manually but that is a lot of custom configuration and I'm not sure this need is understood/documented.A similar issue is reported here: rancher/rke2#5216
Expected behavior:
5 backups per control-plane.
Perhaps the default of 5 should be raised to at least cover the common scenario of three control planes; or S3 should be
etcd-snapshot-retention x number of control planes
. Or perhaps restoring the original behaviour and implementing a different process for cleaning up orphaned snapshots withage > n
. Ideally some way to communicate the impact this is having on existing clusters too.The text was updated successfully, but these errors were encountered: