Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Snapshot Retention policy is too aggressive #9866

Open
maggie44 opened this issue Apr 4, 2024 · 12 comments
Open

S3 Snapshot Retention policy is too aggressive #9866

maggie44 opened this issue Apr 4, 2024 · 12 comments

Comments

@maggie44
Copy link

maggie44 commented Apr 4, 2024

Describe the bug:
A PR had been introduced that prunes S3 etcd backups by date:

The default etcd-snapshot-retention is 5.

Since this PR, only 5 total snapshots are allowed in S3 buckets, rather than 5 per control plane. This means that for my cluster with three control planes, I only have one backup for one of them, and the backups for the others are only from today (total of 5). As I add more control planes, presumably some are not going to be backed up at all as they will be pruned immediately in favour of the latest by date (one more control plane would push out that top fsn1 backup completely):

Screenshot 2024-04-04 at 17 16 40

If my understanding here is correct, this is a significant breaking change in the patch release that has reduced the number of backups available for people's clusters, potentially leading to nasty surprises when they go back to check them.

A potential solution is to change the etcd-snapshot-retention to number-of-control-planes*5 manually but that is a lot of custom configuration and I'm not sure this need is understood/documented.

A similar issue is reported here: rancher/rke2#5216

Expected behavior:
5 backups per control-plane.

Perhaps the default of 5 should be raised to at least cover the common scenario of three control planes; or S3 should be etcd-snapshot-retention x number of control planes. Or perhaps restoring the original behaviour and implementing a different process for cleaning up orphaned snapshots with age > n. Ideally some way to communicate the impact this is having on existing clusters too.

@brandond
Copy link
Member

brandond commented Apr 5, 2024

This is discussed in some detail at rancher/rke2#5216 (comment), which I see you already linked. As mentioned there, the current behavior is intentional; we will likely eventually add a separate control over the number of s3 snapshots.

Multiplying the retention count by the number of etcd nodes is also an interesting idea, but this could lead to overly aggressive pruning following a restore when the number of etcd nodes is temporarily reduced.

There is also another related issue:

@brandond brandond changed the title Etcd S3 snapshots being pruned by date has reduced number of backups S3 Snapshot Retention policy is too aggressive Apr 5, 2024
Copy link
Contributor

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@HoustonDad
Copy link

Not stale

Copy link
Contributor

github-actions bot commented Jul 8, 2024

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@horihel
Copy link

horihel commented Jul 9, 2024

hello friendly bot - s3 backup retention policies are still broken, no direct workaround available. Let's hope everyone has soft-delete features in their s3-buckets to make up for it.

@horihel
Copy link

horihel commented Jul 9, 2024

that said, does anyone know how to make an 'undeleted' backup known to rancher so it can be restored?

Copy link
Contributor

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@HoustonDad
Copy link

not stale

Copy link
Contributor

github-actions bot commented Oct 9, 2024

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@bwenrich
Copy link

Not stale.

Q: for the use case with multiple control planes, would it be helpful if each cluster used a different path/prefix in the S3 bucket?

That might avoid the issue in the linked RKE issue (about data from deleted nodes never getting pruned), but also prevent clusters from pruning data belonging to each other.

If this cannot be done, I would wonder if it's even safe to have multiple clusters sharing the same S3 bucket, considering that "misconfiguring" the pruning on one of them could have a wide impact to all of their snapshots.

@brandond
Copy link
Member

brandond commented Oct 17, 2024

No, all the nodes should use the same bucket and prefix. It is assumed (and we should probably document this as a requirement) that all nodes use the same bucket and prefix. If this does not happen nodes will remove S3 snapshot records created by other nodes, thinking that they are "missing" from s3 - since they are all expected to have the same view of what exists on S3.

The snapshots on S3 do not "belong" to the node that uploaded them, they are accessible to and can be restored/listed/deleted/pruned by any node. Having them owned by a node would not make a lot of sense, given that S3 is supposed to be a stable external system available to any cluster member, even one that was just created from scratch that perhaps needs to restore an old snapshot in order to restore the cluster from complete loss of all nodes.

@horihel
Copy link

horihel commented Oct 18, 2024

would it make more sense then to disable the built-in retention altogether and instead use retention policies built into the s3 backend? Is k3s/rke2 able to handle that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Triage
Development

No branches or pull requests

5 participants