S3 Snapshot Retention policy is too aggressive #9866

maggie44 · 2024-04-04T16:35:46Z

Describe the bug:
A PR had been introduced that prunes S3 etcd backups by date:

[Release 1.24] Etcd snapshots retention when node name changes #8124
*https://docs.k3s.io/release-notes/v1.24.X#release-v12417k3s1

The default etcd-snapshot-retention is 5.

Since this PR, only 5 total snapshots are allowed in S3 buckets, rather than 5 per control plane. This means that for my cluster with three control planes, I only have one backup for one of them, and the backups for the others are only from today (total of 5). As I add more control planes, presumably some are not going to be backed up at all as they will be pruned immediately in favour of the latest by date (one more control plane would push out that top fsn1 backup completely):

If my understanding here is correct, this is a significant breaking change in the patch release that has reduced the number of backups available for people's clusters, potentially leading to nasty surprises when they go back to check them.

A potential solution is to change the etcd-snapshot-retention to number-of-control-planes*5 manually but that is a lot of custom configuration and I'm not sure this need is understood/documented.

A similar issue is reported here: rancher/rke2#5216

Expected behavior:
5 backups per control-plane.

Perhaps the default of 5 should be raised to at least cover the common scenario of three control planes; or S3 should be etcd-snapshot-retention x number of control planes. Or perhaps restoring the original behaviour and implementing a different process for cleaning up orphaned snapshots with age > n. Ideally some way to communicate the impact this is having on existing clusters too.

The text was updated successfully, but these errors were encountered:

brandond · 2024-04-05T03:05:09Z

This is discussed in some detail at rancher/rke2#5216 (comment), which I see you already linked. As mentioned there, the current behavior is intentional; we will likely eventually add a separate control over the number of s3 snapshots.

Multiplying the retention count by the number of etcd nodes is also an interesting idea, but this could lead to overly aggressive pruning following a restore when the number of etcd nodes is temporarily reduced.

There is also another related issue:

[BUG] etcd snapshot retention on S3 removes too many snapshots rancher/rancher#43872

github-actions · 2024-05-20T20:04:37Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

HoustonDad · 2024-05-24T14:28:08Z

Not stale

github-actions · 2024-07-08T20:05:28Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

horihel · 2024-07-09T05:51:19Z

hello friendly bot - s3 backup retention policies are still broken, no direct workaround available. Let's hope everyone has soft-delete features in their s3-buckets to make up for it.

horihel · 2024-07-09T05:52:25Z

that said, does anyone know how to make an 'undeleted' backup known to rancher so it can be restored?

github-actions · 2024-08-23T20:05:09Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

HoustonDad · 2024-08-23T20:08:35Z

not stale

github-actions · 2024-10-09T20:05:06Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

bwenrich · 2024-10-17T22:31:15Z

Not stale.

Q: for the use case with multiple control planes, would it be helpful if each cluster used a different path/prefix in the S3 bucket?

That might avoid the issue in the linked RKE issue (about data from deleted nodes never getting pruned), but also prevent clusters from pruning data belonging to each other.

If this cannot be done, I would wonder if it's even safe to have multiple clusters sharing the same S3 bucket, considering that "misconfiguring" the pruning on one of them could have a wide impact to all of their snapshots.

brandond · 2024-10-17T22:36:50Z

No, all the nodes should use the same bucket and prefix. It is assumed (and we should probably document this as a requirement) that all nodes use the same bucket and prefix. If this does not happen nodes will remove S3 snapshot records created by other nodes, thinking that they are "missing" from s3 - since they are all expected to have the same view of what exists on S3.

The snapshots on S3 do not "belong" to the node that uploaded them, they are accessible to and can be restored/listed/deleted/pruned by any node. Having them owned by a node would not make a lot of sense, given that S3 is supposed to be a stable external system available to any cluster member, even one that was just created from scratch that perhaps needs to restore an old snapshot in order to restore the cluster from complete loss of all nodes.

horihel · 2024-10-18T05:29:53Z

would it make more sense then to disable the built-in retention altogether and instead use retention policies built into the s3 backend? Is k3s/rke2 able to handle that?

maggie44 mentioned this issue Apr 4, 2024

Allow configuring s3 etcd-snapshot-retention in config file kube-hetzner/terraform-hcloud-kube-hetzner#1307

Closed

brandond changed the title ~~Etcd S3 snapshots being pruned by date has reduced number of backups~~ S3 Snapshot Retention policy is too aggressive Apr 5, 2024

github-actions bot added the status/stale label May 20, 2024

github-actions bot removed the status/stale label May 24, 2024

github-actions bot added the status/stale label Jul 8, 2024

github-actions bot removed the status/stale label Jul 9, 2024

github-actions bot added the status/stale label Aug 23, 2024

github-actions bot removed the status/stale label Aug 24, 2024

github-actions bot added the status/stale label Oct 9, 2024

github-actions bot removed the status/stale label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 Snapshot Retention policy is too aggressive #9866

S3 Snapshot Retention policy is too aggressive #9866

maggie44 commented Apr 4, 2024 •

edited by brandond

Loading

brandond commented Apr 5, 2024 •

edited

Loading

github-actions bot commented May 20, 2024

HoustonDad commented May 24, 2024

github-actions bot commented Jul 8, 2024

horihel commented Jul 9, 2024

horihel commented Jul 9, 2024

github-actions bot commented Aug 23, 2024

HoustonDad commented Aug 23, 2024

github-actions bot commented Oct 9, 2024

bwenrich commented Oct 17, 2024

brandond commented Oct 17, 2024 •

edited

Loading

horihel commented Oct 18, 2024

S3 Snapshot Retention policy is too aggressive #9866

S3 Snapshot Retention policy is too aggressive #9866

Comments

maggie44 commented Apr 4, 2024 • edited by brandond Loading

brandond commented Apr 5, 2024 • edited Loading

github-actions bot commented May 20, 2024

HoustonDad commented May 24, 2024

github-actions bot commented Jul 8, 2024

horihel commented Jul 9, 2024

horihel commented Jul 9, 2024

github-actions bot commented Aug 23, 2024

HoustonDad commented Aug 23, 2024

github-actions bot commented Oct 9, 2024

bwenrich commented Oct 17, 2024

brandond commented Oct 17, 2024 • edited Loading

horihel commented Oct 18, 2024

maggie44 commented Apr 4, 2024 •

edited by brandond

Loading

brandond commented Apr 5, 2024 •

edited

Loading

brandond commented Oct 17, 2024 •

edited

Loading