-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd snapshot cleanup fails if node name changes #3714
Comments
I'll talk this over with the team. On the S3 side, the correct behavior is probably to retain |
Still needs to be worked. |
It is not only fails to upload new backups but also filling up masters disc space with local snapshots which are not cleaned up when CM grows too large and fails to apply which leads to an incident as it puts master nodes to disc pressure. |
@riuvshyn we are working this separate from the snapshot list configmap issue. This issue will serve only to track the issue of snapshot cleanup only handling snapshots whose name contains the current node's hostname. |
/backport v1.26.8+rke2r1 |
/backport v1.25.13+rke2r1 |
/backport v1.24.17+rke2r1 |
Validated on master branch with commit c3ec545Environment DetailsInfrastructure
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
Config.yaml: Main ETCD SERVER (+CONTROL PLANE) CONFIG:
Sample Secondary Etcd, control plane config.yaml:
AGENT CONFIG:
Additional files
Testing Steps
Note: First round node-names:
Using Version:
4a. Also check the s3 bucket/folder in aws to see the snapshots listed.
7a. Also check the s3 bucket/folder in aws to see the snapshots listed.
Replication Results:
SETUP:
Node-names in order of update for the main etcd server:
Final output of snapshot list - after multiple node name changes:
As we can see above, previous snapshots with different node-names are still listed and not cleaned-up. Validation Results:
Node names in order of update for the main etcd server:
After updating node-names 2 times, the snapshots listed are:
As we can see, the previous snapshots with old node-names are no longer retained and get cleaned up. |
Environmental Info:
RKE2 Version:
rke2 version v1.21.14+rke2r1 (514ae51)
go version go1.16.14b7
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
We have multiple rke2 clusters, but all of them have at least 3 control plane nodes and multiple workers
Describe the bug:
We have multiple rke2 clusters and all of them have automatic etcd snapshots enabled (taken every 5 hours). We also configured s3 uploading of those snapshots. Recently, we found that no s3 snapshots are uploaded anymore. We investigated the issue and found the following rke2-server output:
I checked the code and found that rke2 is leveraging the etcd snapshot capabilities from k3s for this. A function is executed periodically on all control plane nodes. The function takes local snapshots, uploads them to s3 (if configured) and also reconciles a configmap which contains all snapshots and metadata about them. Looking at the code it seems that the reconcilation of that "sync" configmap is based on the name of the node which executes the etcd snapshot. Same goes for the s3 retention functions (only old objects which contain the node name will be cleaned up). As we are replacing all our nodes in the clusters whenever there is a new flatcar version, the node names change quite often. This leads to orphaned entries in the config map and also orphaned objects in the s3 buckets (although this could be worked around with a lifecycle policy).
Are there any ideas what could be done to fix this?
I found this bug report which describes the too large configmap in the rancher repo.
Steps To Reproduce:
Enable etcd snapshots and s3 uploading. After replacing the control plane nodes with new machines (new names), there will be orphaned entries in the 'rke2-etcd-snapshots' configmap. Whenever the configmap grew too large, no new snapshots will be uploaded to s3 anymore.
Expected behavior:
The sync configmap only contains the snapshots of the current nodes of the clusters and removes all other ones.
The text was updated successfully, but these errors were encountered: