-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip]: Provide automated cluster backups #359
[wip]: Provide automated cluster backups #359
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: retroflexer The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
## Proposal | ||
|
||
### A new container is added to etcd static pod to run automated backups. | ||
The objectives of the new container are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be easier to visualize with a tree view of the backup directory to help establish the overall structure and then note details below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a screenshot with tree diagram.
* Add a subcommand to `cluster-etcd-operator` | ||
* Change the etcd spec `etcd-pod.yaml` to run the subcommand as a separate container. | ||
|
||
## Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some risks of the sidecar pattern in this context:
- Future changes to the sidecar container spec will require a new etcd pod revision even though the backup mechanism is actually orthogonal to the managed etcd
- Resource requirements of the backup controller are conflated with the requirements for etcd within the pod (what are the possible impacts?)
- Health/liveness checks of the backup controller are conflated with etcd within the pod
Are there any backup controller failure modes which could either take down the etcd pod itself or cause the etcd pod to erroneously report trouble with etcd itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another general set of risk related thoughts I wanted to get out of my head...
I think these statements are true:
- The impact of the platform accidentally killing (or otherwise blowing up) any given etcd container or pod is extremely high
- The impact of such a problem manifesting against all etcd pods simultaneously is probably catastrophic
If so, then I also think it follows the baseline risk of introducing any additional stuff to the etcd pod is probably pretty high to begin with.
The sidecar approach means adding a lot of new code inside all etcd pods. New code, and code which will have bugs. How confident are we in the boundaries between this sidecar and etcd?
All this is to say, let's pay careful attention to the risks section and try to think of all the ways this could go very wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add performance concerns around snapshot on each node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a list of concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, then I also think it follows the baseline risk of introducing any additional stuff to the etcd pod is probably pretty high to begin with.
This is not a natural result of those statements. Simple counter example: point to point network checks. Your concerns are good, but your conclusion doesn't appear to flow to me.
The sidecar approach means adding a lot of new code inside all etcd pods. New code, and code which will have bugs. How confident are we in the boundaries between this sidecar and etcd?
We are extremely confident in correct handling of container sized blast radius by a kubelet. You can seem multiple sidecars, our willingness to add more, and the low risks present by individual failure in the kube-apiserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thoughts, very helpful
|
||
## Drawbacks | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to talk about the non-sidecar approaches here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Listed them. Will elaborate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few notes
1. On all masters it write `/etc/kubernetes/static-pod-backups/backup-N/backup.env` file containing 4 environmental | ||
variables `CREATED`, `OCP_VERSION`, `ETCD_REVISION` and `APISERVER_REVISION`. | ||
1. On all masters take an etcd snapshot `etcdctl snapshot save | ||
/etc/kubernetes/static-pod-backups/backup-N/etcd-data/backup.db`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ideally want a backup on each node but we don't want to cause performance issues, this needs to be considered.
|
||
1. Create a subcommand to cluster-etcd-operator to run as a container in the etcd static pod. | ||
2. It runs in an auto-pilot mode without requiring adding any new OpenShift API (for 4.4) | ||
3. In future releases, cluster-etcd-operator will subsume this functionality with a backup controller (4.6+). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to stub out pruning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Added it.
* Add a subcommand to `cluster-etcd-operator` | ||
* Change the etcd spec `etcd-pod.yaml` to run the subcommand as a separate container. | ||
|
||
## Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add performance concerns around snapshot on each node.
a80e58f
to
496278c
Compare
496278c
to
a7c959c
Compare
So, @retroflexer early on wanted to explore the idea of cron jobs generally, and we've had a chance to talk about that approach more in depth. Summarizing some discussions we've had about leveraging CronJob as the "simplest thing that could possibly work"... Some possible constraints we have identified which can inform the design:
CronJob seems to offer some capabilities that match the needs:
So something like a CronJob that has an affinity for master nodes hosting etcd, has some anti-affinity rules to help enforce serialization, might be a simple path forward to consider in more depth. |
This approach appears to create a cycle between the etcd operator, the kube-controller-manager, and the kube-scheduler. Currently, while running in steady state, the only dependency is the kube-apiserver. Increasing that surface area to produce cycles with more components requires significant forcing functions. Especially when the alternative is a container which runs check if I'm the leader, if not sleep 5 minutes
if I'm the leader
1. sleep 30 minutes
2. take backup
3. sleep 30 minutes This looks pretty safe, simple, single per cluster, doesn't coincide immediately with startup, maintains the "about an hour latency" requested by @eparis. I don't understand the drive to add complexity that isn't required to satisfy the use-case. Given a fair amount of time and many configuration options, I can see value in a generalized backup via more complex APIs, but I don't see the justification for it here. |
If a leader process solves the coordination problem locally, then I'm good with the sidecar |
Had offline discussion, summary is:
Seems like we have consensus that the sidecar will be simple and reliable for this purpose. |
Hi folks - just checking if you have reviewed the approach being taken by OSD for this use-case? https://github.com/openshift/managed-velero-operator /cc @cblecker |
OSD took the Velero approach as backing up the data out of etcd rather than backing up etcd itself gave us more flexibility and decoupled us from changes to the underlying storage mechanism. /cc @jwmatthews |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
There is an effort under the name of 'OpenShift API for Data Protection' (OADP) that is delivering a Velero operator to community-operators. Focused on specific namespace/application backup/restore. The intent of OADP is to deliver Velero with a set of maintained backup/restore plugins focused on addressing edge cases where core Velero functionality was not sufficient for OpenShift (backup/restore of internal Images for example). Operator: Plugins: |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Current cluster backup procedure requires SSH and manual user commands from a node which are unsafe and error prone. Furthermore, a periodic backup taken automatically will allow the user to restore from the most recent backup when cluster gets into unexpected quorum loss or other unrecoverable data loss situations.