-
Notifications
You must be signed in to change notification settings - Fork 471
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enhancement to provide automated cluster backups
- Loading branch information
retroflexer
committed
Jun 4, 2020
1 parent
f681a13
commit 496278c
Showing
2 changed files
with
151 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
--- | ||
title: automated-cluster-backups | ||
|
||
authors: | ||
- "@skolicha" | ||
- "@dmace" | ||
- "@hexfusion" | ||
reviewers: | ||
- "@deads2k" | ||
- "@hexfusion" | ||
approvers: | ||
- TBD | ||
creation-date: 2020-06-04 | ||
last-updated: 2020-06-04 | ||
status: provisional | ||
see-also: | ||
- "https://github.com/kubernetes/enhancements/blob/master/etcd/disaster-recovery-with-ceo.md" | ||
--- | ||
|
||
# automated-cluster-backups | ||
|
||
## Release Signoff Checklist | ||
|
||
- [ ] Enhancement is `implementable` | ||
- [ ] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
## Summary | ||
A periodic backup taken automatically will allow the user to restore from the most recent backup when cluster gets into | ||
unexpected quorum loss or other unrecoverable data loss situations. | ||
|
||
## Motivation | ||
The current cluster backup procedure requires SSH and manual user commands from a node which are unsafe and error prone. | ||
Customers are recommended to avoid sshing to nodes. Requiring an ssh session on a host (master) to take a backup goes | ||
against recommendations with SSH. In addition, | ||
|
||
The immediate need for automated backups is to provide a path to upgrade from 4.4 to 4.5, with the knowledge that if the | ||
upgrade fails for some reason, there is no easy path to rollback, as the etcd version 3.4.x (used in OCP 4.5) is | ||
incompatible with the etcd version of 3.3.x (used in 4.4.x). Since restoring a backup is the safest way to rollback, it | ||
is important to have automated periodical backups to protect the users from unexpected data loss in such scenarios. | ||
|
||
## Goals | ||
|
||
1. Automatic periodic backups and automatic pruning | ||
1. Make most recent backups readily available | ||
1. Eliminate SSH requirement | ||
|
||
## Non-Goals | ||
|
||
* Automated restoration is not considered for this enhancement. | ||
|
||
## Currently Supported Functionality | ||
|
||
Currently the scripts available support the following functionality: | ||
### Cluster Backup | ||
Takes a snapshot of cluster’s etcd data along with static-pod-resources at the time of the backup. | ||
|
||
### Cluster Restore | ||
Restores the etcd data from a backup snapshot. It also restores the static pod resources while deleting all the newer | ||
revisions. | ||
|
||
## Proposal | ||
|
||
### A new container is added to etcd static pod to run automated backups. | ||
The objectives of the new container are: | ||
1. On all etcd static pods, run a subcommand of cluster-etcd-operator to produce automated backups. | ||
1. On all masters it creates a new backup revision `/etc/kubernetes/static-pod-backups/backup-N`. | ||
1. On all masters it write `/etc/kubernetes/static-pod-backups/backup-N/backup.env` file containing 3 environmental | ||
variables `CREATED`, `OCP_VERSION`, `ETCD_REVISION` along with REVISION numbers for all other static pods. | ||
1. On all masters take an etcd snapshot `etcdctl snapshot save | ||
/etc/kubernetes/static-pod-backups/backup-N/snapshot<date-time-string>.db`. | ||
1. On all masters copy all static pod resources to `/etc/kubernetes/static-pod-backups/backup-N/`. | ||
1. On all masters symbolically link the directory `/etc/kubernetes/static-pod-backups/latest-backup` with the directory | ||
containing the most recent `/etc/kubernetes/static-pod-manifests/backup-N` directory. | ||
1. Also be responsible for pruning older backups to keep no more than X number of backups. | ||
|
||
![Backup tree structure](backup-tree-structure.jpg) | ||
## User Stories [optional] | ||
|
||
### Security | ||
|
||
Your clusters backup data is as secure as your cluster. If someone were to root the system they would have direct | ||
access to all data. | ||
|
||
### Availability | ||
|
||
Your data is as resilient as your cluster. We make N copies of your data so in the case of failure you dont have to | ||
worry about your last backup location. | ||
|
||
### Recovery Automation | ||
|
||
If the cluster were to lose quorum and every master is seeded with data required to restore. Automation of recovery | ||
tasks becomes easier. | ||
|
||
|
||
## Implementation Plan | ||
|
||
1. Create a subcommand to cluster-etcd-operator to run as a container in the etcd static pod. | ||
2. It runs in an auto-pilot mode without requiring the addition of any new OpenShift API (for 4.4) | ||
3. In future releases, cluster-etcd-operator will subsume this functionality with a backup controller (4.6+). | ||
|
||
## Implementation Details/Notes/Constraints | ||
|
||
* Add a subcommand to `cluster-etcd-operator` | ||
* The subcommand could look basically like any other controller, using a sync loop driven by a workqueue, but instead | ||
of the queue being pushed by k8s informers, a timer ticker enqueues the sync events | ||
* The sync function attempts to find the backup for the current hour and create if doesn't already exist | ||
* On successful backup, prune the backups to keep them to be less than or equal to the allowed number of backups. | ||
* Change the etcd spec `etcd-pod.yaml` to run the subcommand as a separate container. | ||
|
||
## Risks and Mitigations | ||
|
||
* Performance impact | ||
|
||
Taking snapshots are disk intensive, and can impact the performance of the etcd member. If all the members take | ||
snapshot at the same time, it could trigger a leader election possibly causing cluster-wide failures to access etcd. | ||
* Disks running out of space | ||
|
||
If they backups are not pruned properly, they could overfill the disk. | ||
* Future changes to this sidecar container spec will require a new etcd pod revision | ||
* Resource requirements of the backup controller are conflated with the requirements for etcd | ||
* Health/liveness checks of the backup controller are conflated with etcd within the pod | ||
|
||
## Design Details | ||
|
||
## Test Plan | ||
|
||
## Graduation Criteria | ||
|
||
## Upgrade / Downgrade Strategy | ||
|
||
## Version Skew Strategy | ||
|
||
|
||
## Implementation History | ||
|
||
## Drawbacks | ||
|
||
## Alternatives | ||
* A controller that deploys a CronJob or statefulset in 4.4.z and removes in 4.5. | ||
* An upgrade process it takes a backup before attempting to do any upgrade. | ||
## Infrastructure Needed [optional] | ||
|
||
### New github projects: | ||
|
||
|
||
### New images built: | ||
|
||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.