Skip to content

Commit

Permalink
Enhancement to provide automated cluster backups
Browse files Browse the repository at this point in the history
  • Loading branch information
retroflexer committed Jun 4, 2020
1 parent f681a13 commit 496278c
Show file tree
Hide file tree
Showing 2 changed files with 151 additions and 0 deletions.
151 changes: 151 additions & 0 deletions enhancements/etcd/automated-cluster-backups.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
title: automated-cluster-backups

authors:
- "@skolicha"
- "@dmace"
- "@hexfusion"
reviewers:
- "@deads2k"
- "@hexfusion"
approvers:
- TBD
creation-date: 2020-06-04
last-updated: 2020-06-04
status: provisional
see-also:
- "https://github.com/kubernetes/enhancements/blob/master/etcd/disaster-recovery-with-ceo.md"
---

# automated-cluster-backups

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary
A periodic backup taken automatically will allow the user to restore from the most recent backup when cluster gets into
unexpected quorum loss or other unrecoverable data loss situations.

## Motivation
The current cluster backup procedure requires SSH and manual user commands from a node which are unsafe and error prone.
Customers are recommended to avoid sshing to nodes. Requiring an ssh session on a host (master) to take a backup goes
against recommendations with SSH. In addition,

The immediate need for automated backups is to provide a path to upgrade from 4.4 to 4.5, with the knowledge that if the
upgrade fails for some reason, there is no easy path to rollback, as the etcd version 3.4.x (used in OCP 4.5) is
incompatible with the etcd version of 3.3.x (used in 4.4.x). Since restoring a backup is the safest way to rollback, it
is important to have automated periodical backups to protect the users from unexpected data loss in such scenarios.

## Goals

1. Automatic periodic backups and automatic pruning
1. Make most recent backups readily available
1. Eliminate SSH requirement

## Non-Goals

* Automated restoration is not considered for this enhancement.

## Currently Supported Functionality

Currently the scripts available support the following functionality:
### Cluster Backup
Takes a snapshot of cluster’s etcd data along with static-pod-resources at the time of the backup.

### Cluster Restore
Restores the etcd data from a backup snapshot. It also restores the static pod resources while deleting all the newer
revisions.

## Proposal

### A new container is added to etcd static pod to run automated backups.
The objectives of the new container are:
1. On all etcd static pods, run a subcommand of cluster-etcd-operator to produce automated backups.
1. On all masters it creates a new backup revision `/etc/kubernetes/static-pod-backups/backup-N`.
1. On all masters it write `/etc/kubernetes/static-pod-backups/backup-N/backup.env` file containing 3 environmental
variables `CREATED`, `OCP_VERSION`, `ETCD_REVISION` along with REVISION numbers for all other static pods.
1. On all masters take an etcd snapshot `etcdctl snapshot save
/etc/kubernetes/static-pod-backups/backup-N/snapshot<date-time-string>.db`.
1. On all masters copy all static pod resources to `/etc/kubernetes/static-pod-backups/backup-N/`.
1. On all masters symbolically link the directory `/etc/kubernetes/static-pod-backups/latest-backup` with the directory
containing the most recent `/etc/kubernetes/static-pod-manifests/backup-N` directory.
1. Also be responsible for pruning older backups to keep no more than X number of backups.

![Backup tree structure](backup-tree-structure.jpg)
## User Stories [optional]

### Security

Your clusters backup data is as secure as your cluster. If someone were to root the system they would have direct
access to all data.

### Availability

Your data is as resilient as your cluster. We make N copies of your data so in the case of failure you dont have to
worry about your last backup location.

### Recovery Automation

If the cluster were to lose quorum and every master is seeded with data required to restore. Automation of recovery
tasks becomes easier.


## Implementation Plan

1. Create a subcommand to cluster-etcd-operator to run as a container in the etcd static pod.
2. It runs in an auto-pilot mode without requiring the addition of any new OpenShift API (for 4.4)
3. In future releases, cluster-etcd-operator will subsume this functionality with a backup controller (4.6+).

## Implementation Details/Notes/Constraints

* Add a subcommand to `cluster-etcd-operator`
* The subcommand could look basically like any other controller, using a sync loop driven by a workqueue, but instead
of the queue being pushed by k8s informers, a timer ticker enqueues the sync events
* The sync function attempts to find the backup for the current hour and create if doesn't already exist
* On successful backup, prune the backups to keep them to be less than or equal to the allowed number of backups.
* Change the etcd spec `etcd-pod.yaml` to run the subcommand as a separate container.

## Risks and Mitigations

* Performance impact

Taking snapshots are disk intensive, and can impact the performance of the etcd member. If all the members take
snapshot at the same time, it could trigger a leader election possibly causing cluster-wide failures to access etcd.
* Disks running out of space

If they backups are not pruned properly, they could overfill the disk.
* Future changes to this sidecar container spec will require a new etcd pod revision
* Resource requirements of the backup controller are conflated with the requirements for etcd
* Health/liveness checks of the backup controller are conflated with etcd within the pod

## Design Details

## Test Plan

## Graduation Criteria

## Upgrade / Downgrade Strategy

## Version Skew Strategy


## Implementation History

## Drawbacks

## Alternatives
* A controller that deploys a CronJob or statefulset in 4.4.z and removes in 4.5.
* An upgrade process it takes a backup before attempting to do any upgrade.
## Infrastructure Needed [optional]

### New github projects:


### New images built:


Binary file added enhancements/etcd/backup-tree-structure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 496278c

Please sign in to comment.