[wip]: Provide automated cluster backups #359

retroflexer · 2020-06-04T05:48:18Z

Current cluster backup procedure requires SSH and manual user commands from a node which are unsafe and error prone. Furthermore, a periodic backup taken automatically will allow the user to restore from the most recent backup when cluster gets into unexpected quorum loss or other unrecoverable data loss situations.

openshift-ci-robot · 2020-06-04T05:48:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: retroflexer
To complete the pull request process, please assign ashcrow
You can assign the PR to them by writing /assign @ashcrow in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enhancements/etcd/automated-cluster-backups.md

ironcladlou · 2020-06-04T13:00:55Z

enhancements/etcd/automated-cluster-backups.md

+## Proposal
+
+### A new container is added to etcd static pod to run automated backups. 
+The objectives of the new container are:


This might be easier to visualize with a tree view of the backup directory to help establish the overall structure and then note details below

Added a screenshot with tree diagram.

enhancements/etcd/automated-cluster-backups.md

ironcladlou · 2020-06-04T13:07:11Z

enhancements/etcd/automated-cluster-backups.md

+*  Add a subcommand to `cluster-etcd-operator`
+*  Change the etcd spec `etcd-pod.yaml` to run the subcommand as a separate container.
+
+## Risks and Mitigations


Some risks of the sidecar pattern in this context:

Future changes to the sidecar container spec will require a new etcd pod revision even though the backup mechanism is actually orthogonal to the managed etcd

Resource requirements of the backup controller are conflated with the requirements for etcd within the pod (what are the possible impacts?)

Health/liveness checks of the backup controller are conflated with etcd within the pod

Are there any backup controller failure modes which could either take down the etcd pod itself or cause the etcd pod to erroneously report trouble with etcd itself?

Another general set of risk related thoughts I wanted to get out of my head...

I think these statements are true:

The impact of the platform accidentally killing (or otherwise blowing up) any given etcd container or pod is extremely high

The impact of such a problem manifesting against all etcd pods simultaneously is probably catastrophic

If so, then I also think it follows the baseline risk of introducing any additional stuff to the etcd pod is probably pretty high to begin with.

The sidecar approach means adding a lot of new code inside all etcd pods. New code, and code which will have bugs. How confident are we in the boundaries between this sidecar and etcd?

All this is to say, let's pay careful attention to the risks section and try to think of all the ways this could go very wrong.

Add performance concerns around snapshot on each node.

Added a list of concerns.

If so, then I also think it follows the baseline risk of introducing any additional stuff to the etcd pod is probably pretty high to begin with.

This is not a natural result of those statements. Simple counter example: point to point network checks. Your concerns are good, but your conclusion doesn't appear to flow to me.

The sidecar approach means adding a lot of new code inside all etcd pods. New code, and code which will have bugs. How confident are we in the boundaries between this sidecar and etcd?

We are extremely confident in correct handling of container sized blast radius by a kubelet. You can seem multiple sidecars, our willingness to add more, and the low risks present by individual failure in the kube-apiserver.

Thanks for the thoughts, very helpful

ironcladlou · 2020-06-04T13:07:24Z

enhancements/etcd/automated-cluster-backups.md

+
+## Drawbacks
+
+## Alternatives


Need to talk about the non-sidecar approaches here

Listed them. Will elaborate.

retroflexer · 2020-06-04T13:08:04Z

/cc @hexfusion @ironcladlou @deads2k

hexfusion

few notes

enhancements/etcd/automated-cluster-backups.md

hexfusion · 2020-06-04T14:51:19Z

enhancements/etcd/automated-cluster-backups.md

+1. On all masters it write `/etc/kubernetes/static-pod-backups/backup-N/backup.env` file containing 4 environmental
+   variables `CREATED`, `OCP_VERSION`, `ETCD_REVISION` and `APISERVER_REVISION`.
+1. On all masters take an etcd snapshot `etcdctl snapshot save
+    /etc/kubernetes/static-pod-backups/backup-N/etcd-data/backup.db`.


We ideally want a backup on each node but we don't want to cause performance issues, this needs to be considered.

hexfusion · 2020-06-04T14:52:31Z

enhancements/etcd/automated-cluster-backups.md

+
+1. Create a subcommand to cluster-etcd-operator to run as a container in the etcd static pod.
+2. It runs in an auto-pilot mode without requiring adding any new OpenShift API (for 4.4)
+3. In future releases, cluster-etcd-operator will subsume this functionality with a backup controller (4.6+).


we need to stub out pruning

Yes. Added it.

hexfusion · 2020-06-04T14:53:14Z

enhancements/etcd/automated-cluster-backups.md

+*  Add a subcommand to `cluster-etcd-operator`
+*  Change the etcd spec `etcd-pod.yaml` to run the subcommand as a separate container.
+
+## Risks and Mitigations


Add performance concerns around snapshot on each node.

ironcladlou · 2020-06-04T19:45:37Z

So, @retroflexer early on wanted to explore the idea of cron jobs generally, and we've had a chance to talk about that approach more in depth. Summarizing some discussions we've had about leveraging CronJob as the "simplest thing that could possibly work"...

Some possible constraints we have identified which can inform the design:

We only really need a single on-host backup each interval
We want to avoid concurrent expensive etcd snapshots to avoid elections or other issues
Regardless of the ability to inspect backup details from in-cluster, we need a backup observing approach for DR scenarios where etcd may not even be available anymore — so a means to inspect on-disk backups on nodes (e.g. at well-known locations) is necessary, and an in-cluster API or other means to observe backups could be deferred

CronJob seems to offer some capabilities that match the needs:

Basic Kube primitive, obviates need to reinvent any aspect of job scheduling
Basic out of the box in-cluster status reporting through the job resource itself
Serialized execution (modulo some edge cases which maybe could be worked around with affinity rules)
Stronger boundary between the backup process and the etcd pod

So something like a CronJob that has an affinity for master nodes hosting etcd, has some anti-affinity rules to help enforce serialization, might be a simple path forward to consider in more depth.

deads2k · 2020-06-05T17:13:17Z

CronJob seems to offer some capabilities that match the needs:

Basic Kube primitive, obviates need to reinvent any aspect of job scheduling

Basic out of the box in-cluster status reporting through the job resource itself

Serialized execution (modulo some edge cases which maybe could be worked around with affinity rules)

Stronger boundary between the backup process and the etcd pod

This approach appears to create a cycle between the etcd operator, the kube-controller-manager, and the kube-scheduler. Currently, while running in steady state, the only dependency is the kube-apiserver. Increasing that surface area to produce cycles with more components requires significant forcing functions. Especially when the alternative is a container which runs

check if I'm the leader, if not sleep 5 minutes
if I'm the leader
1. sleep 30 minutes
2. take backup
3. sleep 30 minutes

This looks pretty safe, simple, single per cluster, doesn't coincide immediately with startup, maintains the "about an hour latency" requested by @eparis.

I don't understand the drive to add complexity that isn't required to satisfy the use-case. Given a fair amount of time and many configuration options, I can see value in a generalized backup via more complex APIs, but I don't see the justification for it here.

ironcladlou · 2020-06-05T17:39:26Z

If a leader process solves the coordination problem locally, then I'm good with the sidecar

ironcladlou · 2020-06-05T18:22:20Z

Had offline discussion, summary is:

Concerns about etcd exposure to sidecars already addresses in other comments here
Agreed that sidecars can reliably avoid contention through some gating logic, eliminating any need for an external coordinator (i.e. the operator)

Seems like we have consensus that the sidecar will be simple and reliable for this purpose.

jeremyeder · 2020-06-05T20:58:29Z

Hi folks - just checking if you have reviewed the approach being taken by OSD for this use-case? https://github.com/openshift/managed-velero-operator

/cc @cblecker

cblecker · 2020-06-10T19:47:19Z

OSD took the Velero approach as backing up the data out of etcd rather than backing up etcd itself gave us more flexibility and decoupled us from changes to the underlying storage mechanism.

/cc @jwmatthews
The CAM team has also been discussing this. There is a backup/restore SIG as well internally.

openshift-bot · 2020-10-20T12:56:15Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

jwmatthews · 2020-10-20T13:17:54Z

There is an effort under the name of 'OpenShift API for Data Protection' (OADP) that is delivering a Velero operator to community-operators. Focused on specific namespace/application backup/restore.

The intent of OADP is to deliver Velero with a set of maintained backup/restore plugins focused on addressing edge cases where core Velero functionality was not sufficient for OpenShift (backup/restore of internal Images for example).

Operator:
https://github.com/konveyor/oadp-operator

Plugins:
https://github.com/konveyor/openshift-velero-plugin

openshift-bot · 2020-11-19T14:54:20Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2020-12-19T16:40:34Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2020-12-19T16:40:49Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2020

openshift-ci-robot requested review from joelanford and sdodson June 4, 2020 05:48

ironcladlou reviewed Jun 4, 2020

View reviewed changes

openshift-ci-robot requested review from deads2k, hexfusion and ironcladlou June 4, 2020 13:08

hexfusion reviewed Jun 4, 2020

View reviewed changes

retroflexer force-pushed the automated-cluster-backups branch from a80e58f to 496278c Compare June 4, 2020 18:03

retroflexer changed the title ~~[wip]: Enhancement to provide automated cluster backups~~ [wip]: Provide automated cluster backups Jun 4, 2020

Enhancement to provide automated cluster backups

a7c959c

retroflexer force-pushed the automated-cluster-backups branch from 496278c to a7c959c Compare June 4, 2020 18:32

openshift-ci-robot requested a review from cblecker June 5, 2020 20:58

openshift-ci-robot requested a review from jwmatthews June 10, 2020 19:47

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 20, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 19, 2020

openshift-ci-robot closed this Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip]: Provide automated cluster backups #359

[wip]: Provide automated cluster backups #359

retroflexer commented Jun 4, 2020

openshift-ci-robot commented Jun 4, 2020

ironcladlou Jun 4, 2020

retroflexer Jun 4, 2020

ironcladlou Jun 4, 2020

ironcladlou Jun 4, 2020

hexfusion Jun 4, 2020

retroflexer Jun 4, 2020

deads2k Jun 5, 2020

ironcladlou Jun 5, 2020

ironcladlou Jun 4, 2020

retroflexer Jun 4, 2020

retroflexer commented Jun 4, 2020

hexfusion left a comment

hexfusion Jun 4, 2020

hexfusion Jun 4, 2020

retroflexer Jun 4, 2020

hexfusion Jun 4, 2020

ironcladlou commented Jun 4, 2020

deads2k commented Jun 5, 2020

ironcladlou commented Jun 5, 2020

ironcladlou commented Jun 5, 2020

jeremyeder commented Jun 5, 2020

cblecker commented Jun 10, 2020

openshift-bot commented Oct 20, 2020

jwmatthews commented Oct 20, 2020

openshift-bot commented Nov 19, 2020

openshift-bot commented Dec 19, 2020

openshift-ci-robot commented Dec 19, 2020

[wip]: Provide automated cluster backups #359

[wip]: Provide automated cluster backups #359

Conversation

retroflexer commented Jun 4, 2020

openshift-ci-robot commented Jun 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

retroflexer commented Jun 4, 2020

hexfusion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ironcladlou commented Jun 4, 2020

deads2k commented Jun 5, 2020

ironcladlou commented Jun 5, 2020

ironcladlou commented Jun 5, 2020

jeremyeder commented Jun 5, 2020

cblecker commented Jun 10, 2020

openshift-bot commented Oct 20, 2020

jwmatthews commented Oct 20, 2020

openshift-bot commented Nov 19, 2020

openshift-bot commented Dec 19, 2020

openshift-ci-robot commented Dec 19, 2020