Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip]: Provide automated cluster backups #359

Closed

Conversation

retroflexer
Copy link
Contributor

Current cluster backup procedure requires SSH and manual user commands from a node which are unsafe and error prone. Furthermore, a periodic backup taken automatically will allow the user to restore from the most recent backup when cluster gets into unexpected quorum loss or other unrecoverable data loss situations.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: retroflexer
To complete the pull request process, please assign ashcrow
You can assign the PR to them by writing /assign @ashcrow in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enhancements/etcd/automated-cluster-backups.md Outdated Show resolved Hide resolved
## Proposal

### A new container is added to etcd static pod to run automated backups.
The objectives of the new container are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be easier to visualize with a tree view of the backup directory to help establish the overall structure and then note details below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a screenshot with tree diagram.

* Add a subcommand to `cluster-etcd-operator`
* Change the etcd spec `etcd-pod.yaml` to run the subcommand as a separate container.

## Risks and Mitigations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some risks of the sidecar pattern in this context:

  • Future changes to the sidecar container spec will require a new etcd pod revision even though the backup mechanism is actually orthogonal to the managed etcd
  • Resource requirements of the backup controller are conflated with the requirements for etcd within the pod (what are the possible impacts?)
  • Health/liveness checks of the backup controller are conflated with etcd within the pod

Are there any backup controller failure modes which could either take down the etcd pod itself or cause the etcd pod to erroneously report trouble with etcd itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another general set of risk related thoughts I wanted to get out of my head...

I think these statements are true:

  1. The impact of the platform accidentally killing (or otherwise blowing up) any given etcd container or pod is extremely high
  2. The impact of such a problem manifesting against all etcd pods simultaneously is probably catastrophic

If so, then I also think it follows the baseline risk of introducing any additional stuff to the etcd pod is probably pretty high to begin with.

The sidecar approach means adding a lot of new code inside all etcd pods. New code, and code which will have bugs. How confident are we in the boundaries between this sidecar and etcd?

All this is to say, let's pay careful attention to the risks section and try to think of all the ways this could go very wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add performance concerns around snapshot on each node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a list of concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, then I also think it follows the baseline risk of introducing any additional stuff to the etcd pod is probably pretty high to begin with.

This is not a natural result of those statements. Simple counter example: point to point network checks. Your concerns are good, but your conclusion doesn't appear to flow to me.

The sidecar approach means adding a lot of new code inside all etcd pods. New code, and code which will have bugs. How confident are we in the boundaries between this sidecar and etcd?

We are extremely confident in correct handling of container sized blast radius by a kubelet. You can seem multiple sidecars, our willingness to add more, and the low risks present by individual failure in the kube-apiserver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughts, very helpful


## Drawbacks

## Alternatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to talk about the non-sidecar approaches here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listed them. Will elaborate.

@retroflexer
Copy link
Contributor Author

/cc @hexfusion @ironcladlou @deads2k

Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few notes

enhancements/etcd/automated-cluster-backups.md Outdated Show resolved Hide resolved
1. On all masters it write `/etc/kubernetes/static-pod-backups/backup-N/backup.env` file containing 4 environmental
variables `CREATED`, `OCP_VERSION`, `ETCD_REVISION` and `APISERVER_REVISION`.
1. On all masters take an etcd snapshot `etcdctl snapshot save
/etc/kubernetes/static-pod-backups/backup-N/etcd-data/backup.db`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ideally want a backup on each node but we don't want to cause performance issues, this needs to be considered.


1. Create a subcommand to cluster-etcd-operator to run as a container in the etcd static pod.
2. It runs in an auto-pilot mode without requiring adding any new OpenShift API (for 4.4)
3. In future releases, cluster-etcd-operator will subsume this functionality with a backup controller (4.6+).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to stub out pruning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added it.

* Add a subcommand to `cluster-etcd-operator`
* Change the etcd spec `etcd-pod.yaml` to run the subcommand as a separate container.

## Risks and Mitigations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add performance concerns around snapshot on each node.

@retroflexer retroflexer changed the title [wip]: Enhancement to provide automated cluster backups [wip]: Provide automated cluster backups Jun 4, 2020
@ironcladlou
Copy link
Contributor

So, @retroflexer early on wanted to explore the idea of cron jobs generally, and we've had a chance to talk about that approach more in depth. Summarizing some discussions we've had about leveraging CronJob as the "simplest thing that could possibly work"...

Some possible constraints we have identified which can inform the design:

  • We only really need a single on-host backup each interval
  • We want to avoid concurrent expensive etcd snapshots to avoid elections or other issues
  • Regardless of the ability to inspect backup details from in-cluster, we need a backup observing approach for DR scenarios where etcd may not even be available anymore — so a means to inspect on-disk backups on nodes (e.g. at well-known locations) is necessary, and an in-cluster API or other means to observe backups could be deferred

CronJob seems to offer some capabilities that match the needs:

  • Basic Kube primitive, obviates need to reinvent any aspect of job scheduling
  • Basic out of the box in-cluster status reporting through the job resource itself
  • Serialized execution (modulo some edge cases which maybe could be worked around with affinity rules)
  • Stronger boundary between the backup process and the etcd pod

So something like a CronJob that has an affinity for master nodes hosting etcd, has some anti-affinity rules to help enforce serialization, might be a simple path forward to consider in more depth.

@deads2k
Copy link
Contributor

deads2k commented Jun 5, 2020

CronJob seems to offer some capabilities that match the needs:

  • Basic Kube primitive, obviates need to reinvent any aspect of job scheduling
  • Basic out of the box in-cluster status reporting through the job resource itself
  • Serialized execution (modulo some edge cases which maybe could be worked around with affinity rules)
  • Stronger boundary between the backup process and the etcd pod

This approach appears to create a cycle between the etcd operator, the kube-controller-manager, and the kube-scheduler. Currently, while running in steady state, the only dependency is the kube-apiserver. Increasing that surface area to produce cycles with more components requires significant forcing functions. Especially when the alternative is a container which runs

check if I'm the leader, if not sleep 5 minutes
if I'm the leader
1. sleep 30 minutes
2. take backup
3. sleep 30 minutes

This looks pretty safe, simple, single per cluster, doesn't coincide immediately with startup, maintains the "about an hour latency" requested by @eparis.

I don't understand the drive to add complexity that isn't required to satisfy the use-case. Given a fair amount of time and many configuration options, I can see value in a generalized backup via more complex APIs, but I don't see the justification for it here.

@ironcladlou
Copy link
Contributor

If a leader process solves the coordination problem locally, then I'm good with the sidecar

@ironcladlou
Copy link
Contributor

Had offline discussion, summary is:

  • Concerns about etcd exposure to sidecars already addresses in other comments here
  • Agreed that sidecars can reliably avoid contention through some gating logic, eliminating any need for an external coordinator (i.e. the operator)

Seems like we have consensus that the sidecar will be simple and reliable for this purpose.

@jeremyeder
Copy link
Contributor

Hi folks - just checking if you have reviewed the approach being taken by OSD for this use-case? https://github.com/openshift/managed-velero-operator

/cc @cblecker

@cblecker
Copy link
Member

OSD took the Velero approach as backing up the data out of etcd rather than backing up etcd itself gave us more flexibility and decoupled us from changes to the underlying storage mechanism.

/cc @jwmatthews
The CAM team has also been discussing this. There is a backup/restore SIG as well internally.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 20, 2020
@jwmatthews
Copy link
Member

There is an effort under the name of 'OpenShift API for Data Protection' (OADP) that is delivering a Velero operator to community-operators. Focused on specific namespace/application backup/restore.

The intent of OADP is to deliver Velero with a set of maintained backup/restore plugins focused on addressing edge cases where core Velero functionality was not sufficient for OpenShift (backup/restore of internal Images for example).

Operator:
https://github.com/konveyor/oadp-operator

Plugins:
https://github.com/konveyor/openshift-velero-plugin

@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 19, 2020
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants