Automated solution to back up etcd on a schedule from within the cluster #609

SriRamanujam · 2021-04-27T18:03:01Z

SriRamanujam
Apr 27, 2021

The current documentation on backing up the etcd cluster notes that there is a script on control plane nodes that can be used to take a snapshot of the etcd data and write it to to a local directory on the node. It does not provide any guidance or details on how to manage this backup process effectively. Since backups are basically the only recourse available in disaster recovery scenarios, they are very important to do correctly. Additionally, it is imperative that the backups be stored outside the cluster itself.

At the April 27th OKD Working Group meeting, there was some discussion of how to automate and manage this procedure. Several ideas were floated:

An operator that manages a periodic cron job that would run the backup script and offsites it somehow
A systemd unit + timer that could be dropped in via machine config
Automating the backup of the etcd cluster to another etcd cluster

If there's community interest around one of these options (or perhaps another option entirely!) we can collaborate on putting something simple together to point people at when they ask about backups, or recommend all OKD users deploy into their clusters for peace of mind.

vrutkovs · 2021-04-27T18:52:40Z

vrutkovs
Apr 27, 2021
Maintainer

Another idea worth investigating: a tekton pipeline, which can be triggered ad-hoc (or via a CronJob)

0 replies

hafe · 2021-04-27T19:03:17Z

hafe
Apr 27, 2021

Also need to consider the keys for an encrypted etcd snapshot

2 replies

SriRamanujam Apr 29, 2021
Author

I've never deployed a cluster with encrypted etcd. Are the keys dumped by the backup script along with everything else?

duritong Jul 28, 2021

Afaik they are included in the snapshot.

staranto · 2021-04-28T12:53:27Z

staranto
Apr 28, 2021

Very much a work in progress and the codebase is a mess, but I'm working on two methods -- a systemd approach and a "k8s-native" approach. Both drive /usr/local/bin/cluster-backup.sh on a schedule and then managed the lifecycle of the snapshots. Works for my use cases, YMMV.

https://github.com/staranto/ocp4-etcd-snapshot

1 reply

SriRamanujam Apr 29, 2021
Author

This seems like exactly the kind of thing I was thinking of! Awesome to see that someone already got around to making it :D

Which approach has worked better for you?

danielchristianschroeter · 2021-07-21T18:32:35Z

danielchristianschroeter
Jul 21, 2021

We manage the continuous etcd backup process with this cron one-liner on an external server. Of course this external server need ssh and optional also access via oc.
/usr/bin/ssh -i '/root/.ssh/id_rsa' -o 'StrictHostKeyChecking=no' core@/usr/local/bin/oc get node -ojsonpath='{.items[0].metadata.name}' \"/usr/bin/sudo -E /usr/bin/mkdir -p /home/core/assets/backup && /usr/bin/sudo -E /usr/bin/mount -t nfs external-nfs-backupserver.com:/okd_backup/<environment> /home/core/assets/backup && /usr/bin/sudo -E /usr/local/bin/cluster-backup.sh /home/core/assets/backup && /usr/bin/sudo -E /usr/bin/find /home/core/assets/backup -type f -mtime +6 -delete && /usr/bin/sudo -E /usr/bin/umount /home/core/assets/backup\"
Github meesed up the character escaping, here the same correctly escaped: https://pastebin.com/p1nn3W67

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated solution to back up etcd on a schedule from within the cluster #609

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Automated solution to back up etcd on a schedule from within the cluster #609

SriRamanujam Apr 27, 2021

Replies: 4 comments · 3 replies

vrutkovs Apr 27, 2021 Maintainer

hafe Apr 27, 2021

SriRamanujam Apr 29, 2021 Author

duritong Jul 28, 2021

staranto Apr 28, 2021

SriRamanujam Apr 29, 2021 Author

danielchristianschroeter Jul 21, 2021

SriRamanujam
Apr 27, 2021

Replies: 4 comments 3 replies

vrutkovs
Apr 27, 2021
Maintainer

hafe
Apr 27, 2021

SriRamanujam Apr 29, 2021
Author

staranto
Apr 28, 2021

SriRamanujam Apr 29, 2021
Author

danielchristianschroeter
Jul 21, 2021