Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] During CPM of hibernated clusters, two snapshots can be taken at the same time causing issues when etcd-main is restored #763

Open
plkokanov opened this issue Aug 21, 2024 · 0 comments
Assignees
Labels

Comments

@plkokanov
Copy link
Contributor

Describe the bug:

During the deletion of a hibernated cluster for which the control plane was previously migrated to a different seed, the backup-restore container of the etcd-main-0 pod could not be started successfully with the following error:

time="2024-08-20T18:30:10Z" level=error msg="Failed initialization: error while restoring corrupt data: failed to restore snapshot: mismatched event revision while applying delta snapshot, expected 210 but applied 419 " actor=backup-restore-server

This issue can occur if two snapshots are taken at the same time. One possibility for this occurring during control plane migration is that in the migrate phase, gardenlet instructs etcd-main to take a special final full snapshot. It is possible that during this time (or shortly before that) an incremental (delta) snapshot is created in-memory and the process of pushing it to the bucket had already started.

So far we have only observed this behaviour for hibernated shoots. This means that during the migrate phase the etcd-main-0 pod will be created (woken up), gardenlet will take a final full snapshot, and then the etcd-main-0 pod will be deleted.

Expected behavior:

An incremental and full (final) snapshot should not be pushed to the backup bucket with the same timestamp and overlapping revisions so that etcd-main-0 can be successfully restored from backup during the restore phase of control plane migration.

How To Reproduce (as minimally and precisely as possible):

This issue occurs rarely and is hard to reproduce

Logs:
Here are logs from an occurrence of this issue in one of our testmachinery tests:

time="2024-08-20T18:30:06Z" level=info msg="Restoring from base snapshot: Full-00000000-00000210-1719664221.gz" actor=restorer
time="2024-08-20T18:30:04Z" level=info msg="Restoring from base snapshot: Full-00000000-00000210-1719664221.gz" actor=restorer
time="2024-08-20T18:30:04Z" level=info msg="start decompressing the snapshot with gzip compressionPolicy" actor=de-compressor
time="2024-08-20T18:30:04Z" level=info msg="successfully fetched data of base snapshot in 0.0054225 seconds [CompressionPolicy:gzip]" actor=restorer
{"level":"info","ts":1724178604.1535509,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
time="2024-08-20T18:30:04Z" level=info msg="Attempting to apply 1 delta snapshots for restoration." actor=restorer
time="2024-08-20T18:30:04Z" level=info msg="Creating temporary directory /var/etcd/data/restoration.temp for persisting delta snapshots locally." actor=restorer
time="2024-08-20T18:30:04Z" level=info msg="Starting an embedded etcd server..." actor=restorer
...
time="2024-08-20T18:30:06Z" level=info msg="Applying delta snapshots..." actor=restorer
time="2024-08-20T18:30:06Z" level=info msg="Applying first delta snapshot Incr-00000002-00000210-1719664221.gz" actor=restorer
time="2024-08-20T18:30:06Z" level=info msg="start decompressing the snapshot with gzip compressionPolicy" actor=de-compressor
time="2024-08-20T18:30:06Z" level=info msg="successfully decompressed data of delta snapshot in 0.002845945 seconds [CompressionPolicy:gzip]" actor=restorer
time="2024-08-20T18:30:06Z" level=info msg="Applying first delta snapshot Incr-00000002-00000210-1719664221.gz" actor=restorer
time="2024-08-20T18:30:07Z" level=info msg="Responding to status request with: Progress" actor=backup-restore-server
{"level":"info","ts":"2024-08-20T18:30:07.101Z","caller":"etcdserver/server.go:1485","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"}
{"level":"info","ts":"2024-08-20T18:30:07.105Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"default","data-dir":"/var/etcd/data/new.etcd.part","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
{"level":"info","ts":"2024-08-20T18:30:07.105Z","caller":"etcdserver/server.go:1485","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"}
{"level":"info","ts":"2024-08-20T18:30:07.105Z","caller":"embed/etcd.go:558","msg":"stopping serving peer traffic","address":"127.0.0.1:43937"}
{"level":"info","ts":"2024-08-20T18:30:07.105Z","caller":"embed/etcd.go:565","msg":"stopped serving peer traffic","address":"127.0.0.1:43937"}
{"level":"info","ts":"2024-08-20T18:30:07.105Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"default","data-dir":"/var/etcd/data/new.etcd.part","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
time="2024-08-20T18:30:07Z" level=error msg="Failed initialization: error while restoring corrupt data: failed to restore snapshot: mismatched event revision while applying delta snapshot, expected 210 but applied 419 " actor=backup-restore-server

The snapshots that were present in the bucket were:
image

Sadly logs from the migrate phase were no-longer present

Environment (please complete the following information):

Anything else we need to know?:

@plkokanov plkokanov added the kind/bug Bug label Aug 21, 2024
@renormalize renormalize self-assigned this Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants