Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: DeleteBeforeRestore is ignored in backup engines #16811

Closed
rvrangel opened this issue Sep 20, 2024 · 4 comments
Closed

Bug Report: DeleteBeforeRestore is ignored in backup engines #16811

rvrangel opened this issue Sep 20, 2024 · 4 comments

Comments

@rvrangel
Copy link
Contributor

Overview of the Issue

Pulling this discussion out of another PR: Vitess has a flag called DeleteBeforeRestore that is supposed to signal the backup engines if they should wipe out the current MySQL data before restoring a backup, but it is currently being ignored by all backup engines (including the new one introduce in #16295)

We decided to keep the behaviour consistent for now, but this issue is to continue the discussion if we should remove this or fix anything that might be breaking as a result of it.

Reproduction Steps

From the description in here, it seems vtop would be relying on the backup engines ignoring that value so that when the vttablet starts with the --restore-from-backup it still deletes and restores the databases despite the vtdataroot already being filled.

Binary Version

current version on (`main`)

Operating System and Environment details

not OS dependant

Log Fragments

No response

@rvrangel rvrangel added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged labels Sep 20, 2024
@rvrangel
Copy link
Contributor Author

when working on #16295, we noticed both xtrabackup and builtin ignore that flag and when we used it as expected in mysqlshell, @frouioui saw the issues with the restore failing because the data was still there when the restore started.

For us it wasn't an issue since we either restore on an empty tablet or we call the RestoreFromBackup RPC (which would cause it to be wiped), but it seems that the vtop tooling doesn't clean tablets once they are restored and relies on this "bug" to complete the restore?

@deepthi
Copy link
Member

deepthi commented Sep 25, 2024

TL;DR there is no bug.

Let's look at these functions from restore.go

func (tm *TabletManager) RestoreData(...
	deleteBeforeRestore bool,...

and

func (tm *TabletManager) restoreDataLocked(..., deleteBeforeRestore bool, ...

RestoreData is called from tm_init.go. Whenever a tablet is started with --restore_from_backup, we call this with deleteBeforeRestore = false. This is because on a simple restart we would not want to delete an existing database. Instead, later on in the call stack, we'll check whether we shouldRestore, and part of that check is to check whether data dir (or database) is non-empty. If the database is empty, we restore from backup, otherwise we just continue with tablet initialization.
restoreDataLocked is called from both RestoreData and from the tablet RPC RestoreFromBackup. When a tablet is told via RPC to restore from a backup, the understanding is that we want to throw away whatever we currently have and start over from a backup. So in that call path, restoreDataLocked is called with deleteBeforeRestore = true.
Regardless of how vttablet is being restarted, whether by vtop or something else, --restore_from_backup does not mean that you throw away existing data. Any change to this behavior and expectation will break most existing Vitess deployments.
The reason this was implemented in this fashion is because it is not easy to change vttablet startup parameters such that the first time a new tablet comes up, it restores from a backup, but subsequently only restarts. So even though the flag is called --restore_from_backup it needs to be understood that the restore only happens if the data dir / database is not already present.

@rvrangel
Copy link
Contributor Author

thanks for looking into @deepthi. My understanding and @frouioui can probably weigh in since it was him that originally reported this behaviour, was that when creating a new tablet in the vitess operator, it would bring up the pod with existing data, then start the vttablet with --restore_from_backup and expect a backup to be restored (even though as you mentioned above, this shouldn't happen).

In that particular case, it seems ShouldRestore() inside restoreDataLocked() returned true even though it should have returned false because deleteBeforeRestore = false and the database already existed on MySQL, but the code was still reaching all the way up to the backup engine's ExecuteRestore()?

@harshit-gangal harshit-gangal removed the Needs Triage This issue needs to be correctly labelled and triaged label Oct 9, 2024
@frouioui
Copy link
Member

Moreover, because we defined a backup configuration on our K8S yaml file, we will start vttablet with the --restore_from_backup flag. Which will try to load any existing backup after the tablets are up and running.

Is what I said in the comment you linked @rvrangel, which is actually wrong and was based on bad assumptions. When the data directory of the tablet is not empty, we will restart the tablet from that directory and not from backup. I created a test cluster to illustrate it:

$ kubectl get pods
NAME                                                         READY   STATUS             RESTARTS        AGE
...
example-vttablet-zone1-0790125915-4e37d9d5                   3/3     Running            1 (7m21s ago)   9m7s
...

# I decide to restart that pod to simulate a vttablet that restarts
$ kubectl delete pod example-vttablet-zone1-0790125915-4e37d9d5
pod "example-vttablet-zone1-0790125915-4e37d9d5" deleted

# Let's look at the logs
$ kubectl logs example-vttablet-zone1-0790125915-4e37d9d5
...
I1021 20:17:03.882470       1 restore.go:261] Attempting to restore, but mysqld already contains data. Assuming vttablet was just restarted.
...

ShouldRestore returns false as there is a database and this condition is met:

// Returns (false, nil) if the check succeeds but the condition is not
// satisfied (there is a DB).

I am going to close this issue for now.

@frouioui frouioui closed this as not planned Won't fix, can't repro, duplicate, stale Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants