Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: vtorc ERS incorrectly flags replica as having errant GTIDs #16724

Open
deepthi opened this issue Sep 6, 2024 · 3 comments
Open
Assignees
Labels
Component: VTorc Vitess Orchestrator integration Type: Bug

Comments

@deepthi
Copy link
Member

deepthi commented Sep 6, 2024

Overview of the Issue

When ERS is evaluating candidates for promotion, it checks whether any of the candidates has an errant GTID. The way this computation is done can lead to false positives when there are only two candidates. This can lead to ERS choosing to promote a replica that is not actually the most advanced.

Reproduction Steps

This is not easy to reproduce, but can probably be done as follows

  • Run a cluster with 3 tablets: 1 primary, 2 replicas
  • manually delay replication on one of the replicas
  • Take down the primary MySQL, let ERS promote
  • You should see that it promotes the lagging replica

Binary Version

v18

Operating System and Environment details

Any

Log Fragments

E0905 23:15:34.245238       1 replication.go:126] skipping zone1-100 because we detected errant GTIDs - 8e166b50-d4e3-11ee-9779-e2b8a56b2179:79-84
@deepthi deepthi added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged Component: VTorc Vitess Orchestrator integration and removed Needs Triage This issue needs to be correctly labelled and triaged labels Sep 6, 2024
@deepthi deepthi self-assigned this Sep 6, 2024
@deepthi
Copy link
Member Author

deepthi commented Sep 6, 2024

I'll leave this open until we resolve #16725 (comment)

@shlomi-noach shlomi-noach self-assigned this Sep 8, 2024
@shlomi-noach
Copy link
Contributor

Assigned myself to look into the ERS logic and see what scenarios are broken.

@shlomi-noach
Copy link
Contributor

So the current logic (and in particular before #16725) is flawed, and I believe contrary to the correct logic.

  • It only looks at the relay log GTID, but that's the least interesting part when investigating errant GTIDs, as those are generally created on the replica itself. Therefore, we must use @@gtid_executed rather than the relay log GTID.
  • It's OK to then union the relay log GTID as a "total-would-be-GTID"
  • The current logic (a bit mitigated by FindErrantGTIDs: superset is not an errant GTID situation #16725) prefers promoting a replica that has less relay log GTID. It should pefer promoting the replica that has the largest GTID set (executed+relay)
  • Since our topology is always flat (all replicas connect directly to the Primary, never sub-replicating from another replica) it is not so important to do specific UUID analysis as described in Find Errant GTIDs #6296 (review)
  • We also need to consider Find Errant GTIDs #6296 (comment):

One thing to note is whether you necessarily intend to wait for the replica to consume its relay logs (hence, its retrieved_gtid_set)
I'll discuss with @GuptaManan100

I'm gonna work on improving the logic and add some more challenging testing scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VTorc Vitess Orchestrator integration Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants