Bug Report: vtorc ERS incorrectly flags replica as having errant GTIDs #16724

deepthi · 2024-09-06T21:12:22Z

Overview of the Issue

When ERS is evaluating candidates for promotion, it checks whether any of the candidates has an errant GTID. The way this computation is done can lead to false positives when there are only two candidates. This can lead to ERS choosing to promote a replica that is not actually the most advanced.

Reproduction Steps

This is not easy to reproduce, but can probably be done as follows

Run a cluster with 3 tablets: 1 primary, 2 replicas
manually delay replication on one of the replicas
Take down the primary MySQL, let ERS promote
You should see that it promotes the lagging replica

Binary Version

v18

Operating System and Environment details

Any

Log Fragments

E0905 23:15:34.245238       1 replication.go:126] skipping zone1-100 because we detected errant GTIDs - 8e166b50-d4e3-11ee-9779-e2b8a56b2179:79-84

deepthi · 2024-09-06T23:37:12Z

I'll leave this open until we resolve #16725 (comment)

shlomi-noach · 2024-09-08T10:31:32Z

Assigned myself to look into the ERS logic and see what scenarios are broken.

shlomi-noach · 2024-09-08T12:35:39Z

So the current logic (and in particular before #16725) is flawed, and I believe contrary to the correct logic.

It only looks at the relay log GTID, but that's the least interesting part when investigating errant GTIDs, as those are generally created on the replica itself. Therefore, we must use @@gtid_executed rather than the relay log GTID.
It's OK to then union the relay log GTID as a "total-would-be-GTID"
The current logic (a bit mitigated by FindErrantGTIDs: superset is not an errant GTID situation #16725) prefers promoting a replica that has less relay log GTID. It should pefer promoting the replica that has the largest GTID set (executed+relay)
Since our topology is always flat (all replicas connect directly to the Primary, never sub-replicating from another replica) it is not so important to do specific UUID analysis as described in Find Errant GTIDs #6296 (review)
We also need to consider Find Errant GTIDs #6296 (comment):

One thing to note is whether you necessarily intend to wait for the replica to consume its relay logs (hence, its retrieved_gtid_set)
I'll discuss with @GuptaManan100

I'm gonna work on improving the logic and add some more challenging testing scenarios.

deepthi added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged Component: VTorc Vitess Orchestrator integration and removed Needs Triage This issue needs to be correctly labelled and triaged labels Sep 6, 2024

deepthi self-assigned this Sep 6, 2024

deepthi mentioned this issue Sep 6, 2024

FindErrantGTIDs: superset is not an errant GTID situation #16725

Merged

5 tasks

shlomi-noach self-assigned this Sep 8, 2024

GuptaManan100 mentioned this issue Sep 26, 2024

Fix: Errant GTID detection on the replicas when they set replication source #16833

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: vtorc ERS incorrectly flags replica as having errant GTIDs #16724

Bug Report: vtorc ERS incorrectly flags replica as having errant GTIDs #16724

deepthi commented Sep 6, 2024

deepthi commented Sep 6, 2024

shlomi-noach commented Sep 8, 2024

shlomi-noach commented Sep 8, 2024

Bug Report: vtorc ERS incorrectly flags replica as having errant GTIDs #16724

Bug Report: vtorc ERS incorrectly flags replica as having errant GTIDs #16724

Comments

deepthi commented Sep 6, 2024

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

deepthi commented Sep 6, 2024

shlomi-noach commented Sep 8, 2024

shlomi-noach commented Sep 8, 2024