You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When ERS is evaluating candidates for promotion, it checks whether any of the candidates has an errant GTID. The way this computation is done can lead to false positives when there are only two candidates. This can lead to ERS choosing to promote a replica that is not actually the most advanced.
Reproduction Steps
This is not easy to reproduce, but can probably be done as follows
Run a cluster with 3 tablets: 1 primary, 2 replicas
manually delay replication on one of the replicas
Take down the primary MySQL, let ERS promote
You should see that it promotes the lagging replica
Binary Version
v18
Operating System and Environment details
Any
Log Fragments
E0905 23:15:34.245238 1 replication.go:126] skipping zone1-100 because we detected errant GTIDs - 8e166b50-d4e3-11ee-9779-e2b8a56b2179:79-84
The text was updated successfully, but these errors were encountered:
So the current logic (and in particular before #16725) is flawed, and I believe contrary to the correct logic.
It only looks at the relay log GTID, but that's the least interesting part when investigating errant GTIDs, as those are generally created on the replica itself. Therefore, we must use @@gtid_executed rather than the relay log GTID.
It's OK to then union the relay log GTID as a "total-would-be-GTID"
Since our topology is always flat (all replicas connect directly to the Primary, never sub-replicating from another replica) it is not so important to do specific UUID analysis as described in Find Errant GTIDs #6296 (review)
One thing to note is whether you necessarily intend to wait for the replica to consume its relay logs (hence, its retrieved_gtid_set)
I'll discuss with @GuptaManan100
I'm gonna work on improving the logic and add some more challenging testing scenarios.
Overview of the Issue
When ERS is evaluating candidates for promotion, it checks whether any of the candidates has an errant GTID. The way this computation is done can lead to false positives when there are only two candidates. This can lead to ERS choosing to promote a replica that is not actually the most advanced.
Reproduction Steps
This is not easy to reproduce, but can probably be done as follows
Binary Version
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: