Bug Report: Each replication repair resets the replica logs #13276

GuptaManan100 · 2023-06-08T17:05:32Z

Overview of the Issue

Whenever VTOrc (or anyone else (manually, vtctld, etc)) fixes a replica to repair replication-related failures, it ends up resetting all the replica logs.
This in of itself, isn't a huge deal since the I/O thread can basically just read the logs again. But if the users are using the replication reporter (--enable_replication_reporter) on the vttablet, then this becomes an issue. The replication reporter uses the Seconds_behind_source as the source of its information to calculate the replication lag. According to the MySQL docs for this field -

In essence, this field measures the time difference in seconds between the replication SQL (applier) thread and the replication I/O (receiver) thread. If the network connection between source and replica is fast, the replication receiver thread is very close to the source, so this field is a good approximation of how late the replication applier thread is compared to the source. If the network is slow, this is not a good approximation; the replication applier thread may quite often be caught up with the slow-reading replication receiver thread, so Seconds_Behind_Source often shows a value of 0, even if the replication receiver thread is late compared to the source. In other words, this column is useful only for fast networks.

Resetting the relay logs essentially resets the I/O thread too which can lead to incorrectly reporting the replication lag.

Reproduction Steps

Run a cluster
stop replication on a tablet and see that after VTOrc repairs it, the relay logs are gone.

Binary Version

main

Operating System and Environment details

all

Log Fragments

No response

The text was updated successfully, but these errors were encountered:

deepthi · 2023-06-23T16:52:25Z

This can be more serious than just a reporting issue. If it so happens that vtorc performs replication repair on all replicas just before a primary failure, we can end up losing data upon ERS.
We need to backport any fix we come up with all the way back to when we introduced the RESET during replication repair. That was in #10943.

GuptaManan100 · 2023-06-26T06:56:44Z

That PR was merged into release-15.0 onwards. So we'll have to backport the fix to 17, 16 and 15.

GuptaManan100 added Type: Bug Component: Cluster management labels Jun 8, 2023

GuptaManan100 changed the title ~~Bug Report: Each VTOrc repair resets the replica logs~~ Bug Report: Each replication repair resets the replica logs Jun 8, 2023

GuptaManan100 mentioned this issue Jun 26, 2023

Prevent resetting replication every time we set replication source #13377

Merged

4 tasks

GuptaManan100 closed this as completed in #13377 Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: Each replication repair resets the replica logs #13276

Bug Report: Each replication repair resets the replica logs #13276

GuptaManan100 commented Jun 8, 2023 •

edited

Loading

deepthi commented Jun 23, 2023 •

edited

Loading

GuptaManan100 commented Jun 26, 2023

Bug Report: Each replication repair resets the replica logs #13276

Bug Report: Each replication repair resets the replica logs #13276

Comments

GuptaManan100 commented Jun 8, 2023 • edited Loading

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

deepthi commented Jun 23, 2023 • edited Loading

GuptaManan100 commented Jun 26, 2023

GuptaManan100 commented Jun 8, 2023 •

edited

Loading

deepthi commented Jun 23, 2023 •

edited

Loading