Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Each replication repair resets the replica logs #13276

Closed
GuptaManan100 opened this issue Jun 8, 2023 · 2 comments · Fixed by #13377
Closed

Bug Report: Each replication repair resets the replica logs #13276

GuptaManan100 opened this issue Jun 8, 2023 · 2 comments · Fixed by #13377

Comments

@GuptaManan100
Copy link
Member

GuptaManan100 commented Jun 8, 2023

Overview of the Issue

Whenever VTOrc (or anyone else (manually, vtctld, etc)) fixes a replica to repair replication-related failures, it ends up resetting all the replica logs.
This in of itself, isn't a huge deal since the I/O thread can basically just read the logs again. But if the users are using the replication reporter (--enable_replication_reporter) on the vttablet, then this becomes an issue. The replication reporter uses the Seconds_behind_source as the source of its information to calculate the replication lag. According to the MySQL docs for this field -

In essence, this field measures the time difference in seconds between the replication SQL (applier) thread and the replication I/O (receiver) thread. If the network connection between source and replica is fast, the replication receiver thread is very close to the source, so this field is a good approximation of how late the replication applier thread is compared to the source. If the network is slow, this is not a good approximation; the replication applier thread may quite often be caught up with the slow-reading replication receiver thread, so Seconds_Behind_Source often shows a value of 0, even if the replication receiver thread is late compared to the source. In other words, this column is useful only for fast networks. 

Resetting the relay logs essentially resets the I/O thread too which can lead to incorrectly reporting the replication lag.

Reproduction Steps

  1. Run a cluster
  2. stop replication on a tablet and see that after VTOrc repairs it, the relay logs are gone.

Binary Version

main

Operating System and Environment details

all

Log Fragments

No response

@GuptaManan100 GuptaManan100 changed the title Bug Report: Each VTOrc repair resets the replica logs Bug Report: Each replication repair resets the replica logs Jun 8, 2023
@deepthi
Copy link
Member

deepthi commented Jun 23, 2023

This can be more serious than just a reporting issue. If it so happens that vtorc performs replication repair on all replicas just before a primary failure, we can end up losing data upon ERS.
We need to backport any fix we come up with all the way back to when we introduced the RESET during replication repair. That was in #10943.

@GuptaManan100
Copy link
Member Author

That PR was merged into release-15.0 onwards. So we'll have to backport the fix to 17, 16 and 15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants