-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [Remote Store] Timeout on shard relocation #10727
Comments
@sachinpkale has started to look into this and it appears that it is timing out on the translog transfer (not the segment file transfer) |
So, we hold translog which corresponds to (n-1)th refresh (last but 1). Due to this, we are holding almost twice the data in translog than the translog flush threshold size. There is also a possibility that user can pass in higher values for
|
@ashking94 Is there anything that can be done as an immediate mitigation when this scenario occurs? |
@andrross I am raising a fix for this, will need your help for review. |
I think we still have a possibility of timing out due to segment sync here as well. The sync step before peer recovery starts has a heartbeat callback setting last access time, but there is an additional segment sync call that does not here during |
With that said the timeout here is on the source shard, so that heartbeat wouldn't reset that timeout time. The issue is that the |
This sync call is for incremental data since we have already download the segments and translog before starting the peer recovery from source.
Yes, that is correct. I am exploring if there is any knob (configuration) that can be used to tune this in short term and eventually move this to an approach where the target can inform the source about finishing the download without blocking the transport call. |
Describe the bug
Triggering a relocation event for a large shard results in a failure caused by a timeout.
To Reproduce
Observe the following exception in the logs:
Additional context
The cause of this may be the synchronous call to download segments from the remote store in PeerRecoveryTargetService.
The text was updated successfully, but these errors were encountered: