Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Remote Store] Timeout on shard relocation #10727

Closed
andrross opened this issue Oct 19, 2023 · 7 comments · Fixed by #10761
Closed

[BUG] [Remote Store] Timeout on shard relocation #10727

andrross opened this issue Oct 19, 2023 · 7 comments · Fixed by #10761
Assignees
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework v2.12.0 Issues and PRs related to version 2.12.0

Comments

@andrross
Copy link
Member

Describe the bug
Triggering a relocation event for a large shard results in a failure caused by a timeout.

To Reproduce

  • Create a 2 node cluster.
  • Create an index with 2 primary shards and ensure each node gets 1 shard.
  • Index about 25GB of data into each shard.
  • Create an exclusion rule to force one of the shards to relocate, e.g.:
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type:application/json' -d '{
   "persistent":{
      "cluster.routing.allocation.exclude._name": "8714cc6c80113d27569bbf190a52a669"
   }
}'

Observe the following exception in the logs:

[2023-10-19T02:05:46,382][WARN ][o.o.i.c.IndicesClusterStateService] [30ddfff069f4928ae20b3f7f9046e41f] [nyc_taxis][2] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[nyc_taxis][2]: Recovery failed from {, remote_
        at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:134)
        at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:177)
        at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:212)
        at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:743)
        at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:669)
        at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:412)
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1526)
        at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:438)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:858)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: RecoveryFailedException[[nyc_taxis][2]: Recovery failed from {8714cc6c80113d27569bbf190a52a669}{
        ... 9 more
Caused by: RemoteTransportException[[8714cc6c80113d27569bbf190a52a669][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog fail
Caused by: [nyc_taxis/FPAU1P1hTH-DcuCuzkHWjA][[nyc_taxis][2]] RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: ReceiveTimeoutTransportException[[30ddfff069f4928ae20b3f7f9046e41f][
        at org.opensearch.indices.recovery.RecoverySourceHandler.lambda$prepareTargetForTranslog$22(RecoverySourceHandler.java:629)
        at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90)
        at org.opensearch.core.action.ActionListener$4.onFailure(ActionListener.java:192)
        at org.opensearch.core.action.ActionListener$6.onFailure(ActionListener.java:311)
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:218)
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:210)
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75)
        at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:412)
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1526)
        at org.opensearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1417)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:858)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(Thread.java:833)
Caused by: ReceiveTimeoutTransportException[[30ddfff069f4928ae20b3f7f9046e41f][][internal:index/shard/recovery/prepare_translog] request_id [1304756] timed out after [60018ms]]
        at org.opensearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1420)
        ... 4 more

Additional context
The cause of this may be the synchronous call to download segments from the remote store in PeerRecoveryTargetService.

@andrross andrross added bug Something isn't working untriaged labels Oct 19, 2023
@andrross
Copy link
Member Author

@sachinpkale has started to look into this and it appears that it is timing out on the translog transfer (not the segment file transfer)

@ashking94
Copy link
Member

So, we hold translog which corresponds to (n-1)th refresh (last but 1). Due to this, we are holding almost twice the data in translog than the translog flush threshold size. There is also a possibility that user can pass in higher values for index.translog.flush_threshold_size in which case the download translog can fail again. There are 2 things we should solve here.

  1. Even with no replica, we should have refreshes running background. This will remove the bimodal behaviour of not uploading segments with 0 replica.
  2. While checking for flush threshold size, we consider the min referenced generation for checking the translog size. This will lead to flushes happening correctly.

@andrross
Copy link
Member Author

@ashking94 Is there anything that can be done as an immediate mitigation when this scenario occurs?

@ashking94
Copy link
Member

@andrross I am raising a fix for this, will need your help for review.

@mch2
Copy link
Member

mch2 commented Oct 19, 2023

I think we still have a possibility of timing out due to segment sync here as well. The sync step before peer recovery starts has a heartbeat callback setting last access time, but there is an additional segment sync call that does not here during innerOpenEngineAndTranslog.

@mch2
Copy link
Member

mch2 commented Oct 19, 2023

With that said the timeout here is on the source shard, so that heartbeat wouldn't reset that timeout time. The issue is that the prepareForTranslogOperations step on the source previously only opened the engine without any file transfer. Now it is opening the engine + syncing from remote store.

@ashking94
Copy link
Member

I think we still have a possibility of timing out due to segment sync here as well. The sync step before peer recovery starts has a heartbeat callback setting last access time, but there is an additional segment sync call that does not here during innerOpenEngineAndTranslog.

This sync call is for incremental data since we have already download the segments and translog before starting the peer recovery from source.

With that said the timeout here is on the source shard, so that heartbeat wouldn't reset that timeout time. The issue is that the prepareForTranslogOperations step on the source previously only opened the engine without any file transfer. Now it is opening the engine + syncing from remote store.

Yes, that is correct. I am exploring if there is any knob (configuration) that can be used to tune this in short term and eventually move this to an approach where the target can inform the source about finishing the download without blocking the transport call.

@ashking94 ashking94 added Storage:Durability Issues and PRs related to the durability framework v2.12.0 Issues and PRs related to version 2.12.0 and removed Cluster Manager labels Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework v2.12.0 Issues and PRs related to version 2.12.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants