[BUG] [Remote Store] Timeout on shard relocation #10727

andrross · 2023-10-19T02:16:51Z

Describe the bug
Triggering a relocation event for a large shard results in a failure caused by a timeout.

To Reproduce

Create a 2 node cluster.
Create an index with 2 primary shards and ensure each node gets 1 shard.
Index about 25GB of data into each shard.
Create an exclusion rule to force one of the shards to relocate, e.g.:

curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type:application/json' -d '{
   "persistent":{
      "cluster.routing.allocation.exclude._name": "8714cc6c80113d27569bbf190a52a669"
   }
}'

Observe the following exception in the logs:

[2023-10-19T02:05:46,382][WARN ][o.o.i.c.IndicesClusterStateService] [30ddfff069f4928ae20b3f7f9046e41f] [nyc_taxis][2] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[nyc_taxis][2]: Recovery failed from {, remote_
        at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:134)
        at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:177)
        at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:212)
        at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:743)
        at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:669)
        at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:412)
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1526)
        at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:438)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:858)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: RecoveryFailedException[[nyc_taxis][2]: Recovery failed from {8714cc6c80113d27569bbf190a52a669}{
        ... 9 more
Caused by: RemoteTransportException[[8714cc6c80113d27569bbf190a52a669][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog fail
Caused by: [nyc_taxis/FPAU1P1hTH-DcuCuzkHWjA][[nyc_taxis][2]] RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: ReceiveTimeoutTransportException[[30ddfff069f4928ae20b3f7f9046e41f][
        at org.opensearch.indices.recovery.RecoverySourceHandler.lambda$prepareTargetForTranslog$22(RecoverySourceHandler.java:629)
        at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90)
        at org.opensearch.core.action.ActionListener$4.onFailure(ActionListener.java:192)
        at org.opensearch.core.action.ActionListener$6.onFailure(ActionListener.java:311)
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:218)
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:210)
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75)
        at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:412)
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1526)
        at org.opensearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1417)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:858)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(Thread.java:833)
Caused by: ReceiveTimeoutTransportException[[30ddfff069f4928ae20b3f7f9046e41f][][internal:index/shard/recovery/prepare_translog] request_id [1304756] timed out after [60018ms]]
        at org.opensearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1420)
        ... 4 more

Additional context
The cause of this may be the synchronous call to download segments from the remote store in PeerRecoveryTargetService.

The text was updated successfully, but these errors were encountered:

andrross · 2023-10-19T04:27:58Z

@sachinpkale has started to look into this and it appears that it is timing out on the translog transfer (not the segment file transfer)

ashking94 · 2023-10-19T15:18:19Z

So, we hold translog which corresponds to (n-1)th refresh (last but 1). Due to this, we are holding almost twice the data in translog than the translog flush threshold size. There is also a possibility that user can pass in higher values for index.translog.flush_threshold_size in which case the download translog can fail again. There are 2 things we should solve here.

Even with no replica, we should have refreshes running background. This will remove the bimodal behaviour of not uploading segments with 0 replica.
While checking for flush threshold size, we consider the min referenced generation for checking the translog size. This will lead to flushes happening correctly.

andrross · 2023-10-19T15:27:34Z

@ashking94 Is there anything that can be done as an immediate mitigation when this scenario occurs?

ashking94 · 2023-10-19T15:47:06Z

@andrross I am raising a fix for this, will need your help for review.

mch2 · 2023-10-19T16:25:27Z

I think we still have a possibility of timing out due to segment sync here as well. The sync step before peer recovery starts has a heartbeat callback setting last access time, but there is an additional segment sync call that does not here during innerOpenEngineAndTranslog.

mch2 · 2023-10-19T16:30:09Z

With that said the timeout here is on the source shard, so that heartbeat wouldn't reset that timeout time. The issue is that the prepareForTranslogOperations step on the source previously only opened the engine without any file transfer. Now it is opening the engine + syncing from remote store.

ashking94 · 2023-10-19T19:08:09Z

I think we still have a possibility of timing out due to segment sync here as well. The sync step before peer recovery starts has a heartbeat callback setting last access time, but there is an additional segment sync call that does not here during innerOpenEngineAndTranslog.

This sync call is for incremental data since we have already download the segments and translog before starting the peer recovery from source.

With that said the timeout here is on the source shard, so that heartbeat wouldn't reset that timeout time. The issue is that the prepareForTranslogOperations step on the source previously only opened the engine without any file transfer. Now it is opening the engine + syncing from remote store.

Yes, that is correct. I am exploring if there is any knob (configuration) that can be used to tune this in short term and eventually move this to an approach where the target can inform the source about finishing the download without blocking the transport call.

andrross added bug Something isn't working untriaged labels Oct 19, 2023

anasalkouz added the Cluster Manager label Oct 19, 2023

anasalkouz assigned ashking94 Oct 19, 2023

anasalkouz removed the untriaged label Oct 19, 2023

ashking94 mentioned this issue Oct 19, 2023

[Remote Store] Fix relocation failure due to transport receive timeout #10761

Merged

7 tasks

ashking94 added Storage:Durability Issues and PRs related to the durability framework v2.12.0 Issues and PRs related to version 2.12.0 and removed Cluster Manager labels Oct 20, 2023

andrross closed this as completed in #10761 Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Remote Store] Timeout on shard relocation #10727

[BUG] [Remote Store] Timeout on shard relocation #10727

andrross commented Oct 19, 2023

andrross commented Oct 19, 2023

ashking94 commented Oct 19, 2023

andrross commented Oct 19, 2023

ashking94 commented Oct 19, 2023

mch2 commented Oct 19, 2023

mch2 commented Oct 19, 2023 •

edited

Loading

ashking94 commented Oct 19, 2023

[BUG] [Remote Store] Timeout on shard relocation #10727

[BUG] [Remote Store] Timeout on shard relocation #10727

Comments

andrross commented Oct 19, 2023

andrross commented Oct 19, 2023

ashking94 commented Oct 19, 2023

andrross commented Oct 19, 2023

ashking94 commented Oct 19, 2023

mch2 commented Oct 19, 2023

mch2 commented Oct 19, 2023 • edited Loading

ashking94 commented Oct 19, 2023

mch2 commented Oct 19, 2023 •

edited

Loading