Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RemoteStoreRestoreIT tests are flaky - Missing cluster-manager, expected nodes #11085

Open
peternied opened this issue Nov 3, 2023 · 10 comments
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Storage:Remote

Comments

@peternied
Copy link
Member

Describe the bug

  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithRefreshedDataPrimaryReplicaDown
  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRemoteTranslogRestoreWithCommittedData
  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRemoteTranslogRestoreWithRefreshedData
  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRemoteTranslogRestoreWithNoDataPostCommit
  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown
  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreDataOnlyInTranslog
  • org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithCommittedDataPrimaryReplicaDown

tests are flaky.

Stacktrace

Consistent with all but testRTSRestoreWithRefreshedDataPrimaryReplicaDown

java.lang.AssertionError: Missing cluster-manager, expected nodes: [{node_s1}{GD7uyrgiTxSKXQLlfKgK_A}{EpXKR9vyS6mfjExohbCtsA}{127.0.0.1}{127.0.0.1:39151}{d}{shard_indexing_pressure_enabled=true}, {node_s2}{QOKIN5VyR06merR9uOI4bw}{QteIuzBmRZWq6-fJZkbdEg}{127.0.0.1}{127.0.0.1:37443}{d}{shard_indexing_pressure_enabled=true}, {node_s0}{Kr_DbhdESPyQswOYSNJVQw}{f5hbAyrURuOHqIbtC_b8_Q}{127.0.0.1}{127.0.0.1:35393}{m}{shard_indexing_pressure_enabled=true}] and actual cluster states [cluster uuid: AzELHIGZT6uSizEPPUXcuA [committed: true]
version: 3
state uuid: borCjOpOQFWpk6JLeG0TmQ
from_diff: false
meta data version: 2
   coordination_metadata:
      term: 1
      last_committed_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
      last_accepted_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
      voting tombstones: []
metadata customs:
   repositories: {"test-remote-store-repo-2":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/SbGDsuuFWS"},"generation":-2,"pending_generation":-1},"test-remote-store-repo":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/DVrmjempuo"},"generation":-2,"pending_generation":-1}}   index-graveyard: IndexGraveyard[[]]
nodes: 
   {node_s0}{Kr_DbhdESPyQswOYSNJVQw}{f5hbAyrURuOHqIbtC_b8_Q}{127.0.0.1}{127.0.0.1:35393}{m}{shard_indexing_pressure_enabled=true}, local, cluster-manager
   {node_s2}{QOKIN5VyR06merR9uOI4bw}{QteIuzBmRZWq6-fJZkbdEg}{127.0.0.1}{127.0.0.1:37443}{d}{shard_indexing_pressure_enabled=true}
routing_table (version 1):
routing_nodes:
-----node_id[QOKIN5VyR06merR9uOI4bw][V]
---- unassigned
, cluster uuid: _na_ [committed: false]
version: 0
state uuid: DDhrUTuzS0-TkQaTHp03EQ
from_diff: false
meta data version: 0
   coordination_metadata:
      term: 0
      last_committed_config: VotingConfiguration{}
      last_accepted_config: VotingConfiguration{}
      voting tombstones: []
metadata customs:
   index-graveyard: IndexGraveyard[[]]
blocks: 
   _global_:
      1,state not recovered / initialized, blocks READ,WRITE,METADATA_READ,METADATA_WRITE,CREATE_INDEX      2,no cluster-manager, blocks METADATA_WRITE
nodes: 
   {node_s1}{GD7uyrgiTxSKXQLlfKgK_A}{EpXKR9vyS6mfjExohbCtsA}{127.0.0.1}{127.0.0.1:39151}{d}{shard_indexing_pressure_enabled=true}, local
routing_table (version 0):
routing_nodes:
-----node_id[GD7uyrgiTxSKXQLlfKgK_A][V]
---- unassigned
, cluster uuid: AzELHIGZT6uSizEPPUXcuA [committed: true]
version: 3
state uuid: borCjOpOQFWpk6JLeG0TmQ
from_diff: false
meta data version: 2
   coordination_metadata:
      term: 1
      last_committed_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
      last_accepted_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
      voting tombstones: []
metadata customs:
   repositories: {"test-remote-store-repo-2":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/SbGDsuuFWS"},"generation":-2,"pending_generation":-1},"test-remote-store-repo":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/DVrmjempuo"},"generation":-2,"pending_generation":-1}}   index-graveyard: IndexGraveyard[[]]
nodes: 
   {node_s0}{Kr_DbhdESPyQswOYSNJVQw}{f5hbAyrURuOHqIbtC_b8_Q}{127.0.0.1}{127.0.0.1:35393}{m}{shard_indexing_pressure_enabled=true}, cluster-manager
   {node_s2}{QOKIN5VyR06merR9uOI4bw}{QteIuzBmRZWq6-fJZkbdEg}{127.0.0.1}{127.0.0.1:37443}{d}{shard_indexing_pressure_enabled=true}, local
routing_table (version 1):
routing_nodes:
-----node_id[QOKIN5VyR06merR9uOI4bw][V]
---- unassigned
]

From testRTSRestoreWithRefreshedDataPrimaryReplicaDown

java.lang.AssertionError:  inconsistent generation 

To Reproduce
CI - https://build.ci.opensearch.org/job/gradle-check/29522/testReport/

Expected behavior
Test should always pass

@linuxpi
Copy link
Collaborator

linuxpi commented Jan 12, 2024

The recent failure reported is due to suite timeout

java.lang.Exception: Test abandoned because suite timeout was reached.
	at __randomizedtesting.SeedInfo.seed([ADC655A786161FFE]:0)

found one blocked thread

 12) Thread[id=4135, name=opensearch[node_s2][generic][T#3], state=BLOCKED, group=TGRP-RemoteStoreRestoreIT]
        at org.opensearch.index.shard.IndexShard$11.getSegmentInfosSnapshot(IndexShard.java:4768)
        at org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:5113)
        at org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1676)
        at org.opensearch.index.shard.IndexShard.updateReplicationCheckpoint(IndexShard.java:4690)
        at org.opensearch.index.shard.IndexShard$ReplicationCheckpointUpdater.afterRefresh(IndexShard.java:4684)
        at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
        at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
        at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
        at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1862)
        at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1977)
        at org.opensearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:589)
        at org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:546)
        at org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:146)
        at org.opensearch.index.shard.IndexShard.resetEngineToGlobalCheckpoint(IndexShard.java:4812)
        at org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:720)
        at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4079)
        at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4049)
        at org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$33(IndexShard.java:4000)
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
        at org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

@linuxpi
Copy link
Collaborator

linuxpi commented Jan 12, 2024

Test was not able to properly terminate the node_s2, which lingered on for 20 mins

[2023-12-22T10:14:00,407][INFO ][o.o.t.InternalTestCluster] [testRTSRestoreWithRefreshedDataPrimaryReplicaDown] Closing filtered random node [node_s2] 
[2023-12-22T10:14:00,408][INFO ][o.o.n.Node               ] [testRTSRestoreWithRefreshedDataPrimaryReplicaDown] stopping ...
[2023-12-22T10:14:00,408][INFO ][o.o.c.c.Coordinator      ] [node_s2] cluster-manager node [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{jBjkBSbZSL-3AlK2H7Z0gg}{127.0.0.1}{127.0.0.1:35721}{m}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
org.opensearch.transport.NodeDisconnectedException: [node_s0][127.0.0.1:35721][disconnected] disconnected
[2023-12-22T10:14:00,409][INFO ][o.o.c.c.FollowersChecker ] [node_s0] FollowerChecker{discoveryNode={node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{nOLSTsBLR7-smx0D64pDvA}{127.0.0.1
...[truncated 1074993 chars]...
v8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:28,614][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:38,614][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:48,615][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:58,617][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:25:08,617][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:25:18,618][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:25:28,618][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1

....


[2023-12-22T10:33:28,637][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:38,637][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:48,637][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:58,638][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:58,924][INFO ][o.o.n.Node               ] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] stopped
[2023-12-22T10:33:58,924][INFO ][o.o.n.Node               ] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] closing ...
[2023-12-22T10:33:58,931][INFO ][o.o.n.Node               ] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] closed
[2023-12-22T10:33:59,438][INFO ][o.o.r.RemoteStoreRestoreIT] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] [RemoteStoreRestoreIT#testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown]: cleaning up after test
[2023-12-22T10:33:59,439][INFO ][o.o.r.RemoteStoreRestoreIT] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] after test

@linuxpi
Copy link
Collaborator

linuxpi commented Jan 12, 2024

Looks like the thread was blocked on

@Override
public GatedCloseable<SegmentInfos> getSegmentInfosSnapshot() {
synchronized (engineMutex) {
if (newEngineReference.get() == null) {
throw new AlreadyClosedException("engine was closed");
}
return newEngineReference.get().getSegmentInfosSnapshot();
}
}
@Override

@linuxpi
Copy link
Collaborator

linuxpi commented Jan 12, 2024

We are terminating the nodes in order of replica then primary, but we only check for shard 0. replica node for shard 0 would have had primaries for other shards. Terminating it would lead to relocation and during relocation we create the read only engine(code pointer in last comment), which where our locks are blocking the thread

/**
* Helper function to test restoring an index having replicas from remote store when all the nodes housing the primary/replica drop.
* @param numberOfIterations Number of times a refresh/flush should be invoked, followed by indexing some data.
* @param invokeFlush If true, a flush is invoked. Otherwise, a refresh is invoked.
* @throws IOException IO Exception.
*/
private void testRestoreFlowBothPrimaryReplicasDown(int numberOfIterations, boolean invokeFlush, int shardCount) throws Exception {
prepareCluster(1, 2, INDEX_NAME, 1, shardCount);
Map<String, Long> indexStats = indexData(numberOfIterations, invokeFlush, INDEX_NAME);
assertEquals(shardCount * 2, getNumShards(INDEX_NAME).totalNumShards);
internalCluster().stopRandomNode(InternalTestCluster.nameFilter(replicaNodeName(INDEX_NAME)));
internalCluster().stopRandomNode(InternalTestCluster.nameFilter(primaryNodeName(INDEX_NAME)));
ensureRed(INDEX_NAME);
internalCluster().startDataOnlyNodes(2);
restoreAndVerify(shardCount, 1, indexStats);
}

@linuxpi
Copy link
Collaborator

linuxpi commented Jan 12, 2024

index shutdown and replica to primary promotion causing a deadlock

"opensearch[node_s2][generic][T#3]" ID=4135 BLOCKED on java.lang.Object@3c0fa470 owned by "opensearch[node_s2][indices_shutdown][T#1]" ID=4183
	at app//org.opensearch.index.shard.IndexShard$11.getSegmentInfosSnapshot(IndexShard.java:4768)
	- blocked on java.lang.Object@3c0fa470
	at app//org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:5113)
	at app//org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1676)
	at app//org.opensearch.index.shard.IndexShard.updateReplicationCheckpoint(IndexShard.java:4690)
	at app//org.opensearch.index.shard.IndexShard$ReplicationCheckpointUpdater.afterRefresh(IndexShard.java:4684)
	at app//org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
	at app//org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
	at app//org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
	at app//org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1862)
	at app//org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1977)
	at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:589)
	at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:546)
	at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:146)
	at app//org.opensearch.index.shard.IndexShard.resetEngineToGlobalCheckpoint(IndexShard.java:4812)
	at app//org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:720)
	at app//org.opensearch.index.shard.IndexShard$$Lambda/0x00007f45aca36238.run(Unknown Source)
	at app//org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4079)
	at app//org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4049)
	at app//org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$33(IndexShard.java:4000)
	at app//org.opensearch.index.shard.IndexShard$$Lambda/0x00007f45aca11738.accept(Unknown Source)
	at app//org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
	at app//org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911)
	at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at [email protected]/java.lang.Thread.runWith(Thread.java:1596)
	at [email protected]/java.lang.Thread.run(Thread.java:1583)
	Locked synchronizers:
	- java.util.concurrent.locks.ReentrantLock$NonfairSync@28675992
	- java.util.concurrent.ThreadPoolExecutor$Worker@1197c3a3
	- java.util.concurrent.locks.ReentrantLock$NonfairSync@57ac517a

  20) Thread[id=4183, name=opensearch[node_s2][indices_shutdown][T#1], state=WAITING, group=TGRP-RemoteStoreRestoreIT]
        at java.base/jdk.internal.misc.Unsafe.park(Native Method)
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)
        at java.base/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
        at org.opensearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:69)
        at org.opensearch.index.engine.Engine.close(Engine.java:2070)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:89)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:131)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:81)
        at org.opensearch.index.shard.IndexShard$11.close(IndexShard.java:4785)
        at org.opensearch.index.engine.Engine.flushAndClose(Engine.java:2059)
        at org.opensearch.index.shard.IndexShard.close(IndexShard.java:2003)
        at org.opensearch.index.IndexService.closeShard(IndexService.java:642)
        at org.opensearch.index.IndexService.removeShard(IndexService.java:618)
        at org.opensearch.index.IndexService.close(IndexService.java:389)
        at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:1047)
        at org.opensearch.indices.IndicesService.lambda$doStop$3(IndicesService.java:542)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

In the IndexShard.close flow we acquire engineMutex and then try to acquire write lock on the engine. In replica to primary promotion flow we acquire readlock on the engine first and then try to acquire engineMutex . Read and Write lock are shared via ReenterantReadWriteLock, leading to deadlock.

@linuxpi
Copy link
Collaborator

linuxpi commented Jan 12, 2024

Created a separate bug to track the deadlock issue as this bug is tracking other reasons due to which restore tests are failing.

#11869

@linuxpi
Copy link
Collaborator

linuxpi commented Feb 5, 2024

I wasn't able to repro this even after 1K iterations. Since there have been only 2 occurrences and last one was almost 2 months ago, closing the issue. Feel free to reopen if you encounter same issue again

@linuxpi linuxpi closed this as completed Feb 6, 2024
@shwetathareja
Copy link
Member

org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown failed again here - #12252 (comment)

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3006, name=opensearch[node_s2][remote_refresh_retry][T#1], state=RUNNABLE, group=TGRP-RemoteStoreRestoreIT]
	at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73:F08AC3EEAC6BC269]:0)
Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7384c70f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@57463c3d[Wrapped task = [threaded] org.opensearch.index.shard.ReleasableRetryableRefreshListener$$Lambda/0x00007ff968b34230@56239546]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@294dab55[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 13]
	at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73]:0)
	at app//org.opensearch.common.util.concurrent.OpenSearchAbortPolicy.rejectedExecution(OpenSearchAbortPolicy.java:67)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
	at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
	at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
	at app//org.opensearch.threadpool.ThreadPool.schedule(ThreadPool.java:469)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:125)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:178)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at [email protected]/java.lang.Thread.run(Thread.java:1583)

@linuxpi
Copy link
Collaborator

linuxpi commented Apr 22, 2024

All 3 recent failures were for

org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown

failure trace

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3006, name=opensearch[node_s2][remote_refresh_retry][T#1], state=RUNNABLE, group=TGRP-RemoteStoreRestoreIT]
	at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73:F08AC3EEAC6BC269]:0)
Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7384c70f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@57463c3d[Wrapped task = [threaded] org.opensearch.index.shard.ReleasableRetryableRefreshListener$$Lambda/0x00007ff968b34230@56239546]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@294dab55[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 13]
	at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73]:0)
	at app//org.opensearch.common.util.concurrent.OpenSearchAbortPolicy.rejectedExecution(OpenSearchAbortPolicy.java:67)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
	at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
	at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
	at app//org.opensearch.threadpool.ThreadPool.schedule(ThreadPool.java:469)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:125)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:178)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at [email protected]/java.lang.Thread.run(Thread.java:1583)

@linuxpi
Copy link
Collaborator

linuxpi commented May 1, 2024

Ran around 5000 iterations locally and i cannot repro this. Will add trace logging annotation on this test to get more details in the PR build failures to help debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Storage:Remote
Projects
Status: 🆕 New
Development

No branches or pull requests

4 participants