-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] RemoteStoreRestoreIT tests are flaky - Missing cluster-manager, expected nodes #11085
Comments
The recent failure reported is due to suite timeout
found one blocked thread
|
Test was not able to properly terminate the node_s2, which lingered on for 20 mins
|
Looks like the thread was blocked on OpenSearch/server/src/main/java/org/opensearch/index/shard/IndexShard.java Lines 4752 to 4763 in 5c82ab8
|
We are terminating the nodes in order of replica then primary, but we only check for shard 0. replica node for shard 0 would have had primaries for other shards. Terminating it would lead to relocation and during relocation we create the read only engine(code pointer in last comment), which where our locks are blocking the thread OpenSearch/server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteStoreRestoreIT.java Lines 137 to 156 in 5c82ab8
|
index shutdown and replica to primary promotion causing a deadlock
In the IndexShard.close flow we acquire engineMutex and then try to acquire write lock on the engine. In replica to primary promotion flow we acquire readlock on the engine first and then try to acquire engineMutex . Read and Write lock are shared via ReenterantReadWriteLock, leading to deadlock. |
Created a separate bug to track the deadlock issue as this bug is tracking other reasons due to which restore tests are failing. |
I wasn't able to repro this even after 1K iterations. Since there have been only 2 occurrences and last one was almost 2 months ago, closing the issue. Feel free to reopen if you encounter same issue again |
org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown failed again here - #12252 (comment)
|
All 3 recent failures were for failure trace
|
Ran around 5000 iterations locally and i cannot repro this. Will add trace logging annotation on this test to get more details in the PR build failures to help debug. |
Describe the bug
tests are flaky.
Stacktrace
Consistent with all but
testRTSRestoreWithRefreshedDataPrimaryReplicaDown
From
testRTSRestoreWithRefreshedDataPrimaryReplicaDown
To Reproduce
CI - https://build.ci.opensearch.org/job/gradle-check/29522/testReport/
Expected behavior
Test should always pass
The text was updated successfully, but these errors were encountered: