Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness flaky #3603

Open
reta opened this issue Jun 16, 2022 · 17 comments
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Comments

@reta
Copy link
Collaborator

reta commented Jun 16, 2022

Describe the bug
New flaky test after #3563 got merged:

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness" -Dtests.seed=CD3B9289D31206B8 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl -Dtests.timezone=Asia/Katmandu -Druntime.java=17

org.opensearch.cluster.allocation.AwarenessAllocationIT > testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness FAILED
    java.lang.AssertionError: unexpected
        at org.opensearch.test.InternalTestCluster.removeExclusions(InternalTestCluster.java:1912)
        at org.opensearch.test.InternalTestCluster.stopNodesAndClients(InternalTestCluster.java:1777)
        at org.opensearch.test.InternalTestCluster.stopNodesAndClient(InternalTestCluster.java:1764)
        at org.opensearch.test.InternalTestCluster.stopRandomNode(InternalTestCluster.java:1672)
        at org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness(AwarenessAllocationIT.java:425)

        Caused by:
        java.util.concurrent.ExecutionException: MasterNotDiscoveredException[null]
            at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:286)
            at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:273)
            at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:104)
            at org.opensearch.test.InternalTestCluster.removeExclusions(InternalTestCluster.java:1910)
            ... 4 more

            Caused by:
            MasterNotDiscoveredException[null]
                at app//org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:282)
                at app//org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
                at app//org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
                at app//org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:697)
                at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:739)
                at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
                at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
                at [email protected]/java.lang.Thread.run(Thread.java:833)

    MasterNotDiscoveredException[null]
        at app//org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:282)
        at app//org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
        at app//org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
        at app//org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:697)
        at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:739)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at [email protected]/java.lang.Thread.run(Thread.java:833)

To Reproduce
Steps to reproduce the behavior:

/gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness" -Dtests.seed=CD3B9289D31206B8 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl -Dtests.timezone=Asia/Katmandu -Druntime.java=17

Expected behavior
On CI, the test is flaky, see please https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6061.log for example.,

Plugins
Standard.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • CI

Additional context
Add any other context about the problem here.

@dreamer-89
Copy link
Member

dreamer-89 commented Sep 7, 2022

Seed for repro

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness" -Dtests.seed=6191C1DB4D571AC -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=lv -Dtests.timezone=Asia/Anadyr -Druntime.java=17

@nknize
Copy link
Collaborator

nknize commented Nov 2, 2022

@imRishN Can you have a look at this test? Do we need 15 nodes for this test or could we scale it back to something like 12? three nodes per AZ? This is pretty demanding for an integration test and I'm wondering if that's what's causing these intermittent timeout failures?

@dblock
Copy link
Member

dblock commented Nov 8, 2022

#5143

@dblock
Copy link
Member

dblock commented Nov 18, 2022

#5283

@imRishN
Copy link
Member

imRishN commented Nov 19, 2022

I'll take a look to this test and try to fix it

@dblock dblock changed the title [CI] Test Failure org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness [BUG] org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness flaky Dec 1, 2022
@Bukhtawar
Copy link
Collaborator

@imRishN lets take a look and get back

@andrross
Copy link
Member

Another failure here: #6333 (comment)

@ashking94
Copy link
Member

Another failure : #8758 (comment)

@shwetathareja
Copy link
Member

Another failure here - #10777 (comment)

@peternied
Copy link
Member

@imRishN
Copy link
Member

imRishN commented Jan 7, 2024

Investigated this issue. OpenSearch in a way follows greedy approach while allocating shards and doesn't compute the optimal allocation for all the shards that needs to be allocated. This approach based on certain filters and rules tries to control nodes where shards are assigned.

The unassigned shards causing test failure is due to the same above reason where a node where the shard was supposed to be assigned created a conflict with the awareness allocation decider. Hence, it is stuck in a state, waiting for space to allocate the unassigned shard because it cannot assign it to the only node with space. This also seemed more likely to happen in this particular test case because it is creating a 15 nodes cluster and over 120 shards which increases the probability of landing up in such a case. A smaller cluster with lesser shards would be less likely to land up in such a case.

This also seem to be a known issue after scrolling open issues in ElasticSearch/OpenSearch

Also added same in #7401

@imRishN
Copy link
Member

imRishN commented Jan 8, 2024

Closing the issue, test has been muted and will be fixed and enabled back after the fix in allocator

@imRishN imRishN closed this as completed Jan 8, 2024
@peternied
Copy link
Member

This issue isn't fixed - the test is disabled. If we were to delete the test I'd be happy to close this issue; however, I suspect that we want to fix the underlying issue. If there is another issue tracking the scroll issue please link it here.

@peternied peternied reopened this Jan 8, 2024
@imRishN
Copy link
Member

imRishN commented Jan 8, 2024

@peternied, the merged PR which is muting the test links it to actual underlying issue. This is the merged PR muting the test - #11767. And the PR links the test to actual issue - #5908.

Feel free to close the issue if this suffices, else we can keep this open

@imRishN imRishN closed this as completed Jan 8, 2024
@imRishN imRishN reopened this Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.
Projects
None yet
Development

No branches or pull requests