[BUG] org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness flaky #3603

reta · 2022-06-16T14:05:24Z

Describe the bug
New flaky test after #3563 got merged:

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness" -Dtests.seed=CD3B9289D31206B8 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl -Dtests.timezone=Asia/Katmandu -Druntime.java=17

org.opensearch.cluster.allocation.AwarenessAllocationIT > testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness FAILED
    java.lang.AssertionError: unexpected
        at org.opensearch.test.InternalTestCluster.removeExclusions(InternalTestCluster.java:1912)
        at org.opensearch.test.InternalTestCluster.stopNodesAndClients(InternalTestCluster.java:1777)
        at org.opensearch.test.InternalTestCluster.stopNodesAndClient(InternalTestCluster.java:1764)
        at org.opensearch.test.InternalTestCluster.stopRandomNode(InternalTestCluster.java:1672)
        at org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness(AwarenessAllocationIT.java:425)

        Caused by:
        java.util.concurrent.ExecutionException: MasterNotDiscoveredException[null]
            at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:286)
            at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:273)
            at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:104)
            at org.opensearch.test.InternalTestCluster.removeExclusions(InternalTestCluster.java:1910)
            ... 4 more

            Caused by:
            MasterNotDiscoveredException[null]
                at app//org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:282)
                at app//org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
                at app//org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
                at app//org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:697)
                at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:739)
                at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
                at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
                at [email protected]/java.lang.Thread.run(Thread.java:833)

    MasterNotDiscoveredException[null]
        at app//org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:282)
        at app//org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
        at app//org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
        at app//org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:697)
        at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:739)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at [email protected]/java.lang.Thread.run(Thread.java:833)

To Reproduce
Steps to reproduce the behavior:

/gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness" -Dtests.seed=CD3B9289D31206B8 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl -Dtests.timezone=Asia/Katmandu -Druntime.java=17

Expected behavior
On CI, the test is flaky, see please https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6061.log for example.,

Plugins
Standard.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

CI

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

dreamer-89 · 2022-09-07T21:49:33Z

Seed for repro

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness" -Dtests.seed=6191C1DB4D571AC -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=lv -Dtests.timezone=Asia/Anadyr -Druntime.java=17

nknize · 2022-11-02T15:13:38Z

@imRishN Can you have a look at this test? Do we need 15 nodes for this test or could we scale it back to something like 12? three nodes per AZ? This is pretty demanding for an integration test and I'm wondering if that's what's causing these intermittent timeout failures?

dblock · 2022-11-08T16:41:14Z

#5143

dblock · 2022-11-18T22:23:30Z

#5283

imRishN · 2022-11-19T19:03:54Z

I'll take a look to this test and try to fix it

Bukhtawar · 2023-02-07T08:50:19Z

@imRishN lets take a look and get back

andrross · 2023-02-23T22:55:00Z

Another failure here: #6333 (comment)

ashking94 · 2023-08-04T10:40:40Z

Another failure : #8758 (comment)

shwetathareja · 2023-10-23T07:21:48Z

Another failure here - #10777 (comment)

peternied · 2024-01-02T22:09:14Z

Impacted PR Fix issue when calling Delete PIT endpoint and no PITs exist #11711, see test report.

imRishN · 2024-01-07T14:38:39Z

Investigated this issue. OpenSearch in a way follows greedy approach while allocating shards and doesn't compute the optimal allocation for all the shards that needs to be allocated. This approach based on certain filters and rules tries to control nodes where shards are assigned.

The unassigned shards causing test failure is due to the same above reason where a node where the shard was supposed to be assigned created a conflict with the awareness allocation decider. Hence, it is stuck in a state, waiting for space to allocate the unassigned shard because it cannot assign it to the only node with space. This also seemed more likely to happen in this particular test case because it is creating a 15 nodes cluster and over 120 shards which increases the probability of landing up in such a case. A smaller cluster with lesser shards would be less likely to land up in such a case.

This also seem to be a known issue after scrolling open issues in ElasticSearch/OpenSearch

Also added same in #7401

imRishN · 2024-01-08T07:06:03Z

Closing the issue, test has been muted and will be fixed and enabled back after the fix in allocator

peternied · 2024-01-08T14:31:12Z

This issue isn't fixed - the test is disabled. If we were to delete the test I'd be happy to close this issue; however, I suspect that we want to fix the underlying issue. If there is another issue tracking the scroll issue please link it here.

imRishN · 2024-01-08T16:30:48Z

@peternied, the merged PR which is muting the test links it to actual underlying issue. This is the merged PR muting the test - #11767. And the PR links the test to actual issue - #5908.

Feel free to close the issue if this suffices, else we can keep this open

reta added bug Something isn't working untriaged labels Jun 16, 2022

reta mentioned this issue Jun 16, 2022

Update to Gradle 7.5 #3594

Merged

5 tasks

kotwanikunal added >test-failure Test failure from CI, local build, etc. flaky-test Random test failure that succeeds on second run and removed untriaged labels Jun 16, 2022

tlfeng mentioned this issue Jun 22, 2022

[Backport 2.x] [BUG] opensearch crashes on closed client connection before search reply #3645

Merged

imRishN mentioned this issue Jun 22, 2022

Fixing flaky test testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness #3646

Merged

5 tasks

reta mentioned this issue Aug 17, 2022

Add changes for Create PIT and Delete PIT rest layer and rest high level client #4064

Merged

5 tasks

This was referenced Sep 2, 2022

[Bug]: gradle check failing with java heap OutOfMemoryError #4328

Merged

Add bwcVersion 1.3.6 to 2.x #4452

Merged

reta mentioned this issue Oct 24, 2022

[Backport] [2.x] Add missing no jdk distributions (#4722) #4884

Merged

kotwanikunal mentioned this issue Oct 27, 2022

Add remote shards allocator for searchable snapshots #4870

Merged

6 tasks

VachaShah mentioned this issue Oct 30, 2022

Fix dependencies #4963

Merged

6 tasks

nknize mentioned this issue Nov 2, 2022

[Test] Add IAE test for deprecated edgeNGram analyzer name #5040

Merged

dblock mentioned this issue Nov 8, 2022

Fix: raise error on malformed CSV. #5143

Merged

Poojita-Raj mentioned this issue Nov 8, 2022

[Backport 2.x] [Segment Replication] Fix for AlreadyClosedException for engine #5116

Merged

Poojita-Raj mentioned this issue Nov 15, 2022

[Meta] Fix random test failures #1715

Closed

37 tasks

dblock assigned imRishN Nov 21, 2022

dblock changed the title ~~[CI] Test Failure org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness~~ [BUG] org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness flaky Dec 1, 2022

kartg mentioned this issue Jan 4, 2023

Enhance searchable snapshots to enable a read-only view of older snapshots #5429

Merged

6 tasks

navneet1v mentioned this issue Jan 10, 2023

Add GeoTile and GeoHash Grid aggregations on GeoShapes. #5589

Merged

6 tasks

mch2 mentioned this issue Jan 18, 2023

Muting TranslogTransferManagerTests on Windows. #5924

Merged

6 tasks

BhumikaSaini-Amazon mentioned this issue Jul 21, 2023

[Remote Segment Store] Add Lucene major version to UploadedSegmentMetadata (#8088) #8820

Merged

6 tasks

ashking94 mentioned this issue Aug 4, 2023

Fix local segments stats update in RemoteStoreRefreshListener #8758

Merged

6 tasks

kotwanikunal mentioned this issue Aug 15, 2023

Mute remote store + segRep flaky tests that frequently block checks. #9366

Merged

6 tasks

sohami mentioned this issue Sep 2, 2023

Add average query concurrency metric for concurrent segment search #9670

Merged

6 tasks

amkhar mentioned this issue Oct 18, 2023

Add cluster state stats #10670

Merged

7 tasks

reta mentioned this issue Oct 19, 2023

[Streaming Indexing] Introduce new experimental HTTP transport based on Netty 4 and Project Reactor (Reactor Netty) #9672

Merged

6 tasks

shwetathareja mentioned this issue Oct 23, 2023

Fix remote cluster restore for data stream #10777

Merged

7 tasks

reta mentioned this issue Oct 30, 2023

Bump org.apache.logging.log4j:log4j-core from 2.21.0 to 2.21.1 in /buildSrc/src/testKit/thirdPartyAudit/sample_jars #11000

Merged

7 tasks

ticheng-aws mentioned this issue Nov 9, 2023

Fix slice collectors to leaves association with post filter #11134

Merged

8 tasks

neetikasinghal mentioned this issue Nov 10, 2023

update index random function to fix the bogus document deletion #11142

Merged

8 tasks

Poojita-Raj mentioned this issue Nov 17, 2023

[Backport 2.x] Reduce amount of docs ingested for SegRep ITs #11237

Merged

sohami mentioned this issue Nov 19, 2023

Add indexRandomForConcurrentSearch to tests #11259

Merged

8 tasks

reta mentioned this issue Nov 27, 2023

[Backport 2.x] Properly encapsulate SearchRequestOperationsListener related APIs as package protected (internal) #11345

Merged

This was referenced Dec 6, 2023

[AUTOCUT] Gradle Check Failure on push to 2.x #11486

Closed

[AUTOCUT] Gradle Check Failure on push to main #11519

Closed

imRishN mentioned this issue Jan 5, 2024

Mute flaky testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness #11767

Merged

1 task

bowenlan-amzn mentioned this issue Jan 6, 2024

Apply the fast filter optimization to composite aggregation #11505

Merged

8 tasks

imRishN closed this as completed Jan 8, 2024

peternied reopened this Jan 8, 2024

github-actions bot added the untriaged label Jan 8, 2024

imRishN closed this as completed Jan 8, 2024

imRishN reopened this Jan 8, 2024

peternied removed the untriaged label Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness flaky #3603

[BUG] org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness flaky #3603

reta commented Jun 16, 2022

dreamer-89 commented Sep 7, 2022 •

edited

Loading

nknize commented Nov 2, 2022 •

edited

Loading

dblock commented Nov 8, 2022

dblock commented Nov 18, 2022

imRishN commented Nov 19, 2022

Bukhtawar commented Feb 7, 2023

andrross commented Feb 23, 2023

ashking94 commented Aug 4, 2023

shwetathareja commented Oct 23, 2023

peternied commented Jan 2, 2024

imRishN commented Jan 7, 2024

imRishN commented Jan 8, 2024

peternied commented Jan 8, 2024

imRishN commented Jan 8, 2024 •

edited

Loading

[BUG] org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness flaky #3603

[BUG] org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness flaky #3603

Comments

reta commented Jun 16, 2022

dreamer-89 commented Sep 7, 2022 • edited Loading

nknize commented Nov 2, 2022 • edited Loading

dblock commented Nov 8, 2022

dblock commented Nov 18, 2022

imRishN commented Nov 19, 2022

Bukhtawar commented Feb 7, 2023

andrross commented Feb 23, 2023

ashking94 commented Aug 4, 2023

shwetathareja commented Oct 23, 2023

peternied commented Jan 2, 2024

imRishN commented Jan 7, 2024

imRishN commented Jan 8, 2024

peternied commented Jan 8, 2024

imRishN commented Jan 8, 2024 • edited Loading

dreamer-89 commented Sep 7, 2022 •

edited

Loading

nknize commented Nov 2, 2022 •

edited

Loading

imRishN commented Jan 8, 2024 •

edited

Loading