Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] GatewayMetaStatePersistedStateTests testDataOnlyNodePersistence failing #87952

Closed
DaveCTurner opened this issue Jun 23, 2022 · 6 comments
Closed
Assignees
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

Build scan:
https://gradle-enterprise.elastic.co/s/ek3czwqfjdds6/tests/:server:test/org.elasticsearch.gateway.GatewayMetaStatePersistedStateTests/testDataOnlyNodePersistence

Reproduction line:
./gradlew ':server:test' --tests "org.elasticsearch.gateway.GatewayMetaStatePersistedStateTests.testDataOnlyNodePersistence" -Dtests.seed=D933DB43D05D85D5 -Dtests.locale=ro-RO -Dtests.timezone=Asia/Chita -Druntime.java=17

Applicable branches:
master

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.gateway.GatewayMetaStatePersistedStateTests&tests.test=testDataOnlyNodePersistence

Failure excerpt:

java.lang.AssertionError: (No message provided)

  at __randomizedtesting.SeedInfo.seed([D933DB43D05D85D5:6C1D5DA6402AD2C7]:0)
  at org.junit.Assert.fail(Assert.java:86)
  at org.junit.Assert.assertTrue(Assert.java:41)
  at org.junit.Assert.assertTrue(Assert.java:52)
  at org.elasticsearch.gateway.GatewayMetaStatePersistedStateTests.lambda$testDataOnlyNodePersistence$5(GatewayMetaStatePersistedStateTests.java:437)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1098)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1071)
  at org.elasticsearch.gateway.GatewayMetaStatePersistedStateTests.testDataOnlyNodePersistence(GatewayMetaStatePersistedStateTests.java:437)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:833)

@DaveCTurner DaveCTurner added :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI labels Jun 23, 2022
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Jun 23, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@pxsalehi
Copy link
Member

I spent some time digging into this, unfortunately with not much outcome.

All the past three failures of this test are on assertTrue(gateway.allPendingAsyncStatesWritten() (L437, L509). From what I understood, the first one (L437) should really be asserting on a clean setup (by that I mean no existing/pending state), since it seems at that stage there is no state being loaded from disk, and by setting some random coordination metadata we are triggering this async state write for the first time. Is that the case @DaveCTurner? Do you have some hints or gut feelings about this, that I could follow up on?

@DaveCTurner
Copy link
Contributor Author

Hmm, nothing jumps out to me. We're already doing these assertions in an assertBusy() which will retry for up to 10s. Are we stuck or is this just not waiting long enough? IOW if you give that assertBusy() a longer timeout does it fix things? My only other idea would be to set logger.org.elasticsearch.gateway: TRACE and see if that gives any further clues.

@pxsalehi
Copy link
Member

pxsalehi commented Jul 12, 2022

I've increased the timeout, and enabled TRACE logging, in case it happens with the new timeout too. (#88477)

@astefan
Copy link
Contributor

astefan commented Aug 17, 2022

I don't think this will help, but today there was a similar failure (I'd say it's the same from the looks of it) on 8.3: https://gradle-enterprise.elastic.co/s/amiwyjfcfp74o

@pxsalehi
Copy link
Member

pxsalehi commented Sep 13, 2022

@DaveCTurner It's been two months since the timeout was increased and there hasn't been any new failures. The above mentioned failure was on a branch w/o the new timeout. Should we close this, or you'd prefer to keep it open?

@pxsalehi pxsalehi closed this as completed Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants