Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress failure on master #46508

Closed
dakrone opened this issue Sep 9, 2019 · 7 comments · Fixed by #46573, #47405 or #47488
Closed
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >test-failure Triaged test failures from CI

Comments

@dakrone
Copy link
Member

dakrone commented Sep 9, 2019

From https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/583/console & https://gradle-enterprise.elastic.co/s/6x67ha6426acy/console-log

  2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:ilm:test' --tests "org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress" -Dtests.seed=7BA427BA999CD99D -Dtests.security.manager=true -Dtests.locale=fr-GP -Dtests.timezone=America/Edmonton -Dcompiler.java=12 -Druntime.java=11
  2> java.lang.AssertionError
        at __randomizedtesting.SeedInfo.seed([7BA427BA999CD99D:67E39A043E8F736]:0)
        at org.junit.Assert.fail(Assert.java:86)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.junit.Assert.assertNotNull(Assert.java:712)
        at org.junit.Assert.assertNotNull(Assert.java:722)
        at org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests.lambda$testRetentionWhileSnapshotInProgress$2(SLMSnapshotBlockingIntegTests.java:153)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:866)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:840)
        at org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress(SLMSnapshotBlockingIntegTests.java:146)

Likely from this exception when trying to kick off the second snapshot:

  1> [2019-09-09T13:42:03,563][INFO ][o.e.x.s.SLMSnapshotBlockingIntegTests] [testRetentionWhileSnapshotInProgress] --> waiting for snapshot snap-qdwsdayhtfuymbsj7vi2yw to be completed, got: STARTED
  1> [2019-09-09T13:42:03,821][INFO ][o.e.x.s.SLMSnapshotBlockingIntegTests] [testRetentionWhileSnapshotInProgress] --> waiting for snapshot snap-qdwsdayhtfuymbsj7vi2yw to be completed, got: SUCCESS
  1> [2019-09-09T13:42:03,821][INFO ][o.e.x.s.SLMSnapshotBlockingIntegTests] [testRetentionWhileSnapshotInProgress] --> blocking nodes from completing snapshot
  1> [2019-09-09T13:42:03,822][INFO ][o.e.x.s.SnapshotLifecycleTask] [node_s0] snapshot lifecycle policy [slm-policy] issuing create snapshot [snap-frash4insd-kptw8sm1rew]
  1> [2019-09-09T13:42:03,824][INFO ][o.e.x.s.SLMSnapshotBlockingIntegTests] [testRetentionWhileSnapshotInProgress] --> checking for in progress snapshot...
  1> [2019-09-09T13:42:03,826][INFO ][o.e.x.s.SLMSnapshotBlockingIntegTests] [testRetentionWhileSnapshotInProgress] --> checking for in progress snapshot...
  1> [2019-09-09T13:42:03,828][WARN ][o.e.s.SnapshotsService   ] [node_s0] [slm-repo][snap-frash4insd-kptw8sm1rew] failed to create snapshot
  1> org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [slm-repo:snap-frash4insd-kptw8sm1rew]  a snapshot is already running
  1> 	at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:301) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1> 	at java.lang.Thread.run(Thread.java:834) [?:?]

My hunch is that the first snapshot has a "SUCCESS" status, but is still present in the cluster state. We should ensure it's no longer present in the cluster state before issuing the second execute policy request.

@dakrone dakrone added >test-failure Triaged test failures from CI :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Sep 9, 2019
@dakrone dakrone self-assigned this Sep 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

alpar-t added a commit that referenced this issue Sep 10, 2019
dakrone added a commit to dakrone/elasticsearch that referenced this issue Sep 10, 2019
This commit adds a wait/check for all running snapshots to be cleared
before taking another snapshot. The previous snapshot was successful but
had not yet been cleared from the cluster state, so the second snapshot
failed due to a `ConcurrentSnapshotException`.

Resolves elastic#46508
dakrone added a commit that referenced this issue Sep 11, 2019
This commit adds a wait/check for all running snapshots to be cleared
before taking another snapshot. The previous snapshot was successful but
had not yet been cleared from the cluster state, so the second snapshot
failed due to a `ConcurrentSnapshotException`.

Resolves #46508
dakrone added a commit that referenced this issue Sep 11, 2019
This commit adds a wait/check for all running snapshots to be cleared
before taking another snapshot. The previous snapshot was successful but
had not yet been cleared from the cluster state, so the second snapshot
failed due to a `ConcurrentSnapshotException`.

Resolves #46508
@dakrone
Copy link
Member Author

dakrone commented Sep 12, 2019

This is no longer failing with the previous exception, but has been failing with the following errors:

  1> [2019-09-12T21:22:19,569][INFO ][o.e.c.m.MetaDataCreateIndexService] [node_s0] [.slm-history-1-000001] creating index, cause [api], templates [random_index_template], shards [9]/[0], mappings []
  1> [2019-09-12T21:22:19,629][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] adding template [.slm-history] for index patterns [.slm-history-1*]
  1> [2019-09-12T21:22:19,658][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] removing template [random_index_template]
  1> [2019-09-12T21:22:19,721][INFO ][o.e.r.RepositoriesService] [node_s0] delete repository [slm-repo]
  1> [2019-09-12T21:22:19,902][INFO ][o.e.c.r.a.AllocationService] [node_s0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.slm-history-1-000001][8]]]).
  1> [2019-09-12T21:22:19,931][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] [.slm-history-1-000001/qzAUgC5zRhWfDfscNWGLpQ] create_mapping [_doc]
  1> [2019-09-12T21:22:24,738][INFO ][o.e.x.s.SLMSnapshotBlockingIntegTests] [testRetentionWhileSnapshotInProgress] after test
  2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:ilm:test' --tests "org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress" -Dtests.seed=750626CC760E9398 -Dtests.security.manager=true -Dtests.locale=sbp-TZ -Dtests.timezone=Africa/Malabo -Dcompiler.java=12 -Druntime.java=11

  2> java.lang.AssertionError: Shard [.slm-history-1-000001][3] is still locked after 5 sec waiting

And

org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests > testRetentionWhileSnapshotInProgress FAILED
    ConcurrentSnapshotExecutionException[[slm-repo:snap-aslsnwpbthkrgysq9ptn7w/H1qm1VwyTmSE_bhTsw4DlQ] cannot delete - another snapshot is currently being deleted]
REPRODUCE WITH: ./gradlew ':x-pack:plugin:ilm:test' --tests "org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress" -Dtests.seed=3A63B02CFA471059 -Dtests.security.manager=true -Dbuild.snapshot=false -Dtests.jvm.argline="-Dbuild.snapshot=false" -Dtests.locale=es-PA -Dtests.timezone=America/Indiana/Winamac -Dcompiler.java=12 -Druntime.java=11

However, these seem not to have stacktraces attached. I'm re-opening this and I'll continue investigation (I have been unable to reproduce either of these locally)

@dakrone dakrone reopened this Sep 12, 2019
@gwbrown
Copy link
Contributor

gwbrown commented Sep 12, 2019

There's a stack traces available via build stats and the Gradle build scans:

java.lang.AssertionError: Shard [.slm-history-1-000001][3] is still locked after 5 sec waiting
	at __randomizedtesting.SeedInfo.seed([750626CC760E9398:8DC38D6AC7ABD33]:0)
	at org.junit.Assert.fail(Assert.java:88)
	at org.elasticsearch.test.InternalTestCluster.assertAfterTest(InternalTestCluster.java:2288)
	at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:553)
	at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:1971)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:834)

and

ConcurrentSnapshotExecutionException[[slm-repo:snap-aslsnwpbthkrgysq9ptn7w/H1qm1VwyTmSE_bhTsw4DlQ] cannot delete - another snapshot is currently being deleted]
	at __randomizedtesting.SeedInfo.seed([3A63B02CFA471059:47B9AE3620333EF2]:0)
	at org.elasticsearch.snapshots.SnapshotsService$7.execute(SnapshotsService.java:1169)
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

Neither of the stack traces looks particularly useful. Interestingly, ActivateWatchTests.testDeactivateAndActivate failed a few times with the "shard is still locked" error for the .watcher-history index before it was muted, and prior to that, several unrelated tests (security, non-SLM snapshots, a couple others) failed with the same exception. I don't see any issues that were opened for these failures, though.

@original-brownbear
Copy link
Member

Just in case it helps, this line reproduces it for me locally 100% of the time on master:

./gradlew ':x-pack:plugin:ilm:test' --tests "org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests" -Dtests.seed=587F913DE32BF2B6 -Dtests.security.manager=true -Dtests.locale=fr-CH -Dtests.timezone=America/Anchorage -Dcompiler.java=12 -Druntime.java=11

@matriv
Copy link
Contributor

matriv commented Sep 20, 2019

One more failure in 7.x: https://gradle-enterprise.elastic.co/s/7yo2x4zsaetea/console-log?task=:x-pack:plugin:ilm:test

Doesn't reproduce for me locally with:

./gradlew ':x-pack:plugin:ilm:test' --tests "org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress" \
  -Dtests.seed=6C7257CBDC1564D4 \
  -Dtests.security.manager=true \
  -Dtests.locale=fr-TG \
  -Dtests.timezone=EST \
  -Dcompiler.java=12 \
  -Druntime.java=13

@dakrone
Copy link
Member Author

dakrone commented Sep 23, 2019

I got some more failures with the increased logging level, so I've muted this again until I can look into and fix this.

dakrone added a commit to dakrone/elasticsearch that referenced this issue Sep 30, 2019
This commit adds a check to the `SnapshotHistoryStore.putAsync` method
that checks to see whether SLM is enabled prior to indexing the history
document.

Resolves failures from elastic#46508 that were caused by a retention job being
kicked off, the test finishing and deleting all indices (including the
`.slm-history-*` index), and then retention completing and indexing
another history document. This caused the `.slm-history-*` index to be
re-created and the test to fail due to the shard lock check in
`InternalTestCluster.assertAfterTest` that checks all shards being
unlocked.
dakrone added a commit to dakrone/elasticsearch that referenced this issue Oct 1, 2019
This commit adds the `/_slm/_execute_retention` API endpoint. This
endpoint kicks off SLM retention and then returns immediately.

This in particular allows us to run retention without scheduling it
(for entirely manual invocation) or perform a one-off cleanup.

This commit also includes HLRC for the new API, and fixes an issue
in SLMSnapshotBlockingIntegTests where retention invoked prior to the
test completing could resurrect an index the internal test cluster
cleanup had already deleted.

Resolves elastic#46508
Relates to elastic#43663
dakrone added a commit that referenced this issue Oct 2, 2019
* Add API to execute SLM retention on-demand

This commit adds the `/_slm/_execute_retention` API endpoint. This
endpoint kicks off SLM retention and then returns immediately.

This in particular allows us to run retention without scheduling it
(for entirely manual invocation) or perform a one-off cleanup.

This commit also includes HLRC for the new API, and fixes an issue
in SLMSnapshotBlockingIntegTests where retention invoked prior to the
test completing could resurrect an index the internal test cluster
cleanup had already deleted.

Resolves #46508
Relates to #43663
dakrone added a commit that referenced this issue Oct 2, 2019
* Add API to execute SLM retention on-demand (#47405)

This is a backport of #47405

This commit adds the `/_slm/_execute_retention` API endpoint. This
endpoint kicks off SLM retention and then returns immediately.

This in particular allows us to run retention without scheduling it
(for entirely manual invocation) or perform a one-off cleanup.

This commit also includes HLRC for the new API, and fixes an issue
in SLMSnapshotBlockingIntegTests where retention invoked prior to the
test completing could resurrect an index the internal test cluster
cleanup had already deleted.

Resolves #46508
Relates to #43663
@original-brownbear
Copy link
Member

original-brownbear commented Oct 3, 2019

There's still a possible failure here but it's a core snapshot bug/inconsistency. Reopening this and closing via a PR shortly.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 3, 2019
If we fail to read the global metadata in a snapshot
we would throw `SnapshotMissingException` but wouldn't
do so for the index metadata.
This is breaking SLM tests at a low rate because they
use `SnapshotMissingException` thrown from snapshot status APIs
to wait for a snapshot being gone.
Also, we should be consistent here in general and not leak the
`NoSuchFileException` to the transport layer for index meta.

Closes elastic#46508
original-brownbear added a commit that referenced this issue Oct 3, 2019
If we fail to read the global metadata in a snapshot
we would throw `SnapshotMissingException` but wouldn't
do so for the index metadata.
This is breaking SLM tests at a low rate because they
use `SnapshotMissingException` thrown from snapshot status APIs
to wait for a snapshot being gone.
Also, we should be consistent here in general and not leak the
`NoSuchFileException` to the transport layer for index meta.

Closes #46508
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 3, 2019
If we fail to read the global metadata in a snapshot
we would throw `SnapshotMissingException` but wouldn't
do so for the index metadata.
This is breaking SLM tests at a low rate because they
use `SnapshotMissingException` thrown from snapshot status APIs
to wait for a snapshot being gone.
Also, we should be consistent here in general and not leak the
`NoSuchFileException` to the transport layer for index meta.

Closes elastic#46508
original-brownbear added a commit that referenced this issue Oct 3, 2019
If we fail to read the global metadata in a snapshot
we would throw `SnapshotMissingException` but wouldn't
do so for the index metadata.
This is breaking SLM tests at a low rate because they
use `SnapshotMissingException` thrown from snapshot status APIs
to wait for a snapshot being gone.
Also, we should be consistent here in general and not leak the
`NoSuchFileException` to the transport layer for index meta.

Closes #46508
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >test-failure Triaged test failures from CI
Projects
None yet
5 participants