Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][ML] MlDistributedFailureIT.testLoseDedicatedMasterNode randomly fails on feature-jindex-master branch #36760

Closed
dimitris-athanasiou opened this issue Dec 18, 2018 · 4 comments
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI v7.0.0-beta1

Comments

@dimitris-athanasiou
Copy link
Contributor

This test has been observed to fail occasionally in the feature-jindex-master branch. I have not yet managed to reproduce locally. I will shortly be muting the test as we need a green build to merge the branch in master. However, I am raising the issue to ensure we get to the bottom of this failure.

Link to failure (one of them): https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/2556/console

Reproduce with:

./gradlew :x-pack:plugin:ml:internalClusterTest -Dtests.seed=D2A618A38265651F -Dtests.class=org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT -Dtests.method="testLoseDedicatedMasterNode" -Dtests.security.manager=true -Dtests.locale=be-BY -Dtests.timezone=Asia/Katmandu -Dcompiler.java=11 -Druntime.java=8

Failure:

> Throwable #1: java.lang.AssertionError
   > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:847)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:821)
   > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.run(MlDistributedFailureIT.java:292)
   > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testLoseDedicatedMasterNode(MlDistributedFailureIT.java:88)
   > 	at java.lang.Thread.run(Thread.java:748)
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError
   > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 moreThrowable #2: java.lang.RuntimeException: Had to resort to force-stopping datafeed, something went wrong?
   > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:296)
   > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.cleanupWorkaround(BaseMlIntegTestCase.java:209)
   > 	at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.util.concurrent.ExecutionException: ElasticsearchStatusException[Cannot stop datafeed [data_feed_id] because the datafeed does not have an assigned node. Use force stop to stop the datafeed]
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:253)
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
   > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:284)
   > 	... 36 more
   > Caused by: ElasticsearchStatusException[Cannot stop datafeed [data_feed_id] because the datafeed does not have an assigned node. Use force stop to stop the datafeed]
   > 	at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:50)
   > 	at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.normalStopDatafeed(TransportStopDatafeedAction.java:147)
   > 	at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.lambda$doExecute$0(TransportStopDatafeedAction.java:130)
   > 	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60)
   > 	at org.elasticsearch.xpack.ml.datafeed.persistence.DatafeedConfigProvider.lambd
@dimitris-athanasiou dimitris-athanasiou added >test-failure Triaged test failures from CI v7.0.0 :ml Machine learning labels Dec 18, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@tvernum
Copy link
Contributor

tvernum commented Dec 19, 2018

This just failed on 6.x as well, with a matching stack trace. Doesn't reproduce.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/696/consoleFull

./gradlew :x-pack:plugin:ml:internalClusterTest -Dtests.seed=CD15640088E44B91 -Dtests.class=org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT -Dtests.method="testLoseDedicatedMasterNode" -Dtests.security.manager=true -Dtests.locale=de-CH -Dtests.timezone=Poland -Dcompiler.java=11 -Druntime.java=8
01:24:58   1> [2018-12-19T02:23:49,064][INFO ][o.e.x.m.i.MlDistributedFailureIT] [testLoseDedicatedMasterNode] after test
01:24:58 ERROR   24.7s J1 | MlDistributedFailureIT.testLoseDedicatedMasterNode <<< FAILURES!
01:24:58    > Throwable #1: java.lang.AssertionError
01:24:58    > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
01:24:58    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:822)
01:24:58    > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.run(MlDistributedFailureIT.java:292)
01:24:58    > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testLoseDedicatedMasterNode(MlDistributedFailureIT.java:88)
01:24:58    > 	at java.lang.Thread.run(Thread.java:748)
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 more
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 more
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 more
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 more
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 more
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 more
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 more
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 more
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 more
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 more
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 more
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 more
01:24:58    > 	Suppressed: java.lang.AssertionError
01:24:58    > 		at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$run$15(MlDistributedFailureIT.java:298)
01:24:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
01:24:58    > 		... 40 moreThrowable #2: java.lang.RuntimeException: Had to resort to force-stopping datafeed, something went wrong?
01:24:58    > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:295)
01:24:58    > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.cleanupWorkaround(BaseMlIntegTestCase.java:208)
01:24:58    > 	at java.lang.Thread.run(Thread.java:748)
01:24:58    > Caused by: java.util.concurrent.ExecutionException: ElasticsearchStatusException[Cannot stop datafeed [data_feed_id] because the datafeed does not have an assigned node. Use force stop to stop the datafeed]
01:24:58    > 	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:265)
01:24:58    > 	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252)
01:24:58    > 	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:94)
01:24:58    > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:283)
01:24:58    > 	... 36 more
01:24:58    > Caused by: ElasticsearchStatusException[Cannot stop datafeed [data_feed_id] because the datafeed does not have an assigned node. Use force stop to stop the datafeed]
01:24:58    > 	at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:50)
01:24:58    > 	at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.normalStopDatafeed(TransportStopDatafeedAction.java:153)
01:24:58    > 	at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.lambda$doExecute$0(TransportStopDatafeedAction.java:136)
01:24:58    > 	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60)
01:24:58    > 	at org.elasticsearch.xpack.ml.datafeed.DatafeedConfigReader.lambda$expandDatafeedIds$1(DatafeedConfigReader.java:85)
01:24:58    > 	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60)
01:24:58    > 	at org.elasticsearch.xpack.ml.datafeed.persistence.DatafeedConfigProvider.lambda$expandDatafeedIdsWithoutMissingCheck$4(DatafeedConfigProvider.java:422)
01:24:58    > 	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60)
01:24:58    > 	at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)
01:24:58    > 	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85)
01:24:58    > 	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81)
01:24:58    > 	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onResponse(AbstractSearchAsyncAction.java:313)
01:24:58    > 	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onResponse(AbstractSearchAsyncAction.java:50)
01:24:58    > 	at org.elasticsearch.action.search.FetchSearchPhase$3.run(FetchSearchPhase.java:213)
01:24:58    > 	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:160)
01:24:58    > 	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:153)
01:24:58    > 	at org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:119)
01:24:58    > 	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:160)
01:24:58    > 	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:153)
01:24:58    > 	at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:206)
01:24:58    > 	at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:104)
01:24:58    > 	at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:110)
01:24:58    > 	at org.elasticsearch.action.search.FetchSearchPhase.access$000(FetchSearchPhase.java:44)
01:24:58    > 	at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:86)
01:24:58    > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759)
01:24:58    > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
01:24:58    > 	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
01:24:58    > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
01:24:58    > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
01:24:58    > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
01:24:58    > 	... 1 more

@dimitris-athanasiou dimitris-athanasiou self-assigned this Dec 19, 2018
@dimitris-athanasiou
Copy link
Contributor Author

I have made progress digging into this and I now have a theory. I won't share before getting a bit more certain though :-) I'm glad it also happened in 6.x as I couldn't explain why it hadn't. Note it does reproduce locally if run many many times: approximately 1/100 runs.

@dimitris-athanasiou
Copy link
Contributor Author

I have also muted this in 6.x.

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Dec 19, 2018
... MlDistributedFailureIT.testLoseDedicatedMasterNode.

An intermittent failure has been observed in
`MlDistributedFailureIT. testLoseDedicatedMasterNode`.
The test launches a cluster comprised by a dedicated master node
and a data and ML node. It creates a job and datafeed and starts them.
It then shuts down and restarts the master node. Finally, the test asserts
that the two tasks have been reassigned within 10s.

The intermittent failure is due to the assertions that the tasks have been
reassigned failing. Investigating the failure revealed that the `assertBusy`
that performs that assertion times out. Furthermore, it appears that the
job task is not reassigned because the memory tracking info is stale.

Memory tracking info is refreshed asynchronously when a job is attempted
to be reassigned. Tasks are attempted to be reassigned either due to a relevant
cluster state change or periodically. The periodic interval is controlled by a cluster
setting called `cluster.persistent_tasks.allocation.recheck_interval` and defaults to 30s.

What seems to be happening in this test is that if all cluster state changes after the
master node is restarted come through before the async memory info refresh completes,
then the job might take up to 30s until it is attempted to reassigned. Thus the `assertBusy`
times out.

This commit changes the test to set `cluster.persistent_tasks.allocation.recheck_interval`
to 1s. If the above theory is correct, this should eradicate those failures.

Closes elastic#36760
dimitris-athanasiou added a commit that referenced this issue Dec 20, 2018
…36845)

... MlDistributedFailureIT.testLoseDedicatedMasterNode.

An intermittent failure has been observed in
`MlDistributedFailureIT. testLoseDedicatedMasterNode`.
The test launches a cluster comprised by a dedicated master node
and a data and ML node. It creates a job and datafeed and starts them.
It then shuts down and restarts the master node. Finally, the test asserts
that the two tasks have been reassigned within 10s.

The intermittent failure is due to the assertions that the tasks have been
reassigned failing. Investigating the failure revealed that the `assertBusy`
that performs that assertion times out. Furthermore, it appears that the
job task is not reassigned because the memory tracking info is stale.

Memory tracking info is refreshed asynchronously when a job is attempted
to be reassigned. Tasks are attempted to be reassigned either due to a relevant
cluster state change or periodically. The periodic interval is controlled by a cluster
setting called `cluster.persistent_tasks.allocation.recheck_interval` and defaults to 30s.

What seems to be happening in this test is that if all cluster state changes after the
master node is restarted come through before the async memory info refresh completes,
then the job might take up to 30s until it is attempted to reassigned. Thus the `assertBusy`
times out.

This commit changes the test to reduce the periodic check that reassigns persistent
tasks to `200ms`. If the above theory is correct, this should eradicate those failures.

Closes #36760
dimitris-athanasiou added a commit that referenced this issue Dec 20, 2018
…36845)

... MlDistributedFailureIT.testLoseDedicatedMasterNode.

An intermittent failure has been observed in
`MlDistributedFailureIT. testLoseDedicatedMasterNode`.
The test launches a cluster comprised by a dedicated master node
and a data and ML node. It creates a job and datafeed and starts them.
It then shuts down and restarts the master node. Finally, the test asserts
that the two tasks have been reassigned within 10s.

The intermittent failure is due to the assertions that the tasks have been
reassigned failing. Investigating the failure revealed that the `assertBusy`
that performs that assertion times out. Furthermore, it appears that the
job task is not reassigned because the memory tracking info is stale.

Memory tracking info is refreshed asynchronously when a job is attempted
to be reassigned. Tasks are attempted to be reassigned either due to a relevant
cluster state change or periodically. The periodic interval is controlled by a cluster
setting called `cluster.persistent_tasks.allocation.recheck_interval` and defaults to 30s.

What seems to be happening in this test is that if all cluster state changes after the
master node is restarted come through before the async memory info refresh completes,
then the job might take up to 30s until it is attempted to reassigned. Thus the `assertBusy`
times out.

This commit changes the test to reduce the periodic check that reassigns persistent
tasks to `200ms`. If the above theory is correct, this should eradicate those failures.

Closes #36760
dimitris-athanasiou added a commit that referenced this issue Dec 20, 2018
…36845)

... MlDistributedFailureIT.testLoseDedicatedMasterNode.

An intermittent failure has been observed in
`MlDistributedFailureIT. testLoseDedicatedMasterNode`.
The test launches a cluster comprised by a dedicated master node
and a data and ML node. It creates a job and datafeed and starts them.
It then shuts down and restarts the master node. Finally, the test asserts
that the two tasks have been reassigned within 10s.

The intermittent failure is due to the assertions that the tasks have been
reassigned failing. Investigating the failure revealed that the `assertBusy`
that performs that assertion times out. Furthermore, it appears that the
job task is not reassigned because the memory tracking info is stale.

Memory tracking info is refreshed asynchronously when a job is attempted
to be reassigned. Tasks are attempted to be reassigned either due to a relevant
cluster state change or periodically. The periodic interval is controlled by a cluster
setting called `cluster.persistent_tasks.allocation.recheck_interval` and defaults to 30s.

What seems to be happening in this test is that if all cluster state changes after the
master node is restarted come through before the async memory info refresh completes,
then the job might take up to 30s until it is attempted to reassigned. Thus the `assertBusy`
times out.

This commit changes the test to reduce the periodic check that reassigns persistent
tasks to `200ms`. If the above theory is correct, this should eradicate those failures.

Closes #36760
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI v7.0.0-beta1
Projects
None yet
Development

No branches or pull requests

4 participants