Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Reduce persistent tasks periodic reassignment interval in ... #36845

Conversation

dimitris-athanasiou
Copy link
Contributor

... MlDistributedFailureIT.testLoseDedicatedMasterNode.

An intermittent failure has been observed in
MlDistributedFailureIT. testLoseDedicatedMasterNode.
The test launches a cluster comprised by a dedicated master node
and a data and ML node. It creates a job and datafeed and starts them.
It then shuts down and restarts the master node. Finally, the test asserts
that the two tasks have been reassigned within 10s.

The intermittent failure is due to the assertions that the tasks have been
reassigned failing. Investigating the failure revealed that the assertBusy
that performs that assertion times out. Furthermore, it appears that the
job task is not reassigned because the memory tracking info is stale.

Memory tracking info is refreshed asynchronously when a job is attempted
to be reassigned. Tasks are attempted to be reassigned either due to a relevant
cluster state change or periodically. The periodic interval is controlled by a cluster
setting called cluster.persistent_tasks.allocation.recheck_interval and defaults to 30s.

What seems to be happening in this test is that if all cluster state changes after the
master node is restarted come through before the async memory info refresh completes,
then the job might take up to 30s until it is attempted to reassigned. Thus the assertBusy
times out.

This commit changes the test to set cluster.persistent_tasks.allocation.recheck_interval
to 1s. If the above theory is correct, this should eradicate those failures.

Closes #36760

... MlDistributedFailureIT.testLoseDedicatedMasterNode.

An intermittent failure has been observed in
`MlDistributedFailureIT. testLoseDedicatedMasterNode`.
The test launches a cluster comprised by a dedicated master node
and a data and ML node. It creates a job and datafeed and starts them.
It then shuts down and restarts the master node. Finally, the test asserts
that the two tasks have been reassigned within 10s.

The intermittent failure is due to the assertions that the tasks have been
reassigned failing. Investigating the failure revealed that the `assertBusy`
that performs that assertion times out. Furthermore, it appears that the
job task is not reassigned because the memory tracking info is stale.

Memory tracking info is refreshed asynchronously when a job is attempted
to be reassigned. Tasks are attempted to be reassigned either due to a relevant
cluster state change or periodically. The periodic interval is controlled by a cluster
setting called `cluster.persistent_tasks.allocation.recheck_interval` and defaults to 30s.

What seems to be happening in this test is that if all cluster state changes after the
master node is restarted come through before the async memory info refresh completes,
then the job might take up to 30s until it is attempted to reassigned. Thus the `assertBusy`
times out.

This commit changes the test to set `cluster.persistent_tasks.allocation.recheck_interval`
to 1s. If the above theory is correct, this should eradicate those failures.

Closes elastic#36760
@dimitris-athanasiou dimitris-athanasiou added >test-failure Triaged test failures from CI v7.0.0 :ml Machine learning v6.6.0 v6.7.0 labels Dec 19, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bleskes
Copy link
Contributor

bleskes commented Dec 19, 2018

I would really like to avoid changing production defaults due to test concerns. Can't we set the interval in the tests?

@dimitris-athanasiou
Copy link
Contributor Author

dimitris-athanasiou commented Dec 19, 2018

@bleskes This isn't changing the default value of the setting. It reduces the minimum valid value of the setting to 1s. We could also increase the timeout of the assertBusy to over 10 seconds. However, I don't see the downsides of allowing the setting to go as low as 1s if necessary and it allows us to avoid having slow tests.

@droberts195
Copy link
Contributor

Since this is an internal cluster test we have access to the server objects in the same JVM, so I changed the approach to directly change the interval on the appropriate PersistentTasksClusterService object rather than using settings. I reverted the change to the minimum setting value.

@dimitris-athanasiou dimitris-athanasiou merged commit 08bcd83 into elastic:master Dec 20, 2018
@dimitris-athanasiou dimitris-athanasiou deleted the fix-lose-dedicated-master-node-test branch December 20, 2018 12:53
dimitris-athanasiou added a commit that referenced this pull request Dec 20, 2018
…36845)

... MlDistributedFailureIT.testLoseDedicatedMasterNode.

An intermittent failure has been observed in
`MlDistributedFailureIT. testLoseDedicatedMasterNode`.
The test launches a cluster comprised by a dedicated master node
and a data and ML node. It creates a job and datafeed and starts them.
It then shuts down and restarts the master node. Finally, the test asserts
that the two tasks have been reassigned within 10s.

The intermittent failure is due to the assertions that the tasks have been
reassigned failing. Investigating the failure revealed that the `assertBusy`
that performs that assertion times out. Furthermore, it appears that the
job task is not reassigned because the memory tracking info is stale.

Memory tracking info is refreshed asynchronously when a job is attempted
to be reassigned. Tasks are attempted to be reassigned either due to a relevant
cluster state change or periodically. The periodic interval is controlled by a cluster
setting called `cluster.persistent_tasks.allocation.recheck_interval` and defaults to 30s.

What seems to be happening in this test is that if all cluster state changes after the
master node is restarted come through before the async memory info refresh completes,
then the job might take up to 30s until it is attempted to reassigned. Thus the `assertBusy`
times out.

This commit changes the test to reduce the periodic check that reassigns persistent
tasks to `200ms`. If the above theory is correct, this should eradicate those failures.

Closes #36760
dimitris-athanasiou added a commit that referenced this pull request Dec 20, 2018
…36845)

... MlDistributedFailureIT.testLoseDedicatedMasterNode.

An intermittent failure has been observed in
`MlDistributedFailureIT. testLoseDedicatedMasterNode`.
The test launches a cluster comprised by a dedicated master node
and a data and ML node. It creates a job and datafeed and starts them.
It then shuts down and restarts the master node. Finally, the test asserts
that the two tasks have been reassigned within 10s.

The intermittent failure is due to the assertions that the tasks have been
reassigned failing. Investigating the failure revealed that the `assertBusy`
that performs that assertion times out. Furthermore, it appears that the
job task is not reassigned because the memory tracking info is stale.

Memory tracking info is refreshed asynchronously when a job is attempted
to be reassigned. Tasks are attempted to be reassigned either due to a relevant
cluster state change or periodically. The periodic interval is controlled by a cluster
setting called `cluster.persistent_tasks.allocation.recheck_interval` and defaults to 30s.

What seems to be happening in this test is that if all cluster state changes after the
master node is restarted come through before the async memory info refresh completes,
then the job might take up to 30s until it is attempted to reassigned. Thus the `assertBusy`
times out.

This commit changes the test to reduce the periodic check that reassigns persistent
tasks to `200ms`. If the above theory is correct, this should eradicate those failures.

Closes #36760
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI v6.6.0 v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants