[ML] Reduce persistent tasks periodic reassignment interval in ... #36845

dimitris-athanasiou · 2018-12-19T15:51:11Z

... MlDistributedFailureIT.testLoseDedicatedMasterNode.

An intermittent failure has been observed in
MlDistributedFailureIT. testLoseDedicatedMasterNode.
The test launches a cluster comprised by a dedicated master node
and a data and ML node. It creates a job and datafeed and starts them.
It then shuts down and restarts the master node. Finally, the test asserts
that the two tasks have been reassigned within 10s.

The intermittent failure is due to the assertions that the tasks have been
reassigned failing. Investigating the failure revealed that the assertBusy
that performs that assertion times out. Furthermore, it appears that the
job task is not reassigned because the memory tracking info is stale.

Memory tracking info is refreshed asynchronously when a job is attempted
to be reassigned. Tasks are attempted to be reassigned either due to a relevant
cluster state change or periodically. The periodic interval is controlled by a cluster
setting called cluster.persistent_tasks.allocation.recheck_interval and defaults to 30s.

What seems to be happening in this test is that if all cluster state changes after the
master node is restarted come through before the async memory info refresh completes,
then the job might take up to 30s until it is attempted to reassigned. Thus the assertBusy
times out.

This commit changes the test to set cluster.persistent_tasks.allocation.recheck_interval
to 1s. If the above theory is correct, this should eradicate those failures.

Closes #36760

... MlDistributedFailureIT.testLoseDedicatedMasterNode. An intermittent failure has been observed in `MlDistributedFailureIT. testLoseDedicatedMasterNode`. The test launches a cluster comprised by a dedicated master node and a data and ML node. It creates a job and datafeed and starts them. It then shuts down and restarts the master node. Finally, the test asserts that the two tasks have been reassigned within 10s. The intermittent failure is due to the assertions that the tasks have been reassigned failing. Investigating the failure revealed that the `assertBusy` that performs that assertion times out. Furthermore, it appears that the job task is not reassigned because the memory tracking info is stale. Memory tracking info is refreshed asynchronously when a job is attempted to be reassigned. Tasks are attempted to be reassigned either due to a relevant cluster state change or periodically. The periodic interval is controlled by a cluster setting called `cluster.persistent_tasks.allocation.recheck_interval` and defaults to 30s. What seems to be happening in this test is that if all cluster state changes after the master node is restarted come through before the async memory info refresh completes, then the job might take up to 30s until it is attempted to reassigned. Thus the `assertBusy` times out. This commit changes the test to set `cluster.persistent_tasks.allocation.recheck_interval` to 1s. If the above theory is correct, this should eradicate those failures. Closes elastic#36760

elasticmachine · 2018-12-19T15:51:14Z

Pinging @elastic/ml-core

droberts195

LGTM

bleskes · 2018-12-19T15:56:30Z

I would really like to avoid changing production defaults due to test concerns. Can't we set the interval in the tests?

dimitris-athanasiou · 2018-12-19T16:11:28Z

@bleskes This isn't changing the default value of the setting. It reduces the minimum valid value of the setting to 1s. We could also increase the timeout of the assertBusy to over 10 seconds. However, I don't see the downsides of allowing the setting to go as low as 1s if necessary and it allows us to avoid having slow tests.

droberts195 · 2018-12-20T10:25:42Z

Since this is an internal cluster test we have access to the server objects in the same JVM, so I changed the approach to directly change the interval on the appropriate PersistentTasksClusterService object rather than using settings. I reverted the change to the minimum setting value.

…36845) ... MlDistributedFailureIT.testLoseDedicatedMasterNode. An intermittent failure has been observed in `MlDistributedFailureIT. testLoseDedicatedMasterNode`. The test launches a cluster comprised by a dedicated master node and a data and ML node. It creates a job and datafeed and starts them. It then shuts down and restarts the master node. Finally, the test asserts that the two tasks have been reassigned within 10s. The intermittent failure is due to the assertions that the tasks have been reassigned failing. Investigating the failure revealed that the `assertBusy` that performs that assertion times out. Furthermore, it appears that the job task is not reassigned because the memory tracking info is stale. Memory tracking info is refreshed asynchronously when a job is attempted to be reassigned. Tasks are attempted to be reassigned either due to a relevant cluster state change or periodically. The periodic interval is controlled by a cluster setting called `cluster.persistent_tasks.allocation.recheck_interval` and defaults to 30s. What seems to be happening in this test is that if all cluster state changes after the master node is restarted come through before the async memory info refresh completes, then the job might take up to 30s until it is attempted to reassigned. Thus the `assertBusy` times out. This commit changes the test to reduce the periodic check that reassigns persistent tasks to `200ms`. If the above theory is correct, this should eradicate those failures. Closes #36760

dimitris-athanasiou added >test-failure Triaged test failures from CI v7.0.0 :ml Machine learning v6.6.0 v6.7.0 labels Dec 19, 2018

droberts195 approved these changes Dec 19, 2018

View reviewed changes

droberts195 added 2 commits December 20, 2018 09:48

Merge branch 'master' into fix-lose-dedicated-master-node-test

f78a288

Avoid changing the minimum permitted refresh interval

e523425

Further reduce recheck interval

0761d1d

dimitris-athanasiou merged commit 08bcd83 into elastic:master Dec 20, 2018

dimitris-athanasiou deleted the fix-lose-dedicated-master-node-test branch December 20, 2018 12:53

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Reduce persistent tasks periodic reassignment interval in ... #36845

[ML] Reduce persistent tasks periodic reassignment interval in ... #36845

dimitris-athanasiou commented Dec 19, 2018

elasticmachine commented Dec 19, 2018

droberts195 left a comment

bleskes commented Dec 19, 2018

dimitris-athanasiou commented Dec 19, 2018 •

edited

Loading

droberts195 commented Dec 20, 2018

[ML] Reduce persistent tasks periodic reassignment interval in ... #36845

[ML] Reduce persistent tasks periodic reassignment interval in ... #36845

Conversation

dimitris-athanasiou commented Dec 19, 2018

elasticmachine commented Dec 19, 2018

droberts195 left a comment

Choose a reason for hiding this comment

bleskes commented Dec 19, 2018

dimitris-athanasiou commented Dec 19, 2018 • edited Loading

droberts195 commented Dec 20, 2018

dimitris-athanasiou commented Dec 19, 2018 •

edited

Loading