Ensure no Watches are running after Watcher is stopped. #43888

jakelandis · 2019-07-02T16:47:38Z

Watcher keeps track of which watches are currently running keyed by watcher name/id.
If a Watch is currently running it will not run the same Watch and will result in a
message : "Watch is already queued in thread pool" and a state: "not_executed_already_queued"

When Watcher is stopped, it will stop watcher (rejecting any new watches), but allow
the currently running watches to run to completion. Waiting for the currently running
Watches to complete is done async to the stopping of Watcher. Meaning that Watcher will
report as fully stopped, but there is still a background thread waiting for all of the
Watches to finish before it removes the Watch from it's list of currently running Watches.

The integration test start and stop watcher between each test. The goal to ensure a clean
state between tests. However, since Watcher can report "yes - I am stopped", but there
are still running Watches, the tests may bleed over into each other, especially on slow
machines. This can result in errors related to "Watch is already queued in thread pool"
and a state: "not_executed_already_queued", and is VERY difficult to reproduce. This
may also change the most recent Watcher history document in an unpredictable way.

This commit changes the waiting for Watches on stop/pause from an aysnc waiting, back to a
sync wait as it worked prior to #30118. This help ensure that for testing testing scenario
the stop much more predictable, such that after fully stopped, no Watches are running.
This should have little impact if any on production code since Watcher isn't stopped/paused
too often and when it stop/pause it has the same behavior is the same, it will just run on
the calling thread, not a generic thread.

Related: #42409

Watcher keeps track of which watches are currently running keyed by watcher name/id. If a watch is currently running it will not run the same watch and will result in a message : "Watch is already queued in thread pool" and a state: "not_executed_already_queued" When Watcher is stopped, it will stop watcher (rejecting any new watches), but allow the currently running watches to run to completion. Waiting for the currently running watches to complete is done async to the stopping of Watcher. Meaning that Watcher will report as fully stopped, but there is still a background thread waiting for all of the Watches to finish before it removes the watch from it's list of currently running Watches. The integration test start and stop watcher between each test. The goal to ensure a clean state between tests. However, since Watcher can report "yes - I am stopped", but there are still running Watches, the tests may bleed over into each other, especially on slow machines. This can result in errors related to "Watch is already queued in thread pool" and a state: "not_executed_already_queued", and is VERY difficult to reproduce. This commit changes the waiting for Watches on stop/pause from an aysnc waiting, back to a sync wait as it worked prior to elastic#30118. This help ensure that for testing testing scenario the stop much more predictable, such that after fully stopped, no Watches are running. This should have little impact if any on production code since Watcher isn't stopped/paused too often and when it stop/pause it has the same behavior is the same, it will just run on the calling thread, not a generic thread.

elasticmachine · 2019-07-02T17:04:50Z

Pinging @elastic/es-core-features

martijnvg

I left a question, but this looks good otherwise.

martijnvg · 2019-07-02T19:32:19Z

...plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/execution/ExecutionService.java

@@ -106,7 +105,7 @@
    private final WatchExecutor executor;
    private final ExecutorService genericExecutor;

-    private AtomicReference<CurrentExecutions> currentExecutions = new AtomicReference<>();
+    private CurrentExecutions currentExecutions;


Can CurrentExecutions remain inside an AtomicReference?
It is read else where without acquiring a lock and as far as I understand it is about making sure that clearExecutions() happens in a sync manner, which is possible with keeping the AtomicReference?

(also the currentExecutions field can then be made final)

Yes, I think this either needs to be volatile or go back to being an AtomicReference.

thanks nice catch. I also had messed up the order of "sealing" the concurrent executions. The change here is now a single line that removes the fork.

jakelandis · 2019-07-02T20:02:29Z

@elasticmachine update branch

This reverts commit 9d18274.

This reverts commit 926a671.

Watcher keeps track of which watches are currently running keyed by watcher name/id. If a watch is currently running it will not run the same watch and will result in a message : "Watch is already queued in thread pool" and a state: "not_executed_already_queued" When Watcher is stopped, it will stop watcher (rejecting any new watches), but allow the currently running watches to run to completion. Waiting for the currently running watches to complete is done async to the stopping of Watcher. Meaning that Watcher will report as fully stopped, but there is still a background thread waiting for all of the Watches to finish before it removes the watch from it's list of currently running Watches. The integration test start and stop watcher between each test. The goal to ensure a clean state between tests. However, since Watcher can report "yes - I am stopped", but there are still running Watches, the tests may bleed over into each other, especially on slow machines. This can result in errors related to "Watch is already queued in thread pool" and a state: "not_executed_already_queued", and is VERY difficult to reproduce. This commit changes the waiting for Watches on stop/pause from an aysnc waiting, back to a sync wait as it worked prior to elastic#30118. This help ensure that for testing testing scenario the stop much more predictable, such that after fully stopped, no Watches are running. This should have little impact if any on production code since Watcher isn't stopped/paused too often and when it stop/pause it has the same behavior is the same, it will just run on the calling thread, not a generic thread.

…icsearch into watcher_stop_less_async

jakelandis · 2019-07-03T01:46:24Z

test failures appear relevant... looking into it.

jakelandis · 2019-07-03T03:47:36Z

It appears that this can result in holding up the cluster state applier thread too long. closing this PR and will open a new one that will take into account the concurrent executions in addition to the closed state returned by watcher stats to block until Watcher is fully stopped.

jakelandis added 2 commits July 2, 2019 11:43

fix check style

9d18274

jakelandis added :Data Management/Watcher >test Issues or PRs that are addressing/adding tests v7.3.0 v8.0.0 labels Jul 2, 2019

jakelandis marked this pull request as ready for review July 2, 2019 17:04

jakelandis requested a review from spinscale July 2, 2019 17:06

martijnvg reviewed Jul 2, 2019

View reviewed changes

elasticmachine and others added 5 commits July 2, 2019 15:02

Merge branch 'master' into watcher_stop_less_async

33d06f9

Revert "fix check style"

ce5dd1b

This reverts commit 9d18274.

Revert "Ensure no Watches are running after Watcher is stopped."

ec360da

This reverts commit 926a671.

Merge branch 'watcher_stop_less_async' of github.com:jakelandis/elast…

f3c5462

…icsearch into watcher_stop_less_async

jakelandis mentioned this pull request Jul 3, 2019

Meta: Fix Watcher Test Failures #42409

Closed

jakelandis closed this Jul 3, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure no Watches are running after Watcher is stopped. #43888

Ensure no Watches are running after Watcher is stopped. #43888

jakelandis commented Jul 2, 2019 •

edited

Loading

elasticmachine commented Jul 2, 2019

martijnvg left a comment

martijnvg Jul 2, 2019

gwbrown Jul 2, 2019

jakelandis Jul 2, 2019

jakelandis commented Jul 2, 2019

jakelandis commented Jul 3, 2019

jakelandis commented Jul 3, 2019

Ensure no Watches are running after Watcher is stopped. #43888

Ensure no Watches are running after Watcher is stopped. #43888

Conversation

jakelandis commented Jul 2, 2019 • edited Loading

elasticmachine commented Jul 2, 2019

martijnvg left a comment

Choose a reason for hiding this comment

martijnvg Jul 2, 2019

Choose a reason for hiding this comment

gwbrown Jul 2, 2019

Choose a reason for hiding this comment

jakelandis Jul 2, 2019

Choose a reason for hiding this comment

jakelandis commented Jul 2, 2019

jakelandis commented Jul 3, 2019

jakelandis commented Jul 3, 2019

jakelandis commented Jul 2, 2019 •

edited

Loading