Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure no Watches are running after Watcher is stopped. #43888

Closed
wants to merge 7 commits into from

Conversation

jakelandis
Copy link
Contributor

@jakelandis jakelandis commented Jul 2, 2019

Watcher keeps track of which watches are currently running keyed by watcher name/id.
If a Watch is currently running it will not run the same Watch and will result in a
message : "Watch is already queued in thread pool" and a state: "not_executed_already_queued"

When Watcher is stopped, it will stop watcher (rejecting any new watches), but allow
the currently running watches to run to completion. Waiting for the currently running
Watches to complete is done async to the stopping of Watcher. Meaning that Watcher will
report as fully stopped, but there is still a background thread waiting for all of the
Watches to finish before it removes the Watch from it's list of currently running Watches.

The integration test start and stop watcher between each test. The goal to ensure a clean
state between tests. However, since Watcher can report "yes - I am stopped", but there
are still running Watches, the tests may bleed over into each other, especially on slow
machines. This can result in errors related to "Watch is already queued in thread pool"
and a state: "not_executed_already_queued", and is VERY difficult to reproduce. This
may also change the most recent Watcher history document in an unpredictable way.

This commit changes the waiting for Watches on stop/pause from an aysnc waiting, back to a
sync wait as it worked prior to #30118. This help ensure that for testing testing scenario
the stop much more predictable, such that after fully stopped, no Watches are running.
This should have little impact if any on production code since Watcher isn't stopped/paused
too often and when it stop/pause it has the same behavior is the same, it will just run on
the calling thread, not a generic thread.

Related: #42409

Watcher keeps track of which watches are currently running keyed by watcher name/id.
If a watch is currently running it will not run the same watch and will result in a
message : "Watch is already queued in thread pool" and a state: "not_executed_already_queued"

When Watcher is stopped, it will stop watcher (rejecting any new watches), but allow
the currently running watches to run to completion. Waiting for the currently running
watches to complete is done async to the stopping of Watcher. Meaning that Watcher will
report as fully stopped, but there is still a background thread waiting for all of the
Watches to finish before it removes the watch from it's list of currently running Watches.

The integration test start and stop watcher between each test. The goal to ensure a clean
state between tests. However, since Watcher can report "yes - I am stopped", but there
are still running Watches, the tests may bleed over into each other, especially on slow
machines.  This can result in errors related to "Watch is already queued in thread pool"
and a state: "not_executed_already_queued", and is VERY difficult to reproduce.

This commit changes the waiting for Watches on stop/pause from an aysnc waiting, back to a
sync wait as it worked prior to elastic#30118.  This help ensure that for testing testing scenario
the stop much more predictable, such that after fully stopped, no Watches are running.
This should have little impact if any on production code since Watcher isn't stopped/paused
too often and when it stop/pause it has the same behavior is the same, it will just run on
the calling thread, not a generic thread.
@jakelandis jakelandis added :Data Management/Watcher >test Issues or PRs that are addressing/adding tests v7.3.0 v8.0.0 labels Jul 2, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@jakelandis jakelandis marked this pull request as ready for review July 2, 2019 17:04
@jakelandis jakelandis requested a review from spinscale July 2, 2019 17:06
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a question, but this looks good otherwise.

@@ -106,7 +105,7 @@
private final WatchExecutor executor;
private final ExecutorService genericExecutor;

private AtomicReference<CurrentExecutions> currentExecutions = new AtomicReference<>();
private CurrentExecutions currentExecutions;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can CurrentExecutions remain inside an AtomicReference?
It is read else where without acquiring a lock and as far as I understand it is about making sure that clearExecutions() happens in a sync manner, which is possible with keeping the AtomicReference?

(also the currentExecutions field can then be made final)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this either needs to be volatile or go back to being an AtomicReference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks nice catch. I also had messed up the order of "sealing" the concurrent executions. The change here is now a single line that removes the fork.

@jakelandis
Copy link
Contributor Author

@elasticmachine update branch

elasticmachine and others added 5 commits July 2, 2019 15:02
This reverts commit 9d18274.
Watcher keeps track of which watches are currently running keyed by watcher name/id.
If a watch is currently running it will not run the same watch and will result in a
message : "Watch is already queued in thread pool" and a state: "not_executed_already_queued"

When Watcher is stopped, it will stop watcher (rejecting any new watches), but allow
the currently running watches to run to completion. Waiting for the currently running
watches to complete is done async to the stopping of Watcher. Meaning that Watcher will
report as fully stopped, but there is still a background thread waiting for all of the
Watches to finish before it removes the watch from it's list of currently running Watches.

The integration test start and stop watcher between each test. The goal to ensure a clean
state between tests. However, since Watcher can report "yes - I am stopped", but there
are still running Watches, the tests may bleed over into each other, especially on slow
machines.  This can result in errors related to "Watch is already queued in thread pool"
and a state: "not_executed_already_queued", and is VERY difficult to reproduce.

This commit changes the waiting for Watches on stop/pause from an aysnc waiting, back to a
sync wait as it worked prior to elastic#30118.  This help ensure that for testing testing scenario
the stop much more predictable, such that after fully stopped, no Watches are running.
This should have little impact if any on production code since Watcher isn't stopped/paused
too often and when it stop/pause it has the same behavior is the same, it will just run on
the calling thread, not a generic thread.
@jakelandis
Copy link
Contributor Author

test failures appear relevant... looking into it.

@jakelandis
Copy link
Contributor Author

It appears that this can result in holding up the cluster state applier thread too long. closing this PR and will open a new one that will take into account the concurrent executions in addition to the closed state returned by watcher stats to block until Watcher is fully stopped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Watcher >test Issues or PRs that are addressing/adding tests v7.3.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants