Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watcher add stopped listener #43939

Merged
merged 20 commits into from
Aug 16, 2019

Conversation

jakelandis
Copy link
Contributor

@jakelandis jakelandis commented Jul 3, 2019

When Watcher is stopped and there are still outstanding watches running
Watcher will report it self as stopped. In normal cases, this is not problematic.

However, for integration tests Watcher is started and stopped between
each test to help ensure a clean slate for each test. The tests are blocking
only on the stopped state and make an implicit assumption that all watches are
finished if the Watcher is stopped. This is an incorrect assumption since
Stopped really means, "I will not accept any more watches". This can lead to
un-predictable behavior in the tests such as message : "Watch is already queued
in thread pool" and state: "not_executed_already_queued".
This can also change the .watcher-history if watches linger between tests.

This commit changes the semantics of a manual stopping watcher to now mean:
"I will not accept any more watches AND all running watches are complete".
There is now an intermediary step "Stopping" and callback to allow transition
to a "Stopped" state when all Watches have completed.

Additionally since this impacts how long the tests will block waiting for a
"Stopped" state, the timeout has been increased.

Related: #42409

@jakelandis
Copy link
Contributor Author

@elasticmachine update branch

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@jakelandis jakelandis marked this pull request as ready for review July 4, 2019 00:59
@jakelandis
Copy link
Contributor Author

I may need to adjust the time out or tests based on a few runs through CI.

Copy link
Contributor

@spinscale spinscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few nitpicks, but the racy issue should be solved IMO

@spinscale
Copy link
Contributor

something that dawned me this morning: when changing this is the fact that there is no guarantee about order the execution when passed to the generic executor, so passing in stop/start/stop/start may end up in a different order so the last call could become stop - which in turn may mean different status on different notes (especially on rapid start/stopping succession after test runs). A potential solution here might be to pass the cluster state version on those checks, so that executions with an earlier cluster state version than the latest will be dismissed and will not change the state.

@jakelandis
Copy link
Contributor Author

something that dawned me this morning: when changing this is the fact that there is no guarantee about order the execution when passed to the generic executor, so passing in stop/start/stop/start may end up in a different order so the last call could become stop - which in turn may mean different status on different notes (especially on rapid start/stopping succession after test runs). A potential solution here might be to pass the cluster state version on those checks, so that executions with an earlier cluster state version than the latest will be dismissed and will not change the state.

I'm not sure I follow. https://github.com/elastic/elasticsearch/pull/43939/files#diff-5831c85834676ac07259e13086bf1a95R108 only allows transitions from STOPPING -> STOPPED , and the lines above only allow transitions from STARTED -> STOPPING . It is possible for the STOPPED to never be reached (if Watcher was restarted while waiting for the async to clean up to complete ). The tests should not allow this.

@jakelandis
Copy link
Contributor Author

@spinscale @martijnvg - Pending clarification of #43939 (comment) , This should be ready for another pass.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few question mainly for my own education.

// if this is not a data node, we need to start it ourselves possibly
if (event.state().nodes().getLocalNode().isDataNode() == false &&
isWatcherStoppedManually == false && this.state.get() == WatcherState.STOPPED) {
isWatcherStoppedManually == false && isStoppedOrStopping) {
this.state.set(WatcherState.STARTING);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we only be able to set the state to STARTING if the current state is STOPPED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to stay passive with the old behavior. If we kept this to only STOPPED with this change, it would mean you can not restart watcher while any watches are currently inflight.

The pre-existing design allows you "stop" and restart it immediately and re-process the same watch, any inflight watches will finish up. If I didn't allow STOPPING here you would have to wait for all inflight watches to finish up and if one were were stuck (for what ever reason) could cause the in-ability to ever reach a fully STOPPED state.

// if this is not a data node, we need to start it ourselves possibly
if (event.state().nodes().getLocalNode().isDataNode() == false &&
isWatcherStoppedManually == false && this.state.get() == WatcherState.STOPPED) {
isWatcherStoppedManually == false && isStoppedOrStopping) {
this.state.set(WatcherState.STARTING);
watcherService.start(event.state(), () -> this.state.set(WatcherState.STARTED));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we guard against going into STARTED state from a state other than STARTING?
(this.state.compareAndSet(WatcherState.STARTING, WatcherState.STARTED);)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to avoid changes to the START* states/behavior. This change should only impact the STOPPED state for a manual (e.g. via the API) request to stop.

@jakelandis
Copy link
Contributor Author

jakelandis commented Jul 10, 2019

@elasticmachine update branch

jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Aug 23, 2019
As of elastic#43939 Watcher tests now correctly block until all Watch executions
kicked off by that test are finished. Prior we allowed tests to finish with
outstanding watch executions. It was known that this would increase the
time needed to finish a test. However, running the tests on CI can be slow
and on at least 1 occasion it took 60s to actually finish.

This PR simply increases the max allowable timeout for Watcher tests
to clean up after themselves.
jakelandis added a commit that referenced this pull request Aug 26, 2019
As of #43939 Watcher tests now correctly block until all Watch executions
kicked off by that test are finished. Prior we allowed tests to finish with
outstanding watch executions. It was known that this would increase the
time needed to finish a test. However, running the tests on CI can be slow
and on at least 1 occasion it took 60s to actually finish.

This PR simply increases the max allowable timeout for Watcher tests
to clean up after themselves.
jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 7, 2019
This test is believed to be fixed by elastic#43939
jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 7, 2019
These tests are believed to be fixed by elastic#43939

closes elastic#45582 and elastic#43975

These tests are currently only muted master.

Jul 23 - muted in master (Test transform scripts are updated on execution)
Aug 14 - muted in master (Test condition scripts are updated on execution)
Aug 15 - last recorded failure (from either test on any branch)
Aug 16 - elastic#43939 commited to master
Aug 22 - elastic#43939 backported to 7.x
jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 7, 2019
This test is believed to be fixed by elastic#43939

closes elastic#43889

--------

This test is currently only muted master.

Jul 10 - muted in master
Aug 16 - elastic#43939 commited to master
Aug 22 - elastic#43939 backported to 7.x
Aug 23 - last recorded failure (on any branch)
jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 7, 2019
This test is believed to be fixed by elastic#43939

closes elastic#43988
jakelandis added a commit that referenced this pull request Oct 7, 2019
This test is believed to be fixed by #43939

Closes #45585
jakelandis added a commit that referenced this pull request Oct 7, 2019
These tests are believed to be fixed by #43939

closes #45582 and #43975
jakelandis added a commit that referenced this pull request Oct 7, 2019
This test is believed to be fixed by #43939

closes #43889
jakelandis added a commit that referenced this pull request Oct 7, 2019
This test is believed to be fixed by #43939

closes #43988
jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 7, 2019
This test is believed to be fixed by elastic#43939

closes elastic#43988
jakelandis added a commit that referenced this pull request Oct 8, 2019
This test is believed to be fixed by #43939

closes #43988
jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 11, 2019
This test is believed to be fixed by elastic#43939

closes elastic#40178

--------

Note - this test was run for 24+ hours on CI hardware with no failures.
jakelandis added a commit that referenced this pull request Oct 14, 2019
This test is believed to be fixed by #43939

closes #40178
jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 14, 2019
jakelandis added a commit that referenced this pull request Oct 14, 2019
jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 18, 2019
This test is believed to be fixed by elastic#43939

closes elastic#33185
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants