Watcher add stopped listener #43939

jakelandis · 2019-07-03T19:18:16Z

When Watcher is stopped and there are still outstanding watches running
Watcher will report it self as stopped. In normal cases, this is not problematic.

However, for integration tests Watcher is started and stopped between
each test to help ensure a clean slate for each test. The tests are blocking
only on the stopped state and make an implicit assumption that all watches are
finished if the Watcher is stopped. This is an incorrect assumption since
Stopped really means, "I will not accept any more watches". This can lead to
un-predictable behavior in the tests such as message : "Watch is already queued
in thread pool" and state: "not_executed_already_queued".
This can also change the .watcher-history if watches linger between tests.

This commit changes the semantics of a manual stopping watcher to now mean:
"I will not accept any more watches AND all running watches are complete".
There is now an intermediary step "Stopping" and callback to allow transition
to a "Stopped" state when all Watches have completed.

Additionally since this impacts how long the tests will block waiting for a
"Stopped" state, the timeout has been increased.

Related: #42409

jakelandis · 2019-07-04T00:41:59Z

@elasticmachine update branch

elasticmachine · 2019-07-04T00:42:50Z

Pinging @elastic/es-core-features

jakelandis · 2019-07-04T01:00:08Z

I may need to adjust the time out or tests based on a few runs through CI.

spinscale

left a few nitpicks, but the racy issue should be solved IMO

...ck/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherLifeCycleService.java

x-pack/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherService.java

...lugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/execution/CurrentExecutions.java

...ugin/watcher/src/test/java/org/elasticsearch/xpack/watcher/WatcherLifeCycleServiceTests.java

spinscale · 2019-07-05T07:50:14Z

something that dawned me this morning: when changing this is the fact that there is no guarantee about order the execution when passed to the generic executor, so passing in stop/start/stop/start may end up in a different order so the last call could become stop - which in turn may mean different status on different notes (especially on rapid start/stopping succession after test runs). A potential solution here might be to pass the cluster state version on those checks, so that executions with an earlier cluster state version than the latest will be dismissed and will not change the state.

jakelandis · 2019-07-08T18:29:58Z

something that dawned me this morning: when changing this is the fact that there is no guarantee about order the execution when passed to the generic executor, so passing in stop/start/stop/start may end up in a different order so the last call could become stop - which in turn may mean different status on different notes (especially on rapid start/stopping succession after test runs). A potential solution here might be to pass the cluster state version on those checks, so that executions with an earlier cluster state version than the latest will be dismissed and will not change the state.

I'm not sure I follow. https://github.com/elastic/elasticsearch/pull/43939/files#diff-5831c85834676ac07259e13086bf1a95R108 only allows transitions from STOPPING -> STOPPED , and the lines above only allow transitions from STARTED -> STOPPING . It is possible for the STOPPED to never be reached (if Watcher was restarted while waiting for the async to clean up to complete ). The tests should not allow this.

jakelandis · 2019-07-09T14:33:30Z

@spinscale @martijnvg - Pending clarification of #43939 (comment) , This should be ready for another pass.

martijnvg

I left a few question mainly for my own education.

...ck/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherLifeCycleService.java

martijnvg · 2019-07-10T09:02:07Z

...ck/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherLifeCycleService.java

        // if this is not a data node, we need to start it ourselves possibly
        if (event.state().nodes().getLocalNode().isDataNode() == false &&
-            isWatcherStoppedManually == false && this.state.get() == WatcherState.STOPPED) {
+            isWatcherStoppedManually == false && isStoppedOrStopping) {
            this.state.set(WatcherState.STARTING);


Shouldn't we only be able to set the state to STARTING if the current state is STOPPED?

This is to stay passive with the old behavior. If we kept this to only STOPPED with this change, it would mean you can not restart watcher while any watches are currently inflight.

The pre-existing design allows you "stop" and restart it immediately and re-process the same watch, any inflight watches will finish up. If I didn't allow STOPPING here you would have to wait for all inflight watches to finish up and if one were were stuck (for what ever reason) could cause the in-ability to ever reach a fully STOPPED state.

martijnvg · 2019-07-10T09:11:19Z

...ck/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherLifeCycleService.java

        // if this is not a data node, we need to start it ourselves possibly
        if (event.state().nodes().getLocalNode().isDataNode() == false &&
-            isWatcherStoppedManually == false && this.state.get() == WatcherState.STOPPED) {
+            isWatcherStoppedManually == false && isStoppedOrStopping) {
            this.state.set(WatcherState.STARTING);
            watcherService.start(event.state(), () -> this.state.set(WatcherState.STARTED));


Should we guard against going into STARTED state from a state other than STARTING?
(this.state.compareAndSet(WatcherState.STARTING, WatcherState.STARTED);)

I am trying to avoid changes to the START* states/behavior. This change should only impact the STOPPED state for a manual (e.g. via the API) request to stop.

jakelandis · 2019-07-10T15:48:44Z

@elasticmachine update branch

As of elastic#43939 Watcher tests now correctly block until all Watch executions kicked off by that test are finished. Prior we allowed tests to finish with outstanding watch executions. It was known that this would increase the time needed to finish a test. However, running the tests on CI can be slow and on at least 1 occasion it took 60s to actually finish. This PR simply increases the max allowable timeout for Watcher tests to clean up after themselves.

As of #43939 Watcher tests now correctly block until all Watch executions kicked off by that test are finished. Prior we allowed tests to finish with outstanding watch executions. It was known that this would increase the time needed to finish a test. However, running the tests on CI can be slow and on at least 1 occasion it took 60s to actually finish. This PR simply increases the max allowable timeout for Watcher tests to clean up after themselves.

This test is believed to be fixed by elastic#43939

These tests are believed to be fixed by elastic#43939 closes elastic#45582 and elastic#43975 These tests are currently only muted master. Jul 23 - muted in master (Test transform scripts are updated on execution) Aug 14 - muted in master (Test condition scripts are updated on execution) Aug 15 - last recorded failure (from either test on any branch) Aug 16 - elastic#43939 commited to master Aug 22 - elastic#43939 backported to 7.x

This test is believed to be fixed by elastic#43939 closes elastic#43889 -------- This test is currently only muted master. Jul 10 - muted in master Aug 16 - elastic#43939 commited to master Aug 22 - elastic#43939 backported to 7.x Aug 23 - last recorded failure (on any branch)

This test is believed to be fixed by elastic#43939 closes elastic#43988

This test is believed to be fixed by #43939 Closes #45585

These tests are believed to be fixed by #43939 closes #45582 and #43975

This test is believed to be fixed by #43939 closes #43889

This test is believed to be fixed by #43939 closes #43988

This test is believed to be fixed by elastic#43939 closes elastic#43988

This test is believed to be fixed by #43939 closes #43988

This test is believed to be fixed by elastic#43939 closes elastic#40178 -------- Note - this test was run for 24+ hours on CI hardware with no failures.

This test is believed to be fixed by #43939 closes #40178

This test is believed to be fixed by elastic#43939 closes elastic#40178

This test is believed to be fixed by #43939 closes #40178

This test is believed to be fixed by elastic#43939 closes elastic#33185

jakelandis added 6 commits July 3, 2019 14:07

introduce a stopped listener

5911077

fix checkstyle

1b0afbd

fix ordering

2890581

assert state transition

6ca165d

bump up the timeout for Watcher stops

d701006

fix checkstyle again

ddc4032

Merge branch 'master' into watcher_add_stopped_listener

af03eb4

jakelandis added :Data Management/Watcher v7.4.0 v8.0.0 labels Jul 4, 2019

jakelandis requested review from spinscale and martijnvg July 4, 2019 00:59

jakelandis marked this pull request as ready for review July 4, 2019 00:59

spinscale suggested changes Jul 4, 2019

View reviewed changes

jakelandis added 2 commits July 8, 2019 12:08

Merge branch 'master' into watcher_add_stopped_listener

3807bdc

consumer->runnable and other review changes

39b0336

jakelandis added 4 commits July 8, 2019 13:53

fix null assertion

89649d3

fix checkstyle

41e194e

minor logging updates

7e158e6

fix potential issue with class member swap

e46e14f

martijnvg reviewed Jul 10, 2019

View reviewed changes

protect against multiple calls to watcherService.stop

3a2e31b

Merge branch 'master' into watcher_add_stopped_listener

2fe0a41

jakelandis removed the backport pending label Aug 22, 2019

jakelandis mentioned this pull request Aug 23, 2019

watcher tests - increase stop timeout to 60s (#45679) #45934

Merged

colings86 added the >enhancement label Aug 30, 2019

jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 7, 2019

watcher - re-enable test

5e1eddf

This test is believed to be fixed by elastic#43939

jakelandis mentioned this pull request Oct 7, 2019

watcher - re-enable test #47687

Merged

jakelandis mentioned this pull request Oct 7, 2019

Re-enable Watcher rest tests #47690

Merged

jakelandis mentioned this pull request Oct 7, 2019

Re-enable Watcher rest tests #47692

Merged

jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 7, 2019

Watcher - re-enable test

180a496

This test is believed to be fixed by elastic#43939 closes elastic#43988

jakelandis mentioned this pull request Oct 7, 2019

Re-enable Watcher rest test #47699

Merged

jakelandis added a commit that referenced this pull request Oct 7, 2019

Re-enable Watcher rest test (#47687)

7db3dc3

This test is believed to be fixed by #43939 Closes #45585

jakelandis added a commit that referenced this pull request Oct 7, 2019

Re-enable Watcher rest tests (#47690)

07ed3d5

These tests are believed to be fixed by #43939 closes #45582 and #43975

jakelandis added a commit that referenced this pull request Oct 7, 2019

Re-enable Watcher rest tests (#47692)

04a1b1d

This test is believed to be fixed by #43939 closes #43889

jakelandis added a commit that referenced this pull request Oct 7, 2019

Re-enable Watcher rest test (#47699)

9358b2f

This test is believed to be fixed by #43939 closes #43988

jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 7, 2019

Re-enable Watcher rest test (elastic#47699)

15ef82f

This test is believed to be fixed by elastic#43939 closes elastic#43988

jakelandis mentioned this pull request Oct 7, 2019

Re-enable Watcher rest test (#47699) #47705

Merged

jakelandis added a commit that referenced this pull request Oct 8, 2019

Re-enable Watcher rest test (#47699) (#47705)

b578059

This test is believed to be fixed by #43939 closes #43988

jakelandis mentioned this pull request Oct 11, 2019

Re-enable Watcher full cluster restart test #47950

Merged

jakelandis added a commit that referenced this pull request Oct 14, 2019

Re-enable Watcher full cluster restart test (#47950)

63f835d

This test is believed to be fixed by #43939 closes #40178

jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 14, 2019

Re-enable Watcher full cluster restart test (elastic#47950)

f13b6a2

This test is believed to be fixed by elastic#43939 closes elastic#40178

jakelandis mentioned this pull request Oct 14, 2019

Re-enable Watcher full cluster restart test (#47950) #48000

Merged

jakelandis added a commit that referenced this pull request Oct 14, 2019

Re-enable Watcher full cluster restart test (#47950) (#48000)

5a4745a

This test is believed to be fixed by #43939 closes #40178

jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Oct 18, 2019

Re-enable Watcher full rolling restart tests

bd56b7f

This test is believed to be fixed by elastic#43939 closes elastic#33185

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watcher add stopped listener #43939

Watcher add stopped listener #43939

jakelandis commented Jul 3, 2019 •

edited

Loading

jakelandis commented Jul 4, 2019

elasticmachine commented Jul 4, 2019

jakelandis commented Jul 4, 2019

spinscale left a comment

spinscale commented Jul 5, 2019

jakelandis commented Jul 8, 2019

jakelandis commented Jul 9, 2019

martijnvg left a comment

martijnvg Jul 10, 2019

jakelandis Jul 10, 2019

martijnvg Jul 10, 2019

jakelandis Jul 10, 2019

jakelandis commented Jul 10, 2019 •

edited

Loading

Watcher add stopped listener #43939

Watcher add stopped listener #43939

Conversation

jakelandis commented Jul 3, 2019 • edited Loading

jakelandis commented Jul 4, 2019

elasticmachine commented Jul 4, 2019

jakelandis commented Jul 4, 2019

spinscale left a comment

Choose a reason for hiding this comment

spinscale commented Jul 5, 2019

jakelandis commented Jul 8, 2019

jakelandis commented Jul 9, 2019

martijnvg left a comment

Choose a reason for hiding this comment

martijnvg Jul 10, 2019

Choose a reason for hiding this comment

jakelandis Jul 10, 2019

Choose a reason for hiding this comment

martijnvg Jul 10, 2019

Choose a reason for hiding this comment

jakelandis Jul 10, 2019

Choose a reason for hiding this comment

jakelandis commented Jul 10, 2019 • edited Loading

jakelandis commented Jul 3, 2019 •

edited

Loading

jakelandis commented Jul 10, 2019 •

edited

Loading