Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and Fix Synchronization in InternalTestCluster #39168

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Feb 20, 2019

* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Delivery/Build Build or test infrastructure v8.0.0 v7.2.0 labels Feb 20, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@original-brownbear original-brownbear marked this pull request as ready for review February 20, 2019 09:58
@original-brownbear
Copy link
Member Author

original-brownbear commented Feb 20, 2019

@DaveCTurner found a bunch more possible races here than the one that hit #39118, not sure if you want to review this or I should find someone else? :)

@DaveCTurner DaveCTurner requested review from henningandersen and removed request for DaveCTurner February 20, 2019 11:05
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @original-brownbear .

I am inclined to think that making nodes immutable and creating a new map when adding/removing from the map is a better approach. Especially getClients() is problematic, but also the way we currently rely on the InternalTestCluster monitor ensuring nodes is not changed when reading from nodes. Having it as an immutable map makes it easy and cheap to ensure consistent snapshots for the map. Also, it is then obvious that anything accessing nodes will never block.

@@ -2245,7 +2240,7 @@ public void clearDisruptionScheme() {
clearDisruptionScheme(true);
}

public void clearDisruptionScheme(boolean ensureHealthyCluster) {
public synchronized void clearDisruptionScheme(boolean ensureHealthyCluster) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can wait for a healthy cluster. I think having the entire method synchronized while waiting for the cluster to become healthy could potentially lead to deadlocks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question ..., I couldn't find a possible dead-lock in the implementations of our disruptions from a quick look over them.
My thinking would maybe be this:
If we don't synchronize here, we allow manipulating the cluster while we "wait for healthy" which could lead to some pretty hard to debug issues. Also, we really don't want to manipulate anything about the cluster while this method is in progress.
=> If we create some unforeseen deadlock here, I'd probably rather try to fix the implementation of the disruption to prevent the deadlock, then allow concurrent modification of the cluster while we clear the disruption?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is any chance of manipulations during this phase, I would rather guard against manipulating the cluster explicitly by adding this intermediate closing (or stopped) state.

Do we not risk something like what you described in #39118. If the disruption prevented the result from returning, the callback could be called at this time. If that in turn calls any of the synchronized methods it could potentially deadlock if we have to create a new connection while becoming healthy?

At a minimum I think we should add a comment why the synchronized is there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment for now. But I'm starting to think we're attacking this from the wrong angle to some degree. It seems like methods like this one (and a few others we now discussed) are currently only called from the main JUnit thread. Why, instead of worrying endlessly about how we sync. things like e.g. clearing the disruption while closing and such not just assert that we're on the main JUnit thread and simply not allow manipulating the cluster from elsewhere. We currently don't seem to be doing that and I don't see a good reason to start doing that kind of thing either (+ if someone needs this kind of thing down the line, they're free to add it as needed).
IMO, that would make calls to e.g. InternalTestCluster#restartRandomDataNode(org.elasticsearch.test.InternalTestCluster.RestartCallback) a lot easier to follow/debug.
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also did not find any specific places where we deliberately manipulate the cluster in other threads (though I have not done an exhaustive search). However, it is not obvious that calling for instance client() could be invalid on a thread (if client or even node is lazily created, implicitly manipulating the cluster)? Also, I wonder if disruptive restart tests could be good to add and if that would be harder to then add since all changes have to be done in main thread. I think the code is now much clearer with this PR and would prefer to leave it with synchronized in place.

@original-brownbear
Copy link
Member Author

original-brownbear commented Feb 20, 2019

@henningandersen thanks for the thorough review! Moved to an immutable nodes now as you suggested and answered the points that aren't automatically addressed by that. => This should be good for another review :)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getClients still has an issue (also prior to this PR) in that when you call next() it also calls NodeAndClient.client(), which lazily builds the client. I think this mandates solving that part of the problem using an explicit monitor (does it have to be the InternalTestCluster monitor to be safe wrt closing?) in NodeAndClient?

In turn I think this would remove the need for several of the client accessor methods to be synchronized (including smartClient).

Also (nit), getClients() does not need to be synchronized.

@@ -2245,7 +2240,7 @@ public void clearDisruptionScheme() {
clearDisruptionScheme(true);
}

public void clearDisruptionScheme(boolean ensureHealthyCluster) {
public synchronized void clearDisruptionScheme(boolean ensureHealthyCluster) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is any chance of manipulations during this phase, I would rather guard against manipulating the cluster explicitly by adding this intermediate closing (or stopped) state.

Do we not risk something like what you described in #39118. If the disruption prevented the result from returning, the callback could be called at this time. If that in turn calls any of the synchronized methods it could potentially deadlock if we have to create a new connection while becoming healthy?

At a minimum I think we should add a comment why the synchronized is there.

@original-brownbear
Copy link
Member Author

getClients still has an issue (also prior to this PR) in that when you call next() it also calls NodeAndClient.client(), which lazily builds the client. I think this mandates solving that part of the problem using an explicit monitor (does it have to be the InternalTestCluster monitor to be safe wrt closing?) in NodeAndClient?

I think for now let's sync on the InternalTestCluster yea, though we could look into a different way of syncing the close here next I guess.

In turn I think this would remove the need for several of the client accessor methods to be synchronized (including smartClient).

🎉 true :)

@original-brownbear
Copy link
Member Author

If that in turn calls any of the synchronized methods it could potentially deadlock if we have to create a new connection while becoming healthy?

Not really I think because we're still on the same thread. The ServiceDisruptionScheme#stopDisrupting would have to create that new connection/client on a separate thread. Currently, we don't do that and I'm not sure we should start fixing this kind of thing. It will always be questionable to manipulate the cluster concurrently (even if we get it safe concurrency wise, the tests just become so hard to interpret).
I'll add a comment for now :)

@original-brownbear
Copy link
Member Author

@henningandersen Thanks for all the finds! All points addressed again I think :)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @original-brownbear , I left just a few comments otherwise looking good.

@@ -2245,7 +2240,7 @@ public void clearDisruptionScheme() {
clearDisruptionScheme(true);
}

public void clearDisruptionScheme(boolean ensureHealthyCluster) {
public synchronized void clearDisruptionScheme(boolean ensureHealthyCluster) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also did not find any specific places where we deliberately manipulate the cluster in other threads (though I have not done an exhaustive search). However, it is not obvious that calling for instance client() could be invalid on a thread (if client or even node is lazily created, implicitly manipulating the cluster)? Also, I wonder if disruptive restart tests could be good to add and if that would be harder to then add since all changes have to be done in main thread. I think the code is now much clearer with this PR and would prefer to leave it with synchronized in place.

@original-brownbear
Copy link
Member Author

@henningandersen thanks! all addressed I think :)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thanks @original-brownbear

@original-brownbear
Copy link
Member Author

@henningandersen thanks for the great review :)

@original-brownbear original-brownbear merged commit 3a5d4dc into elastic:master Feb 21, 2019
@original-brownbear original-brownbear deleted the less-sync-internal-test-cluster branch February 21, 2019 13:20
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 21, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes elastic#37965
* Closes elastic#37275
* Closes elastic#37345
original-brownbear added a commit that referenced this pull request Feb 21, 2019
)

* Simplify and Fix Synchronization in InternalTestCluster (#39168)

* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes #37965
* Closes #37275
* Closes #37345
jasontedor added a commit to DaveCTurner/elasticsearch that referenced this pull request Feb 22, 2019
* elastic/master:
  Ensure index commit released when testing timeouts (elastic#39273)
  Avoid using TimeWarp in TransformIntegrationTests. (elastic#39277)
  Fixed missed stopping of SchedulerEngine (elastic#39193)
  [CI] Mute CcrRetentionLeaseIT.testRetentionLeaseIsRenewedDuringRecovery (elastic#39269)
  Muting AutoFollowIT.testAutoFollowManyIndices (elastic#39264)
  Clarify the use of sleep in CCR test
  Fix testCannotShrinkLeaderIndex (elastic#38529)
  Fix CCR tests that manipulate transport requests
  Align generated release notes with doc standards (elastic#39234)
  Mute test (elastic#39248)
  ReadOnlyEngine should update translog recovery state information (elastic#39238)
  Wrap accounting breaker check in assertBusy (elastic#39211)
  Simplify and Fix Synchronization in InternalTestCluster (elastic#39168)
  [Tests] Make testEngineGCDeletesSetting deterministic (elastic#38942)
  Extend nextDoc to delegate to the wrapped doc-value iterator for date_nanos (elastic#39176)
  Change ShardFollowTask to reuse common serialization logic (elastic#39094)
  Replace superfluous usage of Counter with Supplier (elastic#39048)
  Disable bwc tests for elastic#39094
weizijun pushed a commit to weizijun/elasticsearch that referenced this pull request Feb 22, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes elastic#37965
* Closes elastic#37275
* Closes elastic#37345
weizijun pushed a commit to weizijun/elasticsearch that referenced this pull request Feb 22, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes elastic#37965
* Closes elastic#37275
* Closes elastic#37345
@ywelsch
Copy link
Contributor

ywelsch commented Mar 13, 2019

Should this be backported to 7.0 (and possibly earlier) to avoid test failures?

@original-brownbear
Copy link
Member Author

@ywelsch yea definitely! (sorry for forgetting that), I'll back port to 7.0 and will look into how tricky it is to get this into 6.7 (I remember that wasn't totally straight forward when I last checked)

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 13, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes elastic#37965
* Closes elastic#37275
* Closes elastic#37345
original-brownbear added a commit that referenced this pull request Mar 14, 2019
)

* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes #37965
* Closes #37275
* Closes #37345
@original-brownbear
Copy link
Member Author

back ported to 7.0 in #40013

@original-brownbear
Copy link
Member Author

@ywelsch back porting this to 6.7 is somewhat complex and I think not necessary since 6.7 uses the blocking mock networking which isn't hit by the issue(s) that motivated this PR in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test Issues or PRs that are addressing/adding tests v7.0.0-rc1 v7.2.0 v8.0.0-alpha1
Projects
None yet
7 participants