Simplify and Fix Synchronization in InternalTestCluster #39168

original-brownbear · 2019-02-20T05:44:15Z

Remove unnecessary synchronized statements
- Add some asserts in their place where useful
Make Predicates constants where possible
Cleanup some stream usage
Make unsafe public methods synchronized
Make fields final where possible
Fix possible races from concurrent access to nodes (we were using it in the unicast hosts file builder without any sync!)
- Make nodes synchronized as the docs claim already
- Synchronize for bulk operations on it when not synchronized on this or make a copy where possible to avoid the lock on this
Closes [CI] HotThreadsIT.testHotThreadsDontFail failure on master #37965
Closes [CI] multiple tests started failing with NoNodeAvailableException #37275
Closes CI: test failure PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode #37345

* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized`

elasticmachine · 2019-02-20T05:44:16Z

Pinging @elastic/es-core-infra

…-test-cluster

original-brownbear · 2019-02-20T10:01:01Z

@DaveCTurner found a bunch more possible races here than the one that hit #39118, not sure if you want to review this or I should find someone else? :)

henningandersen

Thanks for looking into this @original-brownbear .

I am inclined to think that making nodes immutable and creating a new map when adding/removing from the map is a better approach. Especially getClients() is problematic, but also the way we currently rely on the InternalTestCluster monitor ensuring nodes is not changed when reading from nodes. Having it as an immutable map makes it easy and cheap to ensure consistent snapshots for the map. Also, it is then obvious that anything accessing nodes will never block.

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

henningandersen · 2019-02-20T13:20:03Z

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

@@ -2245,7 +2240,7 @@ public void clearDisruptionScheme() {
        clearDisruptionScheme(true);
    }

-    public void clearDisruptionScheme(boolean ensureHealthyCluster) {
+    public synchronized void clearDisruptionScheme(boolean ensureHealthyCluster) {


This method can wait for a healthy cluster. I think having the entire method synchronized while waiting for the cluster to become healthy could potentially lead to deadlocks?

Good question ..., I couldn't find a possible dead-lock in the implementations of our disruptions from a quick look over them.
My thinking would maybe be this:
If we don't synchronize here, we allow manipulating the cluster while we "wait for healthy" which could lead to some pretty hard to debug issues. Also, we really don't want to manipulate anything about the cluster while this method is in progress.
=> If we create some unforeseen deadlock here, I'd probably rather try to fix the implementation of the disruption to prevent the deadlock, then allow concurrent modification of the cluster while we clear the disruption?

If there is any chance of manipulations during this phase, I would rather guard against manipulating the cluster explicitly by adding this intermediate closing (or stopped) state.

Do we not risk something like what you described in #39118. If the disruption prevented the result from returning, the callback could be called at this time. If that in turn calls any of the synchronized methods it could potentially deadlock if we have to create a new connection while becoming healthy?

At a minimum I think we should add a comment why the synchronized is there.

I added a comment for now. But I'm starting to think we're attacking this from the wrong angle to some degree. It seems like methods like this one (and a few others we now discussed) are currently only called from the main JUnit thread. Why, instead of worrying endlessly about how we sync. things like e.g. clearing the disruption while closing and such not just assert that we're on the main JUnit thread and simply not allow manipulating the cluster from elsewhere. We currently don't seem to be doing that and I don't see a good reason to start doing that kind of thing either (+ if someone needs this kind of thing down the line, they're free to add it as needed).
IMO, that would make calls to e.g. InternalTestCluster#restartRandomDataNode(org.elasticsearch.test.InternalTestCluster.RestartCallback) a lot easier to follow/debug.
WDYT?

I also did not find any specific places where we deliberately manipulate the cluster in other threads (though I have not done an exhaustive search). However, it is not obvious that calling for instance client() could be invalid on a thread (if client or even node is lazily created, implicitly manipulating the cluster)? Also, I wonder if disruptive restart tests could be good to add and if that would be harder to then add since all changes have to be done in main thread. I think the code is now much clearer with this PR and would prefer to leave it with synchronized in place.

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

…-test-cluster

original-brownbear · 2019-02-20T14:23:16Z

@henningandersen thanks for the thorough review! Moved to an immutable nodes now as you suggested and answered the points that aren't automatically addressed by that. => This should be good for another review :)

henningandersen

getClients still has an issue (also prior to this PR) in that when you call next() it also calls NodeAndClient.client(), which lazily builds the client. I think this mandates solving that part of the problem using an explicit monitor (does it have to be the InternalTestCluster monitor to be safe wrt closing?) in NodeAndClient?

In turn I think this would remove the need for several of the client accessor methods to be synchronized (including smartClient).

Also (nit), getClients() does not need to be synchronized.

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

henningandersen · 2019-02-20T17:45:33Z

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

@@ -2245,7 +2240,7 @@ public void clearDisruptionScheme() {
        clearDisruptionScheme(true);
    }

-    public void clearDisruptionScheme(boolean ensureHealthyCluster) {
+    public synchronized void clearDisruptionScheme(boolean ensureHealthyCluster) {


If there is any chance of manipulations during this phase, I would rather guard against manipulating the cluster explicitly by adding this intermediate closing (or stopped) state.

Do we not risk something like what you described in #39118. If the disruption prevented the result from returning, the callback could be called at this time. If that in turn calls any of the synchronized methods it could potentially deadlock if we have to create a new connection while becoming healthy?

At a minimum I think we should add a comment why the synchronized is there.

original-brownbear · 2019-02-20T18:15:20Z

getClients still has an issue (also prior to this PR) in that when you call next() it also calls NodeAndClient.client(), which lazily builds the client. I think this mandates solving that part of the problem using an explicit monitor (does it have to be the InternalTestCluster monitor to be safe wrt closing?) in NodeAndClient?

I think for now let's sync on the InternalTestCluster yea, though we could look into a different way of syncing the close here next I guess.

In turn I think this would remove the need for several of the client accessor methods to be synchronized (including smartClient).

🎉 true :)

original-brownbear · 2019-02-20T18:21:18Z

If that in turn calls any of the synchronized methods it could potentially deadlock if we have to create a new connection while becoming healthy?

Not really I think because we're still on the same thread. The ServiceDisruptionScheme#stopDisrupting would have to create that new connection/client on a separate thread. Currently, we don't do that and I'm not sure we should start fixing this kind of thing. It will always be questionable to manipulate the cluster concurrently (even if we get it safe concurrency wise, the tests just become so hard to interpret).
I'll add a comment for now :)

original-brownbear · 2019-02-20T18:55:28Z

@henningandersen Thanks for all the finds! All points addressed again I think :)

…-test-cluster

henningandersen

Thanks @original-brownbear , I left just a few comments otherwise looking good.

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

henningandersen · 2019-02-21T08:52:02Z

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

@@ -2245,7 +2240,7 @@ public void clearDisruptionScheme() {
        clearDisruptionScheme(true);
    }

-    public void clearDisruptionScheme(boolean ensureHealthyCluster) {
+    public synchronized void clearDisruptionScheme(boolean ensureHealthyCluster) {


I also did not find any specific places where we deliberately manipulate the cluster in other threads (though I have not done an exhaustive search). However, it is not obvious that calling for instance client() could be invalid on a thread (if client or even node is lazily created, implicitly manipulating the cluster)? Also, I wonder if disruptive restart tests could be good to add and if that would be harder to then add since all changes have to be done in main thread. I think the code is now much clearer with this PR and would prefer to leave it with synchronized in place.

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

original-brownbear · 2019-02-21T10:06:43Z

@henningandersen thanks! all addressed I think :)

henningandersen

LGTM.

Thanks @original-brownbear

original-brownbear · 2019-02-21T13:19:16Z

@henningandersen thanks for the great review :)

* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes elastic#37965 * Closes elastic#37275 * Closes elastic#37345

) * Simplify and Fix Synchronization in InternalTestCluster (#39168) * Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes #37965 * Closes #37275 * Closes #37345

* elastic/master: Ensure index commit released when testing timeouts (elastic#39273) Avoid using TimeWarp in TransformIntegrationTests. (elastic#39277) Fixed missed stopping of SchedulerEngine (elastic#39193) [CI] Mute CcrRetentionLeaseIT.testRetentionLeaseIsRenewedDuringRecovery (elastic#39269) Muting AutoFollowIT.testAutoFollowManyIndices (elastic#39264) Clarify the use of sleep in CCR test Fix testCannotShrinkLeaderIndex (elastic#38529) Fix CCR tests that manipulate transport requests Align generated release notes with doc standards (elastic#39234) Mute test (elastic#39248) ReadOnlyEngine should update translog recovery state information (elastic#39238) Wrap accounting breaker check in assertBusy (elastic#39211) Simplify and Fix Synchronization in InternalTestCluster (elastic#39168) [Tests] Make testEngineGCDeletesSetting deterministic (elastic#38942) Extend nextDoc to delegate to the wrapped doc-value iterator for date_nanos (elastic#39176) Change ShardFollowTask to reuse common serialization logic (elastic#39094) Replace superfluous usage of Counter with Supplier (elastic#39048) Disable bwc tests for elastic#39094

* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes elastic#37965 * Closes elastic#37275 * Closes elastic#37345

ywelsch · 2019-03-13T08:14:56Z

Should this be backported to 7.0 (and possibly earlier) to avoid test failures?

original-brownbear · 2019-03-13T13:49:16Z

@ywelsch yea definitely! (sorry for forgetting that), I'll back port to 7.0 and will look into how tricky it is to get this into 6.7 (I remember that wasn't totally straight forward when I last checked)

* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes elastic#37965 * Closes elastic#37275 * Closes elastic#37345

) * Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes #37965 * Closes #37275 * Closes #37345

original-brownbear · 2019-03-14T06:26:11Z

back ported to 7.0 in #40013

original-brownbear · 2019-03-14T07:57:22Z

@ywelsch back porting this to 6.7 is somewhat complex and I think not necessary since 6.7 uses the blocking mock networking which isn't hit by the issue(s) that motivated this PR in the first place.

Simplify and Fix Synchronization in InternalTestCluster

b988867

* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized`

original-brownbear added >test Issues or PRs that are addressing/adding tests :Delivery/Build Build or test infrastructure v8.0.0 v7.2.0 labels Feb 20, 2019

original-brownbear added 2 commits February 20, 2019 07:22

simpler

49123ec

more cleanups

407a9d9

original-brownbear mentioned this pull request Feb 20, 2019

Fix Deadlock in HotThreadsIT #39118

Closed

original-brownbear added 2 commits February 20, 2019 10:55

more cleanups

24f2848

Merge remote-tracking branch 'elastic/master' into less-sync-internal…

232d5dc

…-test-cluster

original-brownbear marked this pull request as ready for review February 20, 2019 09:58

original-brownbear requested a review from DaveCTurner February 20, 2019 10:00

DaveCTurner requested review from henningandersen and removed request for DaveCTurner February 20, 2019 11:05

henningandersen reviewed Feb 20, 2019

View reviewed changes

original-brownbear added 3 commits February 20, 2019 14:34

Merge remote-tracking branch 'elastic/master' into less-sync-internal…

5590521

…-test-cluster

re-add sync

93b75fb

move to immutable nodes map

8a597fc

original-brownbear requested a review from henningandersen February 20, 2019 14:23

original-brownbear mentioned this pull request Feb 20, 2019

CI: test failure PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode #37345

Closed

henningandersen reviewed Feb 20, 2019

View reviewed changes

original-brownbear added 2 commits February 20, 2019 19:40

CR: sync client lazy init + remove sync on assertions

fcd056b

CR: remove duplication

9dfa779

original-brownbear requested a review from henningandersen February 20, 2019 18:56

Merge remote-tracking branch 'elastic/master' into less-sync-internal…

01821b3

…-test-cluster

henningandersen reviewed Feb 21, 2019

View reviewed changes

CR: fix syn on node

e649b40

original-brownbear requested a review from henningandersen February 21, 2019 10:06

henningandersen approved these changes Feb 21, 2019

View reviewed changes

original-brownbear merged commit 3a5d4dc into elastic:master Feb 21, 2019

original-brownbear deleted the less-sync-internal-test-cluster branch February 21, 2019 13:20

original-brownbear added the backport pending label Feb 21, 2019

original-brownbear removed the backport pending label Feb 21, 2019

original-brownbear mentioned this pull request Mar 5, 2019

[CI] BulkIntegrationIT.testBulkWithGlobalDefaults fails #39702

Closed

original-brownbear added the v7.0.0 label Mar 13, 2019

original-brownbear mentioned this pull request Mar 18, 2019

[ci] RemoveCorruptedShardDataCommandIT.testCorruptIndex and 3 others #36189

Closed

michaelbaamonde added v7.0.0-rc1 and removed v7.0.0 labels Mar 25, 2019

mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify and Fix Synchronization in InternalTestCluster #39168

Simplify and Fix Synchronization in InternalTestCluster #39168

original-brownbear commented Feb 20, 2019 •

edited

Loading

elasticmachine commented Feb 20, 2019

original-brownbear commented Feb 20, 2019 •

edited

Loading

henningandersen left a comment

henningandersen Feb 20, 2019

original-brownbear Feb 20, 2019

henningandersen Feb 20, 2019

original-brownbear Feb 20, 2019

henningandersen Feb 21, 2019

original-brownbear commented Feb 20, 2019 •

edited

Loading

henningandersen left a comment

henningandersen Feb 20, 2019

original-brownbear commented Feb 20, 2019

original-brownbear commented Feb 20, 2019

original-brownbear commented Feb 20, 2019

henningandersen left a comment

henningandersen Feb 21, 2019

original-brownbear commented Feb 21, 2019

henningandersen left a comment

original-brownbear commented Feb 21, 2019

ywelsch commented Mar 13, 2019

original-brownbear commented Mar 13, 2019

original-brownbear commented Mar 14, 2019

original-brownbear commented Mar 14, 2019

Simplify and Fix Synchronization in InternalTestCluster #39168

Simplify and Fix Synchronization in InternalTestCluster #39168

Conversation

original-brownbear commented Feb 20, 2019 • edited Loading

elasticmachine commented Feb 20, 2019

original-brownbear commented Feb 20, 2019 • edited Loading

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Feb 20, 2019

Choose a reason for hiding this comment

original-brownbear Feb 20, 2019

Choose a reason for hiding this comment

henningandersen Feb 20, 2019

Choose a reason for hiding this comment

original-brownbear Feb 20, 2019

Choose a reason for hiding this comment

henningandersen Feb 21, 2019

Choose a reason for hiding this comment

original-brownbear commented Feb 20, 2019 • edited Loading

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Feb 20, 2019

Choose a reason for hiding this comment

original-brownbear commented Feb 20, 2019

original-brownbear commented Feb 20, 2019

original-brownbear commented Feb 20, 2019

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Feb 21, 2019

Choose a reason for hiding this comment

original-brownbear commented Feb 21, 2019

henningandersen left a comment

Choose a reason for hiding this comment

original-brownbear commented Feb 21, 2019

ywelsch commented Mar 13, 2019

original-brownbear commented Mar 13, 2019

original-brownbear commented Mar 14, 2019

original-brownbear commented Mar 14, 2019

original-brownbear commented Feb 20, 2019 •

edited

Loading

original-brownbear commented Feb 20, 2019 •

edited

Loading

original-brownbear commented Feb 20, 2019 •

edited

Loading