Closed shard should never open new engine #47186

dnhatn · 2019-09-26T20:32:00Z

We should not open new engines if a shard is closed. We break this assumption in #45263 where we stop verifying the shard state before creating an engine but only before swapping the engine reference. We can fail to snapshot the store metadata or checkIndex a closed shard if there's some IndexWriter holding the index lock.

Closes #47060

elasticmachine · 2019-09-26T20:32:02Z

Pinging @elastic/es-distributed

henningandersen

Thanks for looking into this @dnhatn, I believe I know what you want to achieve, but I also think this might revert back blocking the cluster state applier thread (more details in comments)?

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

dnhatn · 2019-10-02T03:17:14Z

@henningandersen Thank you for your thoughtful review.

I think this means that we now risk waiting for an engine warmer during close.

Great catch. I spent some time on this, but I did not come up with a solution unless we introduce the "start" method to Engine. Do you have any suggestions for this?

henningandersen · 2019-10-04T12:21:19Z

@dnhatn I also lean towards the "start" method that runs outside the mutex. The other options I can think of are:

Move the locking to IndexService to prevent snapshot'ing from hitting this (and fix checkIndex to also avoid it).
Make close async, i.e., it reports back through a listener when done.
Keep close, but fix the caller to run in another thread and async-notify.

Unfortunately, using the generic thread pool for the last approach might lead to issues if it can be called during shutdown?

I think I prefer to add a method to do the warmup, seems simpler overall. The benefit of doing close async though is that it avoids waiting for the rest of the IO happening during InternalEngine constructor. Not sure if this has any benefits?

dnhatn · 2019-10-09T15:00:26Z

I think I prefer to add a method to do the warmup, seems simpler overall.

+1. Thank you for your suggestion. I will work on this solution.

dnhatn · 2019-10-10T01:43:34Z

@henningandersen I worked on a change that moves the engine warming out of the constructor. It is pretty straightforward. However, it does not eliminate the blocking issue. Closing an engine acquires the writeLock, which can be blocked by an engine warming as it holds the readLock (via refresh). We can fix the refresh, but indexing and flushing can cause the same problem. I will reach out to discuss this with you.

Today, we hold the engine readLock while refreshing. Although this choice simplifies the correctness reasoning, it can block IndexShard from closing if warming an external reader takes time. The current implementation of refresh does not need to hold readLock as ReferenceManager can handle errors correctly if the engine is closed in midway. This PR is a prerequisite that we need to solve #47186.

With this change, we won't warm up searchers until we externally refresh an engine. We explicitly refresh before allowing reading from a shard (i.e., move to post_recovery state) and during resetting. These guarantees that we have warmed up the engine before exposing the external searcher. Another prerequisite for #47186.

This reverts commit a56d9ff.

dnhatn · 2019-11-01T16:01:30Z

I have two tests in a56d9ff that can reliably reproduce the test failure reported in #47060. However, neither of them works with the latest change as we now hold engineMutex while closing a shard. I am not sure if I can come up with a useful test for this change. Any suggestions would be great.

@henningandersen This is ready again. Can you please take another look? Thank you.

Today, we hold the engine readLock while refreshing. Although this choice simplifies the correctness reasoning, it can block IndexShard from closing if warming an external reader takes time. The current implementation of refresh does not need to hold readLock as ReferenceManager can handle errors correctly if the engine is closed in midway. This PR is a prerequisite that we need to solve #47186.

With this change, we won't warm up searchers until we externally refresh an engine. We explicitly refresh before allowing reading from a shard (i.e., move to post_recovery state) and during resetting. These guarantees that we have warmed up the engine before exposing the external searcher. Another prerequisite for #47186.

henningandersen

LGTM. Thanks for the large effort on this, @dnhatn .

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

dnhatn · 2019-11-07T20:39:33Z

@henningandersen Thank you so much for your review and discussion on this.

We should not open new engines if a shard is closed. We break this assumption in #45263 where we stop verifying the shard state before creating an engine but only before swapping the engine reference. We can fail to snapshot the store metadata or checkIndex a closed shard if there's some IndexWriter holding the index lock. Closes #47060

Use only engineMutex to protect engine reference

18f3f52

dnhatn added >bug :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v8.0.0 v7.5.0 v7.4.1 labels Sep 26, 2019

dnhatn requested review from ywelsch and henningandersen September 26, 2019 20:32

dnhatn changed the title ~~Use only engineMutex to protect engine reference~~ Closed shard should never open new engine Sep 26, 2019

henningandersen reviewed Sep 30, 2019

View reviewed changes

tomcallahan added v7.4.2 and removed v7.4.1 labels Oct 22, 2019

dnhatn mentioned this pull request Oct 23, 2019

Refresh should not acquire readLock #48414

Merged

Merge branch 'master' into engine-mutex

3151b30

dnhatn mentioned this pull request Oct 28, 2019

Do not warm up searcher in engine constructor #48605

Merged

$@polyfractal$ polyfractal added v7.4.3 and removed v7.4.2 labels Oct 31, 2019

Merge branch 'master' into engine-mutex

816b9ec

dnhatn added the v7.6.0 label Nov 1, 2019

dnhatn added 2 commits November 1, 2019 11:57

add tests

a56d9ff

Revert "add tests"

56739d5

This reverts commit a56d9ff.

dnhatn requested a review from henningandersen November 1, 2019 16:01

henningandersen approved these changes Nov 7, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java Show resolved Hide resolved

dnhatn added 2 commits November 7, 2019 14:35

Merge branch 'master' into engine-mutex

b1befbb

add assertions

54f052b

dnhatn merged commit d029e18 into elastic:master Nov 7, 2019

dnhatn deleted the engine-mutex branch November 7, 2019 20:39

dnhatn added backport pending and removed v7.4.3 labels Nov 7, 2019

dnhatn removed the backport pending label Nov 9, 2019

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closed shard should never open new engine #47186

Closed shard should never open new engine #47186

dnhatn commented Sep 26, 2019

elasticmachine commented Sep 26, 2019

henningandersen left a comment

dnhatn commented Oct 2, 2019

henningandersen commented Oct 4, 2019

dnhatn commented Oct 9, 2019

dnhatn commented Oct 10, 2019

dnhatn commented Nov 1, 2019 •

edited

Loading

henningandersen left a comment

dnhatn commented Nov 7, 2019

Closed shard should never open new engine #47186

Closed shard should never open new engine #47186

Conversation

dnhatn commented Sep 26, 2019

elasticmachine commented Sep 26, 2019

henningandersen left a comment

Choose a reason for hiding this comment

dnhatn commented Oct 2, 2019

henningandersen commented Oct 4, 2019

dnhatn commented Oct 9, 2019

dnhatn commented Oct 10, 2019

dnhatn commented Nov 1, 2019 • edited Loading

henningandersen left a comment

Choose a reason for hiding this comment

dnhatn commented Nov 7, 2019

dnhatn commented Nov 1, 2019 •

edited

Loading