-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALLOC: Fail Stale Primary Alloc. Req. without Data #37226
ALLOC: Fail Stale Primary Alloc. Req. without Data #37226
Conversation
* Get indices shard store status before enqueuing the reallocation state update task to prevent tasks that would fail because a node does not hold a stale copy of the shard on a best effort basis * Closes elastic#37098
Pinging @elastic/es-distributed |
@@ -135,7 +134,7 @@ public void testFailedRecoveryOnAllocateStalePrimaryRequiresAnotherAllocateStale | |||
assertThat(shardRouting.unassignedInfo().getReason(), equalTo(UnassignedInfo.Reason.ALLOCATION_FAILED)); | |||
}); | |||
|
|||
try(Store store = new Store(shardId, indexSettings, new SimpleFSDirectory(indexPath), new DummyShardLock(shardId))) { | |||
try (Store store = new Store(shardId, indexSettings, new SimpleFSDirectory(indexPath), new DummyShardLock(shardId))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for these two noisy cleanups that snuck into this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should back them out for the sake of future git annotate
users :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I left a small number of small comments.
I would rather avoid the // code comments
that you added - they don't really say anything that isn't already clear to me from the code, and I always worry about comments like this that fall out of sync later.
for (AllocationCommand command : request.getCommands().commands()) { | ||
if (command instanceof AllocateStalePrimaryAllocationCommand) { | ||
if (stalePrimaryAllocations == null) { | ||
stalePrimaryAllocations = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can afford to make this HashMap
eagerly and avoid the noise in the loop.
IllegalArgumentException.class, | ||
() -> client().admin().cluster().prepareReroute().add(new AllocateStalePrimaryAllocationCommand("test", 0, | ||
dataNodeWithNoShardCopy, true)).get()); | ||
assertThat(iae.getMessage(), equalTo("No data for shard [0] of index [test] found on node [" + dataNodeWithNoShardCopy + ']')); | ||
|
||
logger.info("--> wait until shard is failed and becomes unassigned again"); | ||
assertBusy(() -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we no longer want this to be an assertBusy()
- we expect that the previous reroute didn't do anything, so it should still be in this state from beforehand.
@@ -135,7 +134,7 @@ public void testFailedRecoveryOnAllocateStalePrimaryRequiresAnotherAllocateStale | |||
assertThat(shardRouting.unassignedInfo().getReason(), equalTo(UnassignedInfo.Reason.ALLOCATION_FAILED)); | |||
}); | |||
|
|||
try(Store store = new Store(shardId, indexSettings, new SimpleFSDirectory(indexPath), new DummyShardLock(shardId))) { | |||
try (Store store = new Store(shardId, indexSettings, new SimpleFSDirectory(indexPath), new DummyShardLock(shardId))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should back them out for the sake of future git annotate
users :)
@DaveCTurner all points addressed, thanks for taking a look! |
Jenkins run gradle build tests 1 |
Jenkins run Gradle build tests 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good, I asked for a couple more tests.
final String index = entry.getKey(); | ||
final ImmutableOpenIntMap<List<IndicesShardStoresResponse.StoreStatus>> indexStatus = status.get(index); | ||
if (indexStatus == null) { | ||
e = ExceptionsHelper.useOrSuppress(e, new IndexNotFoundException(index)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a test that hits this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no ... my bad this one is dead code. The logic in the index shard store status request already checks the index exists.
I added an assertion for this now and a test that makes sure it's not tripped when the index doesn't exist :)
indexStatus.get(command.shardId()); | ||
if (shardStatus == null) { | ||
e = ExceptionsHelper.useOrSuppress(e, new IllegalArgumentException( | ||
"No data for shard [" + command.shardId() + "] of index [" + index + "] found on any node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a test that hits this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in ea98277 :)
@DaveCTurner thanks for taking a look => tests added and implementation simplified a little as well in ea98277 :) |
@DaveCTurner ping :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@DaveCTurner thanks! |
* Get indices shard store status before enqueuing the reallocation state update task to prevent tasks that would fail because a node does not hold a stale copy of the shard on a best effort basis * Closes #37098
tasks that would fail because a node does not hold a stale copy of the shard on a best effort basis