Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Fix NPE in ReplicaShardAllocator (#13993) #14385

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- Refactoring Grok.validatePatternBank by using an iterative approach ([#14206](https://github.com/opensearch-project/OpenSearch/pull/14206))
- Update help output for _cat ([#14722](https://github.com/opensearch-project/OpenSearch/pull/14722))
- Fix bulk upsert ignores the default_pipeline and final_pipeline when auto-created index matches the index template ([#12891](https://github.com/opensearch-project/OpenSearch/pull/12891))
- Fix NPE in ReplicaShardAllocator ([#14385](https://github.com/opensearch-project/OpenSearch/pull/14385))

### Security

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,10 @@ protected Runnable cancelExistingRecoveryForBetterMatch(
Metadata metadata = allocation.metadata();
RoutingNodes routingNodes = allocation.routingNodes();
ShardRouting primaryShard = allocation.routingNodes().activePrimary(shard.shardId());
assert primaryShard != null : "the replica shard can be allocated on at least one node, so there must be an active primary";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaniilRoman do we know why this assertion had to be removed. @dblock In general I am not of fan of removing assertions, they exist in the system for a reason, we shouldn't let our guards down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bukhtawar the assertion has been replaced by an explicit null check, returning null instead of throwing an exception. Either way the code will not proceed past this point. There is a previous return null at line 98, so it seems null is already an expected return value.

Copy link
Collaborator

@Bukhtawar Bukhtawar Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbwiddis If I am understanding this correct there are two things

  1. the method cancelExistingRecoveryForBetterMatch is expected to return a Runnable to denote some work to cancel existing recovery. A null as a method return type denotes no work is supposed to be done for cancelling.
  2. the invariant that says for a given shard(replica) it's corresponding primary should be non-null, since the code path is supposed to kick in for an initialising replica copy which can only be initialising once it's corresponding primary has been assigned see the decider

So there are two different things that exist and equating them IMO is incorrect. Removing the assertion essentially says at this point of execution a primary shard can be null when the system was designed ensuring that it shouldn't?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are two different things that exist and equating them IMO is incorrect. Removing the assertion essentially says at this point of execution a primary shard can be null when the system was designed ensuring that it shouldn't?

I haven't dug into the logic, but from the face of it, the previous situation is that the assertion causes an exception, and the exception isn't handled.

I've no objection to reverting this change if you think the exception should be caught and addressed elsewhere.

Copy link
Collaborator

@Bukhtawar Bukhtawar Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't dug into the logic, but from the face of it, the previous situation is that the assertion causes an exception, and the exception isn't handled.

Unfortunately assertions don't run outside tests and the stack trace isn't an assertion tripping. Even if an assertion is failing the right thing to do is investigate why it is failing rather than burying the problem, by removing the assertion(invariant) that was added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#10254 didn't "fix" or modify the behavior of this method, I had just refactored ...

So I see looking again... sorry. The first time I was just looking for a line number. But now I'm really confused. In order for an NPE on line 108 on null primaryShard it would have had to pass the assertions on 106 and 107.

assert primaryShard != null : "the replica shard can be allocated on at least one node, so there must be an active primary";
assert primaryShard.currentNodeId() != null;
final DiscoveryNode primaryNode = allocation.nodes().get(primaryShard.currentNodeId());

Is this just undefined/unpredictable behavior during JVM shutdown (node drop)?

In any case, seems a revert of this PR is the right action at this point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bukhtawar that's not the case with the assert keyword as it's used here. It throws an AssertionError.
As an Error it's not normally caught by exception catch blocks unless you're specifically looking for it, and by the traditional interpretation of Error means that it's an unrecoverable situation and the entire JVM should be shut down.

Sorry I don't think this is the correct understanding of how assert works. AssertionErrors are only configured for tests using a -ea JVM flag and in this case I don't think an AssertionError was triggered. Please feel free to read up more on this here

In any case, seems a revert of this PR is the right action at this point

Thanks for your understanding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, and thanks for educating me.

So wouldn't an appropriate fix for the NPE then to both put the assert back and null check to avoid the exception? That seems odd but I've never really used assertions like this so at this point I really don't know the best approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So wouldn't an appropriate fix for the NPE then to both put the assert back and null check to avoid the exception?

@dbwiddis The appropriate fix is to figure out why the invariant is being violated and fix that bug. i.e. if no primary exists, then there can be no recovery and this method should not be called. The previous behavior of having the code fail with an NPE is preferable to ignoring a situation that is not intended to be possible.

Slightly off topic, but in general I really dislike using the assert keyword. Instead of assert primaryShard != null : "message" I almost always prefer something like Objects.requireNonNull(primaryShard, "message") so that it deterministically fails in all situations. Only if the code is super performance critical and if the assertion check is computationally intensive do I think the assert keyword makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The appropriate fix is to figure out why the invariant is being violated and fix that bug.

@andrross In general I agree.

In reality, I read the issue reporting this bug and it indicates it's happening during a node drop. Which means the JVM is shutting down for <insert reason here>. Which means undefined/unpredictable behavior.

TLDR: I'm not sure there is a bug, and the symptom may be a result of a JVM being shut down causing a node drop as a result of some totally unrelated thing.

There's not enough information in the issue to infer more.

if (primaryShard == null) {
logger.trace("{}: no active primary shard found or allocated, letting actual allocation figure it out", shard);
return null;
}
DaniilRoman marked this conversation as resolved.
Show resolved Hide resolved
assert primaryShard.currentNodeId() != null;
final DiscoveryNode primaryNode = allocation.nodes().get(primaryShard.currentNodeId());

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -644,6 +644,25 @@ public void testDoNotCancelForBrokenNode() {
assertThat(allocation.routingNodes().shardsWithState(ShardRoutingState.UNASSIGNED), empty());
}

public void testDoNotCancelForInactivePrimaryNode() {
RoutingAllocation allocation = oneInactivePrimaryOnNode1And1ReplicaRecovering(yesAllocationDeciders(), null);
testBatchAllocator.addData(
node1,
null,
"MATCH",
null,
new StoreFileMetadata("file1", 10, "MATCH_CHECKSUM", MIN_SUPPORTED_LUCENE_VERSION)
).addData(node2, randomSyncId(), null, new StoreFileMetadata("file1", 10, "MATCH_CHECKSUM", MIN_SUPPORTED_LUCENE_VERSION));

testBatchAllocator.processExistingRecoveries(
allocation,
Collections.singletonList(new ArrayList<>(allocation.routingNodes().shardsWithState(ShardRoutingState.INITIALIZING)))
);

assertThat(allocation.routingNodesChanged(), equalTo(false));
assertThat(allocation.routingNodes().shardsWithState(ShardRoutingState.UNASSIGNED), empty());
}

public void testAllocateUnassignedBatchThrottlingAllocationDeciderIsHonoured() throws InterruptedException {
ClusterSettings clusterSettings = new ClusterSettings(Settings.EMPTY, ClusterSettings.BUILT_IN_CLUSTER_SETTINGS);
AllocationDeciders allocationDeciders = randomAllocationDeciders(
Expand Down Expand Up @@ -872,6 +891,41 @@ private RoutingAllocation onePrimaryOnNode1And1ReplicaRecovering(AllocationDecid
);
}

private RoutingAllocation oneInactivePrimaryOnNode1And1ReplicaRecovering(AllocationDeciders deciders, UnassignedInfo unassignedInfo) {
ShardRouting primaryShard = TestShardRouting.newShardRouting(shardId, node1.getId(), true, ShardRoutingState.INITIALIZING);
RoutingTable routingTable = RoutingTable.builder()
.add(
IndexRoutingTable.builder(shardId.getIndex())
.addIndexShard(
new IndexShardRoutingTable.Builder(shardId).addShard(primaryShard)
.addShard(
TestShardRouting.newShardRouting(
shardId,
node2.getId(),
null,
false,
ShardRoutingState.INITIALIZING,
unassignedInfo
)
)
.build()
)
)
.build();
ClusterState state = ClusterState.builder(org.opensearch.cluster.ClusterName.CLUSTER_NAME_SETTING.getDefault(Settings.EMPTY))
.routingTable(routingTable)
.nodes(DiscoveryNodes.builder().add(node1).add(node2))
.build();
return new RoutingAllocation(
deciders,
new RoutingNodes(state, false),
state,
ClusterInfo.EMPTY,
SnapshotShardSizeInfo.EMPTY,
System.nanoTime()
);
}

private RoutingAllocation onePrimaryOnNode1And1ReplicaRecovering(AllocationDeciders deciders) {
return onePrimaryOnNode1And1ReplicaRecovering(deciders, new UnassignedInfo(UnassignedInfo.Reason.CLUSTER_RECOVERED, null));
}
Expand Down
Loading