Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not cancel ongoing recovery for noop copy on broken node #48265

Merged
merged 14 commits into from
Nov 1, 2019

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Oct 19, 2019

Today the replica allocator can repeatedly cancel an ongoing recovery for a copy on a broken node if that copy can perform a noop recovery. This loop can be happen endlessly (see testDoNotInfinitelyWaitForMapping). We should detect and avoid canceling in this situation.

Closes #47974

@dnhatn dnhatn added >enhancement :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.5.0 v7.6.0 labels Oct 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Allocation)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with using a simple boolean is that it will suppress no-op recoveries to other nodes that might succeed. I think we should track which node failed the no-op allocation.

What would happen if instead we kept track of all the nodes on which this shard failed during initialization (whether no-op or otherwise) and ignored them all in the ReplicaShardAllocator? I am thinking particularly of #18417 - with a list of past failures we could also prefer to avoid those nodes in the BalancedShardsAllocator.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 21, 2019

@DaveCTurner Thanks for looking. We would make better decisions with a list of failed nodes. We need to make sure that the list is bounded. I used a boolean to avoid adding more load to the cluster state.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Oct 21, 2019

I think a list set of nodes (or node IDs) would not be too much to add to the cluster state and would naturally be bounded: we would only add to it on an ALLOCATION_FAILED which only happens a few times thanks to the MaxRetryAllocationDecider; the list should of course be cleared in resetFailedAllocationCounter() (as a member of UnassignedInfo it also naturally goes away when a shard moves to STARTED)

@dnhatn
Copy link
Member Author

dnhatn commented Oct 21, 2019

@DaveCTurner Good point. I will apply your feedback. Thank you.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 21, 2019

@DaveCTurner It's ready again. Can you please take another look? Thank you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed review @dnhatn, I thought I submitted this a few days ago but apparently not.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @dnhatn, I left a few comments to consider.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of responses to threads in an earlier review and duplicated them here.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 30, 2019

@DaveCTurner I've addressed your feedback. Would you mind taking another look? Thank you.

@gwbrown
Copy link
Contributor

gwbrown commented Oct 30, 2019

@dnhatn The test failure in ci/2 here is a known issue (#48531), feel free to ignore it (it's unmuted to collect logs for troubleshooting purposes).

@dnhatn
Copy link
Member Author

dnhatn commented Oct 31, 2019

Thanks @gwbrown.

@elasticmachine run elasticsearch-ci/2

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnhatn this LGTM

@dnhatn
Copy link
Member Author

dnhatn commented Nov 1, 2019

@DaveCTurner @henningandersen @original-brownbear Thanks for reviewing :).

@dnhatn dnhatn merged commit 36ee74f into elastic:master Nov 1, 2019
@dnhatn dnhatn deleted the ignore-failed-node branch November 1, 2019 13:23
dnhatn added a commit that referenced this pull request Nov 1, 2019
dnhatn added a commit that referenced this pull request Nov 9, 2019
This change fixes a poisonous situation where an ongoing recovery was
canceled because a better copy was found on a node that the cluster had
previously tried allocating the shard to but failed. The solution is to
keep track of the set of nodes that an allocation was failed on so that
we can avoid canceling the current recovery for a copy on failed nodes.

Closes #47974
dnhatn added a commit that referenced this pull request Nov 9, 2019
This change fixes a poisonous situation where an ongoing recovery was
canceled because a better copy was found on a node that the cluster had
previously tried allocating the shard to but failed. The solution is to
keep track of the set of nodes that an allocation was failed on so that
we can avoid canceling the current recovery for a copy on failed nodes.

Closes #47974
dnhatn added a commit that referenced this pull request Nov 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement v7.5.0 v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

testDoNotInfinitelyWaitForMapping fails
7 participants