Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constraints to de-prioritize nodes from becoming shard allocation targets #43350

Closed
vigyasharma opened this issue Jun 18, 2019 · 5 comments
Closed
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) feedback_needed

Comments

@vigyasharma
Copy link
Contributor

vigyasharma commented Jun 18, 2019

The Problem

In clusters where each node has a reasonable number of shards, when a node is replaced, (or a small number of node(s) added); all shards of newly created indexes get allocated on the new empty node(s). It puts a lot of stress on the single (or few) new node, deteriorating overall cluster performance.

This happens because the index-balance factor in weight function is unable to offset shard-balance when other nodes are filled with shards. It causes the new node to always have minimum weight and get picked as the target for allocation (until the new node has approx. mean number of shards/node in the cluster).

Further, index-balance does kick in after the node is relatively filled (as compared to other nodes), and it moves out the recently allocated shards. Thus for the same shards, which are newly created and usually getting indexing traffic, we end up doing twice the work in allocation.

Current Workaround

A work around for this is to set the total-shards-per-node index/cluster setting , which limits shards of an index on a single node.

This however, is a hard limit enforced by Allocation Deciders. If breached on all nodes, the shards go unassigned causing yellow/red clusters. Configuring this setting requires careful calculation around number of nodes, and must be revised when the cluster is scaled down.

Proposal

We propose an allocation constraint mechanism, that de-prioritizes nodes from getting picked for allocation if they breach certain constraints. Whenever an allocation constraint is breached for a shard (or index) on a node, we add a high positive constant to the node's weight. This increased weight makes the node less preferable for allocating the shard. Unlike deciders, however, this is not a hard filter. If no other nodes are eligible to accept shards (say due to deciders like disk watermarks), the shard can still be allocated on nodes with breached constraints.

Constraint Based Weights - Step Function Diagram

Index Shards Per Node Constraint

This constraint controls the number of shards of an index on a single node. indexShardsPerNodeConstraint is breached if the number of shards of an index allocated on a node, exceeds average shards per node for that index.

long expIndexShardsOnNode = node.numShards(index) + numAdditionalShards;
long allowedShardsPerNode = Math.round(Math.ceil(balancer.avgShardsPerNode(index)));
boolean shardPerNodeLimitBreached = (expIndexShardsOnNode - allowedShardsPerNode) > 0;

indexShardsPerNodeConstraint getting breached, causes shards of the newly created index to get assigned to other nodes, thus preventing indexing hot spots. Post unassigned shard allocation, rebalance fills up the node with other index shards, without having to move out the already allocated shards. Since this does not prevent allocation, we do not run into unassigned shards due to breached constraints.

The allocator flow now becomes:

deciders to filter out ineligible nodes -> constraints to de-prioritize certain nodes by increasing their weight -> node selection as allocation target.

Extension

This framework can be extended with other rules by modeling them as boolean constraints. One possiible example is primary shard density on a node, as required by issue #41543. A constraint that requires #primaries on a node <= avg-primaries on a node, could prevent scenarios with all primaries on few nodes and all replicas on others.

Comparing Multiple Constraints

For every constraint breached, we add a high positive constant to the weight function. This is a step function approach where we consider all nodes on the lower step (lower weight) as more eligible than those on a higher step. The high positive constant ensures that node weight doesn't go as high as the next step unless it breaches a constraint.

Since the constant is added for every constraint breached, i.e. c * HIGH_CONSTANT, nodes with one constraints breached are preferred over nodes with two constraints breached and so on. All nodes with the same number of constraints breached simply resort back to the first part of weight function, which is based on shard count.

We could potentially keep different weights for different constraints with some sort of ranking among them. But this can quickly become a maintenance and extension nightmare. Adding new constraints will require going through every other constraint weight and deciding on the right weight and place in priority.

Next Steps

If this idea makes sense and seems like a useful add to the Elasticsearch community, we can raise a PR for these changes.

This change is targeted to solve problems listed in #17213, #37638, #41543, #12279, #29437

@dnhatn dnhatn added the :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Jun 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor

Thanks for the suggestion @vigyasharma. You've clearly put some effort into this idea. We discussed this recently as a team and drew two conclusions:

  1. We do not want to weaken the hard *.routing.allocation.total_shards_per_node limits. These limits are for users who know that each node would fail with more than a certain number of shards, so ignoring these limits would result in a cascade of failures.

  2. You have not provided enough surrounding context to judge its other merits.

To expand on this second point, it sounds like you might be misusing the total_shards_per_node settings, and the problem you describe might be addressed by setting them appropriately. If you are setting total_shards_per_node: N but in fact your nodes can tolerate more than N shards then you should be setting this limit higher. Also if you are heavily indexing into an index (necessitating the total_shards_per_node setting) but then stopping the heavy indexing later then you should also be relaxing or even removing the total_shards_per_node setting on this index to reflect the change in resource constraints. It is normal for the resource constraints on an index to change over time as its access pattern changes, and adjusting settings like total_shards_per_node to reflect this will allow the balancer to do its job properly. This would be the case if, for instance, you are using time-based indices to collect logs.

Would you please describe your cluster and its configuration in significantly more detail? Also, would you prepare a failing test case (e.g. an ESAllocationTestCase) that reflects the configuration of your cluster and that your proposal would fix?

@vigyasharma
Copy link
Contributor Author

Thanks @DaveCTurner for discussing this with the team, and apologies for the delay in response (I was on a vacation).

I don't think this should weaken the total_shards_per_node limits. That decider remains untouched for advanced customers to use.

Let me provide more context on where this change helps.

The problem with *.routing.allocation.total_shards_per_node is that you need full context around all indexes and overall cluster workload, to reason about shards_per_node limits. Several large organizations have different teams owning a set of indices in a shared cluster. The cluster itself may be managed by a central admin team, while index creations and settings are owned by each index owner team.

In such scenarios, overall cluster workload may go down due to one team scaling down their traffic, while other teams are still unaware of these changes. This makes revising the setting at index level non trivial and a cluster management overhead.

Most users add this setting at index level to avoid collocating shards on one node during node replacements and new node additions (i.e. the workaround to problem addressed in this issue). If shard balancer can handle this implicitly, it removes the cognitive and management overhead of setting or revising these limits.

Admins can scale down their fleet if net workload is reduced without individual indexes getting impacted.

e.g. Say an index has 10 shards in an index and there are 6 nodes in the cluster

To be resilient to 2 node failures with max index spread: 
    index.routing.allocation.total_shards_per_node = 3

But if the admin reduces the cluster to 4 nodes, the setting must be revised to 5
Similarly, the cluster can't be scaled down to 3 nodes without making it yellow.

@DaveCTurner
Copy link
Contributor

Welcome back @vigyasharma, hope you had a good break.

The concerns you describe are broadly addressed by the plan for finer-grained constraints laid out in #17213 (comment) and (as I commented there in reply to you a few months back) predicting the constraints dynamically is out of scope for now. Let's walk before we start to try and run, and let's try and keep this conversation focussed on this particular issue.

Just a reminder that we're waiting for more details about your proposal:

Would you please describe your cluster and its configuration in significantly more detail? Also, would you prepare a failing test case (e.g. an ESAllocationTestCase) that reflects the configuration of your cluster and that your proposal would fix?

@DaveCTurner
Copy link
Contributor

Closing this as no further details were provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) feedback_needed
Projects
None yet
Development

No branches or pull requests

4 participants