Shard Allocation Activity Balancing #12279

pickypg · 2015-07-15T21:20:31Z

Now that we have a sense of recovery priority (#11787), it may make sense to use that priority for allocation weight when all things are equal.

Problem

The use case that I am thinking of is, during a cluster restart, you can end up with lopsided primaries even with allocation disabled.

The normal distribution can be observed along the top (minus the lopsided primary distribution). This was on 1.6.0 following a full cluster restart with allocation disabled, and it was only reenabled after everyone was up. This particularly recovery lead to three very real problems:

All of the indexing was hot spotting both nodes.
All of the replicas were appearing on both nodes.
All merging was effectively limited to both nodes.

This was without searching, but, being today's index, it would also be receiving the brunt of the search load as well. Obviously that would have a very negative impact on these two nodes that the rest of the cluster would simply shrug off.

Solution

The concept of primary balancing has been discussed (and removed), but this type of hot spotting is clearly a non-trivial problem. It's easy to spot it, but it's not easy to prevent it.

Given that the cluster maintains index writers for shards that are sized based on their activity level and we have sync_ids, segment counts, and index readability, then we should be able to come up with some estimate of activity to balance against. New shards should be assumed to be as active as the most active shards.

Ideally we can come up with a way to "guess" activity with that. With or without it, we could either use the index.priority or some index.activity as a separate mechanism to allow the user to control it. A readonly index could still get the brunt of the requests. The nice thing about a separate setting is that it could be curated over time separate from priority.

From there, we need to modify the allocator equation to weight significantly based on activity to avoid getting the picture above in normal circumstances. If we go purely based on a number, then we can only do this for advanced use cases because all normal indices would have an equal--defaulted--activity value.

Workaround

Manually rerouting shards can help to prevent this in current situations when you unluckily come across it. It's only really a problem once those shards become large and therefore movement is expensive.

The text was updated successfully, but these errors were encountered:

clintongormley · 2015-07-17T11:47:02Z

The concept of primary balancing has been discussed (and removed), but this type of hot spotting is clearly a non-trivial problem. It's easy to spot it, but it's not easy to prevent it.

The idea that having 5 primaries on one node and 5 replicas on the other node is a bad situation is (for the most part) incorrect. Primaries and replicas do the same amount of work, both at index time and at search time. I say "for the most part" because there are two exceptions:

updates have a get-modify-put process, and the get and modify steps happen only on the primary today. Heavy scripts can make primaries a hotspot. There is an open issue for making the get & modify steps distributed as well, which should solve this issue (Distributed update API #8369)
with shadow replicas, only the primary writes, while the shadow replicas just handle search - clearly this can result in primaries being a hotspot

That said, having most of the shards of your most active index sitting on only two nodes can lead to load imbalance. The cluster.routing.allocation.balance.index setting tries to force shards from the same node apart, but of course this is only one of the factors taken into account (see https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-cluster.html).

Also, the cost of the hotspot needs to be weighed against the cost of copying GBs of data around your cluster in a shard shuffle. Finding the sweet spot for these values is HARD, and so we've left it up to the operator to make these decisions.

Ideally we can come up with a way to "guess" activity with that. With or without it, we could either use the index.priority or some index.activity as a separate mechanism to allow the user to control it. A readonly index could still get the brunt of the requests. The nice thing about a separate setting is that it could be curated over time separate from priority.

I agree that something more automated would be of value, and perhaps the index.priority setting is a nice way of doing that, eg the higher the priority, the more impact the cluster.routing.allocation.balance.index setting has.

I'd like us to make improvement here, but it is a hard problem. Looking forward to hearing what ideas others have.

pickypg · 2015-08-17T20:14:49Z

I definitely agree that, in general, the difference between a replica and a primary shard is practically none (minus the fact that the primary must also wait for a response). In the above screenshot, there are zero updates taking place and there are no shadow replicas in play.

Also, the cost of the hotspot needs to be weighed against the cost of copying GBs of data around your cluster in a shard shuffle. Finding the sweet spot for these values is HARD, and so we've left it up to the operator to make these decisions.

I completely agree, especially about it being a hard problem. Unfortunately, in the above picture, it was doing the recovery onto the same two nodes with all of the indexing activity, thus doubling the load on them rather than spreading it around the cluster.

Annoyingly, the cluster figures out that it should rebalance after it eventually recovers the replicas, but the damage is generally done at that point (it also adds even more load with rebalancing).

Having spent some time away from this issue, I think that we should look at index.activity with some defaulted value, then use the standard priority details (index.priority > index.creation_date > index.name (descending)) to weigh the activity for equivalent activity values.

This allows for new indices to be created with high index.activity values that can be curated as time marches forward.

vvcephei · 2015-10-07T15:28:55Z

This looks like the most relevant discussion to ask my related question on.

I totally agree with @clintongormley that building a bunch of smart heuristics into the balanced allocator is high risk. Like you said, it's hard to get right in general, and it's expensive to shuffle data around the cluster.

I think this is a good opportunity for exploring the issue with plugins. Assuming I understand the code, anyone can provide their own Allocator. The problem is that currently, Allocators are only invoked when the cluster changes (routing table changes, cluster settings updates, etc.).

Allocating based on dynamic properties like index activity, search activity, or whatever else requires the allocator to be invoked on an interval. I can simulate this by updating the settings periodically, but I wonder if anyone has better advice on how to get the cluster to rebalance periodically.

Thanks,
-John

DaveCTurner · 2018-03-20T09:56:04Z

This is an interesting idea, but since its opening we have not seen enough feedback that it is something we should work on. We prefer to close this issue as a clear indication that we are not going to work on this at this time. We are always open to reconsidering this in the future based on compelling feedback; despite this issue being closed please feel free to leave feedback (including +1s).

tdoman · 2018-04-10T03:15:19Z

@DaveCTurner +1
At the very least, some way of distributing the primaries. I can simulate a redistribution of primaries in a 3 node cluster w/ 2 replicas by setting that to 1 replica, waiting for the redistribution, and then setting it back to 2 replicas. Giving the deployer the option to make that decision manually would be very helpful especially for the scenarios (ie. updates) where we know hot spots can be produced. We can get a bit of a hybrid between search and index optimization.

Bukhtawar · 2019-02-26T13:03:40Z

cluster.routing.allocation.type is static, if there were multiple ClusterPlugin based on multiple allocation strategy and there was an option to switch allocators dynamically, do we see a problem. The idea is to make this cluster.routing.allocation.type dynamic to allow customers to rescue their clusters by selecting a different allocation strategy if they see a skew

DaveCTurner · 2019-02-27T15:21:15Z

We are discussing removing ClusterPlugin in #39464. Switching dynamically to a wholly different allocator sounds pretty drastic, and liable to trigger a lot of shard relocations. What would this achieve that cannot be achieved by adjusting the dynamic settings of one or more of the allocation deciders?

Leaf-Lin · 2020-07-29T00:18:58Z

+1

pickypg added discuss :Allocation labels Jul 15, 2015

clintongormley mentioned this issue Jul 17, 2015

ES doesn't exhaust options for allocation leaving unassigned shards. #12273

Open

ywelsch mentioned this issue Mar 21, 2016

Suboptimal shard allocation with imbalanced shards #17213

Open

lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018

clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018

DaveCTurner closed this as completed Mar 20, 2018

klahnakoski mentioned this issue Apr 10, 2018

Moving replica shard involves primary shard, even across zones #29436

Closed

tdoman mentioned this issue Apr 10, 2018

All primary shards on the same node for every index in 6.x #29437

Closed

vigyasharma mentioned this issue Jun 18, 2019

Constraints to de-prioritize nodes from becoming shard allocation targets #43350

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shard Allocation Activity Balancing #12279

Shard Allocation Activity Balancing #12279

pickypg commented Jul 15, 2015

clintongormley commented Jul 17, 2015

pickypg commented Aug 17, 2015

vvcephei commented Oct 7, 2015

DaveCTurner commented Mar 20, 2018

tdoman commented Apr 10, 2018 •

edited

Loading

Bukhtawar commented Feb 26, 2019

DaveCTurner commented Feb 27, 2019

Leaf-Lin commented Jul 29, 2020

Shard Allocation Activity Balancing #12279

Shard Allocation Activity Balancing #12279

Comments

pickypg commented Jul 15, 2015

Problem

Solution

Workaround

clintongormley commented Jul 17, 2015

pickypg commented Aug 17, 2015

vvcephei commented Oct 7, 2015

DaveCTurner commented Mar 20, 2018

tdoman commented Apr 10, 2018 • edited Loading

Bukhtawar commented Feb 26, 2019

DaveCTurner commented Feb 27, 2019

Leaf-Lin commented Jul 29, 2020

tdoman commented Apr 10, 2018 •

edited

Loading