Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard Allocation Activity Balancing #12279

Closed
pickypg opened this issue Jul 15, 2015 · 8 comments
Closed

Shard Allocation Activity Balancing #12279

pickypg opened this issue Jul 15, 2015 · 8 comments
Labels
discuss :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes)

Comments

@pickypg
Copy link
Member

pickypg commented Jul 15, 2015

Now that we have a sense of recovery priority (#11787), it may make sense to use that priority for allocation weight when all things are equal.

Problem

The use case that I am thinking of is, during a cluster restart, you can end up with lopsided primaries even with allocation disabled.

Recovery Hotspot

The normal distribution can be observed along the top (minus the lopsided primary distribution). This was on 1.6.0 following a full cluster restart with allocation disabled, and it was only reenabled after everyone was up. This particularly recovery lead to three very real problems:

  1. All of the indexing was hot spotting both nodes.
  2. All of the replicas were appearing on both nodes.
  3. All merging was effectively limited to both nodes.

Hotspotted Machines

This was without searching, but, being today's index, it would also be receiving the brunt of the search load as well. Obviously that would have a very negative impact on these two nodes that the rest of the cluster would simply shrug off.

Solution

The concept of primary balancing has been discussed (and removed), but this type of hot spotting is clearly a non-trivial problem. It's easy to spot it, but it's not easy to prevent it.

Given that the cluster maintains index writers for shards that are sized based on their activity level and we have sync_ids, segment counts, and index readability, then we should be able to come up with some estimate of activity to balance against. New shards should be assumed to be as active as the most active shards.

Ideally we can come up with a way to "guess" activity with that. With or without it, we could either use the index.priority or some index.activity as a separate mechanism to allow the user to control it. A readonly index could still get the brunt of the requests. The nice thing about a separate setting is that it could be curated over time separate from priority.

From there, we need to modify the allocator equation to weight significantly based on activity to avoid getting the picture above in normal circumstances. If we go purely based on a number, then we can only do this for advanced use cases because all normal indices would have an equal--defaulted--activity value.

Workaround

Manually rerouting shards can help to prevent this in current situations when you unluckily come across it. It's only really a problem once those shards become large and therefore movement is expensive.

@clintongormley
Copy link

The concept of primary balancing has been discussed (and removed), but this type of hot spotting is clearly a non-trivial problem. It's easy to spot it, but it's not easy to prevent it.

The idea that having 5 primaries on one node and 5 replicas on the other node is a bad situation is (for the most part) incorrect. Primaries and replicas do the same amount of work, both at index time and at search time. I say "for the most part" because there are two exceptions:

  • updates have a get-modify-put process, and the get and modify steps happen only on the primary today. Heavy scripts can make primaries a hotspot. There is an open issue for making the get & modify steps distributed as well, which should solve this issue (Distributed update API #8369)
  • with shadow replicas, only the primary writes, while the shadow replicas just handle search - clearly this can result in primaries being a hotspot

That said, having most of the shards of your most active index sitting on only two nodes can lead to load imbalance. The cluster.routing.allocation.balance.index setting tries to force shards from the same node apart, but of course this is only one of the factors taken into account (see https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-cluster.html).

Also, the cost of the hotspot needs to be weighed against the cost of copying GBs of data around your cluster in a shard shuffle. Finding the sweet spot for these values is HARD, and so we've left it up to the operator to make these decisions.

Ideally we can come up with a way to "guess" activity with that. With or without it, we could either use the index.priority or some index.activity as a separate mechanism to allow the user to control it. A readonly index could still get the brunt of the requests. The nice thing about a separate setting is that it could be curated over time separate from priority.

I agree that something more automated would be of value, and perhaps the index.priority setting is a nice way of doing that, eg the higher the priority, the more impact the cluster.routing.allocation.balance.index setting has.

I'd like us to make improvement here, but it is a hard problem. Looking forward to hearing what ideas others have.

@pickypg
Copy link
Member Author

pickypg commented Aug 17, 2015

I definitely agree that, in general, the difference between a replica and a primary shard is practically none (minus the fact that the primary must also wait for a response). In the above screenshot, there are zero updates taking place and there are no shadow replicas in play.

Also, the cost of the hotspot needs to be weighed against the cost of copying GBs of data around your cluster in a shard shuffle. Finding the sweet spot for these values is HARD, and so we've left it up to the operator to make these decisions.

I completely agree, especially about it being a hard problem. Unfortunately, in the above picture, it was doing the recovery onto the same two nodes with all of the indexing activity, thus doubling the load on them rather than spreading it around the cluster.

Annoyingly, the cluster figures out that it should rebalance after it eventually recovers the replicas, but the damage is generally done at that point (it also adds even more load with rebalancing).

Having spent some time away from this issue, I think that we should look at index.activity with some defaulted value, then use the standard priority details (index.priority > index.creation_date > index.name (descending)) to weigh the activity for equivalent activity values.

This allows for new indices to be created with high index.activity values that can be curated as time marches forward.

@vvcephei
Copy link

vvcephei commented Oct 7, 2015

This looks like the most relevant discussion to ask my related question on.

I totally agree with @clintongormley that building a bunch of smart heuristics into the balanced allocator is high risk. Like you said, it's hard to get right in general, and it's expensive to shuffle data around the cluster.

I think this is a good opportunity for exploring the issue with plugins. Assuming I understand the code, anyone can provide their own Allocator. The problem is that currently, Allocators are only invoked when the cluster changes (routing table changes, cluster settings updates, etc.).

Allocating based on dynamic properties like index activity, search activity, or whatever else requires the allocator to be invoked on an interval. I can simulate this by updating the settings periodically, but I wonder if anyone has better advice on how to get the cluster to rebalance periodically.

Thanks,
-John

@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018
@DaveCTurner
Copy link
Contributor

This is an interesting idea, but since its opening we have not seen enough feedback that it is something we should work on. We prefer to close this issue as a clear indication that we are not going to work on this at this time. We are always open to reconsidering this in the future based on compelling feedback; despite this issue being closed please feel free to leave feedback (including +1s).

@tdoman
Copy link

tdoman commented Apr 10, 2018

@DaveCTurner +1
At the very least, some way of distributing the primaries. I can simulate a redistribution of primaries in a 3 node cluster w/ 2 replicas by setting that to 1 replica, waiting for the redistribution, and then setting it back to 2 replicas. Giving the deployer the option to make that decision manually would be very helpful especially for the scenarios (ie. updates) where we know hot spots can be produced. We can get a bit of a hybrid between search and index optimization.

@Bukhtawar
Copy link
Contributor

cluster.routing.allocation.type is static, if there were multiple ClusterPlugin based on multiple allocation strategy and there was an option to switch allocators dynamically, do we see a problem. The idea is to make this cluster.routing.allocation.type dynamic to allow customers to rescue their clusters by selecting a different allocation strategy if they see a skew

@DaveCTurner
Copy link
Contributor

We are discussing removing ClusterPlugin in #39464. Switching dynamically to a wholly different allocator sounds pretty drastic, and liable to trigger a lot of shard relocations. What would this achieve that cannot be achieved by adjusting the dynamic settings of one or more of the allocation deciders?

@Leaf-Lin
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes)
Projects
None yet
Development

No branches or pull requests

8 participants