-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shard Allocation Activity Balancing #12279
Comments
The idea that having 5 primaries on one node and 5 replicas on the other node is a bad situation is (for the most part) incorrect. Primaries and replicas do the same amount of work, both at index time and at search time. I say "for the most part" because there are two exceptions:
That said, having most of the shards of your most active index sitting on only two nodes can lead to load imbalance. The Also, the cost of the hotspot needs to be weighed against the cost of copying GBs of data around your cluster in a shard shuffle. Finding the sweet spot for these values is HARD, and so we've left it up to the operator to make these decisions.
I agree that something more automated would be of value, and perhaps the I'd like us to make improvement here, but it is a hard problem. Looking forward to hearing what ideas others have. |
I definitely agree that, in general, the difference between a replica and a primary shard is practically none (minus the fact that the primary must also wait for a response). In the above screenshot, there are zero updates taking place and there are no shadow replicas in play.
I completely agree, especially about it being a hard problem. Unfortunately, in the above picture, it was doing the recovery onto the same two nodes with all of the indexing activity, thus doubling the load on them rather than spreading it around the cluster. Annoyingly, the cluster figures out that it should rebalance after it eventually recovers the replicas, but the damage is generally done at that point (it also adds even more load with rebalancing). Having spent some time away from this issue, I think that we should look at This allows for new indices to be created with high |
This looks like the most relevant discussion to ask my related question on. I totally agree with @clintongormley that building a bunch of smart heuristics into the balanced allocator is high risk. Like you said, it's hard to get right in general, and it's expensive to shuffle data around the cluster. I think this is a good opportunity for exploring the issue with plugins. Assuming I understand the code, anyone can provide their own Allocator. The problem is that currently, Allocators are only invoked when the cluster changes (routing table changes, cluster settings updates, etc.). Allocating based on dynamic properties like index activity, search activity, or whatever else requires the allocator to be invoked on an interval. I can simulate this by updating the settings periodically, but I wonder if anyone has better advice on how to get the cluster to rebalance periodically. Thanks, |
This is an interesting idea, but since its opening we have not seen enough feedback that it is something we should work on. We prefer to close this issue as a clear indication that we are not going to work on this at this time. We are always open to reconsidering this in the future based on compelling feedback; despite this issue being closed please feel free to leave feedback (including +1s). |
@DaveCTurner +1 |
|
We are discussing removing |
+1 |
Now that we have a sense of recovery priority (#11787), it may make sense to use that priority for allocation weight when all things are equal.
Problem
The use case that I am thinking of is, during a cluster restart, you can end up with lopsided primaries even with allocation disabled.
The normal distribution can be observed along the top (minus the lopsided primary distribution). This was on 1.6.0 following a full cluster restart with allocation disabled, and it was only reenabled after everyone was up. This particularly recovery lead to three very real problems:
This was without searching, but, being today's index, it would also be receiving the brunt of the search load as well. Obviously that would have a very negative impact on these two nodes that the rest of the cluster would simply shrug off.
Solution
The concept of primary balancing has been discussed (and removed), but this type of hot spotting is clearly a non-trivial problem. It's easy to spot it, but it's not easy to prevent it.
Given that the cluster maintains index writers for shards that are sized based on their activity level and we have sync_ids, segment counts, and index readability, then we should be able to come up with some estimate of activity to balance against. New shards should be assumed to be as active as the most active shards.
Ideally we can come up with a way to "guess" activity with that. With or without it, we could either use the
index.priority
or someindex.activity
as a separate mechanism to allow the user to control it. A readonly index could still get the brunt of the requests. The nice thing about a separate setting is that it could be curated over time separate from priority.From there, we need to modify the allocator equation to weight significantly based on activity to avoid getting the picture above in normal circumstances. If we go purely based on a number, then we can only do this for advanced use cases because all normal indices would have an equal--defaulted--activity value.
Workaround
Manually rerouting shards can help to prevent this in current situations when you unluckily come across it. It's only really a problem once those shards become large and therefore movement is expensive.
The text was updated successfully, but these errors were encountered: