-
-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastic is imbalanced and probably needs manual rebalancing #3366
Comments
@mlissner I've reviewed this issue and created a testing ES cluster on AWS, closely following the current settings we have in production, but on a smaller scale and with only three nodes, to easily reproduce an imbalanced node scenario. I tried various approaches to reproduce the imbalanced node issue. One method was deleting a node and allowing the auto-scaler to create a new one. This did not work if the cluster was balanced previously, as the new node created was likely to be balanced again. The method that worked involved reducing the cluster size to three and then deleting two of the nodes almost simultaneously. When the new nodes were regenerated, they balanced in terms of the number of shards, but the node that remained active concentrated the most primary shards, resulting in an imbalance. The problem seems to occur when the cluster loses the majority of its nodes. It tries to keep functioning by concentrating all primary shards on the remaining nodes. However, once the other nodes come back, the primary shards remain imbalanced. I tried tweaking some settings according to the documentation described in: Such as increasing the write load factor to equalize the total write load across nodes: However, this didn’t work once the cluster was already imbalanced. This appears to be a known issue, as described in: So it requires manual intervention to balance the cluster using the reroute API, as mentioned in:elastic/elasticsearch#41543 (comment) The trick is to cancel primary shards as needed to achieve a balanced cluster. Using Kibana or CURL, we can utilize the reroute API:
We need to change the primary For example, I initially had an imbalanced cluster as shown: After running the reroute many times, I ended up with the following result: A balanced cluster with 10 primary shards and 10 replicas on each node. In this case, I canceled the following primary shards on These primary shards were canceled (swapped by their replicas) while the replicas in the other shards were promoted to primary shards. The general idea is to cancel primary shards based on where the replicas are to end up with a balanced cluster. So, after this, the workload should be evenly spread across all the nodes. |
After the work we did a few days ago, Elastic has not rebalanced the shards even though it's supposed to:
In theory, the primary and replica shards should be evenly divided among the nodes so that they're evenly queried. We suspect this is one reason that one of our nodes is constantly pegged at 100% CPU utilization (though it's surprising that it's not two of them, based on the picture above).
Elastic is supposed to automatically rebalance itself, but isn't for some reason, so we probably need to intervene. A couple thoughts:
https://opster.com/guides/elasticsearch/glossary/elasticsearch-rebalance/
I just want to understand why it hasn't happened already....
The text was updated successfully, but these errors were encountered: