Elastic is imbalanced and probably needs manual rebalancing #3366

mlissner · 2023-11-09T22:57:11Z

After the work we did a few days ago, Elastic has not rebalanced the shards even though it's supposed to:

In theory, the primary and replica shards should be evenly divided among the nodes so that they're evenly queried. We suspect this is one reason that one of our nodes is constantly pegged at 100% CPU utilization (though it's surprising that it's not two of them, based on the picture above).

Elastic is supposed to automatically rebalance itself, but isn't for some reason, so we probably need to intervene. A couple thoughts:

We've been aggressively indexing since creating this imbalance in the nodes. Is it possible Elastic only does this when nodes are quiescent?
One way to do this is probably to delete the replicas. I bet if we did that, Elastic would rebalance quickly.
Is it possible this isn't a huge issue? I forget, are replicas used for queries? They are, right? If so, is it possible this situation is OK?
I tried sharing the above image with Bing's GPT thing, and it wasn't much help, but it did point to this article that makes rebalancing look pretty simple:

https://opster.com/guides/elasticsearch/glossary/elasticsearch-rebalance/

I just want to understand why it hasn't happened already....

albertisfu · 2023-11-14T00:33:42Z

@mlissner I've reviewed this issue and created a testing ES cluster on AWS, closely following the current settings we have in production, but on a smaller scale and with only three nodes, to easily reproduce an imbalanced node scenario.

I tried various approaches to reproduce the imbalanced node issue. One method was deleting a node and allowing the auto-scaler to create a new one. This did not work if the cluster was balanced previously, as the new node created was likely to be balanced again.

The method that worked involved reducing the cluster size to three and then deleting two of the nodes almost simultaneously. When the new nodes were regenerated, they balanced in terms of the number of shards, but the node that remained active concentrated the most primary shards, resulting in an imbalance.

The problem seems to occur when the cluster loses the majority of its nodes. It tries to keep functioning by concentrating all primary shards on the remaining nodes. However, once the other nodes come back, the primary shards remain imbalanced.

I tried tweaking some settings according to the documentation described in:
Shard balancing heuristics settings

Such as increasing the write load factor to equalize the total write load across nodes:
cluster.routing.allocation.balance.write_load

However, this didn’t work once the cluster was already imbalanced.

This appears to be a known issue, as described in:
elastic/elasticsearch#41543

So it requires manual intervention to balance the cluster using the reroute API, as mentioned in:elastic/elasticsearch#41543 (comment)

The trick is to cancel primary shards as needed to achieve a balanced cluster. Using Kibana or CURL, we can utilize the reroute API:
POST /_cluster/reroute

{
   "commands":[
      {
         "cancel":{
            "index":"recap",
            "shard":26,
            "node":"elastic-cluster-es-master-data-nodes-v6-1",
            "allow_primary":true
         }
      }
   ]
}

We need to change the primary shard number and the node where it's currently allocated, applying the command as many times as needed to reallocate the shards.

For example, I initially had an imbalanced cluster as shown:

After running the reroute many times, I ended up with the following result:

A balanced cluster with 10 primary shards and 10 replicas on each node.

In this case, I canceled the following primary shards on elastic-cluster-es-master-data-nodes-v6-1: 3, 5, 8, 9, 10, 15, 16, 22, 26, 28.

These primary shards were canceled (swapped by their replicas) while the replicas in the other shards were promoted to primary shards. The general idea is to cancel primary shards based on where the replicas are to end up with a balanced cluster.

So, after this, the workload should be evenly spread across all the nodes.

mlissner · 2023-11-22T20:49:07Z

That worked beautifully and only took a few seconds. Very nice. Here's the new allocation:

Hooray!

mlissner closed this as completed Nov 22, 2023

mlissner mentioned this issue Jan 22, 2024

Elastic probably needs more servers #3644

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic is imbalanced and probably needs manual rebalancing #3366

Elastic is imbalanced and probably needs manual rebalancing #3366

mlissner commented Nov 9, 2023

albertisfu commented Nov 14, 2023

mlissner commented Nov 22, 2023

Elastic is imbalanced and probably needs manual rebalancing #3366

Elastic is imbalanced and probably needs manual rebalancing #3366

Comments

mlissner commented Nov 9, 2023

albertisfu commented Nov 14, 2023

mlissner commented Nov 22, 2023