Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primary shard balancing #41543

Closed
aeftef opened this issue Apr 25, 2019 · 17 comments
Closed

Primary shard balancing #41543

aeftef opened this issue Apr 25, 2019 · 17 comments
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) high hanging fruit Team:Distributed Meta label for distributed team

Comments

@aeftef
Copy link

aeftef commented Apr 25, 2019

Primary shard balancing

There uses cases where we should be able to have a mechanism to balance primary shards through all the nodes so the number of primaries is uniformly distributed.

Problem description

There are situations where the distribution of primary shards is unevenly distributed through the nodes, for instance when doing a rolling restart last node wont have any primary shards as the other nodes would assume primary shard role while the other node is down.

This issue has pop up in other occassions, and the usual answer was that primary/replica role is not an issue becouse the workload a primary or a replica assume is similar. But there are important uses cases where this does not apply.

For instance, in an index heavy scenario, where indexing must be implemented as an scripted upsert, the execution of the upsert logic falls onto primarie shards, and replicas just have to insert the result.

In this cases having unbalanced primaries excerts a bigger workload on the nodes hosting the primaries, this can even overload the cluster capacity as the cluster bottleneck will be the capacity of the nodes hosting primaries and not the sum of the cluster nodes.

Related thread in official forum

Workarounds

Actually there are some workarounds for this situations, but they are not efficient:

  1. Once cluster primaries are unbalanced we could use the Cluster reroute API to try to balance them, swapping, in a "reroute transaction", a replica with a primary. In order to do this, first we need to have more nodes in the cluster than replicas because shards cannot be rerouted to a node where a shard already exists.

As an example,consider simplified scenario 3 nodes 2 shard 3 replicas, where rerouting is not possible:
Node 0: Shard 0 (primary), Shard 1 (primary)
Node 1: Shard 0 (replica), Shard 1 (replica)
Node 2: Shard 0 (replica), Shard 1 (replica)
Rerouting cannot be possible, we cannot swap shard 0 (primary) from Node 0 to Node 1 and shard0 (replica)

Against a scenario like 3 nodes 3 shard 2 replicas where rerouting is possible:
Node 0: Shard 0 (primary), Shard 1 (primary)
Node 1: Shard 0 (replica), Shard 2 (primary)
Node 2: Shard 1 (replica), Shard 2 (replica)

But even, when reroute is possible it means that shard data has to be moved from one node to another (I/O and network...).
Also there is not an automatted way to detect what primaries are unbalanced, and what shards can be swapped and execute that rerouting in samlls chunks in order to dont overload node resources. But implementing a utility or script that does that is feasible (see possible solutions).

  1. Simply "throw a bag of hardware" to the problem, having enough hardware to support an unbalanced scenario. Then we can limit the number of shards per node (https://www.elastic.co/guide/en/elasticsearch/reference/6.7/allocation-total-shards.html), both primaries + replicas, so they are distributed between nodes. This apporach is unecessary expensive, and impractical at certain scale levels.

Possible solutions? (all of them imply new features to be implemented)

So, lets consider possible solutions (take into account that I don't know elasticsearch internals):

  1. Enhance Cluster reroute API so that you can "reroute" the role of a shard: lets say we reroute a primary shard from a node to another node that hosts a replica shard, the data is not moved between the nodes, but replica shard is elected as primary and primary as replica.
    If reroute API has this functionality somehow, it would be possible to develop a script that detects primary shard imbalance and reroute primary hard roles accordingly.
  2. Modify cluster shard allocation protocol, so that primaries are automatically balanced when shards are assigned to the nodes. This could be active by default, or optional (configuring something new cluster settings, under cluster.routing...)
  3. Any other ideas ¿?
@polyfractal polyfractal added the :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Apr 25, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@aewhite
Copy link

aewhite commented Oct 13, 2019

For what it's worth, https://github.com/datarank/tempest does this. Disclaimer, I am one of the authors of that project. I am hoping to port it to more recent versions of ES at some point.

@DaveCTurner
Copy link
Contributor

The trickiest part of this feature is that today there is no way to demote a primary back to being a replica. The existing mechanism requires us to fail the primary, promote a specific replica, and then perform a recovery to replace the failed primary as a replica. Until the recovery is complete the shard is running with reduced redundancy, and I don't think it's acceptable for the balancer to detract from resilience like this. Therefore, I've labelling this as high-hanging fruit.

I've studied Tempest a little and I don't think anything in that plugin changes this. I also think it does not balance the primaries out in situations like the one described in the OP:

Node 0: Shard 0 (primary), Shard 1 (primary)
Node 1: Shard 0 (replica), Shard 1 (replica)
Node 2: Shard 0 (replica), Shard 1 (replica)

At least, there doesn't seem to be a test showing that it does and I couldn't find a mechanism that would have helped here after a small amount of looking. I could be wrong, I don't know the codebase (or Kotlin) very well, so a link to some code or docs to help us understand Tempest's approach would be greatly appreciated.

@hallh
Copy link

hallh commented Oct 18, 2019

Here's a real-life scenario.

Take this index:

  • 30 shards.
  • 4 replicas.
  • Running on 9 data nodes using spot instances on AWS.
  • Using 3 different instance types and 3 AZ's to minimise risk of spot wipeout.
  • Using both scripted upserts and custom routing.

Some spot instance types are more vulnerable than others, so over time we slowly see primaries moving over to the long-lived instances as the short-lived ones are being reclaimed by AWS.

With this setup we're "at risk" of experiencing both the skew in shard sizes from the custom routing (what Tempest tries to solve) as well as the skew in indexing load (what this issue is addressing).

The shard sizes we can deal with on the application side, but the primaries bundling up is a bigger problem. Given the number of replicas, it's hard to do any meaningful custom balancing. Taking a primary out of rotation by moving it would choose a new primary at random, so trying to balance this way could simply result in an endless cycle of shuffling. Never mind the performance impact and potentially the cross-AZ network costs too.

We could try the "bag of hardware" solution, but we'd need to add a new instance type to the spot pool for each additional instance to mitigate spot wipeout, and we'd consequently need to increase the replica count as well. Storage costs would increase, and it might not even solve the problem. We could also move to regular instances, but that's not great either for obvious reasons.

If, at the very least, we could choose which shard was primary, we could solve this problem with a custom solution (without the cost overhead). Having native support for balancing primaries would obviously solve this too.

@hallh
Copy link

hallh commented Nov 2, 2019

Found out that ES doesn't promote a replica when moving the primary - it'll literally just move the primary. Duh.. Anyway, knowing that, it was much simpler to balance the primaries, so for anyone finding this and having the samme issue, I wrote this tool to solve it: https://github.com/hallh/elasticsearch-primary-balancer.

I've run this on our production cluster a few times already and balancing primaries did solve the issue we had with sudden spikes of write rejections under otherwise consistent load.

@aeftef
Copy link
Author

aeftef commented Nov 4, 2019

Found out that ES doesn't promote a replica when moving the primary - it'll literally just move the primary. Duh.. Anyway, knowing that, it was much simpler to balance the primaries, so for anyone finding this and having the same issue, I wrote this tool to solve it: https://github.com/hallh/elasticsearch-primary-balancer.

I've run this on our production cluster a few times already and balancing primaries did solve the issue we had with sudden spikes of write rejections under otherwise consistent load.

Amazing work hallh.

But it's funny, because I also realized that a primary and a replica could be swaped via cluster relocation API as a way to balance replicas (inefficient, but effective).

Even more, I also developed such a tool to evaluate cluster state, generate a migration plan and execute it (I'm talking 4 days ago... so our work overlapped :-( )

So now we have two primary balancer tools...
I will try to publish it too as soon as I consider it production ready, but yea, this last days it has been rebalancing production cluster with no issue, so if I manage to invest some more time it can be ready soon.
Also expect feedback, as we have some common points were we can improve the tools.

Now, as a long term solution, we will try to push elasticsearch capability to move primary role from one shard to another. It may be slow, but being backed by pay support may help.
Once we have this functionality it would be trivial to modify the rebalance tools to just "move the primary roles" instead of swapping the shards with it's data (that would be fast and efficient).

@aeftef
Copy link
Author

aeftef commented Nov 15, 2019

Posting some suggetions for elasticsearch-primary-balancer here .

As an update, I will post my rebalance tool too, sooner than later, I've been busy supporting clusre routing location awareness, as some of my cluster are using it, and that furher limtis the ways you can to try to balance shards roles.

@jffree
Copy link

jffree commented Nov 18, 2019

I also encountered the same problem. When the primary shards are all on the same node, the load on this node can be very large.

shards allocation:

xxxx-36404-2018.11         1 r STARTED 11660265      2gb 10.234.242.30 node1
xxxx-36404-2018.11         1 p STARTED 11660265      2gb 10.234.242.31 node2
xxxx-36404-2018.11         0 r STARTED 11662979      2gb 10.234.242.30 node1
xxxx-36404-2018.11         0 p STARTED 11662979      2gb 10.234.242.31 node2
xxxx-36411-2019.06         1 r STARTED 28357804    6.6gb 10.234.242.30 node1
xxxx-36411-2019.06         1 p STARTED 28357804    6.6gb 10.234.242.31 node2
xxxx-36411-2019.06         0 r STARTED 28359462    6.4gb 10.234.242.30 node1
xxxx-36411-2019.06         0 p STARTED 28359462    6.4gb 10.234.242.31 node2
xxxx-36409-2018.12         1 r STARTED 21875206    4.6gb 10.234.242.30 node1
xxxx-36409-2018.12         1 p STARTED 21875206    4.6gb 10.234.242.31 node2
xxxx-36409-2018.12         0 r STARTED 21872417    4.6gb 10.234.242.30 node1
xxxx-36409-2018.12         0 p STARTED 21872417    4.6gb 10.234.242.31 node2
xxxx-36401-2018.11         1 r STARTED 37610364    6.9gb 10.234.242.30 node1
xxxx-36401-2018.11         1 p STARTED 37610364    6.9gb 10.234.242.31 node2
xxxx-36401-2018.11         0 r STARTED 37619043    6.9gb 10.234.242.30 node1
xxxx-36401-2018.11         0 p STARTED 37619043    6.9gb 10.234.242.31 node2
xxxx-36404-2019.01         1 r STARTED 12161838    2.9gb 10.234.242.30 node1
xxxx-36404-2019.01         1 p STARTED 12161838    2.9gb 10.234.242.31 node2
xxxx-36404-2019.01         0 r STARTED 12161757    2.9gb 10.234.242.30 node1
xxxx-36404-2019.01         0 p STARTED 12161757    2.9gb 10.234.242.31 node2
xxxx-36408-2018.10         1 r STARTED 58700020   10.1gb 10.234.242.30 node1
xxxx-36408-2018.10         1 p STARTED 58700020   10.1gb 10.234.242.31 node2
xxxx-36408-2018.10         0 r STARTED 58692558   10.1gb 10.234.242.30 node1
xxxx-36408-2018.10         0 p STARTED 58692558   10.1gb 10.234.242.31 node2
xxxx-36405-2019.11         1 r STARTED 12921855    3.1gb 10.234.242.30 node1
xxxx-36405-2019.11         1 p STARTED 12921855    3.2gb 10.234.242.31 node2
xxxx-36405-2019.11         0 r STARTED 12924625    3.1gb 10.234.242.30 node1
xxxx-36405-2019.11         0 p STARTED 12924625    3.1gb 10.234.242.31 node2
...

Task distribution is as follows

action                                          task_id                          parent_task_id                   type      start_time    timestamp running_time ip            node
indices:data/read/scroll                        GDS6EwMwRAavA9NACh4nXg:37417628  -                                transport 1574045549545 10:52:29  10.6s        10.234.242.30 node1
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999023 GDS6EwMwRAavA9NACh4nXg:37417628  netty     1574045553089 10:52:33  7.2s         10.234.242.31 node2
indices:data/read/scroll                        GDS6EwMwRAavA9NACh4nXg:37417629  -                                transport 1574045550097 10:52:30  10.1s        10.234.242.30 node1
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999049 GDS6EwMwRAavA9NACh4nXg:37417629  netty     1574045554929 10:52:34  5.3s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999058 GDS6EwMwRAavA9NACh4nXg:37417629  netty     1574045555700 10:52:35  4.6s         10.234.242.31 node2
indices:data/read/scroll                        GDS6EwMwRAavA9NACh4nXg:37417662  -                                transport 1574045553203 10:52:33  7s           10.234.242.30 node1
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999235 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045557817 10:52:37  2.4s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999236 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045557819 10:52:37  2.4s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999237 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045557842 10:52:37  2.4s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999238 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045557868 10:52:37  2.4s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999239 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045557899 10:52:37  2.4s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999240 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045557933 10:52:37  2.3s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999248 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045558112 10:52:38  2.2s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999249 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045558223 10:52:38  2s           10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999251 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045558264 10:52:38  2s           10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999252 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045558268 10:52:38  2s           10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999253 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045558294 10:52:38  2s           10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999254 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045558351 10:52:38  1.9s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999255 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045558398 10:52:38  1.9s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999256 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045558408 10:52:38  1.9s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999264 GDS6EwMwRAavA9NACh4nXg:37417662  netty     1574045558515 10:52:38  1.8s         10.234.242.31 node2
indices:data/read/scroll                        GDS6EwMwRAavA9NACh4nXg:37417765  -                                transport 1574045558454 10:52:38  1.7s         10.234.242.30 node1
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999344 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045558696 10:52:38  1.6s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999345 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045558731 10:52:38  1.5s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999346 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045558758 10:52:38  1.5s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999348 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045558823 10:52:38  1.4s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999349 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045558845 10:52:38  1.4s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999351 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045558893 10:52:38  1.4s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999353 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045558948 10:52:38  1.3s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999352 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045558948 10:52:38  1.3s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999354 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045558961 10:52:38  1.3s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999356 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559141 10:52:39  1.1s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999357 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559151 10:52:39  1.1s         10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999360 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559479 10:52:39  836.6ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999362 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559485 10:52:39  829.9ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999363 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559528 10:52:39  786.8ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999364 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559535 10:52:39  780.3ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999365 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559573 10:52:39  741.8ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999366 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559693 10:52:39  622.2ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999368 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559727 10:52:39  588.1ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999369 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559763 10:52:39  551.8ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999370 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559765 10:52:39  550ms        10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999371 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559785 10:52:39  530.4ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id/scroll] PP7k8Oo9RIqMcRboQS3n9w:119999372 GDS6EwMwRAavA9NACh4nXg:37417765  netty     1574045559793 10:52:39  522.3ms      10.234.242.31 node2
indices:data/read/search                        PP7k8Oo9RIqMcRboQS3n9w:119999406 -                                transport 1574045559857 10:52:39  457.8ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id]        PP7k8Oo9RIqMcRboQS3n9w:119999569 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct    1574045559944 10:52:39  371ms        10.234.242.31 node2
indices:data/read/search[phase/fetch/id]        PP7k8Oo9RIqMcRboQS3n9w:119999573 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct    1574045559998 10:52:39  316.9ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id]        PP7k8Oo9RIqMcRboQS3n9w:119999574 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct    1574045560005 10:52:40  310.2ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id]        PP7k8Oo9RIqMcRboQS3n9w:119999581 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct    1574045560075 10:52:40  240.7ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id]        PP7k8Oo9RIqMcRboQS3n9w:119999582 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct    1574045560155 10:52:40  159.8ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id]        PP7k8Oo9RIqMcRboQS3n9w:119999584 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct    1574045560172 10:52:40  143.1ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id]        PP7k8Oo9RIqMcRboQS3n9w:119999586 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct    1574045560211 10:52:40  103.8ms      10.234.242.31 node2
indices:data/read/search[phase/fetch/id]        PP7k8Oo9RIqMcRboQS3n9w:119999588 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct    1574045560273 10:52:40  41.9ms       10.234.242.31 node2
indices:data/read/search[phase/fetch/id]        PP7k8Oo9RIqMcRboQS3n9w:119999589 PP7k8Oo9RIqMcRboQS3n9w:119999406 direct    1574045560296 10:52:40  19.1ms       10.234.242.31 node2
indices:data/read/search                        PP7k8Oo9RIqMcRboQS3n9w:119999559 -                                transport 1574045559942 10:52:39  372.7ms      10.234.242.31 node2
indices:data/read/scroll                        PP7k8Oo9RIqMcRboQS3n9w:119999570 -                                transport 1574045559980 10:52:39  335.2ms      10.234.242.31 node2
indices:data/read/search                        PP7k8Oo9RIqMcRboQS3n9w:119999572 -                                transport 1574045559990 10:52:39  325.2ms      10.234.242.31 node2
indices:data/read/scroll                        PP7k8Oo9RIqMcRboQS3n9w:119999583 -                                transport 1574045560158 10:52:40  157.1ms      10.234.242.31 node2
cluster:monitor/tasks/lists                     GDS6EwMwRAavA9NACh4nXg:37417903  -                                transport 1574045560189 10:52:40  43.7ms       10.234.242.30 node1
cluster:monitor/tasks/lists[n]                  GDS6EwMwRAavA9NACh4nXg:37417904  GDS6EwMwRAavA9NACh4nXg:37417903  direct    1574045560221 10:52:40  12.5ms       10.234.242.30 node1
cluster:monitor/tasks/lists[n]                  PP7k8Oo9RIqMcRboQS3n9w:119999587 GDS6EwMwRAavA9NACh4nXg:37417903  netty     1574045560239 10:52:40  75.7ms       10.234.242.31 node2

any good solutions?

@gf53520
Copy link

gf53520 commented Aug 10, 2020

+1, we also found this case in our production

@jugrajsingh
Copy link

jugrajsingh commented Sep 8, 2020

+1, We would like to have this functionality.
Some nodes are Highly overloaded due to this very case.

And if a node in the cluster gets rebooted it loses its primary shards and make the cluster even more unbalanced.
and not rebalance after re-adding.

@bitonp
Copy link

bitonp commented Nov 15, 2020

+1 .. same issue here. 2 data node cluster.. primaries evenly balanced. One node died.... all primaries end up on live node (good). Bring the dead node back up, and we have all primaries on one, and all secondaries on another. Wont rebalance the primaries. This means all writes go to one node.... plus aggregate reads all come from one node. There is no spl,it usage acrss nodes.
Weirdly.. I am sure that the old versions of ES did this (not sure when it stopped).. but we do really need to have that ability put in.

Scratch that... found the way.
The 'cancel' command on the primary, cancels the current primary (on node 2) .. promotes teh replica (on node 1) to primary... and the new primary dumps its records to node 2... cool...
POST /_cluster/reroute { "commands": [ { "cancel": { "index":"my-index", "shard": 3, "node": "my-data-2", "allow_primary":true } } ] }

@Ahnfelt
Copy link

Ahnfelt commented Jan 25, 2023

Is the solution @bitonp gave above the official recommendation for rebalancing primary shards?

@coudenysj
Copy link

The biggest problem we see is that not all shards have the same size. Because the rebalancing logic of ES only looks at the number of shards on a node, this causes a lot of problems when nodes are running out of diskspace.

I'm doing some "manual" rebalancing, but I seems that is something that ES would do a lot better.

I find it strange that not more people running into this issue.

PS: https://discuss.elastic.co/t/shard-allocation-based-on-shard-size/257817/12

@karlney
Copy link

karlney commented May 11, 2023

Is the solution @bitonp gave above the official recommendation for rebalancing primary shards?

We have been using it and it saves us money even though it feels like a hack. Details in https://underthehood.meltwater.com/blog/2023/05/11/promoting-replica-shards-to-primary-in-elasticsearch-and-how-it-saves-us-12k-during-rolling-restarts/

@bitonp
Copy link

bitonp commented May 13, 2023

Is the solution @bitonp gave above the official recommendation for rebalancing primary shards?
@Ahnfelt Not the 'official' solution. but it works well, and allows ES to 'do its thing' without affecting the rest of the system. It is simply mimicking a node outage, which ES can cope with if there are no other influences in this scenario.

O noticed that some people are putting all thee primaries on a single node. This is sub-optimal (and 'relational' in concept). By splitting nodes between primaries and secondaries on the nodes, in the case of a node outage, it is quick to promote secondaries to primaries yo fill the gaps.
This also allows a split across nodes for writes and reads (not all going to the same machine) , rediucing load on a given node, and making us of system resources effectively.

@skijash
Copy link

skijash commented Mar 8, 2024

cluster.routing.allocation.balance.write_load attribute mentioned in Shard rebalancing heuristics in v8.12 seems as it could be useful for this kind of balancing. Has anyone tried it?

@DaveCTurner
Copy link
Contributor

Since 8.6 Elasticsearch takes account of write load when making balancing decisions, and we have also been doing other things which separate indexing and search workloads more completely. Given these recent changes and the technical obstacles to genuinely balancing primary shards as suggested here (particularly the need for graceful demotion back to a replica) it's really rather unlikely we'll implement this feature in the foreseeable future. Therefore I'm closing this issue to indicate that it is not something we expect to pursue.

@DaveCTurner DaveCTurner closed this as not planned Won't fix, can't repro, duplicate, stale May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) high hanging fruit Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests