Some combination of `total_shards_per_node` and `allow_rebalance` blocks allocation of unassigned shards with DesiredBalanceAllocator #108594

mrkm4ntr · 2024-05-14T01:00:25Z

Elasticsearch Version

8.12.1

Installed Plugins

No response

Java Version

bundled

OS Version

Linux 5.15.133+

Problem Description

Some combination of total_shards_per_node and allow_rebalance (e.g. total_shards_per_node = 2 and allow_rebalance = indices_all_active) blocks allocation of unassigned shards with DesiredBalanceAllocator. Some replica shards are remaining unassigned even there are room to allocate them. Cluster allocation API tells unassigned shards can be replaced to that room. This is really confusing situation.

The response of explain API.

{
  "index": "item-all",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "REPLICA_ADDED",
    "at": "2024-05-05T13:07:16.494Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "yes",
  "allocation_explanation": "Elasticsearch can allocate the shard.",
  "target_node": {
  ...
}

The response of _internal/desired_balance. It says unassigned shards is 0 even there are actually unassigned shards.

{
  "stats": {
    ...
    "unassigned_shards": 0,
    ...
}

I found this is due to the difference of the order of relocating and assigning unassigned between computing desired balance and actual allocation. During desired balance, the order is relocating -> assigning unassigned. But during actual allocation, the order is assigning unassigned -> relocating (balance).

Detail

This is the log. Now total_shards_per_node = 2. The Node Ox3uTG_uTX6RPFToXcNk5g has 2 shards (9 and 20). 1 replica of shard 0 is unassigned.

During computing desired balance, shard 20 was relocated to other node from Ox3uTG_uTX6RPFToXcNk5g.

T0509 05:22:58.000889 1 [elasticsearch-bench8-es-masters-z-b-1] [item-all][20] marked shard as started (routing: [item-all][20], node[gszxwYrMRUik4tvLE9ONTA], relocating [Ox3uTG_uTX6RPFToXcNk5g], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=m_lHqXcaQRe05AXDoOgqmA, rId=K5b7AIUXTI-QZakqyeDJZg], failed_attempts[0], expected_shard_size[0])

Then shard 0 was allocated to node Ox3uTG_uTX6RPFToXcNk5g because shard 20 was relocated and there is a room.

T0509 05:22:58.000895 1 [elasticsearch-bench8-es-masters-z-b-1] [item-all][0] marked shard as started (routing: [item-all][0], node[Ox3uTG_uTX6RPFToXcNk5g], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=UUfVDOQJRB2GuZE2GhShVw], unassigned_info[[reason=REPLICA_ADDED], at[2024-05-09T04:44:18.973Z], delayed=false, allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[0])

Then delegatedAllocater assigned shard 0 to node Ox3uTG_uTX6RPFToXcNk5g.

T0509 05:22:58.000929 1 [elasticsearch-bench8-es-masters-z-b-1] Assigned shard [[item-all][0], node[Ox3uTG_uTX6RPFToXcNk5g], [R], s[STARTED], a[id=UUfVDOQJRB2GuZE2GhShVw], failed_attempts[0], expected_shard_size[0]] to node [Ox3uTG_uTX6RPFToXcNk5g]

Then here is the computed desired balance. All primary and replicas of shard 0 are assigned to nodes.

T0509 05:22:59.000043 1 [elasticsearch-bench8-es-masters-z-b-1] Desired balance updated: ... [item-all][0]=ShardAssignment[nodeIds=[NkF1_R8nR_Kiym-LxTVMIA, EmeN_FSPSaGca_-zq858LA, Ox3uTG_uTX6RPFToXcNk5g], total=3, unassigned=0, ignored=0]

But in reconciliation, ShardsLimitAllocationDecider returns NO because shard 20 was not relocated actually yet.

T0509 05:22:59.000051 1 [elasticsearch-bench8-es-masters-z-b-1] Reconciler#allocateUnassigned
...
T0509 05:22:59.000051 1 [elasticsearch-bench8-es-masters-z-b-1] Can not allocate [[item-all][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2024-05-09T04:44:18.973Z], delayed=false, allocation_status[no_attempt]], failed_attempts[0]] on node [{elasticsearch-bench8-es-item-all-1}{Ox3uTG_uTX6RPFToXcNk5g}{7Skx_KKdTyWmVjXlNTEWJQ}{elasticsearch-bench8-es-item-all-1}{10.34.1.230}{10.34.1.230:9300}{d}{8.12.1}{7000099-8500010}{transform.config_version=10.0.0, xpack.installed=true, k8s_node_name=gke-citadel-2g-dev-t-d-mercari-eaas-t-f669dce3-k79p, k8s_pod_name=elasticsearch-bench8-es-item-all-1, group=item-all, ml.config_version=12.0.0}]. [ShardsLimitAllocationDecider]: NO()
...
D0509 05:22:59.000051 1 [elasticsearch-bench8-es-masters-z-b-1] Couldn't assign shard [[item-all][0]] to [Ox3uTG_uTX6RPFToXcNk5g]: NO()
...
D0509 05:22:59.000051 1 [elasticsearch-bench8-es-masters-z-b-1] No eligible node found to assign shard [[item-all][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2024-05-09T04:44:18.973Z], delayed=false, allocation_status[no_attempt]], failed_attempts[0]]

And relocation was blocked by ClusterRebalanceAllocationDecider because there are still unassigned shards.

T0509 05:22:59.000058 1 [elasticsearch-bench8-es-masters-z-b-1] Can not rebalance. [ClusterRebalanceAllocationDecider]: NO(the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active])

Then unassigned shards will never be assigned.

Steps to Reproduce

Set total_shards_per_node=2 to index that has many shards (in our case 24).
Increase the number of node.
Increase the number of replicas as soon as possible.
Some new shards are remaining unassigned.

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

DaveCTurner · 2024-05-14T08:48:45Z

I believe #98710 would address the confusing output of the allocation explain API (ES cannot in fact allocate the shards to their desired nodes). However ultimately total_shards_per_node leads to unassigned shards sometimes, as mentioned in its docs. See also #12273.

Since this is a known issue and tracked elsewhere I'm going to close this as a duplicate. It's a valid observation, we just don't need another issue to track it.

mrkm4ntr · 2024-05-14T12:02:42Z

@DaveCTurner We've never faced unassigned shards with total_shards_per_node before using DesiredBalanceAllocator for several years. And as my explanation, the cause is clearly DesiredBalanceAllocator. If you won't plan to fix this soon, at least I'd like you to add note to docs about using total_shards_per_node requires allow_rebalance is indices_primaries_active or always.

mrkm4ntr added >bug needs:triage Requires assignment of a team area label labels May 14, 2024

DaveCTurner closed this as not planned Won't fix, can't repro, duplicate, stale May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some combination of `total_shards_per_node` and `allow_rebalance` blocks allocation of unassigned shards with DesiredBalanceAllocator #108594

Some combination of `total_shards_per_node` and `allow_rebalance` blocks allocation of unassigned shards with DesiredBalanceAllocator #108594

mrkm4ntr commented May 14, 2024

DaveCTurner commented May 14, 2024

mrkm4ntr commented May 14, 2024 •

edited

Loading

Some combination of total_shards_per_node and allow_rebalance blocks allocation of unassigned shards with DesiredBalanceAllocator #108594

Some combination of total_shards_per_node and allow_rebalance blocks allocation of unassigned shards with DesiredBalanceAllocator #108594

Comments

mrkm4ntr commented May 14, 2024

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Detail

Steps to Reproduce

Logs (if relevant)

DaveCTurner commented May 14, 2024

mrkm4ntr commented May 14, 2024 • edited Loading

Some combination of `total_shards_per_node` and `allow_rebalance` blocks allocation of unassigned shards with DesiredBalanceAllocator #108594

Some combination of `total_shards_per_node` and `allow_rebalance` blocks allocation of unassigned shards with DesiredBalanceAllocator #108594

mrkm4ntr commented May 14, 2024 •

edited

Loading