total_shards_per_node may lead to unassigned shards #9248

imriz · 2015-01-12T10:13:58Z

Hi,
I have 3 nodes (version 1.4.1), and my index settings is as follows:
"index.routing.allocation.total_shards_per_node": 2
"index.number_of_shards": 3
"index.number_of_replicas": "1"

Sometimes, The shards gets allocated as follows (p denotes primary, r denotes replica):
Node1: 0p,1r
Node2: 0r,1p
Node2: 2p

This leaves 2r unassigned.
If I raise total_shards_per_node to 3, the cluster will start recovering the unassigned shard (2r).
If I lower total_shards_per_node to 2 after the recovery has finished, it will reallocate the shards correctly.

I see that I am not the only one seeing this:
https://groups.google.com/forum/#!topic/elasticsearch/GZnamxaaj0g

clintongormley · 2015-01-13T20:55:18Z

Hi @imriz

I tried this on 1.4.1, and it seems to be working correctly:

DELETE _all 

PUT t
{
  "settings": {
    "index.routing.allocation.total_shards_per_node": 2,
    "index.number_of_shards": 3,
    "index.number_of_replicas": 1
  }
}

GET _cat/shards

Returns:

t 0 r STARTED 0  79b 192.168.2.183 Paige Guthrie         
t 0 p STARTED 0 115b 192.168.2.183 Sangre                
t 1 p STARTED 0 115b 192.168.2.183 Paige Guthrie         
t 1 r STARTED 0  79b 192.168.2.183 Herbert Edgar Wyndham 
t 2 r STARTED 0  79b 192.168.2.183 Sangre                
t 2 p STARTED 0 115b 192.168.2.183 Herbert Edgar Wyndham

Could you provide a recreation of the problem?

imriz · 2015-01-13T21:06:58Z

So far, I wasn't able to reproduce at will, but I can gather what ever
debugging information needed next time it occurs.

On Tue, Jan 13, 2015, 22:56 Clinton Gormley [email protected]
wrote:

Hi @imriz https://github.com/imriz

I tried this on 1.4.1, and it seems to be working correctly:

DELETE _all

PUT t
{
"settings": {
"index.routing.allocation.total_shards_per_node": 2,
"index.number_of_shards": 3,
"index.number_of_replicas": 1
}
}

GET _cat/shards

Returns:

t 0 r STARTED 0 79b 192.168.2.183 Paige Guthrie
t 0 p STARTED 0 115b 192.168.2.183 Sangre
t 1 p STARTED 0 115b 192.168.2.183 Paige Guthrie
t 1 r STARTED 0 79b 192.168.2.183 Herbert Edgar Wyndham
t 2 r STARTED 0 79b 192.168.2.183 Sangre
t 2 p STARTED 0 115b 192.168.2.183 Herbert Edgar Wyndham

Could you provide a recreation of the problem?

—
Reply to this email directly or view it on GitHub
#9248 (comment)
.

imriz · 2015-01-19T10:48:21Z

Hi @clintongormley,
This happened again:
logstash-app-2015.01.19 0 p STARTED 498246 259mb 10.84.30.188 elasticsearch003
logstash-app-2015.01.19 0 r STARTED 498413 256.5mb 10.84.30.184 elasticsearch002
logstash-app-2015.01.19 1 r STARTED 499403 248.1mb 10.84.30.188 elasticsearch003
logstash-app-2015.01.19 1 p STARTED 499298 236.8mb 10.84.30.184 elasticsearch002
logstash-app-2015.01.19 2 p STARTED 499272 245.5mb 10.84.30.186 elasticsearch001
logstash-app-2015.01.19 2 r UNASSIGNED

The only relevant logs are:
[2015-01-19 04:18:39,824][INFO ][cluster.metadata ] [elasticsearch001] [logstash-app-2015.01.19] creating index, cause [auto(bulk api)], shards [3]/[1], mappings [default]
[2015-01-19 04:18:40,204][INFO ][cluster.metadata ] [elasticsearch001] [logstash-app-2015.01.19] update_mapping logs
[2015-01-19 04:19:39,521][INFO ][cluster.metadata ] [elasticsearch001] [logstash-app-2015.01.19] update_mapping logs

clintongormley · 2015-01-19T20:18:29Z

Hi @imriz

So these are definitely per-index settings that you are setting? Nothing in the per-node config files or cluster settings? Any other allocation-related settings that we should know about?

imriz · 2015-01-19T23:11:22Z

Hi @clintongormley

The only thing remotely related is:
cluster:
routing:
allocation:
disk:
threshold_enabled: true
watermark:
high: 0.99
low: 0.97

But we are not close to these thresholds, nor do I see how they can create the allocation topology I got here..

I should emphasize that it looks completely random - an index creation few minutes later can result a perfectly sensible topology.
Moreover, If I increase the total_number_of_shard to 3, and then reduce it, it will rebalance back to normality.

clintongormley · 2015-01-20T18:53:24Z

Hi @imriz

A colleague has encountered this before and explained to me what is happening:

On elasticsearch003, it has assigned p0 and r1
On elasticsearch002, it has assigned p1 and r0
On elasticsearch001, it has assigned p2, but it can't assign r2, because it never puts copies of the same shard on the same node.

The only way to solve this setup is to move one of the other shards to a different node, but it isn't going to do that until something else triggers the move, eg disk full, node failure, etc

While it goes to some trouble to randomize things like this, so that this should happen seldom, it is possible (as you've seen) with hard limits like total_shards_per_node to reach situations where allocation can't proceed.

imriz · 2015-01-21T04:09:03Z

Hi @clintongormley

Yep, that is what I described in my initial post :)

Just one small note, if total_shards_per_node is set to a higher value, the nodes will happily allocate the same shard to the same node.

i think that the best approach is to have a configuration flag to ensure this will never happen (same shard on the same node), since the whole reason I've had to set total_shards_per_node to 2 is to prevent a node from recovering redundant shards when another node fails (which is very problematic when you don't have a lot of free disk space).

clintongormley · 2015-01-21T13:31:20Z

Just one small note, if total_shards_per_node is set to a higher value, the nodes will happily allocate the same shard to the same node.

This is never be case. I think what happens if you increase the total_shards_per_node is that you end up with 3 (different) shards on a single node, then the allocator rebalances by moving one of the shards to a different node. But having two copies of the same shard on a single node will never be allowed.

the whole reason I've had to set total_shards_per_node to 2 is to prevent a node from recovering redundant shards when another node fails (which is very problematic when you don't have a lot of free disk space).

as above, this never happens.

imriz · 2015-01-21T20:05:44Z

This is never be case. I think what happens if you increase the total_shards_per_node is that you end up with 3 (different) shards on a single node, then the allocator rebalances by moving one of the shards to a different node. But having two copies of the same shard on a single node will never be allowed.

You are, of course, correct.
Anyway, that leaves us with the original issue where the allocator doesn't compute the optimal allocation scheme, which results unassigned shards.

as above, this never happens.

Well, not the same shards, but indeed redundant shards.
In a cluster where storage is limited (for example, in ELK setups, where storage is computed to support a specific X days sliding window retention of logs), and when cluster is designed to only support one node failure (which I believe is a legitimate use case), one might be OK with not recovering replicas of shards after 1 node failure. If I would set total_shards_per_node to a higher value, I risk "eating" up all the free disk space at each node failure, leading to inability to index new logs.

bleskes · 2015-01-21T23:15:47Z

@imriz did you try looking at the the disk based allocation as a means to avoid assigning shards that will cause the node to run out of disk space? This will give the flexibility you need if possible while avoiding filling up the disk. More info is here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html#disk . Note that you will have to be using 1.4.1 or higher for it to include the size of relocating shards (and prevent them): see #7785 which was back ported in 4e5264c

imriz · 2015-01-22T04:49:01Z

@bleskes I am aware of the disk based allocation thresholds, but that doesn't give me the functionality I need - I want to utilize the disk space as much as I can - I just don't want the nodes to recover any shards of another failing node (that is, I want to be able to sustain only 1 node failure).
The total_shards_per_node set to 2 with the combination of number_of_replicas set to 1 gives me exactly that, with the exception of these annoying unassigned shards every once and then.
This is definitely not a showstopper, but it would be nice to get the allocator improved so it will always compute the optimal allocation scheme :)

bleskes · 2015-02-13T10:58:22Z

@imriz sorry for the late response. I hear you and your are right. The disk threshold allocator will give you some more flexibility here, if you have the space for it. Since we can't try all the shard allocations combinations (think thousands of shards, hundreds of nodes) we have to use an iterative algorithm and that might not find a global optimum. If you have any suggestion how to improve this I'll be very happy to discuss them.

imriz · 2015-02-13T11:48:34Z

@bleskes Maybe something like CRUSH maps could be of use?

bleskes · 2015-02-13T13:16:13Z

@imriz thx for the tip. Looks interesting, though it's not clear at first glance how removing the centralized nature (which is very good for other reasons) will allow to deal with local minima.

imriz · 2015-02-14T10:06:29Z

@bleskes CRUSH allows you to set allocation rules, which can prevent the issue
at hand. Twitter's libcrunch is a good place to look at.

On Fri, Feb 13, 2015, 15:17 Boaz Leskes [email protected] wrote:

@imriz https://github.com/imriz thx for the tip. Looks interesting,
though it's not clear at first glance how removing the centralized nature
(which is very good for other reasons) will allow to deal with local minima.

—
Reply to this email directly or view it on GitHub
#9248 (comment)
.

aochsner · 2015-04-09T19:39:19Z

It's been a long time since I tried this approach (I was the author of the linked user group post). But when talking to someone at ElasticON, they said it sounds like a bug and to create an issue. Was about to try do that and came across this issue. Glad I'm not the only one.

Haven't gone through to try to recreate this on 1.4.4. We are currently on 1.4.2. We are running a 3 node cluster, 3 primary, 1 replica, with daily rolling indexes. There's probably about 12 indexes per day. Of those, 2 are very large/hot. We really care that those are distributed evenly (we are on spinning disk and pretty constrained w/ memory so anything we can do to balance it across the cluster, the better). In fact, we'd just prefer everything get distributed evenly. I've tried to play with cluster.routing.allocation.balance.* but that didn't seem to help. Meant to go back and see if it was better but also saw this issue #9023 which sounds like I might want to upgrade to >= 1.4.4 first. Anyways, that's all the background that lead us to trying this setting.

Seems the issue is clear now? Or is there anything I can do to try to help recreate?

clintongormley · 2015-07-17T11:49:23Z

Closing this as a duplicate of #12273

clintongormley added the feedback_needed label Jan 13, 2015

clintongormley added discuss and removed feedback_needed labels Jan 19, 2015

clintongormley added >bug :Allocation labels Jan 20, 2015

clintongormley mentioned this issue Feb 12, 2015

problem assigning shards during replicatoin #9667

Closed

bleskes added stalled and removed discuss labels Feb 13, 2015

clintongormley closed this as completed Jul 17, 2015

lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

total_shards_per_node may lead to unassigned shards #9248

total_shards_per_node may lead to unassigned shards #9248

imriz commented Jan 12, 2015

clintongormley commented Jan 13, 2015

imriz commented Jan 13, 2015

imriz commented Jan 19, 2015

clintongormley commented Jan 19, 2015

imriz commented Jan 19, 2015

clintongormley commented Jan 20, 2015

imriz commented Jan 21, 2015

clintongormley commented Jan 21, 2015

imriz commented Jan 21, 2015

bleskes commented Jan 21, 2015

imriz commented Jan 22, 2015

bleskes commented Feb 13, 2015

imriz commented Feb 13, 2015

bleskes commented Feb 13, 2015

imriz commented Feb 14, 2015

aochsner commented Apr 9, 2015

clintongormley commented Jul 17, 2015

total_shards_per_node may lead to unassigned shards #9248

total_shards_per_node may lead to unassigned shards #9248

Comments

imriz commented Jan 12, 2015

clintongormley commented Jan 13, 2015

imriz commented Jan 13, 2015

imriz commented Jan 19, 2015

clintongormley commented Jan 19, 2015

imriz commented Jan 19, 2015

clintongormley commented Jan 20, 2015

imriz commented Jan 21, 2015

clintongormley commented Jan 21, 2015

imriz commented Jan 21, 2015

bleskes commented Jan 21, 2015

imriz commented Jan 22, 2015

bleskes commented Feb 13, 2015

imriz commented Feb 13, 2015

bleskes commented Feb 13, 2015

imriz commented Feb 14, 2015

aochsner commented Apr 9, 2015

clintongormley commented Jul 17, 2015