Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

total_shards_per_node may lead to unassigned shards #9248

Closed
imriz opened this issue Jan 12, 2015 · 17 comments
Closed

total_shards_per_node may lead to unassigned shards #9248

imriz opened this issue Jan 12, 2015 · 17 comments
Labels
>bug :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. stalled

Comments

@imriz
Copy link

imriz commented Jan 12, 2015

Hi,
I have 3 nodes (version 1.4.1), and my index settings is as follows:
"index.routing.allocation.total_shards_per_node": 2
"index.number_of_shards": 3
"index.number_of_replicas": "1"

Sometimes, The shards gets allocated as follows (p denotes primary, r denotes replica):
Node1: 0p,1r
Node2: 0r,1p
Node2: 2p

This leaves 2r unassigned.
If I raise total_shards_per_node to 3, the cluster will start recovering the unassigned shard (2r).
If I lower total_shards_per_node to 2 after the recovery has finished, it will reallocate the shards correctly.

I see that I am not the only one seeing this:
https://groups.google.com/forum/#!topic/elasticsearch/GZnamxaaj0g

@clintongormley
Copy link
Contributor

Hi @imriz

I tried this on 1.4.1, and it seems to be working correctly:

DELETE _all 

PUT t
{
  "settings": {
    "index.routing.allocation.total_shards_per_node": 2,
    "index.number_of_shards": 3,
    "index.number_of_replicas": 1
  }
}

GET _cat/shards

Returns:

t 0 r STARTED 0  79b 192.168.2.183 Paige Guthrie         
t 0 p STARTED 0 115b 192.168.2.183 Sangre                
t 1 p STARTED 0 115b 192.168.2.183 Paige Guthrie         
t 1 r STARTED 0  79b 192.168.2.183 Herbert Edgar Wyndham 
t 2 r STARTED 0  79b 192.168.2.183 Sangre                
t 2 p STARTED 0 115b 192.168.2.183 Herbert Edgar Wyndham 

Could you provide a recreation of the problem?

@imriz
Copy link
Author

imriz commented Jan 13, 2015

So far, I wasn't able to reproduce at will, but I can gather what ever
debugging information needed next time it occurs.

On Tue, Jan 13, 2015, 22:56 Clinton Gormley [email protected]
wrote:

Hi @imriz https://github.com/imriz

I tried this on 1.4.1, and it seems to be working correctly:

DELETE _all

PUT t
{
"settings": {
"index.routing.allocation.total_shards_per_node": 2,
"index.number_of_shards": 3,
"index.number_of_replicas": 1
}
}

GET _cat/shards

Returns:

t 0 r STARTED 0 79b 192.168.2.183 Paige Guthrie
t 0 p STARTED 0 115b 192.168.2.183 Sangre
t 1 p STARTED 0 115b 192.168.2.183 Paige Guthrie
t 1 r STARTED 0 79b 192.168.2.183 Herbert Edgar Wyndham
t 2 r STARTED 0 79b 192.168.2.183 Sangre
t 2 p STARTED 0 115b 192.168.2.183 Herbert Edgar Wyndham

Could you provide a recreation of the problem?


Reply to this email directly or view it on GitHub
#9248 (comment)
.

@imriz
Copy link
Author

imriz commented Jan 19, 2015

Hi @clintongormley,
This happened again:
logstash-app-2015.01.19 0 p STARTED 498246 259mb 10.84.30.188 elasticsearch003
logstash-app-2015.01.19 0 r STARTED 498413 256.5mb 10.84.30.184 elasticsearch002
logstash-app-2015.01.19 1 r STARTED 499403 248.1mb 10.84.30.188 elasticsearch003
logstash-app-2015.01.19 1 p STARTED 499298 236.8mb 10.84.30.184 elasticsearch002
logstash-app-2015.01.19 2 p STARTED 499272 245.5mb 10.84.30.186 elasticsearch001
logstash-app-2015.01.19 2 r UNASSIGNED

The only relevant logs are:
[2015-01-19 04:18:39,824][INFO ][cluster.metadata ] [elasticsearch001] [logstash-app-2015.01.19] creating index, cause [auto(bulk api)], shards [3]/[1], mappings [default]
[2015-01-19 04:18:40,204][INFO ][cluster.metadata ] [elasticsearch001] [logstash-app-2015.01.19] update_mapping logs
[2015-01-19 04:19:39,521][INFO ][cluster.metadata ] [elasticsearch001] [logstash-app-2015.01.19] update_mapping logs

@clintongormley
Copy link
Contributor

Hi @imriz

So these are definitely per-index settings that you are setting? Nothing in the per-node config files or cluster settings? Any other allocation-related settings that we should know about?

@imriz
Copy link
Author

imriz commented Jan 19, 2015

Hi @clintongormley

The only thing remotely related is:
cluster:
routing:
allocation:
disk:
threshold_enabled: true
watermark:
high: 0.99
low: 0.97

But we are not close to these thresholds, nor do I see how they can create the allocation topology I got here..

I should emphasize that it looks completely random - an index creation few minutes later can result a perfectly sensible topology.
Moreover, If I increase the total_number_of_shard to 3, and then reduce it, it will rebalance back to normality.

@clintongormley
Copy link
Contributor

Hi @imriz

A colleague has encountered this before and explained to me what is happening:

  • On elasticsearch003, it has assigned p0 and r1
  • On elasticsearch002, it has assigned p1 and r0
  • On elasticsearch001, it has assigned p2, but it can't assign r2, because it never puts copies of the same shard on the same node.

The only way to solve this setup is to move one of the other shards to a different node, but it isn't going to do that until something else triggers the move, eg disk full, node failure, etc

While it goes to some trouble to randomize things like this, so that this should happen seldom, it is possible (as you've seen) with hard limits like total_shards_per_node to reach situations where allocation can't proceed.

@imriz
Copy link
Author

imriz commented Jan 21, 2015

Hi @clintongormley

Yep, that is what I described in my initial post :)

Just one small note, if total_shards_per_node is set to a higher value, the nodes will happily allocate the same shard to the same node.

i think that the best approach is to have a configuration flag to ensure this will never happen (same shard on the same node), since the whole reason I've had to set total_shards_per_node to 2 is to prevent a node from recovering redundant shards when another node fails (which is very problematic when you don't have a lot of free disk space).

@clintongormley
Copy link
Contributor

Just one small note, if total_shards_per_node is set to a higher value, the nodes will happily allocate the same shard to the same node.

This is never be case. I think what happens if you increase the total_shards_per_node is that you end up with 3 (different) shards on a single node, then the allocator rebalances by moving one of the shards to a different node. But having two copies of the same shard on a single node will never be allowed.

the whole reason I've had to set total_shards_per_node to 2 is to prevent a node from recovering redundant shards when another node fails (which is very problematic when you don't have a lot of free disk space).

as above, this never happens.

@imriz
Copy link
Author

imriz commented Jan 21, 2015

This is never be case. I think what happens if you increase the total_shards_per_node is that you end up with 3 (different) shards on a single node, then the allocator rebalances by moving one of the shards to a different node. But having two copies of the same shard on a single node will never be allowed.

You are, of course, correct.
Anyway, that leaves us with the original issue where the allocator doesn't compute the optimal allocation scheme, which results unassigned shards.

as above, this never happens.

Well, not the same shards, but indeed redundant shards.
In a cluster where storage is limited (for example, in ELK setups, where storage is computed to support a specific X days sliding window retention of logs), and when cluster is designed to only support one node failure (which I believe is a legitimate use case), one might be OK with not recovering replicas of shards after 1 node failure. If I would set total_shards_per_node to a higher value, I risk "eating" up all the free disk space at each node failure, leading to inability to index new logs.

@bleskes
Copy link
Contributor

bleskes commented Jan 21, 2015

@imriz did you try looking at the the disk based allocation as a means to avoid assigning shards that will cause the node to run out of disk space? This will give the flexibility you need if possible while avoiding filling up the disk. More info is here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html#disk . Note that you will have to be using 1.4.1 or higher for it to include the size of relocating shards (and prevent them): see #7785 which was back ported in 4e5264c

@imriz
Copy link
Author

imriz commented Jan 22, 2015

@bleskes I am aware of the disk based allocation thresholds, but that doesn't give me the functionality I need - I want to utilize the disk space as much as I can - I just don't want the nodes to recover any shards of another failing node (that is, I want to be able to sustain only 1 node failure).
The total_shards_per_node set to 2 with the combination of number_of_replicas set to 1 gives me exactly that, with the exception of these annoying unassigned shards every once and then.
This is definitely not a showstopper, but it would be nice to get the allocator improved so it will always compute the optimal allocation scheme :)

@bleskes
Copy link
Contributor

bleskes commented Feb 13, 2015

@imriz sorry for the late response. I hear you and your are right. The disk threshold allocator will give you some more flexibility here, if you have the space for it. Since we can't try all the shard allocations combinations (think thousands of shards, hundreds of nodes) we have to use an iterative algorithm and that might not find a global optimum. If you have any suggestion how to improve this I'll be very happy to discuss them.

@bleskes bleskes added stalled and removed discuss labels Feb 13, 2015
@imriz
Copy link
Author

imriz commented Feb 13, 2015

@bleskes Maybe something like CRUSH maps could be of use?

@bleskes
Copy link
Contributor

bleskes commented Feb 13, 2015

@imriz thx for the tip. Looks interesting, though it's not clear at first glance how removing the centralized nature (which is very good for other reasons) will allow to deal with local minima.

@imriz
Copy link
Author

imriz commented Feb 14, 2015

@bleskes CRUSH allows you to set allocation rules, which can prevent the issue
at hand. Twitter's libcrunch is a good place to look at.

On Fri, Feb 13, 2015, 15:17 Boaz Leskes [email protected] wrote:

@imriz https://github.com/imriz thx for the tip. Looks interesting,
though it's not clear at first glance how removing the centralized nature
(which is very good for other reasons) will allow to deal with local minima.


Reply to this email directly or view it on GitHub
#9248 (comment)
.

@aochsner
Copy link
Contributor

aochsner commented Apr 9, 2015

It's been a long time since I tried this approach (I was the author of the linked user group post). But when talking to someone at ElasticON, they said it sounds like a bug and to create an issue. Was about to try do that and came across this issue. Glad I'm not the only one.

Haven't gone through to try to recreate this on 1.4.4. We are currently on 1.4.2. We are running a 3 node cluster, 3 primary, 1 replica, with daily rolling indexes. There's probably about 12 indexes per day. Of those, 2 are very large/hot. We really care that those are distributed evenly (we are on spinning disk and pretty constrained w/ memory so anything we can do to balance it across the cluster, the better). In fact, we'd just prefer everything get distributed evenly. I've tried to play with cluster.routing.allocation.balance.* but that didn't seem to help. Meant to go back and see if it was better but also saw this issue #9023 which sounds like I might want to upgrade to >= 1.4.4 first. Anyways, that's all the background that lead us to trying this setting.

Seems the issue is clear now? Or is there anything I can do to try to help recreate?

@clintongormley
Copy link
Contributor

Closing this as a duplicate of #12273

@lcawl lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. stalled
Projects
None yet
Development

No branches or pull requests

5 participants