Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method to remove decommissioned datacenter from catalog #5881

Closed
robn opened this issue May 22, 2019 · 7 comments
Closed

Method to remove decommissioned datacenter from catalog #5881

robn opened this issue May 22, 2019 · 7 comments
Labels
type/enhancement Proposed improvement or new feature

Comments

@robn
Copy link

robn commented May 22, 2019

Feature Description

Some method to remove a decommissioned datacenter from the catalog. I suggest something like consul operator raft remove-peer for the WAN pool.

Use Case(s)

When a datacenter is decommissioned, all Consul nodes are shut down, so its server pool is effectively destroyed. In other datacenters, those nodes still appear in the WAN pool in the "left" state, waiting for the reconnect_timeout_wan time period to pass after which they would be cleaned up.

Since those nodes still "exist", the datacenter is still listed in the catalog. Any tools that then want to operate across datacenters (many of our internal tools) pull this list, then try to execute an operation in each datacenter. The operation for the decommissioned datacenter fails with 500 Internal Server Error: No path to datacenter, which is correct, but not helpful as you can't programatically understand that the datacenter is legitimately unavailable vs a genuine quorum loss or something deeper (network damage).

When I decommissioned a datacenter last week, I looked for an equivalent of consul operator raft remove-peer for the WAN pool, but didn't find one. Something like that is probably all that's needed, since this is such a rare situation and automatically handling it is likely complicated.

@pierresouchay
Copy link
Contributor

@robn fully agree on this.
I recently also had a similar issue while a DC was not accessible, thus, all tools "discovering" DCs with /v1/catalog/datacenters had issues with discovering services and had various outages.

Having a way to de-provision/disconnect a remote DC would be a great additional feature

@banks
Copy link
Member

banks commented Aug 16, 2019

We already have WIP on forced reaping serf members which I think is exactly the same issue as here just for WAN pool rather than LAN.

There is an internal RFC being worked on currently that should address that.

This isn't a dupe as the use-case is different but hopefully will be fixed by the same thing as #2981.

@schristoff
Copy link
Contributor

This should be resolved with #6582 🤞

@hanshasselberg
Copy link
Member

I think this issues should be fixed since #6420 was merged. Left servers no longer appear and thus inaccessible datacenters no longer are in the catalog.

@hanshasselberg
Copy link
Member

Ok, I tested it on master and this is what happened:

$ consul members -wan
Node    Address          Status  Type    Build  Protocol  DC   Segment
s1.dc1  127.0.0.1:8701   alive   server  1.6.1  2         dc1  <all>
s1.dc2  127.0.0.1:8702   alive   server  1.6.1  2         dc2  <all>
s2.dc1  127.0.0.1:40000  alive   server  1.6.1  2         dc1  <all>
s2.dc2  127.0.0.1:40107  alive   server  1.6.1  2         dc2  <all>
s3.dc1  127.0.0.1:40001  alive   server  1.6.1  2         dc1  <all>
s3.dc2  127.0.0.1:40108  alive   server  1.6.1  2         dc2  <all>
$ CONSUL_HTTP_ADDR=127.0.0.1:8501 consul leave
Graceful leave complete
$ CONSUL_HTTP_ADDR=127.0.0.1:30107 consul leave
Graceful leave complete
$ CONSUL_HTTP_ADDR=127.0.0.1:30108 consul leave
Graceful leave complete
$ consul members -wan
Node    Address          Status  Type    Build  Protocol  DC   Segment
s1.dc1  127.0.0.1:8701   alive   server  1.6.1  2         dc1  <all>
s1.dc2  127.0.0.1:8702   left    server  1.6.1  2         dc2  <all>
s2.dc1  127.0.0.1:40000  alive   server  1.6.1  2         dc1  <all>
s2.dc2  127.0.0.1:40107  left    server  1.6.1  2         dc2  <all>
s3.dc1  127.0.0.1:40001  alive   server  1.6.1  2         dc1  <all>
s3.dc2  127.0.0.1:40108  left    server  1.6.1  2         dc2  <all>
$ curl http://localhost:8500/v1/catalog/datacenters
["dc1"]

Before #6420 dc2 would be still in the catalog list, but now it no longer is. Does that fix your issue?

@robn
Copy link
Author

robn commented Nov 19, 2019

Based on your output, yes, that will definitely take care of it.

I do not intend to be decommissioning another datacentre within the next five years, if ever, but I am glad to know I will not run into this again! 😉

Thanks!

@robn robn closed this as completed Nov 19, 2019
@ghost
Copy link

ghost commented Jan 25, 2020

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.

If you are still experiencing problems, or still have questions, feel free to open a new one 👍.

@ghost ghost locked and limited conversation to collaborators Jan 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

5 participants