Method to remove decommissioned datacenter from catalog #5881

robn · 2019-05-22T00:57:55Z

Feature Description

Some method to remove a decommissioned datacenter from the catalog. I suggest something like consul operator raft remove-peer for the WAN pool.

Use Case(s)

When a datacenter is decommissioned, all Consul nodes are shut down, so its server pool is effectively destroyed. In other datacenters, those nodes still appear in the WAN pool in the "left" state, waiting for the reconnect_timeout_wan time period to pass after which they would be cleaned up.

Since those nodes still "exist", the datacenter is still listed in the catalog. Any tools that then want to operate across datacenters (many of our internal tools) pull this list, then try to execute an operation in each datacenter. The operation for the decommissioned datacenter fails with 500 Internal Server Error: No path to datacenter, which is correct, but not helpful as you can't programatically understand that the datacenter is legitimately unavailable vs a genuine quorum loss or something deeper (network damage).

When I decommissioned a datacenter last week, I looked for an equivalent of consul operator raft remove-peer for the WAN pool, but didn't find one. Something like that is probably all that's needed, since this is such a rare situation and automatically handling it is likely complicated.

The text was updated successfully, but these errors were encountered:

pierresouchay · 2019-05-27T23:40:49Z

@robn fully agree on this.
I recently also had a similar issue while a DC was not accessible, thus, all tools "discovering" DCs with /v1/catalog/datacenters had issues with discovering services and had various outages.

Having a way to de-provision/disconnect a remote DC would be a great additional feature

banks · 2019-08-16T10:44:40Z

We already have WIP on forced reaping serf members which I think is exactly the same issue as here just for WAN pool rather than LAN.

There is an internal RFC being worked on currently that should address that.

This isn't a dupe as the use-case is different but hopefully will be fixed by the same thing as #2981.

schristoff · 2019-10-08T19:33:37Z

This should be resolved with #6582 🤞

hanshasselberg · 2019-11-18T12:33:41Z

I think this issues should be fixed since #6420 was merged. Left servers no longer appear and thus inaccessible datacenters no longer are in the catalog.

hanshasselberg · 2019-11-19T21:07:07Z

Ok, I tested it on master and this is what happened:

$ consul members -wan
Node    Address          Status  Type    Build  Protocol  DC   Segment
s1.dc1  127.0.0.1:8701   alive   server  1.6.1  2         dc1  <all>
s1.dc2  127.0.0.1:8702   alive   server  1.6.1  2         dc2  <all>
s2.dc1  127.0.0.1:40000  alive   server  1.6.1  2         dc1  <all>
s2.dc2  127.0.0.1:40107  alive   server  1.6.1  2         dc2  <all>
s3.dc1  127.0.0.1:40001  alive   server  1.6.1  2         dc1  <all>
s3.dc2  127.0.0.1:40108  alive   server  1.6.1  2         dc2  <all>
$ CONSUL_HTTP_ADDR=127.0.0.1:8501 consul leave
Graceful leave complete
$ CONSUL_HTTP_ADDR=127.0.0.1:30107 consul leave
Graceful leave complete
$ CONSUL_HTTP_ADDR=127.0.0.1:30108 consul leave
Graceful leave complete
$ consul members -wan
Node    Address          Status  Type    Build  Protocol  DC   Segment
s1.dc1  127.0.0.1:8701   alive   server  1.6.1  2         dc1  <all>
s1.dc2  127.0.0.1:8702   left    server  1.6.1  2         dc2  <all>
s2.dc1  127.0.0.1:40000  alive   server  1.6.1  2         dc1  <all>
s2.dc2  127.0.0.1:40107  left    server  1.6.1  2         dc2  <all>
s3.dc1  127.0.0.1:40001  alive   server  1.6.1  2         dc1  <all>
s3.dc2  127.0.0.1:40108  left    server  1.6.1  2         dc2  <all>
$ curl http://localhost:8500/v1/catalog/datacenters
["dc1"]

Before #6420 dc2 would be still in the catalog list, but now it no longer is. Does that fix your issue?

robn · 2019-11-19T22:20:49Z

Based on your output, yes, that will definitely take care of it.

I do not intend to be decommissioning another datacentre within the next five years, if ever, but I am glad to know I will not run into this again! 😉

Thanks!

ghost · 2020-01-25T02:14:45Z

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.

If you are still experiencing problems, or still have questions, feel free to open a new one 👍.

pierresouchay mentioned this issue Aug 7, 2019

How to handle "No path to datacenter" correctly? hashicorp/consul-template#1250

Closed

banks added the type/enhancement Proposed improvement or new feature label Aug 16, 2019

pierresouchay mentioned this issue Aug 27, 2019

Distinguish errors between DC not existing and not available #6399

Merged

robn closed this as completed Nov 19, 2019

ghost locked and limited conversation to collaborators Jan 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method to remove decommissioned datacenter from catalog #5881

Method to remove decommissioned datacenter from catalog #5881

robn commented May 22, 2019 •

edited

Loading

pierresouchay commented May 27, 2019

banks commented Aug 16, 2019

schristoff commented Oct 8, 2019

hanshasselberg commented Nov 18, 2019

hanshasselberg commented Nov 19, 2019

robn commented Nov 19, 2019

ghost commented Jan 25, 2020

Method to remove decommissioned datacenter from catalog #5881

Method to remove decommissioned datacenter from catalog #5881

Comments

robn commented May 22, 2019 • edited Loading

Feature Description

Use Case(s)

pierresouchay commented May 27, 2019

banks commented Aug 16, 2019

schristoff commented Oct 8, 2019

hanshasselberg commented Nov 18, 2019

hanshasselberg commented Nov 19, 2019

robn commented Nov 19, 2019

ghost commented Jan 25, 2020

robn commented May 22, 2019 •

edited

Loading