-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mesh-federated wan clusters do not reconnect on an secondary mesh-gateway outage #10132
Comments
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
Hi @dekimsey 👋🏻 Thanks for your thorough report, it made reproducing this a breeze! This is fixed in #11522, which will hopefully make it into our next patch releases. In answer to your questions:
In the case described in your reproduction steps, when the new gateway comes up and registers itself, the leader in the secondary DC will send the new gateway's address, etc. to the primary DC as part of its anti-entropy process. At this point, the primary DC's gateway will reconfigure its proxy with the new address (this was the broken part). Alternatively, you may want to run multiple gateway instances and avoid cycling them at the same time, to avoid downtime.
If all gateways in the primary DC are unavailable, servers in the secondary DC will fall back to the addresses specified in their In any case, for the purpose of re-bootstrapping, ingress traffic to the primary DC's gateways should be allowed from any WAN federated DC's servers.
Once the secondary DC fails to replicate its local federation state to the primary three times it will automatically start the process of re-bootstrapping. There isn't a command to manually trigger this.
Generally, no. With the caveat that re-bootstrapping depends on the statically-configured primary gateway addresses in Hope that helps! Let us know if you encounter any more problems with this. |
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
Thanks @boxofrad, that is helpful and good to know! I'm very glad to hear the report was helpful. I struggled mightily at the time trying to suss out what was going on! |
…#11532) Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
…#11534) Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
FYI, I believe this issue is resolved. I haven't seen it since the fixed version. IIRC we were able to use the on-disk server configuration to change permanent IPs over to load-balancers. |
Thanks will go ahead and close the issue @dekimsey |
Overview of the Issue
If the secondary datacenter's mesh-gateway service experience a full outage (including IP reassignment) the wan looses it's connectivity until the primary's mesh-gateway's consul service is restarted.
Reproduction Steps
Steps to reproduce this issue, eg:
1a. In my environment our gateways are running on separate consul nodes, this may or may not be relevant.
In an attempt to fix:
In theory, we should have a partially connected wan at this stage. (At least that's how the initial step 1 worked during the initial configuration). However observe the primary's consul gateway still has the old IPs for all the secondary servers. Essentially the initial registration, didn't clear/reset/poke the consul data on the mesh-gateway.
Gateway now has all the new secondary gateway's IPs.
Questions:
Consul info for both Client and Server
Client info
Server info
Operating system and Environment details
CentOS 7 systems. Primary datacenter on premise running on VMs. Secondary running in AWS, servers in EC2 and mesh-gateways on ECS*.
Log Fragments
Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use
-log-level=TRACE
on the client and server to capture the maximum log detail.Secondary datacenters report a stream of being unable to forward traffic (small sample, but it's all the same thing):
Examining envoy's clusters shows only the old secondary gateway IPs listed and none of the new ones:
After restarting the consul service on the mesh-gateway node, the new IPs are now listed and the wan works.
The text was updated successfully, but these errors were encountered: