mesh-federated wan clusters do not reconnect on an secondary mesh-gateway outage #10132

dekimsey · 2021-04-27T18:36:10Z

Overview of the Issue

If the secondary datacenter's mesh-gateway service experience a full outage (including IP reassignment) the wan looses it's connectivity until the primary's mesh-gateway's consul service is restarted.

Reproduction Steps

Steps to reproduce this issue, eg:

Create two clusters, a primary and secondary with WAN federation working.
1a. In my environment our gateways are running on separate consul nodes, this may or may not be relevant.
Take down all secondary mesh-gateways and recreate with new IPs.
Observe wan is unable to re-establish it's connections even though the secondary servers are still able to directly reach the primary's gateway without issue.

In an attempt to fix:

Restart all secondary servers in attempt to re-kicktstart bootstrap
In theory, we should have a partially connected wan at this stage. (At least that's how the initial step 1 worked during the initial configuration). However observe the primary's consul gateway still has the old IPs for all the secondary servers. Essentially the initial registration, didn't clear/reset/poke the consul data on the mesh-gateway.
Restart the consul agent on the mesh-gateway's node.
Gateway now has all the new secondary gateway's IPs.

Questions:

What is the process for recovery when a secondary gateway experiences an outage?
What is the process for recovery when the primary gateways experiences an outage?
Are there any operator commands that might be used on either the primary or secondary to force a re-bootstrap of the federated servers?
Is a stable IP for the gateways a requirement (aka load-balancer)? What happens when this changes ("stable" is just a word reality doesn't seem to care much for!)

Consul info for both Client and Server

Client info

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 1
        services = 1
build:
        prerelease =
        revision = 3c1c2267
        version = 1.9.5
consul:
        acl = enabled
        known_servers = 3
        server = false
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 2753
        max_procs = 2
        os = linux
        version = go1.15.8
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 834
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 39575
        members = 322
        query_queue = 0
        query_time = 1

Server info

agent:
        check_monitors = 0
        check_ttls = 1
        checks = 2
        services = 2
build:
        prerelease =
        revision = 3c1c2267
        version = 1.9.5
consul:
        acl = enabled
        bootstrap = false
        known_datacenters = 2
        leader = false
        leader_addr = 10.70.255.185:8300
        server = true
raft:
        applied_index = 253196564
        commit_index = 253196564
        fsm_pending = 0
        last_contact = 37.349547ms
        last_log_index = 253196564
        last_log_term = 3077
        last_snapshot_index = 253194858
        last_snapshot_term = 3077
        latest_configuration = [{Suffrage:Voter ID:4f34087c-506b-ea61-62c2-f3cbabdfb790 Address:10.70.255.191:8300} {Suffrage:Voter ID:4555d830-c164-2326-a9a6-670ce975461e Address:10.70.255.200:8300} {Suffrage:Voter ID:4325030d-273d-3387-1e32-494d48af8522 Address:10.70.255.185:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 3077
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 1740
        max_procs = 2
        os = linux
        version = go1.15.8
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 834
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 39575
        members = 322
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 11719
        members = 6
        query_queue = 0
        query_time = 1

Operating system and Environment details

CentOS 7 systems. Primary datacenter on premise running on VMs. Secondary running in AWS, servers in EC2 and mesh-gateways on ECS*.

Thus high-liklihood for IP replacements. There are no load-balancer's in play here.

Log Fragments

Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use -log-level=TRACE on the client and server to capture the maximum log detail.

Secondary datacenters report a stream of being unable to forward traffic (small sample, but it's all the same thing):

[ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to ${secondary_server_ip}:8302: read tcp ${localserverip}:33164->${localgatewayip}:8443: read: connection reset by peer

Examining envoy's clusters shows only the old secondary gateway IPs listed and none of the new ones:

$ curl -s localhost:19000/clusters | grep ${old_secondary_gateway_ip}

After restarting the consul service on the mesh-gateway node, the new IPs are now listed and the wan works.

The text was updated successfully, but these errors were encountered:

Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.

boxofrad · 2021-11-09T16:48:16Z

Hi @dekimsey 👋🏻

Thanks for your thorough report, it made reproducing this a breeze! This is fixed in #11522, which will hopefully make it into our next patch releases.

In answer to your questions:

What is the process for recovery when a secondary gateway experiences an outage?

In the case described in your reproduction steps, when the new gateway comes up and registers itself, the leader in the secondary DC will send the new gateway's address, etc. to the primary DC as part of its anti-entropy process. At this point, the primary DC's gateway will reconfigure its proxy with the new address (this was the broken part).

Alternatively, you may want to run multiple gateway instances and avoid cycling them at the same time, to avoid downtime.

What is the process for recovery when the primary gateways experiences an outage?

If all gateways in the primary DC are unavailable, servers in the secondary DC will fall back to the addresses specified in their primary_gateways config option (re-bootstrapping). It's desirable then, to either: keep the primary gateway IPs stable, avoid cycling all of the gateways at the same time, or alternatively use a DNS name or go-discover string.

In any case, for the purpose of re-bootstrapping, ingress traffic to the primary DC's gateways should be allowed from any WAN federated DC's servers.

Are there any operator commands that might be used on either the primary or secondary to force a re-bootstrap of the federated servers?

Once the secondary DC fails to replicate its local federation state to the primary three times it will automatically start the process of re-bootstrapping. There isn't a command to manually trigger this.

Is a stable IP for the gateways a requirement (aka load-balancer)? What happens when this changes ("stable" is just a word reality doesn't seem to care much for!)

Generally, no. With the caveat that re-bootstrapping depends on the statically-configured primary gateway addresses in primary_gateways.

Hope that helps! Let us know if you encounter any more problems with this.

Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.

dekimsey · 2021-11-09T18:56:40Z

Thanks @boxofrad, that is helpful and good to know! I'm very glad to hear the report was helpful. I struggled mightily at the time trying to suss out what was going on!

…#11532) Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.

…#11534) Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.

Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.

dekimsey · 2023-02-06T18:08:20Z

FYI, I believe this issue is resolved.

I haven't seen it since the fixed version. IIRC we were able to use the on-disk server configuration to change permanent IPs over to load-balancers.

david-yu · 2023-02-06T19:30:11Z

Thanks will go ahead and close the issue @dekimsey

jsosulska added theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/mesh-gw Track mesh gateway work type/bug Feature does not function as expected labels May 3, 2021

boxofrad mentioned this issue Nov 8, 2021

xds: prefer fed state gateway definitions if they're fresher #11522

Merged

david-yu closed this as completed Feb 6, 2023

snyk-bot mentioned this issue Mar 4, 2023

[Snyk] Security upgrade ember-cli from 3.20.2 to 4.11.0 ekmixon/consul#507

Open

ekmixon mentioned this issue Mar 25, 2024

[Snyk] Security upgrade ember-cli from 3.20.2 to 4.11.0 ekmixon/consul#543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mesh-federated wan clusters do not reconnect on an secondary mesh-gateway outage #10132

mesh-federated wan clusters do not reconnect on an secondary mesh-gateway outage #10132

dekimsey commented Apr 27, 2021

boxofrad commented Nov 9, 2021

dekimsey commented Nov 9, 2021

dekimsey commented Feb 6, 2023

david-yu commented Feb 6, 2023

mesh-federated wan clusters do not reconnect on an secondary mesh-gateway outage #10132

mesh-federated wan clusters do not reconnect on an secondary mesh-gateway outage #10132

Comments

dekimsey commented Apr 27, 2021

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

boxofrad commented Nov 9, 2021

dekimsey commented Nov 9, 2021

dekimsey commented Feb 6, 2023

david-yu commented Feb 6, 2023