Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mesh-federated wan clusters do not reconnect on an secondary mesh-gateway outage #10132

Closed
dekimsey opened this issue Apr 27, 2021 · 4 comments
Closed
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/mesh-gw Track mesh gateway work type/bug Feature does not function as expected

Comments

@dekimsey
Copy link
Collaborator

Overview of the Issue

If the secondary datacenter's mesh-gateway service experience a full outage (including IP reassignment) the wan looses it's connectivity until the primary's mesh-gateway's consul service is restarted.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create two clusters, a primary and secondary with WAN federation working.
    1a. In my environment our gateways are running on separate consul nodes, this may or may not be relevant.
  2. Take down all secondary mesh-gateways and recreate with new IPs.
  3. Observe wan is unable to re-establish it's connections even though the secondary servers are still able to directly reach the primary's gateway without issue.

In an attempt to fix:

  1. Restart all secondary servers in attempt to re-kicktstart bootstrap
    In theory, we should have a partially connected wan at this stage. (At least that's how the initial step 1 worked during the initial configuration). However observe the primary's consul gateway still has the old IPs for all the secondary servers. Essentially the initial registration, didn't clear/reset/poke the consul data on the mesh-gateway.
  2. Restart the consul agent on the mesh-gateway's node.
    Gateway now has all the new secondary gateway's IPs.

Questions:

  1. What is the process for recovery when a secondary gateway experiences an outage?
  2. What is the process for recovery when the primary gateways experiences an outage?
  3. Are there any operator commands that might be used on either the primary or secondary to force a re-bootstrap of the federated servers?
  4. Is a stable IP for the gateways a requirement (aka load-balancer)? What happens when this changes ("stable" is just a word reality doesn't seem to care much for!)

Consul info for both Client and Server

Client info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 1
        services = 1
build:
        prerelease =
        revision = 3c1c2267
        version = 1.9.5
consul:
        acl = enabled
        known_servers = 3
        server = false
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 2753
        max_procs = 2
        os = linux
        version = go1.15.8
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 834
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 39575
        members = 322
        query_queue = 0
        query_time = 1
Server info
agent:
        check_monitors = 0
        check_ttls = 1
        checks = 2
        services = 2
build:
        prerelease =
        revision = 3c1c2267
        version = 1.9.5
consul:
        acl = enabled
        bootstrap = false
        known_datacenters = 2
        leader = false
        leader_addr = 10.70.255.185:8300
        server = true
raft:
        applied_index = 253196564
        commit_index = 253196564
        fsm_pending = 0
        last_contact = 37.349547ms
        last_log_index = 253196564
        last_log_term = 3077
        last_snapshot_index = 253194858
        last_snapshot_term = 3077
        latest_configuration = [{Suffrage:Voter ID:4f34087c-506b-ea61-62c2-f3cbabdfb790 Address:10.70.255.191:8300} {Suffrage:Voter ID:4555d830-c164-2326-a9a6-670ce975461e Address:10.70.255.200:8300} {Suffrage:Voter ID:4325030d-273d-3387-1e32-494d48af8522 Address:10.70.255.185:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 3077
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 1740
        max_procs = 2
        os = linux
        version = go1.15.8
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 834
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 39575
        members = 322
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 11719
        members = 6
        query_queue = 0
        query_time = 1

Operating system and Environment details

CentOS 7 systems. Primary datacenter on premise running on VMs. Secondary running in AWS, servers in EC2 and mesh-gateways on ECS*.

  • Thus high-liklihood for IP replacements. There are no load-balancer's in play here.

Log Fragments

Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use -log-level=TRACE on the client and server to capture the maximum log detail.

Secondary datacenters report a stream of being unable to forward traffic (small sample, but it's all the same thing):

[ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to ${secondary_server_ip}:8302: read tcp ${localserverip}:33164->${localgatewayip}:8443: read: connection reset by peer

Examining envoy's clusters shows only the old secondary gateway IPs listed and none of the new ones:

$ curl -s localhost:19000/clusters | grep ${old_secondary_gateway_ip}

After restarting the consul service on the mesh-gateway node, the new IPs are now listed and the wan works.

@jsosulska jsosulska added theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/mesh-gw Track mesh gateway work type/bug Feature does not function as expected labels May 3, 2021
boxofrad added a commit that referenced this issue Nov 8, 2021
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
boxofrad added a commit that referenced this issue Nov 9, 2021
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
@boxofrad
Copy link
Contributor

boxofrad commented Nov 9, 2021

Hi @dekimsey 👋🏻

Thanks for your thorough report, it made reproducing this a breeze! This is fixed in #11522, which will hopefully make it into our next patch releases.

In answer to your questions:

What is the process for recovery when a secondary gateway experiences an outage?

In the case described in your reproduction steps, when the new gateway comes up and registers itself, the leader in the secondary DC will send the new gateway's address, etc. to the primary DC as part of its anti-entropy process. At this point, the primary DC's gateway will reconfigure its proxy with the new address (this was the broken part).

Alternatively, you may want to run multiple gateway instances and avoid cycling them at the same time, to avoid downtime.

What is the process for recovery when the primary gateways experiences an outage?

If all gateways in the primary DC are unavailable, servers in the secondary DC will fall back to the addresses specified in their primary_gateways config option (re-bootstrapping). It's desirable then, to either: keep the primary gateway IPs stable, avoid cycling all of the gateways at the same time, or alternatively use a DNS name or go-discover string.

In any case, for the purpose of re-bootstrapping, ingress traffic to the primary DC's gateways should be allowed from any WAN federated DC's servers.

Are there any operator commands that might be used on either the primary or secondary to force a re-bootstrap of the federated servers?

Once the secondary DC fails to replicate its local federation state to the primary three times it will automatically start the process of re-bootstrapping. There isn't a command to manually trigger this.

Is a stable IP for the gateways a requirement (aka load-balancer)? What happens when this changes ("stable" is just a word reality doesn't seem to care much for!)

Generally, no. With the caveat that re-bootstrapping depends on the statically-configured primary gateway addresses in primary_gateways.

Hope that helps! Let us know if you encounter any more problems with this.

boxofrad added a commit that referenced this issue Nov 9, 2021
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
boxofrad added a commit that referenced this issue Nov 9, 2021
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
boxofrad added a commit that referenced this issue Nov 9, 2021
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
@dekimsey
Copy link
Collaborator Author

dekimsey commented Nov 9, 2021

Thanks @boxofrad, that is helpful and good to know! I'm very glad to hear the report was helpful. I struggled mightily at the time trying to suss out what was going on!

boxofrad added a commit that referenced this issue Nov 9, 2021
…#11532)

Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
dhiaayachi pushed a commit that referenced this issue Nov 9, 2021
…#11534)

Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
boxofrad added a commit that referenced this issue Nov 9, 2021
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
boxofrad added a commit that referenced this issue Nov 10, 2021
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
@dekimsey
Copy link
Collaborator Author

dekimsey commented Feb 6, 2023

FYI, I believe this issue is resolved.

I haven't seen it since the fixed version. IIRC we were able to use the on-disk server configuration to change permanent IPs over to load-balancers.

@david-yu
Copy link
Contributor

david-yu commented Feb 6, 2023

Thanks will go ahead and close the issue @dekimsey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/mesh-gw Track mesh gateway work type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

4 participants