Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Egress IP across multi subnets #4385

Closed
robbo10 opened this issue Nov 8, 2022 · 11 comments
Closed

Egress IP across multi subnets #4385

robbo10 opened this issue Nov 8, 2022 · 11 comments
Labels
area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@robbo10
Copy link

robbo10 commented Nov 8, 2022

We have Antrea running on EKS, however when trying to make use of the EgressIP feature we are limited to all the nodes being on the same subnet.

For availability purposes we have nodes in 2/3 subnets across different AZ's per cluster.

We would like the ability to use EgressIP in H/A mode by Antrea being able to support multiple subnets for nodes running in a cluster.

Thanks for all the work done on the project thus far! Everyone has been super helpful :)

@robbo10 robbo10 added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 8, 2022
@jianjuns jianjuns added the area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). label Nov 10, 2022
@jianjuns
Copy link
Contributor

Hi @robbo10 , I like to understand your requirement better. Egress feature does not really require all Nodes on the same subnet. Maybe you meant your want a single Egress IP can be failed over from one Node subnet to another Node subnet? That is indeed not supported.

@tnqn

@robbo10
Copy link
Author

robbo10 commented Nov 11, 2022

@jianjuns - On our EKS clusters we are using 3 subnets for the workers, 1 subnet per AZ. We were running into issues trying to get EgressIP working, the configuration etc of Antrea was correct however whenever validating the traffic from the SNAT IP, we always seen the source IP being the node the Egress was assigned to versus the EgressIP.

We had a troubleshooting session with @tnqn last week and he was able to confirm that our issue is due to the workers not being in the same subnet, therefore the ARP is not succeeding. @tnqn does that sound correct, I'm sure there is some extra technical detail missing :)

@tnqn
Copy link
Member

tnqn commented Nov 11, 2022

@robbo10 The Egress IP should be in the same subnet as the Node's IP, so if the node selector selects all Nodes, they need to be in the same subnet. Reminded by @jianjuns's comment, I wonder if you could limit the node selector to one AZ only and use Egress IPs from the subnet of that AZ. You could even have 3 externalIPPools, each of which selects only Nodes of one AZ and contains IPs in that AZ's subnet.

@robbo10
Copy link
Author

robbo10 commented Nov 11, 2022

@tnqn - For HA purposes, to ensure that we don’t have downtime for any products, if worst case say AZ1 went down which would result in all namespaces which have their Egress tied to nodes in that subnet having an outage. Is it possible as you mention to have 3 externalIPPools, and per namespace assign 3 Egress IP's to each namespace, one from each externalIPPool, so as if all nodes within an AZ where to fail we would not bring down a bunch of applications?

Would that make sense to ensure we have HA?

Thanks for the support :)

@tnqn
Copy link
Member

tnqn commented Nov 15, 2022

I think I understand the requirement now and wonder if two backup Egress IPs are needed. It may happen one AZ is totally down so one backup should be enough? If yes, I'm considering a secondaryEgressIP field (and a corresponding secondaryExternalIPPool field), which would take over the Egress traffic when the primary EgressIP's nodes are all unavailable. It may be helpful for static Egress as well, as it also adds HA for it, tolerating one Egress Node's outage. But I haven't thought through what it means from implementation's perspective. Would like to hear whether the use case and the API change makes sense or not first. @robbo10 @jianjuns @antoninbas

@jianjuns
Copy link
Contributor

I also feel >1 Egress IPs are the only way for AWS, where a subnet is a single AZ.

@robbo10
Copy link
Author

robbo10 commented Nov 17, 2022

@tnqn - With the approach you outlined above would that allow for nodes in an EKS cluster to be in two AZ's, therefore the workers could be split across 2 subnets and EgressIP would work?

Thanks

@robbo10
Copy link
Author

robbo10 commented Nov 23, 2022

@tnqn - just to clarify as things stand we can’t assign multi EgressIP to a namespace to solve the multi AZ subnet problem?

Thanks

@tnqn
Copy link
Member

tnqn commented Nov 23, 2022

@tnqn - With the approach you outlined above would that allow for nodes in an EKS cluster to be in two AZ's, therefore the workers could be split across 2 subnets and EgressIP would work?

Yes, but it's not supported yet, just an idea how to resolve the problem and need more evaluation on the implementation.

@tnqn - just to clarify as things stand we can’t assign multi EgressIP to a namespace to solve the multi AZ subnet problem?

It's not supported yet as seen from the Egress API, only a single Egress IP and ExternalIPPool can be specified.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2023
@antoninbas antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 24, 2023
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants