-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bypass Windows host network when forwarding Pod egress traffic in noencap mode #2157
Comments
But you still need to learn gateway MAC right, as the Nodes can be in different subnets. I prefer to start from gateway MAC only if that works in GKE (I know it works for NSX-T), and we can decide later to support more cases. I assume GKE and most clouds do distributed routing, so routing wont be much overhead compared to switching. |
It seems that the gateway MAC solution assumes that the router / gateway always knows how to route all Pod IP addresses, which is the case in GKE but is not always true. I sometimes deploy Antrea on Nodes in the same subnet and I enable noEncap mode. My default gateway doesn't know about the Pod IPs.
We can do the same subnet improvement first by learning the MAC addresses and keep the current implementation for destination Nodes in a different subnet? |
Yes, as @antoninbas pointed out, the routes must be configured on router either manually or by cloud controller, otherwise the basic traffic wouldn't work. If we assume the Nodes should be in same subnet for small and medium size clusters, wouldn't it be too complicated to ask users to configure router manually or install a cloud controller to get basic function? Indeed we need to learn gateway MAC too to optimize different subnets case, I think it would be a little more complicated than learning Node's MAC, as the router might be in active-standy mode so the implemetation may need to update MAC dynamically. Direct routing has less requirements on the topology, it doesn't need cloud controller to be installed, doesn't need distributed routing to be efficient, doesn't need even a router to be deployed. Does it make sense to do it for same subnet case even for long term? |
I now got you guys want to optimize for single subnet case first. But the problem many clusters span across subnets. Do we know what is the case for GKE? I was surprised to know in GKE router cannot forward Pod traffic. Do you know the reason, like: router drop the packet from the same subnet (due to RPF check), or maybe cloud controller does not add routes for the same subnet at all? |
BTW, on Linux there are socket APIs to resolve ARP and even get notified at ARP entry changes (I used the APIs before for an overlay implementation), so it is not very hard to resolve MAC with ARP. But I do not know what support Windows has. |
@jianjuns I'm not sure whether GKE installs a cloud controller with route controller enabled automatically, I think @antoninbas means gateway MAC works in GKE. @antoninbas could you confirm? GKE places all Nodes of a cluster into single subnet: https://cloud.google.com/kubernetes-engine/docs/how-to/routes-based-cluster#cluster_sizing. I think it should apply to most small and medium size clusters.
Yes, I agreee it's not hard to resolve MAC via ARP. By getting notified at ARP entry changes, do you mean listening ARP broadcast or watch the local IP neighbor cache? Is the latter reliable? what if the Node never talks to the gateway, in which situation it will ignore the ARP broadcast? And we will need to handle IPv6 too. |
I see. Seems I misunderstood. Ok, good to know for GKE we can assume a single subnet (but for TKG I do not feel we can assume that). On Linux (and ESX too), there is a socket ioctl API for you to resolve an IP's MAC via ARP (the tcp/ip stack will handle ARP), and there is also an API to call back at the ARP entry change (e.g. after it is refreshed by GARP). If there is no good way to resolve MAC with ARP, I am fine with discovering Node MAC as a quick solution for GKE. |
@jianjuns thanks for sharing the method to resolve MAC. |
Yes, that's what I meant. I didn't try, but the gateway MAC is likely to work in GKE in my opinion. But not as a generic solution for all network infras. |
In GKE, a cluster is created inside a VPC subnet. It's never across subnets. Also correct me if I am wrong, direct routing mode only works within the same subnet, otherwise you will need some tunnel, right? ARP/NDP also doesn't work in GCE. I think in general, broadcast/multicast protocols are not supported in GCE (or other clouds?). When ARP request was sent from an instance, I guess what happens is host will intercepted it and tries to reply it. No real ARP packets will be sent on the wire. |
Thanks for confirming this. There are two ways to make it work across subnets:
In all current traffic modes, Pod network in a Node is a L2 so no broadcast/multicast traffic from Pod instances will be sent on the wire. Pods will use the virtual gateway (antrea-gw0) in the Node as their gateway. The ARP/NDP traffic discussed in the issue is for OVS (acting as the gateway in the proposal) to discover other Nodes and router's MAC addresses, which is essentially same as Nodes discovering each other's MAC addresses for their own communication from the perspective of the underlying network, e.g. the underlying network would only see Node A ask Node N or Router's MAC address. |
Describe what you are trying to solve
To forward Pod egress traffic in noencap mode, Antrea requires the Windows host network to route the packets. The current traffic path is as below:
As uplink interface is attached to OVS bridge, it's possible to output the packets to uplink interface when it enters the OVS bridge the first time, then the path will be as below:
With the shorter path, the bandwidth of Windows Pod-to-Pod can be improved from ~1.5Gbps to ~2.5Gbps in a testbed. Besides, it makes the TX and RX path symmetric, avoiding risks of asymmetric paths like breaking stateful firewall.
To achieve it, antrea-agent needs to know the MAC addresses of next hops when installing Openflow rules that make the egress traffic bypass host network. There are two ways to do it:
When an agent starts, it reports the MAC address of its primary interface (the one that has its Node IP) via the Node's annotation like "node.antrea.io/mac-address". When installing route for a remote Node in NodeRouteController, it gets the remote Node's MAC address from the Node's annotation and installs a L3 flow to forward the packet to uplink interface directly if the annotation is present.
When installing route for a remote Node in NodeRouteController, it sends an ARP/NDP query to get the remote Node's MAC address and installs a L3 flow based on the response.
For both ways, if it fails to get the peer Node's MAC address, it should fall back to previous path, i.e. forwarding to antrea-gw0 to leverage host network routing.
I lean to the first way as Antrea needs to get the Node object anyway and it can handle the MAC discovery failure and update with less overhead: agent just needs to react on Node events, instead of keep sending ARP queries proactively which could add traffic overhead to the network. It also works for both IPv4 and IPv6.
Alternative solutions that you considered
Another way is to use the default gateway's MAC address as the destination MAC for all agents to avoid learning Nodes' MAC addresses. However, it doesn't have same improvement as the direct routing from Node to Node. In the same testbed, the bandwidth was ~1.1Gbps (a Linux server acting as the default gateway).
The text was updated successfully, but these errors were encountered: