Bypass Windows host network when forwarding Pod egress traffic in noencap mode #2157

tnqn · 2021-05-07T13:02:25Z

Describe what you are trying to solve
To forward Pod egress traffic in noencap mode, Antrea requires the Windows host network to route the packets. The current traffic path is as below:

Pod network interface -> OVS bridge -> antrea-gw0 -> (host network forwarding) br-int interface -> OVS bridge -> uplink interface

As uplink interface is attached to OVS bridge, it's possible to output the packets to uplink interface when it enters the OVS bridge the first time, then the path will be as below:

Pod network interface -> OVS bridge -> uplink interface

With the shorter path, the bandwidth of Windows Pod-to-Pod can be improved from ~1.5Gbps to ~2.5Gbps in a testbed. Besides, it makes the TX and RX path symmetric, avoiding risks of asymmetric paths like breaking stateful firewall.

To achieve it, antrea-agent needs to know the MAC addresses of next hops when installing Openflow rules that make the egress traffic bypass host network. There are two ways to do it:

Learn the MAC addresses via controlplane:
When an agent starts, it reports the MAC address of its primary interface (the one that has its Node IP) via the Node's annotation like "node.antrea.io/mac-address". When installing route for a remote Node in NodeRouteController, it gets the remote Node's MAC address from the Node's annotation and installs a L3 flow to forward the packet to uplink interface directly if the annotation is present.
Learn the MAC addresses via dataplane:
When installing route for a remote Node in NodeRouteController, it sends an ARP/NDP query to get the remote Node's MAC address and installs a L3 flow based on the response.

For both ways, if it fails to get the peer Node's MAC address, it should fall back to previous path, i.e. forwarding to antrea-gw0 to leverage host network routing.

I lean to the first way as Antrea needs to get the Node object anyway and it can handle the MAC discovery failure and update with less overhead: agent just needs to react on Node events, instead of keep sending ARP queries proactively which could add traffic overhead to the network. It also works for both IPv4 and IPv6.

Alternative solutions that you considered
Another way is to use the default gateway's MAC address as the destination MAC for all agents to avoid learning Nodes' MAC addresses. However, it doesn't have same improvement as the direct routing from Node to Node. In the same testbed, the bandwidth was ~1.1Gbps (a Linux server acting as the default gateway).

The text was updated successfully, but these errors were encountered:

jianjuns · 2021-05-07T18:28:29Z

But you still need to learn gateway MAC right, as the Nodes can be in different subnets. I prefer to start from gateway MAC only if that works in GKE (I know it works for NSX-T), and we can decide later to support more cases.

I assume GKE and most clouds do distributed routing, so routing wont be much overhead compared to switching.

antoninbas · 2021-05-07T21:05:12Z

It seems that the gateway MAC solution assumes that the router / gateway always knows how to route all Pod IP addresses, which is the case in GKE but is not always true. I sometimes deploy Antrea on Nodes in the same subnet and I enable noEncap mode. My default gateway doesn't know about the Pod IPs.

But you still need to learn gateway MAC

We can do the same subnet improvement first by learning the MAC addresses and keep the current implementation for destination Nodes in a different subnet?

tnqn · 2021-05-10T04:20:17Z

Yes, as @antoninbas pointed out, the routes must be configured on router either manually or by cloud controller, otherwise the basic traffic wouldn't work. If we assume the Nodes should be in same subnet for small and medium size clusters, wouldn't it be too complicated to ask users to configure router manually or install a cloud controller to get basic function?

Indeed we need to learn gateway MAC too to optimize different subnets case, I think it would be a little more complicated than learning Node's MAC, as the router might be in active-standy mode so the implemetation may need to update MAC dynamically.

Direct routing has less requirements on the topology, it doesn't need cloud controller to be installed, doesn't need distributed routing to be efficient, doesn't need even a router to be deployed. Does it make sense to do it for same subnet case even for long term?

jianjuns · 2021-05-10T04:49:38Z

I now got you guys want to optimize for single subnet case first. But the problem many clusters span across subnets. Do we know what is the case for GKE?

I was surprised to know in GKE router cannot forward Pod traffic. Do you know the reason, like: router drop the packet from the same subnet (due to RPF check), or maybe cloud controller does not add routes for the same subnet at all?

jianjuns · 2021-05-10T04:53:49Z

BTW, on Linux there are socket APIs to resolve ARP and even get notified at ARP entry changes (I used the APIs before for an overlay implementation), so it is not very hard to resolve MAC with ARP. But I do not know what support Windows has.

tnqn · 2021-05-10T05:43:32Z

@jianjuns I'm not sure whether GKE installs a cloud controller with route controller enabled automatically, I think @antoninbas means gateway MAC works in GKE. @antoninbas could you confirm?

GKE places all Nodes of a cluster into single subnet: https://cloud.google.com/kubernetes-engine/docs/how-to/routes-based-cluster#cluster_sizing. I think it should apply to most small and medium size clusters.

Node IP addresses are taken from the primary range of the cluster subnet. Your cluster subnet must be large enough to hold the total number of nodes in your cluster.
For example, if you plan to create a 900-node cluster, the cluster subnet must be at least a /22 in size. A /22 range has 210 = 1024 addresses. Subtract the 4 reserved IP addresses, and you get 1020, which is sufficient for the 900 nodes.

Yes, I agreee it's not hard to resolve MAC via ARP. By getting notified at ARP entry changes, do you mean listening ARP broadcast or watch the local IP neighbor cache? Is the latter reliable? what if the Node never talks to the gateway, in which situation it will ignore the ARP broadcast? And we will need to handle IPv6 too.

jianjuns · 2021-05-10T05:56:43Z

I see. Seems I misunderstood.

Ok, good to know for GKE we can assume a single subnet (but for TKG I do not feel we can assume that).

On Linux (and ESX too), there is a socket ioctl API for you to resolve an IP's MAC via ARP (the tcp/ip stack will handle ARP), and there is also an API to call back at the ARP entry change (e.g. after it is refreshed by GARP).

If there is no good way to resolve MAC with ARP, I am fine with discovering Node MAC as a quick solution for GKE.

tnqn · 2021-05-10T13:14:43Z

@jianjuns thanks for sharing the method to resolve MAC.
@lzhecheng please check if windows platform has similar API. We need to have this for different subnet case soon.

antoninbas · 2021-05-10T22:30:39Z

I think @antoninbas means gateway MAC works in GKE. @antoninbas could you confirm?

Yes, that's what I meant. I didn't try, but the gateway MAC is likely to work in GKE in my opinion. But not as a generic solution for all network infras.

anfernee · 2021-05-18T06:30:53Z

In GKE, a cluster is created inside a VPC subnet. It's never across subnets. Also correct me if I am wrong, direct routing mode only works within the same subnet, otherwise you will need some tunnel, right?

ARP/NDP also doesn't work in GCE. I think in general, broadcast/multicast protocols are not supported in GCE (or other clouds?). When ARP request was sent from an instance, I guess what happens is host will intercepted it and tries to reply it. No real ARP packets will be sent on the wire.

tnqn · 2021-05-18T07:02:28Z

In GKE, a cluster is created inside a VPC subnet. It's never across subnets. Also correct me if I am wrong, direct routing mode only works within the same subnet, otherwise you will need some tunnel, right?

Thanks for confirming this. There are two ways to make it work across subnets:

Setting traffic mode to hybrid, in which case it works as you said, i.e. tunnels will be created between nodes that are in different subnets
Setting traffic mode to noencap, in which case it's assumed that the router in the underlying network knows how to route pod traffic. The routing path: Pod A1 -> Node A -> Router -> Node B -> Pod B1. The required routes on the router can be programed by a cloud controller that enables RouteController.

ARP/NDP also doesn't work in GCE. I think in general, broadcast/multicast protocols are not supported in GCE (or other clouds?). When ARP request was sent from an instance, I guess what happens is host will intercepted it and tries to reply it. No real ARP packets will be sent on the wire.

In all current traffic modes, Pod network in a Node is a L2 so no broadcast/multicast traffic from Pod instances will be sent on the wire. Pods will use the virtual gateway (antrea-gw0) in the Node as their gateway. The ARP/NDP traffic discussed in the issue is for OVS (acting as the gateway in the proposal) to discover other Nodes and router's MAC addresses, which is essentially same as Nodes discovering each other's MAC addresses for their own communication from the perspective of the underlying network, e.g. the underlying network would only see Node A ask Node N or Router's MAC address.

tnqn · 2021-06-12T03:21:06Z

Resolved by #2161 and #2160, closing.

tnqn added the kind/design Categorizes issue or PR as related to design. label May 7, 2021

tnqn mentioned this issue May 10, 2021

Update Node's MAC address to the Node's annotation for direct routing #2161

Merged

antoninbas assigned tnqn and lzhecheng May 19, 2021

tnqn closed this as completed Jun 12, 2021

tnqn mentioned this issue Nov 5, 2021

Windows conformance test failed consistently #2981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bypass Windows host network when forwarding Pod egress traffic in noencap mode #2157

Bypass Windows host network when forwarding Pod egress traffic in noencap mode #2157

tnqn commented May 7, 2021 •

edited

Loading

jianjuns commented May 7, 2021 •

edited

Loading

antoninbas commented May 7, 2021

tnqn commented May 10, 2021

jianjuns commented May 10, 2021 •

edited

Loading

jianjuns commented May 10, 2021

tnqn commented May 10, 2021 •

edited

Loading

jianjuns commented May 10, 2021 •

edited

Loading

tnqn commented May 10, 2021

antoninbas commented May 10, 2021

anfernee commented May 18, 2021

tnqn commented May 18, 2021

tnqn commented Jun 12, 2021

Bypass Windows host network when forwarding Pod egress traffic in noencap mode #2157

Bypass Windows host network when forwarding Pod egress traffic in noencap mode #2157

Comments

tnqn commented May 7, 2021 • edited Loading

jianjuns commented May 7, 2021 • edited Loading

antoninbas commented May 7, 2021

tnqn commented May 10, 2021

jianjuns commented May 10, 2021 • edited Loading

jianjuns commented May 10, 2021

tnqn commented May 10, 2021 • edited Loading

jianjuns commented May 10, 2021 • edited Loading

tnqn commented May 10, 2021

antoninbas commented May 10, 2021

anfernee commented May 18, 2021

tnqn commented May 18, 2021

tnqn commented Jun 12, 2021

tnqn commented May 7, 2021 •

edited

Loading

jianjuns commented May 7, 2021 •

edited

Loading

jianjuns commented May 10, 2021 •

edited

Loading

tnqn commented May 10, 2021 •

edited

Loading

jianjuns commented May 10, 2021 •

edited

Loading