Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DSR mode for Service's external addresses #5202

Merged
merged 1 commit into from
Jul 20, 2023

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Jul 4, 2023

This commit adds support for DSR mode for Service's external addresses, including LoadBalancerIPs and ExternalIPs. A configuration option, antreaProxy.defaultLoadBalancerMode is added to determine how external traffic is processed when it's load balanced across Nodes by default. It has two options: nat (default) and dsr. In NAT mode, external traffic is SNAT'd when it's load balanced across Nodes to ensure symmetric path. In DSR mode, external traffic is never SNAT'd and backend Pods running on Nodes that are not the ingress Node can reply to clients directly, bypassing the ingress Node.

Additionally, a Service's load balancer mode can be overridden by annotating it with service.antrea.io/load-balancer-mode. A feature gate, LoadBalancerModeDSR is added to control whether it's allowed to use DSR mode.

When a Service's LoadBalancerMode is DSR, the following changes will be applied to the OpenFlow flows and groups:

  1. ClusterGroup will be used by traffic working in DSR mode on ingress Node.
  • If a local Endpoint is selected, it will just be handled normally as DSR is not applicable in this case.
  • If a remote Endpoint is selected, it will be sent to the backend Node that hosts the Endpoint without being NAT'd, the eventual Endpoint will be determined on the backend Node and may be different from the one selected here.
  1. LocalGroup will be used by traffic working in DSR mode on backend Node. In this way, each Endpoint has the same chance to be selected eventually.
  2. Traffic working in DSR mode on ingress Node will be marked and treated specially, e.g. bypassing SNAT.
  3. Learned flow will be created for each connection to ensure consistent load balance decision for a connection of DSR mode.

Learned flow is necessary because connections of DSR mode will remain invalid on ingress Node as it can only see requests and not responses. And OVS doesn't provide ct_state and ct_label for invalid connections. Thus, we can't store the load balance decision of the connection to ct_state or ct_label. To ensure consistent load balancing decision for packets of a connection, we use "learn" action to generate a learned flow when processing the first packet of a connection, and rely on the learned flow to process subsequent packets of the same connection.

DSR mode usually means lower latency, higher output bandwidth, and preserved client IP. However, due to the use of learned flow, creating new connections may be slightly slower than NAT mode, this may be improved in the future. The benchmark of the current implementation is as below:

Test               NAT       DSR       delta
TCP_CRR            1105.69   1007.82   -8.86%
TCP_RR             6802.55   9054.44   +33.1%

This feature is currently only supported for Linux Nodes, encap mode, and IPv4 cluster. The support for Windows and IPv6 can be added in the future.

Closes #5025

Doc will be added via a new PR soon.

@tnqn tnqn added area/proxy Issues or PRs related to proxy functions in Antrea action/release-note Indicates a PR that should be included in release notes. labels Jul 4, 2023
@tnqn tnqn added this to the Antrea v1.13 release milestone Jul 4, 2023
@tnqn tnqn force-pushed the support-dsr branch 7 times, most recently from 462a3bf to d256d83 Compare July 6, 2023 17:00
@tnqn tnqn force-pushed the support-dsr branch 6 times, most recently from 1d972ed to c30cdd7 Compare July 10, 2023 15:57
@tnqn tnqn marked this pull request as ready for review July 10, 2023 16:00
@@ -702,10 +714,11 @@ func (f *featurePodConnectivity) conntrackFlows() []binding.Flow {
MatchProtocol(ipProtocol).
MatchCTStateNew(false).
MatchCTStateTrk(true).
MatchCTMark(NotServiceCTMark).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the following two changes are not very related to DSR but can make priorityHigh available for a specical flow to allow some invalid connections to pass this table and check if they can match a DSR connection. Before the changes, the flows in ConntrackState are:

table=ConntrackState, priority=210,ct_state=+inv+trk,ip actions=drop
table=ConntrackState, priority=200,ct_state=-new+trk,ct_mark=0x10/0x10,ip actions=load:0x1->NXM_NX_REG0[9],resubmit(,AntreaPolicyEgressRule)
table=ConntrackState, priority=190,ct_state=-new+trk,ip actions=resubmit(,AntreaPolicyEgressRule)
table=ConntrackState, priority=0 actions=resubmit(,PreRoutingClassifier)

After the changes:

table=ConntrackState, priority=200,ct_state=+inv+trk,ip actions=drop
table=ConntrackState, priority=190,ct_state=-new+trk,ct_mark=0x10/0x10,ip actions=load:0x1->NXM_NX_REG0[9],resubmit(,AntreaPolicyEgressRule)
table=ConntrackState, priority=190,ct_state=-new+trk,ct_mark=0/0x10,ip actions=resubmit(,AntreaPolicyEgressRule)
table=ConntrackState, priority=0 actions=resubmit(,PreRoutingClassifier)

@tnqn
Copy link
Member Author

tnqn commented Jul 10, 2023

/test-all
/test-ipv6-all
/test-ipv6-only-all

Comment on lines 144 to 146
# -- Determines how external traffic's processed when it's load balanced across nodes. It must be one of "nat" or
# "dsr".
loadBalanceMode: "nat"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Determines how external traffic's processed/Determines how external traffic is processed

I also feel like loadBalancerMode may be a better name than loadBalanceMode? But I see that you have been using loadBalanceMode consistently everywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, also changed to loadBalancerMode

@@ -141,6 +141,9 @@ antreaProxy:
# will only handle Services without the "service.kubernetes.io/service-proxy-name"
# label, but ignore Services with the label no matter what is the value.
serviceProxyName: ""
# -- Determines how external traffic's processed when it's load balanced across nodes. It must be one of "nat" or
# "dsr".
loadBalanceMode: "nat"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason, I thought we were going to have the ability to enable DSR on a per-Service basis.
is that not feasible, or do you think it is not interesting to have that granularity?

Copy link
Member Author

@tnqn tnqn Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I planned it, but only implemented global config for simplicy. Since this is also your preference, I added per-Service support in the latest patch.

@tnqn tnqn force-pushed the support-dsr branch 5 times, most recently from a9b44af to 84e1ac5 Compare July 11, 2023 17:56
@tnqn tnqn changed the title Support DSR mode for Service's external IPs Support DSR mode for Service's external addresses Jul 13, 2023
@tnqn tnqn force-pushed the support-dsr branch 3 times, most recently from 33d6cbf to 893e859 Compare July 14, 2023 14:47
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments, otherwise lgtm

pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/proxy/proxier_test.go Show resolved Hide resolved
pkg/agent/proxy/types/types.go Outdated Show resolved Hide resolved
pkg/agent/proxy/types/types.go Show resolved Hide resolved
pkg/ovs/openflow/utils.go Outdated Show resolved Hide resolved
test/e2e/proxy_test.go Outdated Show resolved Hide resolved
test/e2e/proxy_test.go Outdated Show resolved Hide resolved
if clientNetns != "" {
cmd = fmt.Sprintf("ip netns exec %s %s", clientNetns, cmd)
}
stdout, stderr, err = data.RunCommandFromPod(data.testNamespace, clientPod, "toolbox", []string{"sh", "-c", cmd})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sh shell actually needed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed. Otherwise the whole command would be regarded as an executable and kubelet would try to find it:

Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "3ef22b5120aaef0fe232e01430ef127e2035e4a64a3e51d9705331c29338eeee": OCI runtime exec failed: exec failed: unable to start container process: exec: "ip netns exec ext-ehw34 curl --connect-timeout 1 --retry 5 --retry-connrefused http://1.1.2.1:8080/clientip": stat ip netns exec ext-ehw34 curl --connect-timeout 1 --retry 5 --retry-connrefused http://1.1.2.1:8080/clientip: no such file or directory: unknown

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's surprising to me. Even when the command is split and passed as a slice? []string{"ip", "netns", ...}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting the command into a slice could work. Both of them are commonly used to set command, []string{"sh", "-c", cmd} is easier to construct complex command and the formatting is more friendly in some cases.

Copy link
Contributor

@hongliangl hongliangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not finished yet, will continue reviewing.

pkg/agent/openflow/framework.go Show resolved Hide resolved
Action().GotoStage(stageValidation).
Done())
}
// If the packet is from gateway but its source IP is not the gateway IP, it's considered external sourced traffic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consider about other IPs on the host here?

pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pod_connectivity_test.go Outdated Show resolved Hide resolved
pkg/agent/proxy/proxier.go Show resolved Hide resolved
pkg/agent/proxy/proxier.go Show resolved Hide resolved
pkg/agent/openflow/fields.go Show resolved Hide resolved
@tnqn
Copy link
Member Author

tnqn commented Jul 17, 2023

/test-all
/test-ipv6-all
/test-ipv6-only-all

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

test/e2e/proxy_test.go Show resolved Hide resolved
@tnqn
Copy link
Member Author

tnqn commented Jul 19, 2023

/test-all
/test-ipv6-all
/test-ipv6-only-all

@tnqn
Copy link
Member Author

tnqn commented Jul 19, 2023

/test-windows-e2e

@tnqn
Copy link
Member Author

tnqn commented Jul 19, 2023

/test-windows-containerd-e2e

antoninbas
antoninbas previously approved these changes Jul 19, 2023
@tnqn
Copy link
Member Author

tnqn commented Jul 20, 2023

@hongliangl do you have other comments?

@XinShuYang
Copy link
Contributor

/test-windows-containerd-e2e

@tnqn
Copy link
Member Author

tnqn commented Jul 20, 2023

/test-windows-containerd-e2e
/test-windows-containerd-conformance

Copy link
Contributor

@hongliangl hongliangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nits.

pkg/agent/openflow/pod_connectivity_test.go Outdated Show resolved Hide resolved
@@ -2443,6 +2554,25 @@ func (f *featureService) endpointDNATFlow(endpointIP net.IP, endpointPort uint16
Done()
}

// dsrServiceNoDNATFlow generates the flow which prevents traffic in DSR mode from being DNATed on the ingress Node.
func (f *featureService) dsrServiceNoDNATFlow() []binding.Flow {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (f *featureService) dsrServiceNoDNATFlow() []binding.Flow {
func (f *featureService) dsrServiceNoDNATFlows() []binding.Flow {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -2443,6 +2554,25 @@ func (f *featureService) endpointDNATFlow(endpointIP net.IP, endpointPort uint16
Done()
}

// dsrServiceNoDNATFlow generates the flow which prevents traffic in DSR mode from being DNATed on the ingress Node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// dsrServiceNoDNATFlow generates the flow which prevents traffic in DSR mode from being DNATed on the ingress Node.
// dsrServiceNoDNATFlows generates the flows which prevent traffic in DSR mode from being DNATed on the ingress Node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1363,18 +1411,43 @@ func (f *featurePodConnectivity) l3FwdFlowToGateway() []binding.Flow {
}

// l3FwdFlowToRemoteViaTun generates the flow to match the packets destined for remote Pods via tunnel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// l3FwdFlowToRemoteViaTun generates the flow to match the packets destined for remote Pods via tunnel.
// l3FwdFlowToRemoteViaTun generates the flows to match the packets destined for remote Pods via tunnel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -1363,18 +1411,43 @@ func (f *featurePodConnectivity) l3FwdFlowToGateway() []binding.Flow {
}

// l3FwdFlowToRemoteViaTun generates the flow to match the packets destined for remote Pods via tunnel.
func (f *featurePodConnectivity) l3FwdFlowToRemoteViaTun(localGatewayMAC net.HardwareAddr, peerSubnet net.IPNet, tunnelPeer net.IP) binding.Flow {
func (f *featurePodConnectivity) l3FwdFlowToRemoteViaTun(localGatewayMAC net.HardwareAddr, peerSubnet net.IPNet, tunnelPeer net.IP) []binding.Flow {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename it to l3FwdFlowsToRemoteViaTun

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if isShortCircuiting {
// For short-circuiting flow, an extra match condition matching packet from local Pod CIDR is added.
flowBuilder = ServiceLBTable.ofTable.BuildFlow(priorityHigh).
func (f *featureService) serviceLBFlow(config *types.ServiceConfig) []binding.Flow {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (f *featureService) serviceLBFlow(config *types.ServiceConfig) []binding.Flow {
func (f *featureService) serviceLBFlows(config *types.ServiceConfig) []binding.Flow {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +1435 to +1436
// If DSR is enabled, packets accessing a DSR Service will not be DNATed on the ingress Node, but EndpointIPField
// holds the selected backend Pod IP, we match it and DSRServiceRegMark to send these packets to corresponding Nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be selected Endpoint IP used in remote Node? If it is not used in remote Node, maybe we could add some comments to explain that we only select the EndpointIP to decide the remote Node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How DSR works end to end is explained in the commit message and the comment of the loadBalancerMode variable in proxier.go. l3FwdFlowsToRemoteViaTun is only a portion of the whole, and I just want to make it focus on L3Forwarding. I feel it would repeated to explain it here.

For your question, the EndpointIP selected in ingress Node will not be used in backend Node. See the explaination:

When a Service's LoadBalancerMode is DSR, the following changes will be applied to the OpenFlow flows and groups:

  1. ClusterGroup will be used by traffic working in DSR mode on ingress Node.
  2. LocalGroup will be used by traffic working in DSR mode on backend Node.
  3. Traffic working in DSR mode on ingress Node will be marked and treated specially, e.g. bypassing SNAT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got that. For ClusterGroup will be used by traffic working in DSR mode on ingress Node., if a remote Endpoint is selected, only the remote Node is decided by removing the suffix of the Endpoint IP, and the packet will be sent to remote Node via tunnel. In remote Node, the Endpoint will be selected finally in a local group. Is that right? If so, how about adding more details to explain that how the remote Endpoint is selected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what details do you mean by "adding more details to explain that how the remote Endpoint is selected?"? LocalGroup will be used on backend Node, so only Endpoints on the remote Node wil be selected if the traffic is sent to it, and the bucket selection is just like all other group selection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should misunderstood something. I remembered that you have updated the way of bucket selection. My question is that:

  • On ingress Node, if a remote Endpoint A is selected in cluster group, then we can get the CIDR of the Node that holds Endpoint A.
  • On the Node that holds Endpoint A, Endpoint A* is selected in local group.
  • Will Endpoint A and A* be the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remembered that you have updated the way of bucket selection.

No, bucket selection method update is removed due to its impact on session affinity and internal traffic.

Will Endpoint A and A* be the same?

No, we didn't encode Endpoint A into overlay packet. And it wouldn't require local group to be used on backend Node if they are the same. We DO another selection on backend Node and only pick an Endpoint from local ones.
If you think about it, each Endpoint gets the same chance to be selected regardless of its location. And there is no need to ensure the Endpoint selected on the ingress Node to be the same one selected by backend Node.

I have updated the comment and description to further explain, hope it helps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated comment is much helpful to understand that how the remote Endpoint is selected in DSR mode.

This commit adds support for DSR mode for Service's external addresses,
including LoadBalancerIPs and ExternalIPs. A configuration option,
`antreaProxy.defaultLoadBalancerMode` is added to determine how external
traffic is processed when it's load balanced across Nodes by default.
It has two options: `nat` (default) and `dsr`. In NAT mode, external
traffic is SNAT'd when it's load balanced across Nodes to ensure
symmetric path. In DSR mode, external traffic is never SNAT'd and
backend Pods running on Nodes that are not the ingress Node can reply to
clients directly, bypassing the ingress Node.

Additionally, a Service's load balancer mode can be overridden by
annotating it with `service.antrea.io/load-balancer-mode`. A feature
gate, `LoadBalancerModeDSR` is added to control whether it's allowed to
use DSR mode.

When a Service's LoadBalancerMode is DSR, the following changes will be
applied to the OpenFlow flows and groups:

1. ClusterGroup will be used by traffic working in DSR mode on ingress
Node.
  * If a local Endpoint is selected, it will just be handled normally as
    DSR is not applicable in this case.
  * If a remote Endpoint is selected, it will be sent to the backend
    Node that hosts the Endpoint without being NAT'd, the eventual
    Endpoint will be determined on the backend Node and may be different
    from the one selected here.
2. LocalGroup will be used by traffic working in DSR mode on backend
Node. In this way, each Endpoint has the same chance to be selected
eventually.
3. Traffic working in DSR mode on ingress Node will be marked and
treated specially, e.g. bypassing SNAT.
4. Learned flow will be created for each connection to ensure consistent
load balance decision for a connection of DSR mode.

Learned flow is necessary because connections of DSR mode will remain
invalid on ingress Node as it can only see requests and not responses.
And OVS doesn't provide ct_state and ct_label for invalid connections.
Thus, we can't store the load balance decision of the connection to
ct_state or ct_label. To ensure consistent load balancing decision for
packets of a connection, we use "learn" action to generate a learned
flow when processing the first packet of a connection, and rely on the
learned flow to process subsequent packets of the same connection.

DSR mode usually means lower latency, higher output bandwidth, and
preserved client IP. However, due to the use of learned flow, creating
new connections may be slightly slower than NAT mode, this may be
improved in the future. The benchmark of the current implementation is
as below:

```
Test               NAT       DSR       delta
TCP_CRR            1105.69   1007.82   -8.86%
TCP_RR             6802.55   9054.44   +33.1%
```

This feature is currently only supported for Linux Nodes, `encap` mode,
and IPv4 cluster. The support for Windows and IPv6 can be added in the
future.

Signed-off-by: Quan Tian <[email protected]>
Copy link
Contributor

@hongliangl hongliangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member Author

tnqn commented Jul 20, 2023

/test-windows-containerd-e2e
/test-windows-containerd-conformance

@tnqn
Copy link
Member Author

tnqn commented Jul 20, 2023

/skip-all which has succeeded before updating some comments

@tnqn tnqn merged commit 8261b12 into antrea-io:main Jul 20, 2023
@tnqn tnqn deleted the support-dsr branch July 20, 2023 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes. area/proxy Issues or PRs related to proxy functions in Antrea
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support DSR mode for LoadBalancerIPs with AntreaProxy
4 participants