-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support DSR mode for Service's external addresses #5202
Conversation
462a3bf
to
d256d83
Compare
1d972ed
to
c30cdd7
Compare
@@ -702,10 +714,11 @@ func (f *featurePodConnectivity) conntrackFlows() []binding.Flow { | |||
MatchProtocol(ipProtocol). | |||
MatchCTStateNew(false). | |||
MatchCTStateTrk(true). | |||
MatchCTMark(NotServiceCTMark). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the following two changes are not very related to DSR but can make priorityHigh available for a specical flow to allow some invalid connections to pass this table and check if they can match a DSR connection. Before the changes, the flows in ConntrackState are:
table=ConntrackState, priority=210,ct_state=+inv+trk,ip actions=drop
table=ConntrackState, priority=200,ct_state=-new+trk,ct_mark=0x10/0x10,ip actions=load:0x1->NXM_NX_REG0[9],resubmit(,AntreaPolicyEgressRule)
table=ConntrackState, priority=190,ct_state=-new+trk,ip actions=resubmit(,AntreaPolicyEgressRule)
table=ConntrackState, priority=0 actions=resubmit(,PreRoutingClassifier)
After the changes:
table=ConntrackState, priority=200,ct_state=+inv+trk,ip actions=drop
table=ConntrackState, priority=190,ct_state=-new+trk,ct_mark=0x10/0x10,ip actions=load:0x1->NXM_NX_REG0[9],resubmit(,AntreaPolicyEgressRule)
table=ConntrackState, priority=190,ct_state=-new+trk,ct_mark=0/0x10,ip actions=resubmit(,AntreaPolicyEgressRule)
table=ConntrackState, priority=0 actions=resubmit(,PreRoutingClassifier)
/test-all |
build/charts/antrea/values.yaml
Outdated
# -- Determines how external traffic's processed when it's load balanced across nodes. It must be one of "nat" or | ||
# "dsr". | ||
loadBalanceMode: "nat" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Determines how external traffic's processed/Determines how external traffic is processed
I also feel like loadBalancerMode
may be a better name than loadBalanceMode
? But I see that you have been using loadBalanceMode
consistently everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, also changed to loadBalancerMode
build/charts/antrea/values.yaml
Outdated
@@ -141,6 +141,9 @@ antreaProxy: | |||
# will only handle Services without the "service.kubernetes.io/service-proxy-name" | |||
# label, but ignore Services with the label no matter what is the value. | |||
serviceProxyName: "" | |||
# -- Determines how external traffic's processed when it's load balanced across nodes. It must be one of "nat" or | |||
# "dsr". | |||
loadBalanceMode: "nat" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some reason, I thought we were going to have the ability to enable DSR on a per-Service basis.
is that not feasible, or do you think it is not interesting to have that granularity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I planned it, but only implemented global config for simplicy. Since this is also your preference, I added per-Service support in the latest patch.
a9b44af
to
84e1ac5
Compare
33d6cbf
to
893e859
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments, otherwise lgtm
if clientNetns != "" { | ||
cmd = fmt.Sprintf("ip netns exec %s %s", clientNetns, cmd) | ||
} | ||
stdout, stderr, err = data.RunCommandFromPod(data.testNamespace, clientPod, "toolbox", []string{"sh", "-c", cmd}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the sh
shell actually needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's needed. Otherwise the whole command would be regarded as an executable and kubelet would try to find it:
Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "3ef22b5120aaef0fe232e01430ef127e2035e4a64a3e51d9705331c29338eeee": OCI runtime exec failed: exec failed: unable to start container process: exec: "ip netns exec ext-ehw34 curl --connect-timeout 1 --retry 5 --retry-connrefused http://1.1.2.1:8080/clientip": stat ip netns exec ext-ehw34 curl --connect-timeout 1 --retry 5 --retry-connrefused http://1.1.2.1:8080/clientip: no such file or directory: unknown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's surprising to me. Even when the command is split and passed as a slice? []string{"ip", "netns", ...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting the command into a slice could work. Both of them are commonly used to set command, []string{"sh", "-c", cmd}
is easier to construct complex command and the formatting is more friendly in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not finished yet, will continue reviewing.
Action().GotoStage(stageValidation). | ||
Done()) | ||
} | ||
// If the packet is from gateway but its source IP is not the gateway IP, it's considered external sourced traffic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to consider about other IPs on the host here?
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
/test-windows-e2e |
/test-windows-containerd-e2e |
@hongliangl do you have other comments? |
/test-windows-containerd-e2e |
/test-windows-containerd-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few nits.
pkg/agent/openflow/pipeline.go
Outdated
@@ -2443,6 +2554,25 @@ func (f *featureService) endpointDNATFlow(endpointIP net.IP, endpointPort uint16 | |||
Done() | |||
} | |||
|
|||
// dsrServiceNoDNATFlow generates the flow which prevents traffic in DSR mode from being DNATed on the ingress Node. | |||
func (f *featureService) dsrServiceNoDNATFlow() []binding.Flow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (f *featureService) dsrServiceNoDNATFlow() []binding.Flow { | |
func (f *featureService) dsrServiceNoDNATFlows() []binding.Flow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/agent/openflow/pipeline.go
Outdated
@@ -2443,6 +2554,25 @@ func (f *featureService) endpointDNATFlow(endpointIP net.IP, endpointPort uint16 | |||
Done() | |||
} | |||
|
|||
// dsrServiceNoDNATFlow generates the flow which prevents traffic in DSR mode from being DNATed on the ingress Node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// dsrServiceNoDNATFlow generates the flow which prevents traffic in DSR mode from being DNATed on the ingress Node. | |
// dsrServiceNoDNATFlows generates the flows which prevent traffic in DSR mode from being DNATed on the ingress Node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/agent/openflow/pipeline.go
Outdated
@@ -1363,18 +1411,43 @@ func (f *featurePodConnectivity) l3FwdFlowToGateway() []binding.Flow { | |||
} | |||
|
|||
// l3FwdFlowToRemoteViaTun generates the flow to match the packets destined for remote Pods via tunnel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// l3FwdFlowToRemoteViaTun generates the flow to match the packets destined for remote Pods via tunnel. | |
// l3FwdFlowToRemoteViaTun generates the flows to match the packets destined for remote Pods via tunnel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/agent/openflow/pipeline.go
Outdated
@@ -1363,18 +1411,43 @@ func (f *featurePodConnectivity) l3FwdFlowToGateway() []binding.Flow { | |||
} | |||
|
|||
// l3FwdFlowToRemoteViaTun generates the flow to match the packets destined for remote Pods via tunnel. | |||
func (f *featurePodConnectivity) l3FwdFlowToRemoteViaTun(localGatewayMAC net.HardwareAddr, peerSubnet net.IPNet, tunnelPeer net.IP) binding.Flow { | |||
func (f *featurePodConnectivity) l3FwdFlowToRemoteViaTun(localGatewayMAC net.HardwareAddr, peerSubnet net.IPNet, tunnelPeer net.IP) []binding.Flow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename it to l3FwdFlowsToRemoteViaTun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/agent/openflow/pipeline.go
Outdated
if isShortCircuiting { | ||
// For short-circuiting flow, an extra match condition matching packet from local Pod CIDR is added. | ||
flowBuilder = ServiceLBTable.ofTable.BuildFlow(priorityHigh). | ||
func (f *featureService) serviceLBFlow(config *types.ServiceConfig) []binding.Flow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (f *featureService) serviceLBFlow(config *types.ServiceConfig) []binding.Flow { | |
func (f *featureService) serviceLBFlows(config *types.ServiceConfig) []binding.Flow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
// If DSR is enabled, packets accessing a DSR Service will not be DNATed on the ingress Node, but EndpointIPField | ||
// holds the selected backend Pod IP, we match it and DSRServiceRegMark to send these packets to corresponding Nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be selected Endpoint IP used in remote Node? If it is not used in remote Node, maybe we could add some comments to explain that we only select the EndpointIP to decide the remote Node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How DSR works end to end is explained in the commit message and the comment of the loadBalancerMode variable in proxier.go. l3FwdFlowsToRemoteViaTun is only a portion of the whole, and I just want to make it focus on L3Forwarding. I feel it would repeated to explain it here.
For your question, the EndpointIP selected in ingress Node will not be used in backend Node. See the explaination:
When a Service's LoadBalancerMode is DSR, the following changes will be applied to the OpenFlow flows and groups:
- ClusterGroup will be used by traffic working in DSR mode on ingress Node.
- LocalGroup will be used by traffic working in DSR mode on backend Node.
- Traffic working in DSR mode on ingress Node will be marked and treated specially, e.g. bypassing SNAT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got that. For ClusterGroup will be used by traffic working in DSR mode on ingress Node.
, if a remote Endpoint is selected, only the remote Node is decided by removing the suffix of the Endpoint IP, and the packet will be sent to remote Node via tunnel. In remote Node, the Endpoint will be selected finally in a local group. Is that right? If so, how about adding more details to explain that how the remote Endpoint is selected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what details do you mean by "adding more details to explain that how the remote Endpoint is selected?"? LocalGroup will be used on backend Node, so only Endpoints on the remote Node wil be selected if the traffic is sent to it, and the bucket selection is just like all other group selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should misunderstood something. I remembered that you have updated the way of bucket selection. My question is that:
- On ingress Node, if a remote Endpoint A is selected in cluster group, then we can get the CIDR of the Node that holds Endpoint A.
- On the Node that holds Endpoint A, Endpoint A* is selected in local group.
- Will Endpoint A and A* be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remembered that you have updated the way of bucket selection.
No, bucket selection method update is removed due to its impact on session affinity and internal traffic.
Will Endpoint A and A* be the same?
No, we didn't encode Endpoint A into overlay packet. And it wouldn't require local group to be used on backend Node if they are the same. We DO another selection on backend Node and only pick an Endpoint from local ones.
If you think about it, each Endpoint gets the same chance to be selected regardless of its location. And there is no need to ensure the Endpoint selected on the ingress Node to be the same one selected by backend Node.
I have updated the comment and description to further explain, hope it helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated comment is much helpful to understand that how the remote Endpoint is selected in DSR mode.
daa4e07
to
11bb15a
Compare
This commit adds support for DSR mode for Service's external addresses, including LoadBalancerIPs and ExternalIPs. A configuration option, `antreaProxy.defaultLoadBalancerMode` is added to determine how external traffic is processed when it's load balanced across Nodes by default. It has two options: `nat` (default) and `dsr`. In NAT mode, external traffic is SNAT'd when it's load balanced across Nodes to ensure symmetric path. In DSR mode, external traffic is never SNAT'd and backend Pods running on Nodes that are not the ingress Node can reply to clients directly, bypassing the ingress Node. Additionally, a Service's load balancer mode can be overridden by annotating it with `service.antrea.io/load-balancer-mode`. A feature gate, `LoadBalancerModeDSR` is added to control whether it's allowed to use DSR mode. When a Service's LoadBalancerMode is DSR, the following changes will be applied to the OpenFlow flows and groups: 1. ClusterGroup will be used by traffic working in DSR mode on ingress Node. * If a local Endpoint is selected, it will just be handled normally as DSR is not applicable in this case. * If a remote Endpoint is selected, it will be sent to the backend Node that hosts the Endpoint without being NAT'd, the eventual Endpoint will be determined on the backend Node and may be different from the one selected here. 2. LocalGroup will be used by traffic working in DSR mode on backend Node. In this way, each Endpoint has the same chance to be selected eventually. 3. Traffic working in DSR mode on ingress Node will be marked and treated specially, e.g. bypassing SNAT. 4. Learned flow will be created for each connection to ensure consistent load balance decision for a connection of DSR mode. Learned flow is necessary because connections of DSR mode will remain invalid on ingress Node as it can only see requests and not responses. And OVS doesn't provide ct_state and ct_label for invalid connections. Thus, we can't store the load balance decision of the connection to ct_state or ct_label. To ensure consistent load balancing decision for packets of a connection, we use "learn" action to generate a learned flow when processing the first packet of a connection, and rely on the learned flow to process subsequent packets of the same connection. DSR mode usually means lower latency, higher output bandwidth, and preserved client IP. However, due to the use of learned flow, creating new connections may be slightly slower than NAT mode, this may be improved in the future. The benchmark of the current implementation is as below: ``` Test NAT DSR delta TCP_CRR 1105.69 1007.82 -8.86% TCP_RR 6802.55 9054.44 +33.1% ``` This feature is currently only supported for Linux Nodes, `encap` mode, and IPv4 cluster. The support for Windows and IPv6 can be added in the future. Signed-off-by: Quan Tian <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-windows-containerd-e2e |
/skip-all which has succeeded before updating some comments |
This commit adds support for DSR mode for Service's external addresses, including LoadBalancerIPs and ExternalIPs. A configuration option,
antreaProxy.defaultLoadBalancerMode
is added to determine how external traffic is processed when it's load balanced across Nodes by default. It has two options:nat
(default) anddsr
. In NAT mode, external traffic is SNAT'd when it's load balanced across Nodes to ensure symmetric path. In DSR mode, external traffic is never SNAT'd and backend Pods running on Nodes that are not the ingress Node can reply to clients directly, bypassing the ingress Node.Additionally, a Service's load balancer mode can be overridden by annotating it with
service.antrea.io/load-balancer-mode
. A feature gate,LoadBalancerModeDSR
is added to control whether it's allowed to use DSR mode.When a Service's LoadBalancerMode is DSR, the following changes will be applied to the OpenFlow flows and groups:
Learned flow is necessary because connections of DSR mode will remain invalid on ingress Node as it can only see requests and not responses. And OVS doesn't provide ct_state and ct_label for invalid connections. Thus, we can't store the load balance decision of the connection to ct_state or ct_label. To ensure consistent load balancing decision for packets of a connection, we use "learn" action to generate a learned flow when processing the first packet of a connection, and rely on the learned flow to process subsequent packets of the same connection.
DSR mode usually means lower latency, higher output bandwidth, and preserved client IP. However, due to the use of learned flow, creating new connections may be slightly slower than NAT mode, this may be improved in the future. The benchmark of the current implementation is as below:
This feature is currently only supported for Linux Nodes,
encap
mode, and IPv4 cluster. The support for Windows and IPv6 can be added in the future.Closes #5025
Doc will be added via a new PR soon.