Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Antrea Proxy NodePort Service Support #1463

Closed
weiqiangt opened this issue Oct 30, 2020 · 19 comments
Closed

Antrea Proxy NodePort Service Support #1463

weiqiangt opened this issue Oct 30, 2020 · 19 comments
Assignees
Labels
area/component/agent Issues or PRs related to the agent component kind/design Categorizes issue or PR as related to design. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@weiqiangt
Copy link
Contributor

weiqiangt commented Oct 30, 2020

Google Docs
Describe what you are trying to solve
This draft is only for Linux Nodes, we still need to design the solution for Windows Nodes.
We had already implemented ClusterIP services support in Antrea, but Kube-Proxy is still needed to support NodePort services. Since the Kube-Proxy does not support to run only for NodePort services, the ClusterIP Service calculations of it waste a lot of CPU cycles and memories. Once we implement NodePort services support in Antrea Proxy and remove Kube-Proxy in the cluster, then the overhead will be wiped. Furthermore, traffic for watching Service resources should also decrease and the pressure of APIServer should also be lower.

Describe the solution you have in mind
For both ClusterIP and NodePort Services, when accessing them, traffic should always be DNATed to a Pod Endpoint. Thus, we can reuse the ClusterIP Endpoint selection flows in OVS. To achieve this, traffic going to the host must be redirected to the OVS correctly.

Describe how your solution impacts user flows
Once we implement this feature then we can remove Kube-Proxy Deployments in theory, although we need to consider the way to start Antrea without Kube-Proxy first.

Describe the main design/architecture of your solution
From our prior experiments, the performance of IPTables will go down significantly if there are too many rules, thus we should keep the number of IPTables rules as small as possible. By using IPSet, we can use a constant number of IPTables rules to match traffic that we need to redirect, the matching complexity will be O(1) since we can use a set with a hash type. For each valid NodePort service, there should be several entries according to the Node addresses in the IPSet.
Traffic no matter if it comes from the remote or current host, once its destination matches entries in IPSET then we need to forward it to the OVS. By doing DNAT, using link-local address 169.254.169.110, we make the packets to be forwarded to the OVS. To make the forward action really happen, we need an IP route rule. Traffic may be sent from 127.0.0.1 and then we need to do masquerade for it to ensure the destination knows where to reply. In the POSTROUTING chain of the nat table, we masquerade packets if they are sent from 127.0.0.1 and going to antrea-gw0.

IPTables

There are two options for a NodePort service: Cluster (default) and Local. Cluster obscures the client source IP and may cause a second hop to another node, but should have good overall load-spreading. Local preserves the client source IP and avoids a second hop for LoadBalancer and NodePort type services, but risks potentially imbalanced traffic spreading. This approach preserves the original source IP address. If there are no local endpoints, packets sent to the node are dropped, so you can rely on the correct source-ip in any packet processing rules you might apply a packet that makes it through to the endpoint.
The IPTables implementation according to externalTrafficPolicy would look like below.

Chain OUTPUT (policy ACCEPT)
Target           prot opt  source     destination
ANTREA-NODEPORT  all  --   0.0.0.0/0  0.0.0.0/0            /* Antrea: jump to Antrea NodePort Service rules */
 
Chain PREROUTING (policy ACCEPT)
target           prot opt source      destination
ANTREA-NODEPORT  all  --  0.0.0.0/0   0.0.0.0/0            /* Antrea: jump to Antrea NodePort Service rules */
 
Chain ANTREA-NODEPORT (2 references)
target     prot opt source         destination
MARK       all  --  0.0.0.0/0      0.0.0.0/0            match-set ANTREA-NODEPORT-LOCAL dst,dst MARK set 0xf0
MARK       all  --  0.0.0.0/0      0.0.0.0/0            match-set ANTREA-NODEPORT-CLUSTER dst,dst MARK set 0xf1
DNAT       all  --  0.0.0.0/0      0.0.0.0/0            mark match 0xf0 to:169.254.169.110
DNAT       all  --  0.0.0.0/0      0.0.0.0/0            mark match 0xf1 to:169.254.169.110
 
Chain POSTROUTING (policy ACCEPT)
target                prot opt source     destination
ANTREA-NODEPORT-MASQ  all  --  0.0.0.0/0  0.0.0.0/0    /* Antrea: jump to Antrea NodePort Service MASQ rules */
ANTREA-POSTROUTING    all  --  0.0.0.0/0  0.0.0.0/0    /* Antrea: jump to Antrea postrouting rules */
 
Chain ANTREA-NODEPORT-MASQ (1 references)
target     prot opt source               destination
MASQ       all  --  127.0.0.1            169.254.169.110
MASQ       all  --  0.0.0.0/0            169.254.169.110      mark match 0xf1

There are two typical traffic paths of NodePort services.

Patterns

For a local policy, we only need to care about the first case. And for a cluster policy, we care about both of these two cases.
For the local policy, the detailed traffic path of our implementation should look like:

ExternalTrafficPolicyLocal

For the two-hop cases, the traffic path looks like below.

ExternalTrafficPolicyCluster

Based on the cases above, we need the following flows:

The flow that makes NodePort packets comes from gateway back to the gateway with ServiceCTMark
In the current OVS pipeline, packets from Pod to external addresses will be tracked with CT_MARK 0x20, as we do DNAT for endpoint selection with CT_MARK 0x21, the second packet of a connection from external will not be correctly tracked. Thus, we need the following flow to handle this issue.

c.pipeline[conntrackCommitTable].BuildFlow(priorityHigh).
    MatchProtocol(binding.ProtocolIP).
    MatchCTMark(serviceCTMark).
    MatchCTStateNew(true).
    MatchCTStateTrk(true).
    MatchRegRange(int(marksReg), markTrafficFromGateway, binding.Range{0, 15}).
    Action().GotoTable(L2ForwardingOutTable).
    Done()

Virtual IP ARP responder

[]binding.Flow{
    c.pipeline[spoofGuardTable].BuildFlow(priorityNormal).MatchProtocol(binding.ProtocolARP).
        MatchInPort(gatewayOFPort).
        MatchARPTpa(NodePortVirtualIP).
        MatchARPSpa(nodeIP).
        Action().GotoTable(arpResponderTable).
        Cookie(c.cookieAllocator.Request(cookie.Service).Raw()).
        Done(),
    c.pipeline[arpResponderTable].BuildFlow(priorityNormal).MatchProtocol(binding.ProtocolARP).
        MatchARPOp(1).
        MatchARPTpa(NodePortVirtualIP).
        Action().Move(binding.NxmFieldSrcMAC, binding.NxmFieldDstMAC).
        Action().SetSrcMAC(globalVirtualMAC).
        Action().LoadARPOperation(2).
        Action().Move(binding.NxmFieldARPSha, binding.NxmFieldARPTha).
        Action().SetARPSha(globalVirtualMAC).
        Action().Move(binding.NxmFieldARPSpa, binding.NxmFieldARPTpa).
        Action().SetARPSpa(NodePortVirtualIP).
        Action().OutputInPort().
        Cookie(c.cookieAllocator.Request(cookie.Service).Raw()).
        Done(),
}

Alternative solutions that you considered
For the host traffic forward part, we can use alternatives like EBPF or IPVS. But for now, I can not see any significant disadvantage if we use IPTables.

Test plan
We can verify this feature by using e2e and conformance tests.

Additional context
Since we use IPSET to match NodePort Services, the time complexity of packet matching should be O(1). While the time complexity of OVS flow matching is also O(1), the performance should be decent. Moreover, since the IPTables rules will reduce significantly once we remove the Kube-Proxy, the connection set-up delay should decrease too. According to the analysis, we can believe that the implementation will improve or keep the performance compared to the current implementation.
As we can see, the traffic from Pod to a NodePort Service will go through a complex path. But as the NodePort services are designed for out of cluster access, Pods to NodePort should not be a common and best practice use case. To keep the implementation clear and efficient for those real use cases, IMO, this implementation is reasonable.

@weiqiangt weiqiangt added kind/design Categorizes issue or PR as related to design. area/component/agent Issues or PRs related to the agent component labels Oct 30, 2020
@weiqiangt weiqiangt linked a pull request Nov 2, 2020 that will close this issue
@antoninbas antoninbas mentioned this issue Jan 8, 2021
7 tasks
weiqiangt added a commit to weiqiangt/antrea that referenced this issue Feb 10, 2021
weiqiangt added a commit to weiqiangt/antrea that referenced this issue Feb 18, 2021
weiqiangt added a commit to weiqiangt/antrea that referenced this issue Feb 18, 2021
weiqiangt added a commit to weiqiangt/antrea that referenced this issue Feb 18, 2021
weiqiangt added a commit to weiqiangt/antrea that referenced this issue Feb 18, 2021
weiqiangt added a commit to weiqiangt/antrea that referenced this issue Feb 24, 2021
weiqiangt added a commit to weiqiangt/antrea that referenced this issue Feb 24, 2021
weiqiangt added a commit to weiqiangt/antrea that referenced this issue Mar 2, 2021
weiqiangt added a commit to weiqiangt/antrea that referenced this issue Mar 4, 2021
- Implement ClusterIP and Loadbalancer Services support
- Add NodePort support for Antrea Proxy on Linux

Resolves antrea-io#1463.

Signed-off-by: Weiqiang Tang <[email protected]>
weiqiangt added a commit to weiqiangt/antrea that referenced this issue Mar 5, 2021
- Implement ClusterIP and Loadbalancer Services support
- Add NodePort support for Antrea Proxy on Linux

Resolves antrea-io#1463.

Signed-off-by: Weiqiang Tang <[email protected]>
weiqiangt added a commit to weiqiangt/antrea that referenced this issue Mar 8, 2021
- Implement ClusterIP and Loadbalancer Services support
- Add NodePort support for Antrea Proxy on Linux

Resolves antrea-io#1463.

Signed-off-by: Weiqiang Tang <[email protected]>
hongliangl pushed a commit to hongliangl/antrea that referenced this issue Apr 6, 2021
hongliangl pushed a commit to hongliangl/antrea that referenced this issue Apr 25, 2021
@github-actions
Copy link
Contributor

github-actions bot commented May 1, 2021

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 1, 2021
hongliangl pushed a commit to hongliangl/antrea that referenced this issue May 31, 2021
@hongliangl
Copy link
Contributor

Replace KubeProxy NodePort with Linux TC

Scenarios

For simplicity, here we assume that we have an interfaces eth0 whose two IP addresses can be used for NodePort. There are three scenarios:

  • Accessing from remote with eth0's IP addresses and NodePort protocol/port. If externalTrafficPolicy is Local, source IP address will not be masqueraded.
  • Accessing from localhost with eth0's IP addresses and NodePort protocol/port. If externalTrafficPolicy is Local, source IP address will be masqueraded.
  • Accessing from localhost with 127.0.0.1/::1 and NodePort protocol/port.

Traffic Redirect

image

NodePort Traffic From Remote

Here we assume that the gateway of Antrea is antrea-gw0.

The key idea is:

  • Redirecting matched request NodePort traffic (from remote hosts) with filters at eth0's ingress to antrea-gw0's egress.
  • Redirecting matched response NodePort traffic (from pods) with filters at antrea-gw0's ingress to eth0's egress.
Request Traffic

For NodePort traffic from remote, filters will be created and attached to the eth0's ingress. Assumed that eth0's IP addresses are 192.168.2.1 and 172.16.2.1, NodePort protocol/port is TCP/57001.

  • Filter that matches destination IP 192.168.2.1 and destination TCP/57001 will be created and attached to eth0's ingress.
  • Filter that matches destination IP 172.16.2.1 and destination TCP/57001 will be created and attached to eth0's ingress.

The command is below:

tc filter add dev eth0 parent ffff:0 prio 104 \
	protocol ipv4 chain 0 handle 0x6dea9 \
	flower ip_proto tcp dst_ip 192.168.2.1 dst_port 57001 \
	action mirred egress redirect dev antrea-gw0
tc filter add dev eth0 parent fff:0 prio 104 \
	protocol ipv4 chain 0 handle 0x106dea9 \
	flower ip_proto tcp dst_ip 172.16.2.1 dst_port 57001 \
	action mirred egress redirect dev antrea-gw0

What the filters created by above commands do:

  • At eth0's ingress, traffic matched destination IP and destination protocol/port will be redirected to antrea-gw0's egress.

Note that: the handle ID 0x6dea9 and 0x106dea9 is not auto-generated or random value. It is calculated by some conditions. The handle ID is used for deleting specific filter. A specific filter cannot be deleted just by replacing the key word add with del in the command. Handle ID calculation will be explained in appendix chapter.

Warning: this is not the best filter design! If an interface has many IP addresses(assumed that an interface has 10 IP addresses), then 10 commands will be executed and 10 filters will be created and attached to the interface's ingress. This is not so graceful and efficient. If possible, the better hierarchic filter design is on the following:

  • Create an init filter matching destination IP address with every IP address of an interface in default chain(chain 0). The matched traffic will be sent to a target chain.
  • There are filters matching NodePort protocol/port in target chain. once a NodePort is created, a filter matching NodePort protocol/port will be created and appended to the target chain.

Unfortunately, there are some strange issues about the hierarchic filter design.

Response Traffic

For response NodePort traffic, it should be redirected to the interface where its request traffic is from. The request traffic can be from different interfaces. Here the hierarchic filter design can work on interface antrea-gw0.

An Init filter matching networking protocol(IPv4/IPv6) and source IP address will be created for every interface(the interface that has available NodePort IP address) when AntreaProxy is inited. The following commands will be executed:

tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
	protocol ipv4 flower \
	src_ip 192.168.2.1 \
	action goto chain 259
tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
	protocol ipv4 flower \
	src_ip 172.16.2.1 \
	action goto chain 259

Note that, the chain number value 259 is also calculated by some conditions. This will be explained in appendix chapter.

When a NodePort Service is created, a filter will be created and attached to chain 259. The following command will be executed:

tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
	chain 259 handle 0x6dea9 protocol ip flower \
	ip_proto tcp src_port 57001 \
	action mirred egress redirect dev eth0

What the filter created by above command does:

  • At antrea-gw0's ingress, traffic matching source IP address and source TCP/port will be redirected to eth0's egress.

Note that, the handle ID above 0x6dea9 is also not auto-generated or random value. It is calculated by some conditions.

NodePort Traffic From Localhost

The key idea is:

  • Redirecting matched request NodePort traffic (from localhost) with filters at lo's egress to antrea-gw0's egress.
  • Redirecting matched response NodePort traffic (from pods) with filters at antrea-gw0's ingress to lo's ingress.
Request Traffic

For NodePort traffic from localhost, filters will be created and attached to the lo's egress. The following endpoints (not the Endpoint of pods) should be available.

  • 127.0.0.1:57001
  • 192.168.2.1:57001
  • 172.16.2.1:57001

For every endpoint, a filter will be created. These filters will be created by following commands:

// For endpoint 127.0.0.1:57001
tc filter add dev lo parent a:0 prio 4 \
	protocol ipv4 chain 0 handle 0x1396dea9 flower \
	ip_proto tcp dst_ip 127.0.0.1 dst_port 57001 \
	action skbmod set smac 12:34:56:78:9a:bc pipe \
  action mirred egress redirect dev antrea-gw0
  
// For endpoint 192.168.2.1:57001
tc filter add dev lo parent a:0 prio 4 \ 
	protocol ipv4 chain 0 handle 0x1396dea9 flower \
	ip_proto tcp dst_ip 192.168.2.1 dst_port 57001 \
	action skbmod set smac 12:34:56:78:9a:bc pipe \
  action mirred egress redirect dev antrea-gw0
  
// For endpoint 172.16.2.1:57001
tc filter add dev lo parent a:0 prio 4 \ 
	protocol ipv4 chain 0 handle 0x1396dea9 flower \
	ip_proto tcp dst_ip 172.16.2.1 dst_port 57001 \
	action skbmod set smac 12:34:56:78:9a:bc pipe \
  action mirred egress redirect dev antrea-gw0  

What the filters created by above commands do:

  • At lo's egress, the source MAC address of traffic matching destination IP address and destination TCP/port will be modified to antrea-gw0's MAC address first(assumed that the antrea-gw0's' MAC address is 12:34:56:78:9a:bc).
  • After the action above, the traffic will be redirected to antrea-gw0's egress.

Note that, the handle ID above 0x6dea9 is also not auto-generated or random value. It is calculated by some conditions.

Response Traffic

For response NodePort traffic, it should be redirected to the interface where its request traffic is from. The request traffic can be from different interfaces. Here the hierarchic filter design can work on interface antrea-gw0.

The following endpoints' response traffic should be redirected to lo.

  • 127.0.0.1:57001
  • 192.168.2.1:57001
  • 172.16.2.1:57001

For every endpoint, a filter will be created. These filters will be created by following commands:

// For endpoint 127.0.0.1:57001's response traffic, destination IP is not needed.
tc filter add dev antrea-gw0 parent ffff:0 prio 4 \ 
	protocol ipv4 flower \
	src_ip 127.0.0.1 \
	action goto chain 257
	
// For endpoint 192.168.2.1:57001's response traffic, source and destination IP are all needed.
tc filter add dev antrea-gw0 parent ffff:0 prio 4 \ 
	protocol ipv4 flower \
	src_ip 192.168.2.1 dst_ip 192.168.2.1 action goto chain 257
	
// For endpoint 172.16.2.1:57001's response traffic, source and destination IP are all needed.
tc filter add dev antrea-gw0 parent ffff:0 prio 4 \
	protocol ipv4 flower \
	src_ip 172.16.2.1 dst_ip 172.16.2.1 \
	action goto chain 257

Hint: we can see that this filters' priority is 4(according to the document of Linux TC, a filter with the smaller priority value has higher priority). These filters has higher priority.

When a NodePort is created, a filter will be created and attached to chain 257. The following command will be executed.

tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
	chain 259 handle 0x6dea9 protocol ip flower \
	ip_proto tcp src_port 57001 \
	action skbmod set smac 00:00:00:00:00:00 set dmac 00:00:00:00:00:00 pipe \
	action mirred egress redirect dev eth0

What the filter created by above command does:

  • At of antrea-gw0's ingress, matched source and destination IP address and TCP source port traffic whose source MAC address will be modified to all-zero MAC address first.
  • After above action, the traffic will be redirected to loopback's ingress.
  • At of antrea-gw0's ingress, the source MAC address of traffic matching source IP address and source TCP/port will be modified to all-zero MAC address first.
  • After the action above, the traffic will be redirected to lo's ingress.

Note that, the handle ID above 0x6dea9 is also not auto-generated or random value. It is calculated by some conditions.

OVS Pipeline

serviceSNATTable

This is a new table between serviceHairpinTable an conntrackTable.

If externalTrafficPolicy is Cluster, the source IP will be masqueraded to antrea-gw0's IP address. The following flows will be appended to table serviceSNATTable.

cookie=0x1040000000000, table=29,priority=200,tcp,reg0=0x1/0xffff,nw_dst=127.0.0.1,tp_dst=57001 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1),exec(move:NXM_OF_ETH_SRC[]->NXM_NX_CT_LABEL[0..47])) // For endpoint 127.0.0.1:57001.
cookie=0x1040000000000, table=29, priority=200,tcp,reg0=0x1/0xffff,nw_dst=192.168.2.1,tp_dst=57001 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1),exec(move:NXM_OF_ETH_SRC[]->NXM_NX_CT_LABEL[0..47])) // For endpoint 192.168.2.1:57001.
cookie=0x1040000000000, table=29, priority=200,tcp,reg0=0x1/0xffff,nw_dst=172.16.2.1,tp_dst=30007 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1),exec(move:NXM_OF_ETH_SRC[]->NXM_NX_CT_LABEL[0..47])) // For endpoint 172.16.2.1:57001.
cookie=0x1000000000000, table=29, priority=0 actions=resubmit(,30) // Default flow

If externalTrafficPolicy is Local, the source IP of the traffic which is from localhost will be masqueraded to antrea-gw0's IP address. The source IP of the traffic which is from remote will be reserved. The following flows will be appended to table serviceSNATTable.

cookie=0x1040000000000, table=29,priority=210,tcp,reg0=0x1/0xffff,nw_src=127.0.0.1,nw_dst=127.0.0.1,tp_dst=57001 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1)) // For endpoint 127.0.0.1:57001 accessed from localhost.
cookie=0x1040000000000, table=29, priority=210,tcp,reg0=0x1/0xffff,nw_src=192.168.2.1,nw_dst=192.168.2.1,tp_dst=57001 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1)) // For endpoint 192.168.2.1:57001 accessed from localhost.
cookie=0x1040000000000, table=29, priority=210,tcp,reg0=0x1/0xffff,nw_src=172.16.2.1,nw_dst=172.16.2.1,tp_dst=30007 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1)) // For endpoint 172.16.2.1:57001 accessed from localhost.
cookie=0x1040000000000, table=29, priority=200,tcp,reg0=0x1/0xffff,nw_dst=192.168.2.1,tp_dst=57001 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1),exec(move:NXM_OF_ETH_SRC[]->NXM_NX_CT_LABEL[0..47])) // For endpoint 192.168.2.1:57001 accessed from remote.
cookie=0x1040000000000, table=29, priority=200,tcp,reg0=0x1/0xffff,nw_dst=172.16.2.1,tp_dst=30007 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1),exec(move:NXM_OF_ETH_SRC[]->NXM_NX_CT_LABEL[0..47])) // For endpoint 172.16.2.1:57001 accessed from remote.
 cookie=0x1000000000000, table=29, priority=0 actions=resubmit(,30)

serviceLBTable

For every NodePort IP address, a flow will be created and appended to table serviceLBTable

 cookie=0x1040000000000, table=41, priority=200,tcp,reg4=0x10000/0x70000,nw_dst=127.0.0.1,tp_dst=57001 actions=load:0x2->NXM_NX_REG4[16..18],load:0x1->NXM_NX_REG0[19],group:3
 cookie=0x1040000000000, table=41, priority=200,tcp,reg4=0x10000/0x70000,nw_dst=192.168.2.1,tp_dst=57001 actions=load:0x2->NXM_NX_REG4[16..18],load:0x1->NXM_NX_REG0[19],group:3
 cookie=0x1040000000000, table=41, priority=200,tcp,reg4=0x10000/0x70000,nw_dst=172.16.2.1,tp_dst=57001 actions=load:0x2->NXM_NX_REG4[16..18],load:0x1->NXM_NX_REG0[19],group:3

If externalTrafficPolicy is Local, the group only has local endpoints.

conntrackCommitTable

Add two flows to match Service traffic from antrea-gw0.

 cookie=0x1000000000000, table=105, priority=210,ct_state=+trk,ct_mark=0x21,ip,reg0=0x1/0xffff actions=resubmit(,106)
 cookie=0x1000000000000, table=105, priority=210,ct_state=+trk,ct_mark=0x21,ipv6,reg0=0x1/0xffff actions=resubmit(,106)

If the flows above are not added to table conntrackCommitTable, the Service traffic will matched by the flow below. If matched, the mark 0x21 will be replaced by mark 0x20, this will lead to unwanted result.

cookie=0x1000000000000, table=105, priority=200,ct_state=+new+trk,ip,reg0=0x1/0xffff actions=ct(commit,table=106,zone=65520,exec(load:0x20->NX
M_NX_CT_MARK[]))

serviceConntrackTable

Do SNAT/unSNAT for Service traffic. Other traffic will be matched as this is another ct zone.

cookie=0x1000000000000, table=106, priority=200,ip actions=ct(table=107,zone=65521,nat)
cookie=0x1000000000000, table=106, priority=200,ipv6 actions=ct(table=107,zone=65511,nat)
cookie=0x1000000000000, table=106, priority=0 actions=resubmit(,107)

serviceDstMacRewriteTable

For NodePort traffic from remote, destination MAC address must be rewritten. Note that, these flows have no effect to NodePort traffic from localhost as NodePort traffic's destination MAC address will be rewritten by Linux TC.

cookie=0x1040000000000, table=107, priority=200,tcp,nw_src=127.0.0.1,tp_src=57001 actions=move:NXM_NX_CT_LABEL[0..47]->NXM_OF_ETH_DST[],resubmit(,108)
cookie=0x1040000000000, table=107, priority=200,tcp,nw_src=172.16.2.1,tp_src=57001 actions=move:NXM_NX_CT_LABEL[0..47]->NXM_OF_ETH_DST[],resubmit(,108)
cookie=0x1040000000000, table=107, priority=200,tcp,nw_src=192.168.2.1,tp_src=57001 actions=move:NXM_NX_CT_LABEL[0..47]->NXM_OF_ETH_DST[],resubmit(,108)

Appendix

TC handle ID generation(NodePort Traffic From Remote / Request Traffic)

The handle ID 0x6dea9 and 0x106dea9 is calculated from:

  • IP address index. As an interface may have multiple IP addresses. When getting IP addresses from an interface in a slice, the IP address index is the slice index.
  • L3 protocol. IPv4(0x0), IPv6(0x29)
  • L4 protocol. TCP(0x6), UDP(0x11), SCTP(0x84)
  • port number. NodePort port number.

The handle ID is :

IP address index << 24 | L3 Protocol & 0xf << 20 | L4 Protocol & 0xf << 16 | NodePort port number

When destination TCP port is 57001 and IP address index is 1, all possible handle IDs are:

L3 Protocol l4 Protocol Handle ID
IPv4(0x0) TCP(0x6) 0x106dea9
IPv4(0x0) UDP(0x11) 0x101dea9
IPv4(0x0) SCTP(0x84) 0x141dea9
IPv6(0x29) TCP(0x6) 0x196dea9
IPv6(0x29) UDP(0x11) 0x191dea9
IPv6(0x29) SCTP(0x84) 0x194dea9

@hongliangl
Copy link
Contributor

hello, @jianjuns @tnqn @antoninbas @wenyingd @lzhecheng, this is the design of NodePort with Linux TC. Please give your suggestions. Thanks.

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 16, 2021
@hongliangl hongliangl self-assigned this Jun 21, 2021
@hongliangl
Copy link
Contributor

hongliangl commented Jun 23, 2021

Hello, @jianjuns @tnqn PR #2239 has implemented 11 cases of NodePort as fowllowing form. Do you think all NodePort cases are neccessary?

No. From Network NotePort ExternalTrafficPolicy Endpoint Network SNAT AntreaProxy Support KubeProxy Support Note
1 Remote Cluster Pod CIDR Yes ✔️ ✔️
2 Remote Local Pod CIDR No ✔️ ✔️
3 Remote Cluster Host Yes ✔️ ✔️
4 Remote Local Host No ✔️
5 Localhost Cluster Pod CIDR Yes ✔️ ✔️
6 Localhost Local Pod CIDR Yes ✔️ ✔️
7 Localhost Cluster Host Yes ✔️ ✔️
8 Localhost Local Host Yes ✔️ ✔️
9 Pod Cluster Pod CIDR Yes ✔️ ✔️
10 Pod Local Pod CIDR No ✔️ ✔️
11 Pod Cluster Host Yes ✔️ ✔️
12 Pod Local Host No ✔️ ✔️

@jianjuns
Copy link
Contributor

@hongliangl: what does Endpoint Network = Service CIDR mean? You mean service access through the ClusterIP, but not the NodePort (but I thought you are asking about NodePort)?

@hongliangl
Copy link
Contributor

Sorry, I misunderstood something. It should be Pod CIDR, not Service CIDR.

@jianjuns
Copy link
Contributor

Sorry, I misunderstood something. It should be Pod CIDR, not Service CIDR.

So that means the Service endpoints are Pods; while "Host" means the Endpoints are Node IPs?

@hongliangl
Copy link
Contributor

Yes, Host means that pod's network is the k8s node network.

@jianjuns
Copy link
Contributor

@hongliangl : two questions.

  1. In your examples of TC rules for response packets, there is not a case for externalTrafficPolicy = Local? In the rules, source IPs are matched, but I think in the "Local" case we do not have fixed source IPs?

  2. For request packets, does kube-proxy match destination IPs must be one of Node IPs? Could we match any destination IP?

@hongliangl
Copy link
Contributor

hongliangl commented Jun 30, 2021

hi, @jianjuns

  1. In your examples of TC rules for response packets, there is not a case for externalTrafficPolicy = Local? In the rules, source IPs are matched, but I think in the "Local" case we do not have fixed source IPs?

On general interfaces(like eth0), TC rules match NodePort destination IPs(case for externalTrafficPolicy = Local preserves request source IPs). On Antrea gateway(antrea-gw0), TC rules match NodePort source IPs(destination IPs can be non-fixed or fixed).

  1. For request packets, does kube-proxy match destination IPs must be one of Node IPs? Could we match any destination IP?

KubeProxy can match any IPs if option NodePortAddresses is not set. We can also match any destination IPs if option NodePortAddresses(this option is also implemented to AntreaProxy) of AntreaProxy is not set.

@jianjuns
Copy link
Contributor

On Antrea gateway(antrea-gw0), TC rules match NodePort source IPs(destination IPs can be non-fixed or fixed).

The example rule you gave is like:

tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
	protocol ipv4 flower \
	src_ip 192.168.2.1 \
	action goto chain 259

So, it matches source IP to be the Node IP. But in my understanding, if externalTrafficPolicy = Local, the source IP of the response packets should not be Node IP, but the server Pod IP?

We can also match any destination IPs if option NodePortAddresses(this option is also implemented to AntreaProxy) of AntreaProxy is not set

Does your PR already do that?

@hongliangl
Copy link
Contributor

So, it matches source IP to be the Node IP. But in my understanding, if externalTrafficPolicy = Local, the source IP of the response packets should not be Node IP, but the server Pod IP?

I don't think the source IP can be service Pod IP of response NodePort traffic. If so, the source/destination IP addresses of the request and response traffic are asymmetric.

Does your PR already do that?

Yes.

@jianjuns
Copy link
Contributor

Could you just explain the end-to-end forwarding path of a "externalTrafficPolicy = Local" request and its response? Like when it is SNAT'd / DNAT'd and to what IP?

I feel the original iptables design descriptions by Weiqiang are much easier to follow. I do not need the diagrams, but at least describe the end-to-end path with texts.

@hongliangl
Copy link
Contributor

Could you just explain the end-to-end forwarding path of a "externalTrafficPolicy = Local" request and its response? Like when it is SNAT'd / DNAT'd and to what IP?

I feel the original iptables design descriptions by Weiqiang are much easier to follow. I do not need the diagrams, but at least describe the end-to-end path with texts.

An example for externalTrafficPolicy = Local. Assumed that NodePort IP is 192.168.77.100, port is 30001. Endpoint is 10.10.0.4:80.

Request traffic:

  • Remote client: nw_src: 192.168.77.1 nw_dst: 192.168.77.100 tp_src: 12345 tp_dst: 30001
  • AntreaProxy gateway: nw_src: 192.168.77.1 nw_dst: 192.168.77.100 tp_src: 12345 tp_dst: 30001
  • AntreaProxy table 42(DNAT): nw_src: 192.168.77.1 nw_dst: 10.10.0.4 tp_src: 12345 tp_dst: 80
  • Endpoint Pod: nw_src: 192.168.77.1 nw_dst: 10.10.0.4 tp_src: 12345 tp_dst: 80

Response traffic:

  • Endpoint Pod: nw_src: 10.10.0.4 nw_dst: 192.168.77.1 tp_src: 80 tp_dst: 12345
  • AntreaProxy table 30(unDNAT): nw_src: 192.168.77.100 nw_dst: 192.168.77.1 tp_src: 30001 tp_dst: 12345
  • AntreaProxy gateway: nw_src: 192.168.77.100 nw_dst: 192.168.77.1 tp_src: 30001 tp_dst: 12345
  • Remote client: nw_src: 192.168.77.100 nw_dst: 192.168.77.1 tp_src: 30001 tp_dst: 12345

@jianjuns
Copy link
Contributor

jianjuns commented Jun 30, 2021

Thanks! It much clear to me now!

So for this rule:

tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
	protocol ipv4 flower \
	src_ip 192.168.2.1 \
	action goto chain 259

If NodePortAddresses is not specified, how it will be like? We can no more match src_ip to be a Node IP?

@hongliangl
Copy link
Contributor

hongliangl commented Jun 30, 2021

If NodePortAddresses is not specified, how it will be like? We can no more match src_ip to be a Node IP?

If NodePortAddresses is not specified, AntreaProxy will get all localhost interfaces'(not including Antrea gateway, this interface can be never accessed by external client) IPv4 and IPv6(not including link-local) addresses as NodePort IP.

For example, a k8s node has interface eth0 (IP: 192.168.1.1, 192.168.2.1; ifindex: 2), eth1(IP: 172.16.1.1; ifindex: 3), below TC rules will created on Antrea gateway's ingress qdisc:

tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
	protocol ipv4 flower \
	src_ip 192.168.1.1 \
	action goto chain 258
tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
	protocol ipv4 flower \
	src_ip 192.168.2.1 \
	action goto chain 258

tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
	protocol ipv4 flower \
	src_ip 172.16.1.1 \
	action goto chain 269

@jianjuns
Copy link
Contributor

@hongliangl : question on your case 4. I feel hard to say it is not a valid case. Do you think we are able to support it with SNAT?

And I am trying to understand the challenge without SNAT. Guess the problem is to redirect return traffic to OVS. But with TC flower, could we use conntrack to mark the connection after the request is forwarded out of the OVS bridge (through gw0)?

@lzhecheng
Copy link
Contributor

@hongliangl should this issue be closed? Maybe @jianjuns 's question has already been solved?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/component/agent Issues or PRs related to the agent component kind/design Categorizes issue or PR as related to design. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
4 participants