-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Antrea Proxy NodePort Service Support #1463
Comments
Resolves antrea-io#1463. Signed-off-by: Weiqiang Tang <[email protected]>
Resolves antrea-io#1463. Signed-off-by: Weiqiang Tang <[email protected]>
Resolves antrea-io#1463. Signed-off-by: Weiqiang Tang <[email protected]>
Resolves antrea-io#1463. Signed-off-by: Weiqiang Tang <[email protected]>
Resolves antrea-io#1463. Signed-off-by: Weiqiang Tang <[email protected]>
Resolves antrea-io#1463. Signed-off-by: Weiqiang Tang <[email protected]>
Resolves antrea-io#1463. Signed-off-by: Weiqiang Tang <[email protected]>
Resolves antrea-io#1463. Signed-off-by: Weiqiang Tang <[email protected]>
- Implement ClusterIP and Loadbalancer Services support - Add NodePort support for Antrea Proxy on Linux Resolves antrea-io#1463. Signed-off-by: Weiqiang Tang <[email protected]>
- Implement ClusterIP and Loadbalancer Services support - Add NodePort support for Antrea Proxy on Linux Resolves antrea-io#1463. Signed-off-by: Weiqiang Tang <[email protected]>
- Implement ClusterIP and Loadbalancer Services support - Add NodePort support for Antrea Proxy on Linux Resolves antrea-io#1463. Signed-off-by: Weiqiang Tang <[email protected]>
Resolves antrea-io#1463. Signed-off-by: Weiqiang Tang <[email protected]>
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days |
Replace KubeProxy NodePort with Linux TCScenariosFor simplicity, here we assume that we have an interfaces eth0 whose two IP addresses can be used for NodePort. There are three scenarios:
Traffic RedirectNodePort Traffic From Remote
The key idea is:
Request TrafficFor NodePort traffic from remote, filters will be created and attached to the eth0's ingress. Assumed that eth0's IP addresses are
The command is below: tc filter add dev eth0 parent ffff:0 prio 104 \
protocol ipv4 chain 0 handle 0x6dea9 \
flower ip_proto tcp dst_ip 192.168.2.1 dst_port 57001 \
action mirred egress redirect dev antrea-gw0
tc filter add dev eth0 parent fff:0 prio 104 \
protocol ipv4 chain 0 handle 0x106dea9 \
flower ip_proto tcp dst_ip 172.16.2.1 dst_port 57001 \
action mirred egress redirect dev antrea-gw0 What the filters created by above commands do:
Warning: this is not the best filter design! If an interface has many IP addresses(assumed that an interface has 10 IP addresses), then 10 commands will be executed and 10 filters will be created and attached to the interface's ingress. This is not so graceful and efficient. If possible, the better hierarchic filter design is on the following:
Response TrafficFor response NodePort traffic, it should be redirected to the interface where its request traffic is from. The request traffic can be from different interfaces. Here the hierarchic filter design can work on interface antrea-gw0. An Init filter matching networking protocol(IPv4/IPv6) and source IP address will be created for every interface(the interface that has available NodePort IP address) when tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
protocol ipv4 flower \
src_ip 192.168.2.1 \
action goto chain 259
tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
protocol ipv4 flower \
src_ip 172.16.2.1 \
action goto chain 259
When a NodePort Service is created, a filter will be created and attached to chain 259. The following command will be executed: tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
chain 259 handle 0x6dea9 protocol ip flower \
ip_proto tcp src_port 57001 \
action mirred egress redirect dev eth0 What the filter created by above command does:
NodePort Traffic From LocalhostThe key idea is:
Request TrafficFor NodePort traffic from localhost, filters will be created and attached to the lo's egress. The following endpoints (not the Endpoint of pods) should be available.
For every endpoint, a filter will be created. These filters will be created by following commands: // For endpoint 127.0.0.1:57001
tc filter add dev lo parent a:0 prio 4 \
protocol ipv4 chain 0 handle 0x1396dea9 flower \
ip_proto tcp dst_ip 127.0.0.1 dst_port 57001 \
action skbmod set smac 12:34:56:78:9a:bc pipe \
action mirred egress redirect dev antrea-gw0
// For endpoint 192.168.2.1:57001
tc filter add dev lo parent a:0 prio 4 \
protocol ipv4 chain 0 handle 0x1396dea9 flower \
ip_proto tcp dst_ip 192.168.2.1 dst_port 57001 \
action skbmod set smac 12:34:56:78:9a:bc pipe \
action mirred egress redirect dev antrea-gw0
// For endpoint 172.16.2.1:57001
tc filter add dev lo parent a:0 prio 4 \
protocol ipv4 chain 0 handle 0x1396dea9 flower \
ip_proto tcp dst_ip 172.16.2.1 dst_port 57001 \
action skbmod set smac 12:34:56:78:9a:bc pipe \
action mirred egress redirect dev antrea-gw0 What the filters created by above commands do:
Response TrafficFor response NodePort traffic, it should be redirected to the interface where its request traffic is from. The request traffic can be from different interfaces. Here the hierarchic filter design can work on interface antrea-gw0. The following endpoints' response traffic should be redirected to lo.
For every endpoint, a filter will be created. These filters will be created by following commands: // For endpoint 127.0.0.1:57001's response traffic, destination IP is not needed.
tc filter add dev antrea-gw0 parent ffff:0 prio 4 \
protocol ipv4 flower \
src_ip 127.0.0.1 \
action goto chain 257
// For endpoint 192.168.2.1:57001's response traffic, source and destination IP are all needed.
tc filter add dev antrea-gw0 parent ffff:0 prio 4 \
protocol ipv4 flower \
src_ip 192.168.2.1 dst_ip 192.168.2.1 action goto chain 257
// For endpoint 172.16.2.1:57001's response traffic, source and destination IP are all needed.
tc filter add dev antrea-gw0 parent ffff:0 prio 4 \
protocol ipv4 flower \
src_ip 172.16.2.1 dst_ip 172.16.2.1 \
action goto chain 257
When a NodePort is created, a filter will be created and attached to chain 257. The following command will be executed. tc filter add dev antrea-gw0 parent ffff:0 prio 104 \
chain 259 handle 0x6dea9 protocol ip flower \
ip_proto tcp src_port 57001 \
action skbmod set smac 00:00:00:00:00:00 set dmac 00:00:00:00:00:00 pipe \
action mirred egress redirect dev eth0 What the filter created by above command does:
OVS PipelineserviceSNATTableThis is a new table between If cookie=0x1040000000000, table=29,priority=200,tcp,reg0=0x1/0xffff,nw_dst=127.0.0.1,tp_dst=57001 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1),exec(move:NXM_OF_ETH_SRC[]->NXM_NX_CT_LABEL[0..47])) // For endpoint 127.0.0.1:57001.
cookie=0x1040000000000, table=29, priority=200,tcp,reg0=0x1/0xffff,nw_dst=192.168.2.1,tp_dst=57001 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1),exec(move:NXM_OF_ETH_SRC[]->NXM_NX_CT_LABEL[0..47])) // For endpoint 192.168.2.1:57001.
cookie=0x1040000000000, table=29, priority=200,tcp,reg0=0x1/0xffff,nw_dst=172.16.2.1,tp_dst=30007 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1),exec(move:NXM_OF_ETH_SRC[]->NXM_NX_CT_LABEL[0..47])) // For endpoint 172.16.2.1:57001.
cookie=0x1000000000000, table=29, priority=0 actions=resubmit(,30) // Default flow If cookie=0x1040000000000, table=29,priority=210,tcp,reg0=0x1/0xffff,nw_src=127.0.0.1,nw_dst=127.0.0.1,tp_dst=57001 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1)) // For endpoint 127.0.0.1:57001 accessed from localhost.
cookie=0x1040000000000, table=29, priority=210,tcp,reg0=0x1/0xffff,nw_src=192.168.2.1,nw_dst=192.168.2.1,tp_dst=57001 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1)) // For endpoint 192.168.2.1:57001 accessed from localhost.
cookie=0x1040000000000, table=29, priority=210,tcp,reg0=0x1/0xffff,nw_src=172.16.2.1,nw_dst=172.16.2.1,tp_dst=30007 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1)) // For endpoint 172.16.2.1:57001 accessed from localhost.
cookie=0x1040000000000, table=29, priority=200,tcp,reg0=0x1/0xffff,nw_dst=192.168.2.1,tp_dst=57001 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1),exec(move:NXM_OF_ETH_SRC[]->NXM_NX_CT_LABEL[0..47])) // For endpoint 192.168.2.1:57001 accessed from remote.
cookie=0x1040000000000, table=29, priority=200,tcp,reg0=0x1/0xffff,nw_dst=172.16.2.1,tp_dst=30007 actions=ct(commit,table=30,zone=65521,nat(src=10.10.0.1),exec(move:NXM_OF_ETH_SRC[]->NXM_NX_CT_LABEL[0..47])) // For endpoint 172.16.2.1:57001 accessed from remote.
cookie=0x1000000000000, table=29, priority=0 actions=resubmit(,30) serviceLBTableFor every NodePort IP address, a flow will be created and appended to table cookie=0x1040000000000, table=41, priority=200,tcp,reg4=0x10000/0x70000,nw_dst=127.0.0.1,tp_dst=57001 actions=load:0x2->NXM_NX_REG4[16..18],load:0x1->NXM_NX_REG0[19],group:3
cookie=0x1040000000000, table=41, priority=200,tcp,reg4=0x10000/0x70000,nw_dst=192.168.2.1,tp_dst=57001 actions=load:0x2->NXM_NX_REG4[16..18],load:0x1->NXM_NX_REG0[19],group:3
cookie=0x1040000000000, table=41, priority=200,tcp,reg4=0x10000/0x70000,nw_dst=172.16.2.1,tp_dst=57001 actions=load:0x2->NXM_NX_REG4[16..18],load:0x1->NXM_NX_REG0[19],group:3 If conntrackCommitTableAdd two flows to match Service traffic from cookie=0x1000000000000, table=105, priority=210,ct_state=+trk,ct_mark=0x21,ip,reg0=0x1/0xffff actions=resubmit(,106)
cookie=0x1000000000000, table=105, priority=210,ct_state=+trk,ct_mark=0x21,ipv6,reg0=0x1/0xffff actions=resubmit(,106) If the flows above are not added to table cookie=0x1000000000000, table=105, priority=200,ct_state=+new+trk,ip,reg0=0x1/0xffff actions=ct(commit,table=106,zone=65520,exec(load:0x20->NX
M_NX_CT_MARK[])) serviceConntrackTableDo SNAT/unSNAT for Service traffic. Other traffic will be matched as this is another ct zone. cookie=0x1000000000000, table=106, priority=200,ip actions=ct(table=107,zone=65521,nat)
cookie=0x1000000000000, table=106, priority=200,ipv6 actions=ct(table=107,zone=65511,nat)
cookie=0x1000000000000, table=106, priority=0 actions=resubmit(,107) serviceDstMacRewriteTableFor NodePort traffic from remote, destination MAC address must be rewritten. Note that, these flows have no effect to NodePort traffic from localhost as NodePort traffic's destination MAC address will be rewritten by Linux TC. cookie=0x1040000000000, table=107, priority=200,tcp,nw_src=127.0.0.1,tp_src=57001 actions=move:NXM_NX_CT_LABEL[0..47]->NXM_OF_ETH_DST[],resubmit(,108)
cookie=0x1040000000000, table=107, priority=200,tcp,nw_src=172.16.2.1,tp_src=57001 actions=move:NXM_NX_CT_LABEL[0..47]->NXM_OF_ETH_DST[],resubmit(,108)
cookie=0x1040000000000, table=107, priority=200,tcp,nw_src=192.168.2.1,tp_src=57001 actions=move:NXM_NX_CT_LABEL[0..47]->NXM_OF_ETH_DST[],resubmit(,108) AppendixTC handle ID generation(NodePort Traffic From Remote / Request Traffic)The handle ID
The handle ID is :
When destination TCP port is 57001 and IP address index is 1, all possible handle IDs are:
|
hello, @jianjuns @tnqn @antoninbas @wenyingd @lzhecheng, this is the design of NodePort with Linux TC. Please give your suggestions. Thanks. |
Hello, @jianjuns @tnqn PR #2239 has implemented 11 cases of NodePort as fowllowing form. Do you think all NodePort cases are neccessary?
|
@hongliangl: what does Endpoint Network = Service CIDR mean? You mean service access through the ClusterIP, but not the NodePort (but I thought you are asking about NodePort)? |
Sorry, I misunderstood something. It should be |
So that means the Service endpoints are Pods; while "Host" means the Endpoints are Node IPs? |
Yes, |
@hongliangl : two questions.
|
hi, @jianjuns
On general interfaces(like eth0), TC rules match NodePort destination IPs(case for externalTrafficPolicy = Local preserves request source IPs). On Antrea gateway(antrea-gw0), TC rules match NodePort source IPs(destination IPs can be non-fixed or fixed).
KubeProxy can match any IPs if option |
The example rule you gave is like:
So, it matches source IP to be the Node IP. But in my understanding, if externalTrafficPolicy = Local, the source IP of the response packets should not be Node IP, but the server Pod IP?
Does your PR already do that? |
I don't think the source IP can be service Pod IP of response NodePort traffic. If so, the source/destination IP addresses of the request and response traffic are asymmetric.
Yes. |
Could you just explain the end-to-end forwarding path of a "externalTrafficPolicy = Local" request and its response? Like when it is SNAT'd / DNAT'd and to what IP? I feel the original iptables design descriptions by Weiqiang are much easier to follow. I do not need the diagrams, but at least describe the end-to-end path with texts. |
An example for externalTrafficPolicy = Local. Assumed that NodePort IP is 192.168.77.100, port is 30001. Endpoint is 10.10.0.4:80. Request traffic:
Response traffic:
|
Thanks! It much clear to me now! So for this rule:
If NodePortAddresses is not specified, how it will be like? We can no more match src_ip to be a Node IP? |
If For example, a k8s node has interface eth0 (IP: 192.168.1.1, 192.168.2.1; ifindex: 2), eth1(IP: 172.16.1.1; ifindex: 3), below TC rules will created on Antrea gateway's ingress qdisc:
|
@hongliangl : question on your case 4. I feel hard to say it is not a valid case. Do you think we are able to support it with SNAT? And I am trying to understand the challenge without SNAT. Guess the problem is to redirect return traffic to OVS. But with TC flower, could we use conntrack to mark the connection after the request is forwarded out of the OVS bridge (through gw0)? |
@hongliangl should this issue be closed? Maybe @jianjuns 's question has already been solved? |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days |
Google Docs
Describe what you are trying to solve
This draft is only for Linux Nodes, we still need to design the solution for Windows Nodes.
We had already implemented ClusterIP services support in Antrea, but Kube-Proxy is still needed to support NodePort services. Since the Kube-Proxy does not support to run only for NodePort services, the ClusterIP Service calculations of it waste a lot of CPU cycles and memories. Once we implement NodePort services support in Antrea Proxy and remove Kube-Proxy in the cluster, then the overhead will be wiped. Furthermore, traffic for watching Service resources should also decrease and the pressure of APIServer should also be lower.
Describe the solution you have in mind
For both ClusterIP and NodePort Services, when accessing them, traffic should always be DNATed to a Pod Endpoint. Thus, we can reuse the ClusterIP Endpoint selection flows in OVS. To achieve this, traffic going to the host must be redirected to the OVS correctly.
Describe how your solution impacts user flows
Once we implement this feature then we can remove Kube-Proxy Deployments in theory, although we need to consider the way to start Antrea without Kube-Proxy first.
Describe the main design/architecture of your solution
From our prior experiments, the performance of IPTables will go down significantly if there are too many rules, thus we should keep the number of IPTables rules as small as possible. By using IPSet, we can use a constant number of IPTables rules to match traffic that we need to redirect, the matching complexity will be O(1) since we can use a set with a hash type. For each valid NodePort service, there should be several entries according to the Node addresses in the IPSet.
Traffic no matter if it comes from the remote or current host, once its destination matches entries in IPSET then we need to forward it to the OVS. By doing DNAT, using link-local address 169.254.169.110, we make the packets to be forwarded to the OVS. To make the forward action really happen, we need an IP route rule. Traffic may be sent from 127.0.0.1 and then we need to do masquerade for it to ensure the destination knows where to reply. In the POSTROUTING chain of the nat table, we masquerade packets if they are sent from 127.0.0.1 and going to antrea-gw0.
There are two options for a NodePort service: Cluster (default) and Local. Cluster obscures the client source IP and may cause a second hop to another node, but should have good overall load-spreading. Local preserves the client source IP and avoids a second hop for LoadBalancer and NodePort type services, but risks potentially imbalanced traffic spreading. This approach preserves the original source IP address. If there are no local endpoints, packets sent to the node are dropped, so you can rely on the correct source-ip in any packet processing rules you might apply a packet that makes it through to the endpoint.
The IPTables implementation according to externalTrafficPolicy would look like below.
There are two typical traffic paths of NodePort services.
For a local policy, we only need to care about the first case. And for a cluster policy, we care about both of these two cases.
For the local policy, the detailed traffic path of our implementation should look like:
For the two-hop cases, the traffic path looks like below.
Based on the cases above, we need the following flows:
The flow that makes NodePort packets comes from gateway back to the gateway with ServiceCTMark
In the current OVS pipeline, packets from Pod to external addresses will be tracked with CT_MARK 0x20, as we do DNAT for endpoint selection with CT_MARK 0x21, the second packet of a connection from external will not be correctly tracked. Thus, we need the following flow to handle this issue.
Virtual IP ARP responder
Alternative solutions that you considered
For the host traffic forward part, we can use alternatives like EBPF or IPVS. But for now, I can not see any significant disadvantage if we use IPTables.
Test plan
We can verify this feature by using e2e and conformance tests.
Additional context
Since we use IPSET to match NodePort Services, the time complexity of packet matching should be
O(1)
. While the time complexity of OVS flow matching is alsoO(1)
, the performance should be decent. Moreover, since the IPTables rules will reduce significantly once we remove the Kube-Proxy, the connection set-up delay should decrease too. According to the analysis, we can believe that the implementation will improve or keep the performance compared to the current implementation.As we can see, the traffic from Pod to a NodePort Service will go through a complex path. But as the NodePort services are designed for out of cluster access, Pods to NodePort should not be a common and best practice use case. To keep the implementation clear and efficient for those real use cases, IMO, this implementation is reasonable.
The text was updated successfully, but these errors were encountered: