Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodePort, LoadBalancer and ClusterIP from k8s Node support for AntreaProxy on Linux (implemented by Linux TC) #2239

Closed
wants to merge 4 commits into from

Conversation

hongliangl
Copy link
Contributor

@hongliangl hongliangl commented Jun 4, 2021

This PR implements:

  • The connection request of NodePort whose client is from remote or
    localhost.
  • The connection request of LoadBalancer whose client is from remote
    or localhost.
  • The connection request of ClusterIP whose client is from localhost.

For NodePort support, on each interface whose IP addresses can be
NodePort IP addresses, Linux TC is used to redirect the request packets
to Antrea gateway. For response packets, on interface Antrea gateway,
Linux TC is used to redirect the packets back to the interface where
the requests packets are from.

For LoadBalancer support, when client is from remote hosts, on default
route output interface, Linux TC is used to redirect the request
packets to Antrea gateway. For response packets, on interface Antrea
gateway,Linux TC is used to redirect the packets back to the default
route output interface. When client is from localhost, the request
packets are routed to Antrea gateway.

For ClusterIP support, the request packets are routed to Antrea gateway.

To support the Service traffic of above cases, there are main changes of
OVS pipeline.

  • Add table serviceConntrackCommitTable 106 to perform SNAT for
    Service traffic.
  • Modify table hairpinSNATTable ID from 106 to 108.
  • Modify table serviceHairpinTable ID from 29 to 23.
  • Add table serviceConntrackTable 24 to transform SNATed connnections.
  • Add table serviceClassifierTable 35 to classify the Service traffic.
  • Add table serviceDstMacRewriteTable 75 to rewrite destination MAC
    address of response Service packets.

The below cases are supported on:

  • IPv4, encap
  • IPv6, encap
  • IPv4, noEncap
No. Service Client ExternalTrafficPolicy Endpoint Network SNAT AntreaProxy Support KubeProxy Support
1 NodePort Remote Cluster Pod CIDR Yes ✔️ ✔️
2 NodePort Remote Local Pod CIDR No ✔️ ✔️
3 NodePort Remote Cluster Host Yes ✔️ ✔️
4 NodePort Remote Local Host No ✔️ ✔️
5 NodePort Localhost Cluster Pod CIDR Yes ✔️ ✔️
6 NodePort Localhost Local Pod CIDR Yes ✔️ ✔️
7 NodePort Localhost Cluster Host Yes ✔️ ✔️
8 NodePort Localhost Local Host Yes ✔️ ✔️
9 NodePort Localhost(127.0.0.1,::1) Cluster Pod CIDR Yes ✔️ ✔️
10 NodePort Localhost(127.0.0.1,::1) Local Pod CIDR Yes ✔️ ✔️
11 NodePort Localhost(127.0.0.1,::1) Cluster Host Yes ✔️ ✔️
12 NodePort Localhost(127.0.0.1,::1) Local Host Yes ✔️ ✔️
13 NodePort Pod (local node) Cluster Pod CIDR Yes ✔️ ✔️
14 NodePort Pod (local node) Local Pod CIDR No ✔️ ✔️
15 NodePort Pod (local node) Cluster Host Yes ✔️ ✔️
16 NodePort Pod (local node) Local Host No ✔️ ✔️
17 NodePort Pod (remote node) Cluster Pod CIDR Yes ✔️ ✔️
18 NodePort Pod (remote node) Local Pod CIDR No ✔️ ✔️
19 NodePort Pod (remote node) Cluster Host Yes ✔️ ✔️
20 NodePort Pod (remote node) Local Host No ✔️ ✔️
21 LoadBalancer Remote Cluster Pod CIDR Yes ✔️ ✔️
22 LoadBalancer Remote Local Pod CIDR No ✔️ ✔️
23 LoadBalancer Remote Cluster Host Yes ✔️ ✔️
24 LoadBalancer Remote Local Host No ✔️ ✔️
25 LoadBalancer Localhost Cluster Pod CIDR Yes ✔️ ✔️
26 LoadBalancer Localhost Local Pod CIDR Yes ✔️ ✔️
27 LoadBalancer Localhost Cluster Host Yes ✔️ ✔️
28 LoadBalancer Localhost Local Host Yes ✔️ ✔️
29 LoadBalancer Pod (local node) Cluster Pod CIDR Yes ✔️ ✔️
30 LoadBalancer Pod (local node) Local Pod CIDR No ✔️ ✔️
31 LoadBalancer Pod (local node) Cluster Host Yes ✔️ ✔️
32 LoadBalancer Pod (local node) Local Host No ✔️ ✔️
33 LoadBalancer Pod (remote node) Cluster Pod CIDR Yes ✔️ ✔️
34 LoadBalancer Pod (remote node) Local Pod CIDR No ✔️ ✔️
35 LoadBalancer Pod (remote node) Cluster Host Yes ✔️ ✔️
36 LoadBalancer Pod (remote node) Local Host No ✔️ ✔️
37 ClusterIP Localhost \ Pod CIDR No ✔️ ✔️
38 ClusterIP Localhost(hairpin) \ Host Yes ✔️ ✔️
39 ClusterIP Pod \ Pod CIDR No ✔️ ✔️
40 ClusterIP Pod(hairpin) \ Pod CIDR Yes ✔️ ✔️
41 ClusterIP Pod \ Host No ✔️ ✔️

Signed-off-by: Hongliang Liu [email protected]

@codecov-commenter
Copy link

codecov-commenter commented Jun 4, 2021

Codecov Report

Merging #2239 (267b6fb) into main (95a167a) will decrease coverage by 1.76%.
The diff coverage is 14.35%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2239      +/-   ##
==========================================
- Coverage   59.67%   57.91%   -1.77%     
==========================================
  Files         281      284       +3     
  Lines       22207    23407    +1200     
==========================================
+ Hits        13253    13555     +302     
- Misses       7546     8398     +852     
- Partials     1408     1454      +46     
Flag Coverage Δ
kind-e2e-tests 45.20% <12.57%> (-1.47%) ⬇️
unit-tests 41.21% <5.83%> (-1.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/agent.go 50.53% <ø> (+0.31%) ⬆️
pkg/agent/config/node_config.go 100.00% <ø> (ø)
pkg/agent/util/net.go 19.80% <0.00%> (-20.20%) ⬇️
pkg/ovs/openflow/ofctrl_action.go 40.61% <0.00%> (-9.81%) ⬇️
pkg/agent/util/tc/tc_linux.go 2.99% <2.99%> (ø)
pkg/agent/route/route_linux.go 26.32% <3.12%> (-18.10%) ⬇️
pkg/agent/proxy/types/conjcounter.go 6.45% <6.45%> (ø)
pkg/agent/util/tc/types/handlegenerator.go 17.24% <17.24%> (ø)
pkg/agent/openflow/client.go 55.77% <24.39%> (-2.82%) ⬇️
pkg/agent/openflow/pipeline.go 63.38% <26.22%> (-8.30%) ⬇️
... and 52 more

Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an updated design doc to learn the new design with TC?

build/yamls/antrea-aks.yml Outdated Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
build/yamls/antrea-aks.yml Outdated Show resolved Hide resolved
pkg/agent/proxy/proxier.go Outdated Show resolved Hide resolved
pkg/agent/proxy/proxier.go Outdated Show resolved Hide resolved
pkg/agent/openflow/client.go Outdated Show resolved Hide resolved
pkg/agent/proxy/proxier_windows.go Outdated Show resolved Hide resolved
pkg/agent/route/route_linux.go Outdated Show resolved Hide resolved
pkg/agent/route/route_linux.go Outdated Show resolved Hide resolved
pkg/agent/route/route_linux.go Outdated Show resolved Hide resolved
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still reviewing..

build/yamls/base/conf/antrea-agent.conf Outdated Show resolved Hide resolved
cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
cmd/antrea-agent/options.go Outdated Show resolved Hide resolved
cmd/antrea-agent/options.go Outdated Show resolved Hide resolved
cmd/antrea-agent/options.go Outdated Show resolved Hide resolved
pkg/agent/openflow/client.go Outdated Show resolved Hide resolved
pkg/agent/openflow/client.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
@hongliangl hongliangl requested a review from jianjuns June 30, 2021 02:56
@hongliangl hongliangl force-pushed the feature-antrea-proxy branch 2 times, most recently from 0300af5 to 9894344 Compare June 30, 2021 03:36
@hongliangl hongliangl changed the title NodePort support for Antrea Proxy on Linux NodePort and ClusterIP from node support for Antrea Proxy on Linux Jul 1, 2021
@hongliangl hongliangl force-pushed the feature-antrea-proxy branch 2 times, most recently from 06cedcc to 3e46e1e Compare July 1, 2021 04:33
@hongliangl hongliangl changed the title NodePort and ClusterIP from node support for Antrea Proxy on Linux NodePort, LoadBalancer and ClusterIP from k8s Node support for AntreaProxy on Linux Jul 22, 2021
@hongliangl hongliangl force-pushed the feature-antrea-proxy branch 2 times, most recently from 24b55d6 to ef7afc9 Compare July 23, 2021 02:41
build/yamls/base/conf/antrea-agent.conf Outdated Show resolved Hide resolved
build/yamls/base/conf/antrea-agent.conf Outdated Show resolved Hide resolved
@@ -10,6 +10,10 @@ featureGates:
# this flag will not take effect.
# EndpointSlice: false

# Enable full Service support in AntreaProxy in antrea-agent. All type of Services can be
# accessed from outside the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: All types of Service traffic will be handled by AntreaProxy?

Maybe list what more are supported compared to disabled.

build/yamls/base/conf/antrea-agent.conf Outdated Show resolved Hide resolved
build/yamls/base/conf/antrea-agent.conf Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
"antrea.io/antrea/pkg/agent/util"
)

func getAvailableNodePortIPs(nodePortIPsFromConfig []string, gateway string) (map[int][]net.IP, map[int][]net.IP, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move it under pkg/agent/util ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. @antoninbas suggested the func could be here, so I moved it here.

pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
Done(),
)
} else {
// If externalTrafficPolicy of Service is Local, clients from localhost requires SNAT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to confirm if this switch is hit only in NodePort Service, and the source and destination are the same Node address? If yes, please add comments, otherwise svcIP as source might cause misunderstanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for remind, I almost forgot this.

pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
DeleteLearned().
MatchNetworkSrcAsDst(protocol).
SetLearnedSrcMACAsDstMAC().
LoadReg(int(PortCacheReg), config.HostGatewayOFPort, ofPortRegRange).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 3 flows in this function has the same learn action, is it possible to use a generic function for the learn action. Then it should be easier to change if needed in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pkg/agent/proxy/proxier.go Outdated Show resolved Hide resolved
if p.proxyFullEnabled {
// Install ClusterIP route on k8s node so that ClusterIP can be accessed on k8s node. When thWhen the
// Service is removed, the route does not need to be removed, because the route target IP block is calculated
// every time a new ClusterIP is created. When there are enough ClusterIPs, the Service CIDR can be finally
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean Antrea will re-calculate the route target IP block when new ClusterIP added? How shall I understand "when there are enough ClusterIPs"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this explanation can be better:

// Install ClusterIP route on Node so that ClusterIP can be accessed on Node. Every time a new ClusterIP
// is created, the routing target IP block will be recalculated for expansion to be able to route the new
// created ClusterIP. Deleting a ClusterIP will not shrink the target routing IP block. The Service CIDR
// can be finally calculated after creating enough ClusterIPs.

if err := p.ofClient.InstallServiceFlows(groupID, svcIP, svcPort, protocol, affinityTimeout); err != nil {
return err
// Note that, NodePort Service is not supported on Windows currently.
func (p *proxier) installLoadBalancerService(groupID binding.GroupIDType, loadBalancerIPStrings []string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the implementations for LoadBalancerService are the same on Linux and Windows, maybe move the function to proxier.go?

Copy link
Contributor Author

@hongliangl hongliangl Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok, Windows has one more function, Linux will just hit a ni function.

pkg/agent/route/route_linux.go Outdated Show resolved Hide resolved
@@ -169,3 +169,102 @@ func GetIPWithFamily(ips []net.IP, addrFamily uint8) (net.IP, error) {
return nil, errors.New("no IP found with IPv4 AddressFamily")
}
}

// GetAllNodeIPs gets all Node IP addresses (not including IPv6 link local address).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean IPv4 link local address is included?

Copy link
Contributor Author

@hongliangl hongliangl Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPv6 link local is special, its prefix is fe80::/64. It can be only routed. I think it should be as NodePort IP. For IPv4 link local, you mean 169.254.x.x?

@hongliangl hongliangl force-pushed the feature-antrea-proxy branch 2 times, most recently from 915260a to 68016f0 Compare August 2, 2021 06:27
@hongliangl
Copy link
Contributor Author

/test-all

@@ -536,7 +549,16 @@ func (c *client) InstallEndpointFlows(protocol binding.Protocol, endpoints []pro
portVal := portToUint16(endpointPort)
cacheKey := generateEndpointFlowCacheKey(endpoint.IP(), endpointPort, protocol)
flows = append(flows, c.endpointDNATFlow(endpointIP, portVal, protocol))
if endpoint.GetIsLocal() {

// If Endpoint network is host network, don't add flow to hairpinSNATFlow table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we need to SNAT hairpin traffic for host network pod? Is it becasue there is another SNAT flow that will do this? If so, can we unify the flows?

Copy link
Contributor Author

@hongliangl hongliangl Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flow will never be hit. Another flow is used to do SNAT for hairpin Service traffic from gateway with a different ct_zone. Existing hairpin SNAT for Service from pod is stateless(just modify source IP without ct commit), I think we don't need unify these two flows.

pkg/agent/openflow/client.go Outdated Show resolved Hide resolved
pkg/agent/openflow/client.go Outdated Show resolved Hide resolved
hardTimeout := uint16(0)
// If the SessionAffinity of Service is set, the learned flow should be set with hard timeout. Assumed that the SessionAffinity
// is 300s, if the learned flow is set with idle timeout 60s, the learned flow may be deleted when idle timeout is reached,
// then the response traffic cannot be returned correctly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why this is specific to the case when sessionAffinity is set. Even if it's not set, isn't the problem there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the previous design, I forget to remove this.

pkg/agent/openflow/client.go Outdated Show resolved Hide resolved
// gateway, so set servicePktMark to let the response packet go back to Antrea gateway.
// Note that, the priority of this flow is lower than above two flows. The purpose is to avoid
// the packet of ClusterIP from Antrea gateway hits the flow.
serviceConnectionTrackCommitTable.BuildFlow(priorityNormal).MatchProtocol(proto).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this can work, it makes the implementation a bit complicated with policy routing involved, so I'm thinking whether this is necessary, want to discuss it here:

  1. Even it's always expected to preserve client IP for ClusterIP traffic, we still have to SNAT the traffic when it's hairpin case, as well as kube-proxy.
  2. Is it a real case that a Service is NodePort type and its Endpoints are host network Pod? If a Pod is host network, a Node cannot hold more than one Pod for this Service so the Pods are unlikely managed by a regular controller like Deployment, ReplicationController, the most common case is DaemonSet Pod running some system service like antrea-agent. If the Pods are running in the host network, there is no point to expose them as a NodePort service which just introduces another hop and occupies another port. For example, the Kubernetes service's own Endpoints are the kube-apiserver Pods but the Service's itself is ClusterIP type and typically a loadbalancer will use the Node IP + kube-apiserver port as the backends directly, not via a NodePort.

Given the above reasons, I feel at least for the first version, we could just SNAT the traffic to be simpler. Then multiple cases can be unified as external Endpoints, and we can have single SNAT flow for all of them. Since this is in very early stage, we can still use this policy routing approach when users have real need of preserving client IP for NodePort services with host network Pods as backends. What do you think? @jianjuns @antoninbas @hongliangl @wenyingd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the solution is to use TC to match the pkt mark and redirect traffic - so TC does not work? I would agree policy routing sounds complicated.

I agree your 2) is a valid point. Are we able to check? If it does complicate the implementation, I would second to start with SNAT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the solution is to use TC to match the pkt mark and redirect traffic - so TC does not work? I would agree policy routing sounds complicated.

The implementation of this case has there steps:
Step 1 Mark the first request packet of the connection with pkt mark within OVS pipeline.
Step 2 Mark the first request packet of the connection with ct mark by matching above pkt mark in iptables mangle table.
Step 3 Mark the response packets of the connection with fw mark by matching above ct mark.
Step 4 Packets with above fw mark will be routed to Antrea gateway.

If using TC,
Step 1 is still step 1.
Step 2 cannot be implemented by TC, I didn't found any doc about how to match a pkt mark and make a ct mark.
Step 3 can be implemented by TC. https://man7.org/linux/man-pages/man8/tc-connmark.8.html
Step 4 can be also implemented by TC. https://man7.org/linux/man-pages/man8/tc-fw.8.html

I don't think the implementation with TC is more simple that that with policy routing. In addition, we may need to add more conditions to match the related packets. This design also adds more overhead. For step 4, TC rule will be installed at default interface's egress(after routing decision), then matched packets will be redirected to Antrea gateway. If using policy routing, before routing decision, the output interface is decided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want not many ways (e.g. TC for request and policy routing for response) to redirect/route the traffic which will be hard to understand/troubleshoot/maintain. I am kindof leaning towards what Quan suggested to do SNAT first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if we just do SNAT, should we add some comments about the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely we should comment the behavior, and briefly describe why we choose to do SNAT.

// Case 4: LoadBalancer, client from remote, Endpoint is on host network, perform SNAT with IP pf serviceGWHairpinIPv4 / serviceGWHairpinIPv6.
// Case 5: LoadBalancer, client from localhost, Endpoint is on pod CIDR. As it is routed via Antrea gateway, its source IP is
// Antrea gateway's IP, SNAT is not needed.
// Case 6: LoadBalancer, client from localhost, Endpoint is on host network, perform SNAT with IP pf serviceGWHairpinIPv4 / serviceGWHairpinIPv6.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think service only determines whether SNAT is required and which IP to SNAT is related to endpoint only, so the comment should be placed to endpoint flow.
Here it could just say NodePort/LoadBalancer traffic should be SNATed.

if svcInfo.NodeLocalExternal() {
groupIDLocal, _ := p.groupCounter.Get(svcPortName, false)
if err := p.ofClient.UninstallServiceGroup(groupIDLocal); err != nil {
klog.Errorf("Failed to remove flows of Service %v: %v", svcPortName, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

structured logging

func (p *proxier) installNodePortService(groupID binding.GroupIDType, svcPort uint16, protocol binding.Protocol, affinityTimeout uint16, nodeLocalExternal bool) error {
for _, nodeIPs := range p.nodePortIPMap {
for _, nodeIP := range nodeIPs {
if err := p.ofClient.InstallServiceFlows(groupID, nodeIP, svcPort, protocol, affinityTimeout); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even today, installing all service/endpoints flows takes long time as typically a single flow takes 2ms. I'm a little concerned that the number of the IPs would impact the time and memory. Basically if it originally takes 5s to install all service flows, 10 local IPs will make it 50s, and the memory usage in agent and ovs-vswitchd for service flow will have same impact. Some unusual cases will make this even worse, if we have some external IPs assigned to the Node, or kube-proxy ipvs is enabled at the same time.
Do we have a way to reduce the flows' redundancy?

Copy link
Contributor Author

@hongliangl hongliangl Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about using a conjunctive flow to reduce the redundancy.

  1. Every NodePort IP is clause 1, conj_id is N,
  2. NodePort Port is clause 2, conj_id is N, load another register X, the register stores group number.
  3. A flow matching conj_id N, and a action of the flow is group: NXM_NX_REGX[].

Unfortunately, above design cannot work. Below explanation is from https://man7.org/linux/man-pages/man7/ovs-actions.7.html

OTHER ACTIONS         top
   The conjunction action
       Syntax:
              conjunction(id, k/n)

       This action allows for sophisticated ``conjunctive match’’ flows.
       Refer to ``Conjunctive Match Fields’’ in ovs-fields(7) for
       details.

       A flow that has one or more conjunction actions may not have any
       other actions except for note actions.

       Conformance:

       Open vSwitch 2.4 introduced the conjunction action and conj_id
       field. They are Open vSwitch extensions to OpenFlow.

@wenyingd , do you have any good ideas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnqn I have tried to use conjunction to reduce flow entries, but I failed. The design I tried:

  • We need three conjun_id to distinguish NodePort traffic
    • externalTrafficPolicy is Cluster
      • condition 1: nw_dst (all NodePort IPs)
      • condition 2: nodePort port
      • actions: set SNAT reg, generate learn flow.
    • externalTrafficPolicy is Local, client is from remote
      • condition 1: nw_dst (all NodePort IPs)
      • condition 2: nodePort port
      • actions: generate learn flow.
    • externalTrafficPolicy is Local, client is from localhost
      • condition 1: nw_dst, nw_src (all NodePort IPs)
      • condition 2: nodePort port
      • actions: set SNAT reg, generate learn flow.

As we can see, the first two cases have the same condition 1, I have no idea how to do. The flow can be added with different priority, but the lower priority one will never be hit.

For Table 41, I have tried to unify the group action, but it seems that action of group cannot be set by a register.

pkg/agent/proxy/types/groupcounter.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
@@ -336,6 +352,9 @@ var (
// Endpoint, still needs to select an Endpoint, or if an Endpoint has already
// been selected and the selection decision needs to be learned.
serviceLearnRegRange = binding.Range{16, 18}
// serviceSNATMarkRange takes a 1-bit range of register serviceSnatReg to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add these comments here or maybe in tunnelClassifierFlow() if that is the place all SNAT flows are added?

// gateway, so set servicePktMark to let the response packet go back to Antrea gateway.
// Note that, the priority of this flow is lower than above two flows. The purpose is to avoid
// the packet of ClusterIP from Antrea gateway hits the flow.
serviceConnectionTrackCommitTable.BuildFlow(priorityNormal).MatchProtocol(proto).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the solution is to use TC to match the pkt mark and redirect traffic - so TC does not work? I would agree policy routing sounds complicated.

I agree your 2) is a valid point. Are we able to check? If it does complicate the implementation, I would second to start with SNAT.

pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
build/yamls/base/conf/antrea-agent.conf Outdated Show resolved Hide resolved
@@ -5,6 +5,10 @@ featureGates:
# Service traffic.
# AntreaProxy: true

# Enable full Service support in AntreaProxy in antrea-agent. All type of Services can be
# accessed from outside the cluster.
# AntreaProxyFull: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not have a good sense a Feature should be always enabled finally. But do you think we can use a Feature Gate to isolate the change, and change it to a config option later?

pkg/agent/openflow/client.go Outdated Show resolved Hide resolved
pkg/agent/openflow/client.go Outdated Show resolved Hide resolved
…Proxy on Linux

This PR implements:
 - The connection request of NodePort whose client is from remote or
localhost.
 - The connection request of LoadBalancer whose client is from remote
or localhost.
 - The connection request of ClusterIP whose client is from localhost.

For NodePort support, on each interface whose IP addresses can be
NodePort IP addresses, Linux TC is used to redirect the request packets
to Antrea gateway. For response packets, on interface Antrea gateway,
Linux TC is used to redirect the packets back to the interface where
the requests packets are from.

For LoadBalancer support, when client is from remote hosts, on default
route output interface, Linux TC is used to redirect the request
packets to Antrea gateway. For response packets, on interface Antrea
gateway,Linux TC is used to redirect the packets back to the default
route output interface. When client is from localhost, the request
packets are routed to Antrea gateway.

For ClusterIP support, the request packets are routed to Antrea gateway.

To support the Service traffic of above cases, there are main changes of
OVS pipeline.
- Add table serviceConntrackCommitTable 106 to perform SNAT for
Service traffic.
- Modify table hairpinSNATTable ID from 106 to 108.
- Modify table serviceHairpinTable ID from 29 to 23.
- Add table serviceConntrackTable 24 to transform SNATed connnections.
- Add table serviceClassifierTable 35 to classify the Service traffic.
- Add table serviceDstMacRewriteTable 75 to rewrite destination MAC
address of response Service packets.

Signed-off-by: Hongliang Liu <[email protected]>
@hongliangl
Copy link
Contributor Author

/test-all

@@ -51,6 +52,24 @@ type Interface interface {
// DeleteSNATRule should delete rule to SNAT outgoing traffic with the mark.
DeleteSNATRule(mark uint32) error

// InitService should add the basic TC configuration on Linux.
InitService(nodePortIPMap map[int][]net.IP, isIPv6 bool) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name InitService is too generic. If it is for NodePort only, call it InitNodePortRoutes; if it also handles other types, call it InitServiceProxyRoutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This func will do:

  • Remove stale LoadBalancer/ClusterIP route entries
  • Add Linux basic filters for NodePort/LoadBalancer
  • Add route entry for routing packets whose destination IP is serviceGWHairpinIP to Antrea gateway.

Maybe call it InitServiceProxyConfig?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InitServiceProxyConfig sounds good to me.

@@ -51,6 +52,24 @@ type Interface interface {
// DeleteSNATRule should delete rule to SNAT outgoing traffic with the mark.
DeleteSNATRule(mark uint32) error

// InitService should add the basic TC configuration on Linux.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know existing funcs go this way, but I suggest to change "should add" to "adds". The same for other funcs you added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding.

pkg/agent/route/interfaces.go Outdated Show resolved Hide resolved
DeleteNodePort(nodePortIPMap map[int][]net.IP, port uint16, protocol binding.Protocol) error

// AddClusterIPRoute should add route on k8s node for Service ClusterIP.
AddClusterIPRoute(svcIP net.IP, isIPv6 bool) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with other funcs, I suggest either add "Routes" to other func names (I know they are TC operations, but that needs not be reflected in the RouteClient interface), or remove "Route" from this func name too.

Probably Route -> Routes, as I assume it might not be a single route entry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there is not a func to remove the routes for a ClusterIP Service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with other funcs, I suggest either add "Routes" to other func names (I know they are TC operations, but that needs not be reflected in the RouteClient interface), or remove "Route" from this func name too.
Probably Route -> Routes, as I assume it might not be a single route entry?

I don't know if adding the suffix "Routes" to func name will cause misunderstood.

  • For funcs AddNodePort/DeleteNodePort, no route entry will be added/deleted, only TC rules.
  • For funcs AddLoadBalancer/DeleteLoadBalancer, route entries and TC rules are both added/deleted.
  • For func AddClusterIPRoute, only route entry will be added or updated.

I don't know if "Route" is an appropriate suffix for all the above funcs.

Why there is not a func to remove the routes for a ClusterIP Service?

We don't a func to remove the route for a ClusterIP Service. There is alway one route entry for all ClusterIP Services.

  • If there is no route entry when a ClusterIP is added, create a route entry for it.
  • If current route entry can route the ClusterIP, there is nothing to do.
  • If current route cannot route the ClusterIP, replace the current route with a new route entry , and the new route entry 's destination IP block can route the ClusterIP IP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how if the ClusterIP CIDR is changed and a ClusterIP is no more valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's no matter. What we wanted is to calculate the IP block of ClusterIP finally as possible, and this IP block is the destination of the only one routing entry for ClusterIP. Adding a ClusterIP may help enlarge the destination IP block(ClusterIP block) of routing entry.
Delete a ClusterIP is ok, as we want just only one routing entry which is used to route ClusterIP traffic. If we add a ClusterIP, then we create a routing entry for it, and when it is deleted, we delete the routing entry, then there will be lots of routing entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Then if we assume CIDR can only be extended we should be ok.

Could you add some comments before the AddClusterRouteIP() implementation in route_linux.go to explain what it does? Actually good to add comments before other RouteClient funcs you added too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll add some comments.

@@ -67,6 +78,9 @@ var (
// IPTablesSyncInterval is exported so that sync interval can be configured for running integration test with
// smaller values. It is meant to be used internally by Run.
IPTablesSyncInterval = 60 * time.Second

serviceGWHairpinIPv4 = net.ParseIP("169.254.169.253")
Copy link
Contributor

@jianjuns jianjuns Aug 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely should not define the constants multiple times. Please define them in a single file, and import from others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about pkg/agent/config? If in the package, we should add a new file which is only to contain this two variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/agent/config sounds good to me. Maybe adding to node_config.go is fine too, if we want not to add a new file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think node_config.go is okay.

@@ -431,6 +455,24 @@ func (c *Client) restoreIptablesData(podCIDR *net.IPNet, podIPSet string, snatMa
"-j", iptables.MasqueradeTarget, "--random-fully",
}...)

if c.proxyFull {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - why we do not put this in InitService(), or why not put InitService() code here? I think the former way is better to avoid passing/checking proxyFull in route and other code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we put it to InitService, I think we need to write a lot of duplicated code about iptables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I got the point. Do not have a better idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comments to explain what rules are restored here when AntreaProxyFull is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

writeLine(iptablesData, []string{
"-A", antreaPostRoutingChain,
"-m", "comment", "--comment", `"Antrea: masquerade Service host network Endpoint traffic"`,
"-s", serviceGWHairpinIPv4.String(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually feel bad route code needs to understand AntreaProxy logics and things like Service hairpin IP. Ideally we should isolate the logics, to avoid many components need to understand a feature.

Maybe we can say Route Client and OpenflowClient are special and need have feature specific code. But do you have some ideas? Also add @tnqn and @wenyingd to see what they think.

@@ -431,6 +455,24 @@ func (c *Client) restoreIptablesData(podCIDR *net.IPNet, podIPSet string, snatMa
"-j", iptables.MasqueradeTarget, "--random-fully",
}...)

if c.proxyFull {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comments to explain what rules are restored here when AntreaProxyFull is enabled?

pkg/agent/route/route_linux.go Show resolved Hide resolved
Signed-off-by: Hongliang Liu <[email protected]>
@hongliangl hongliangl changed the title NodePort, LoadBalancer and ClusterIP from k8s Node support for AntreaProxy on Linux NodePort, LoadBalancer and ClusterIP from k8s Node support for AntreaProxy on Linux (implemented by Linux TC) Aug 16, 2021
@hongliangl hongliangl closed this Aug 19, 2021
@hongliangl hongliangl deleted the feature-antrea-proxy branch April 28, 2022 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants