Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NodePort/LoadBalancer issue when proxyAll is enabled #3295

Merged
merged 1 commit into from
Feb 15, 2022

Conversation

hongliangl
Copy link
Contributor

@hongliangl hongliangl commented Feb 9, 2022

Fix #3301

When proxyAll is enabled, create a NodePort/LoadBalancer Service whose
externalTrafficPolicy is Cluster, then only an OVS group with all
Endpoints will be installed. If change externalTrafficPolicy of the
Service from Cluster to Local, an OVS group with only local Endpoints
should be also installed since externalTrafficPolicy is Local, but it
is not. This patch fixes the issue that OVS group with only local
Endpoints is not installed when externalTrafficPolicy of Service is
changed from Cluster to Local.

Signed-off-by: Hongliang Liu [email protected]

@codecov-commenter
Copy link

codecov-commenter commented Feb 9, 2022

Codecov Report

Merging #3295 (c479ecd) into main (70fac83) will decrease coverage by 6.03%.
The diff coverage is 45.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3295      +/-   ##
==========================================
- Coverage   59.08%   53.05%   -6.04%     
==========================================
  Files         331      462     +131     
  Lines       28444    54663   +26219     
==========================================
+ Hits        16806    29000   +12194     
- Misses       9804    23200   +13396     
- Partials     1834     2463     +629     
Flag Coverage Δ
e2e-tests 52.51% <4.16%> (?)
integration-tests 33.98% <ø> (?)
kind-e2e-tests 48.02% <61.11%> (+1.28%) ⬆️
unit-tests 41.82% <55.55%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/proxy/proxier.go 59.92% <45.83%> (-0.58%) ⬇️
pkg/agent/cniserver/pod_configuration_linux.go 26.31% <0.00%> (-40.36%) ⬇️
pkg/controller/ipam/antrea_ipam_controller.go 48.71% <0.00%> (-31.57%) ⬇️
pkg/controller/networkpolicy/endpoint_querier.go 61.46% <0.00%> (-29.97%) ⬇️
pkg/agent/nodeportlocal/npl_agent_init.go 57.14% <0.00%> (-27.48%) ⬇️
pkg/controller/egress/controller.go 61.11% <0.00%> (-27.34%) ⬇️
.../registry/networkpolicy/clustergroupmember/rest.go 64.28% <0.00%> (-23.95%) ⬇️
pkg/agent/cniserver/ipam/antrea_ipam.go 55.55% <0.00%> (-23.62%) ⬇️
pkg/agent/cniserver/ipam/antrea_ipam_controller.go 56.30% <0.00%> (-23.46%) ⬇️
pkg/agent/util/ethtool/ethtool_linux.go 46.66% <0.00%> (-23.34%) ⬇️
... and 425 more

var localEndpointUpdateList []k8sproxy.Endpoint
// If externalTrafficPolicy of the previous Service is Cluster, a group which only has local Endpoints
// should be installed.
if pSvcInfo != nil && !pSvcInfo.NodeLocalExternal() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be 3 issues:

  1. If a Service's policy is local from the begining, the above code won't create a local group, adding code here will lead to inconsistent behavior.
  2. Code is duplicated with the block under needUpdateEndpoints.
  3. There is no cleanup when policy is changed from local to cluster.

I think we could just set needUpdateEndpoints to true if policy changes, just like when SessionAffinityType changes, then the existing code can take care of installing the local group.
For cleanup, it should uninstall the local group when NodeLocalExternal is false and local group id is found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the idea that just setting needUpdateEndpoints to true if policy changes, and this will be easier to understand. We also need a cleanup when policy is from Local to Cluster.

@tnqn
Copy link
Member

tnqn commented Feb 9, 2022

Please file an issue with steps to reproduce the problem and link the PR to it.

@hongliangl
Copy link
Contributor Author

/test-all-features-conformance
/test-conformance
/test-e2e
/test-flexible-ipam-e2e
/test-ipv6-conformance
/test-ipv6-e2e
/test-ipv6-networkpolicy
/test-ipv6-only-conformance
/test-ipv6-only-e2e
/test-ipv6-only-networkpolicy
/test-multicluster-e2e
/test-networkpolicy
/test-windows-conformance
/test-windows-e2e
/test-windows-networkpolicy
/test-windows-proxyall-e2e
/test-integration

@hongliangl hongliangl requested a review from tnqn February 10, 2022 01:07
@hongliangl
Copy link
Contributor Author

Please file an issue with steps to reproduce the problem and link the PR to it.

Added.

@hongliangl
Copy link
Contributor Author

/test-e2e

@hongliangl
Copy link
Contributor Author

/test-networkpolicy

@hongliangl
Copy link
Contributor Author

/test-windows-e2e

@hongliangl
Copy link
Contributor Author

/test-integration

@@ -370,7 +370,7 @@ func (p *proxier) installServices() {
pSvcInfo = installedSvcPort.(*types.ServiceInfo)
needRemoval = serviceIdentityChanged(svcInfo, pSvcInfo) || (svcInfo.SessionAffinityType() != pSvcInfo.SessionAffinityType())
needUpdateService = needRemoval || (svcInfo.StickyMaxAgeSeconds() != pSvcInfo.StickyMaxAgeSeconds())
needUpdateEndpoints = pSvcInfo.SessionAffinityType() != svcInfo.SessionAffinityType()
needUpdateEndpoints = pSvcInfo.SessionAffinityType() != svcInfo.SessionAffinityType() || pSvcInfo.NodeLocalExternal() != svcInfo.NodeLocalInternal()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
needUpdateEndpoints = pSvcInfo.SessionAffinityType() != svcInfo.SessionAffinityType() || pSvcInfo.NodeLocalExternal() != svcInfo.NodeLocalInternal()
needUpdateEndpoints = pSvcInfo.SessionAffinityType() != svcInfo.SessionAffinityType() || pSvcInfo.NodeLocalExternal() != svcInfo.NodeLocalExternal()

Could you add an e2e test to verify it?

@@ -170,15 +170,15 @@ func (p *proxier) removeStaleServices() {
if svcInfo.NodeLocalExternal() {
groupIDLocal, _ := p.groupCounter.Get(svcPortName, true)
if err := p.ofClient.UninstallServiceGroup(groupIDLocal); err != nil {
klog.ErrorS(err, "Failed to remove flows of Service", "Service", svcPortName)
klog.ErrorS(err, "Failed to remove Group for Service with only local Endpoints", "Service", svcPortName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.ErrorS(err, "Failed to remove Group for Service with only local Endpoints", "Service", svcPortName)
klog.ErrorS(err, "Failed to remove Group of local Endpoints for Service", "Service", svcPortName)

continue
}
p.groupCounter.Recycle(svcPortName, true)
}
// Remove Service group which has all Endpoints.
groupID, _ := p.groupCounter.Get(svcPortName, false)
if err := p.ofClient.UninstallServiceGroup(groupID); err != nil {
klog.ErrorS(err, "Failed to remove flows of Service", "Service", svcPortName)
klog.ErrorS(err, "Failed to remove Group of Service with all Endpoints", "Service", svcPortName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.ErrorS(err, "Failed to remove Group of Service with all Endpoints", "Service", svcPortName)
klog.ErrorS(err, "Failed to remove Group of all Endpoints for Service ", "Service", svcPortName)

continue
// Uninstall the group with only local Endpoints when Service externalTrafficPolicy is changed from Local
// to Cluster.
} else if !svcInfo.NodeLocalInternal() && pSvcInfo != nil && pSvcInfo.NodeLocalExternal() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if !svcInfo.NodeLocalInternal() && pSvcInfo != nil && pSvcInfo.NodeLocalExternal() {
} else if !svcInfo.NodeLocalExternal() && pSvcInfo != nil && pSvcInfo.NodeLocalExternal() {

But doesn't "else if" already mean it?

pkg/agent/proxy/proxier.go Show resolved Hide resolved
@hongliangl
Copy link
Contributor Author

/test-all-features-conformance
/test-conformance
/test-e2e
/test-flexible-ipam-e2e
/test-ipv6-conformance
/test-ipv6-e2e
/test-ipv6-networkpolicy
/test-ipv6-only-conformance
/test-ipv6-only-e2e
/test-ipv6-only-networkpolicy
/test-multicluster-e2e
/test-networkpolicy
/test-windows-conformance
/test-windows-e2e
/test-windows-networkpolicy
/test-windows-proxyall-e2e
/test-integration

continue
}

} else if !svcInfo.NodeLocalExternal() && pSvcInfo != nil && pSvcInfo.NodeLocalExternal() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in #3295 (comment).
And you didn't choose the latter approach in https://github.com/antrea-io/antrea/pull/3295/files#r803823027, what if there is a transient error when processing the service in this round during which the policy is changed from local to cluster, could the code take care of cleaning this group next round? will pSvcInfo be local or cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. BTW, I have verified that calling UninstallServiceGroup to uninstall a group which doesn't exist on OVS returns nil. I thought of this will get an error.

time.Sleep(3 * time.Second)

for idx, node := range nodes {
agentName, err := data.getAntreaPodOnNode(node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the err below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@hongliangl
Copy link
Contributor Author

/test-all-features-conformance
/test-conformance
/test-e2e
/test-flexible-ipam-e2e
/test-ipv6-conformance
/test-ipv6-e2e
/test-ipv6-networkpolicy
/test-ipv6-only-conformance
/test-ipv6-only-e2e
/test-ipv6-only-networkpolicy
/test-multicluster-e2e
/test-networkpolicy
/test-windows-conformance
/test-windows-e2e
/test-windows-networkpolicy
/test-windows-proxyall-e2e
/test-integration

@@ -624,8 +629,7 @@ func testClusterIPRemoval(t *testing.T, svcIP net.IP, epIP net.IP, isIPv6 bool)
mockRouteClient.EXPECT().AddClusterIPRoute(svcIP).Times(1)
mockOFClient.EXPECT().UninstallServiceFlows(svcIP, uint16(svcPort), bindingProtocol).Times(1)
mockOFClient.EXPECT().UninstallEndpointFlows(bindingProtocol, gomock.Any()).Times(1)
mockOFClient.EXPECT().UninstallServiceGroup(groupID).Times(1)

mockOFClient.EXPECT().UninstallServiceGroup(gomock.Any()).AnyTimes()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be more accurate? It should be called twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Updated.

@@ -1528,6 +1528,20 @@ func (data *TestData) createAgnhostNodePortService(serviceName string, affinity,
return data.createService(serviceName, testNamespace, 8080, 8080, map[string]string{"app": "agnhost"}, affinity, nodeLocalExternal, corev1.ServiceTypeNodePort, ipFamily)
}

func (data *TestData) updateAgnhostNodePortServiceExternalTrafficPolicy(serviceName string, nodeLocalExternal bool) (*corev1.Service, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the method could just be updateServiceExternalTrafficPolicy to be generic? otherwise we have to define other methods updateNginxNodePortServiceExternalTrafficPolicy, updateAgnhostLoadBalancerServiceExternalTrafficPolicy in future following this style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

}

// Function of checking the number of occurrences of the target IP in given OVS group output.
checkOutputGroup := func(groupOutput string, podIP *net.IP, expectedCount int) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it doesn't check the traffic and returned client ip directly but check the OVS groups? I think it's more end to end to do the former and can find issues that cannot detected by the latter check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Updated.

When proxyAll is enabled, create a NodePort/LoadBalancer Service whose
externalTrafficPolicy is Cluster, then only an OVS group with all
Endpoints will be installed. If change externalTrafficPolicy of the
Service from Cluster to Local, an OVS group with only local Endpoints
should be also installed since externalTrafficPolicy is Local, but it
is not. This patch fixes the issue that OVS group with only local
Endpoints is not installed when externalTrafficPolicy of Service is
changed from Cluster to Local.

Signed-off-by: Hongliang Liu <[email protected]>
@hongliangl
Copy link
Contributor Author

/test-e2e
/test-integration

@hongliangl hongliangl requested a review from tnqn February 14, 2022 05:27
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one question

@@ -146,6 +146,14 @@ func probeClientIPFromPod(data *TestData, pod string, baseUrl string) (string, e
return host, err
}

func reverseStrs(strs []string) []string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why it needs to reverse the urls?

Copy link
Contributor Author

@hongliangl hongliangl Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little confused to read.

For example, there are two Nodes, and assumed that A is 192.168.1.1; B is 192.168.1.2. portStr is the port number of a NodePort Service.
The test URLs are generated by be following code:

nodeIPs := []string{controlPlaneNodeIPv4(), workerNodeIPv4(1)}
...
var urls []string
for _, nodeIP := range nodeIPs {
    urls = append(urls, net.JoinHostPort(nodeIP, portStr))
}

To make Node A connect to Node B's NodePort and Node B connect to Node A's NodePort in a for loop, the slice should be reversed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

@hongliangl
Copy link
Contributor Author

/test-all-features-conformance
/test-conformance
/test-flexible-ipam-e2e
/test-ipv6-conformance
/test-ipv6-e2e
/test-ipv6-networkpolicy
/test-ipv6-only-conformance
/test-ipv6-only-e2e
/test-ipv6-only-networkpolicy
/test-multicluster-e2e
/test-networkpolicy
/test-windows-conformance
/test-windows-e2e
/test-windows-networkpolicy
/test-windows-proxyall-e2e

@hongliangl
Copy link
Contributor Author

/test-e2e
/test-ipv6-only-conformance
/test-ipv6-only-e2e
/test-ipv6-only-networkpolicy
/test-windows-conformance
/test-windows-e2e

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Feb 15, 2022

/test-e2e

@tnqn tnqn added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Feb 15, 2022
@tnqn tnqn merged commit ff87705 into antrea-io:main Feb 15, 2022
@tnqn
Copy link
Member

tnqn commented Feb 15, 2022

@hongliangl Can you backport this back to previous versions if applicable?

bangqipropel pushed a commit to bangqipropel/antrea that referenced this pull request Mar 2, 2022
)

When proxyAll is enabled, create a NodePort/LoadBalancer Service whose
externalTrafficPolicy is Cluster, then only an OVS group with all
Endpoints will be installed. If change externalTrafficPolicy of the
Service from Cluster to Local, an OVS group with only local Endpoints
should be also installed since externalTrafficPolicy is Local, but it
is not. This patch fixes the issue that OVS group with only local
Endpoints is not installed when externalTrafficPolicy of Service is
changed from Cluster to Local.

Signed-off-by: Hongliang Liu <[email protected]>
@hongliangl hongliangl deleted the fix-proxyall-issue branch April 28, 2022 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NodePort/LoadBalancer bug of AntreaProxy when proxyAll is enabled
3 participants