Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestProxyLoadBalancerModeDSR test failed occasionally #6704

Open
tnqn opened this issue Sep 30, 2024 · 0 comments · May be fixed by #6702
Open

TestProxyLoadBalancerModeDSR test failed occasionally #6704

tnqn opened this issue Sep 30, 2024 · 0 comments · May be fixed by #6702
Assignees
Labels
area/test/e2e Issues or PRs related to Antrea specific end-to-end testing.

Comments

@tnqn
Copy link
Member

tnqn commented Sep 30, 2024

Describe the bug

@antoninbas reported TestProxyLoadBalancerModeDSR/IPv4,withSessionAffinity has been failing occasionally in Kind CI:
https://github.com/antrea-io/antrea/actions/runs/10943236467
https://github.com/antrea-io/antrea/actions/runs/11018658226

The error is typically the following:

=== RUN   TestProxyLoadBalancerModeDSR
2024/09/19 15:31:32 Applying Antrea YAML
2024/09/19 15:31:34 Waiting for all Antrea DaemonSet Pods
2024/09/19 15:31:35 Checking CoreDNS deployment
    fixtures.go:286: Creating 'testproxyloadbalancermodedsr-3i55riyi' K8s Namespace
=== RUN   TestProxyLoadBalancerModeDSR/IPv4,withSessionAffinity
    proxy_test.go:1182: 
        	Error Trace:	/home/runner/work/antrea/antrea/test/e2e/proxy_test.go:1182
        	            				/home/runner/work/antrea/antrea/test/e2e/proxy_test.go:1203
        	Error:      	Not equal: 
        	            	expected: "1.1.1.1"
        	            	actual  : "10.244.0.1"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-1.1.1.1
        	            	+10.244.0.1
        	Test:       	TestProxyLoadBalancerModeDSR/IPv4,withSessionAffinity
        	Messages:   	Client IP should be preserved with DSR mode
    proxy_test.go:1187: Request #0 from external-client-jjk3ndrv got hostname: agnhost-1
    proxy_test.go:1187: Request #1 from external-client-jjk3ndrv got hostname: agnhost-1
    proxy_test.go:1187: Request #2 from external-client-jjk3ndrv got hostname: agnhost-1
    proxy_test.go:1187: Request #3 from external-client-jjk3ndrv got hostname: agnhost-1
    proxy_test.go:1187: Request #4 from external-client-jjk3ndrv got hostname: agnhost-1
    proxy_test.go:1187: Request #5 from external-client-jjk3ndrv got hostname: agnhost-1
    proxy_test.go:1187: Request #6 from external-client-jjk3ndrv got hostname: agnhost-1
    proxy_test.go:1187: Request #7 from external-client-jjk3ndrv got hostname: agnhost-1
    proxy_test.go:1187: Request #8 from external-client-jjk3ndrv got hostname: agnhost-1
    proxy_test.go:1187: Request #9 from external-client-jjk3ndrv got hostname: agnhost-1
    proxy_test.go:1182: 
        	Error Trace:	/home/runner/work/antrea/antrea/test/e2e/proxy_test.go:1182
        	            				/home/runner/work/antrea/antrea/test/e2e/proxy_test.go:1204
        	Error:      	Not equal: 
        	            	expected: "10.244.0.58"
        	            	actual  : "10.244.0.1"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-10.244.0.58
        	            	+10.244.0.1
        	Test:       	TestProxyLoadBalancerModeDSR/IPv4,withSessionAffinity
        	Messages:   	Client IP should be preserved with DSR mode
    proxy_test.go:1187: Request #0 from internal-client got hostname: agnhost-1
    proxy_test.go:1187: Request #1 from internal-client got hostname: agnhost-3
    proxy_test.go:1187: Request #2 from internal-client got hostname: agnhost-1
    proxy_test.go:1187: Request #3 from internal-client got hostname: agnhost-1
    proxy_test.go:1187: Request #4 from internal-client got hostname: agnhost-3
    proxy_test.go:1187: Request #5 from internal-client got hostname: agnhost-3
    proxy_test.go:1187: Request #6 from internal-client got hostname: agnhost-3
    proxy_test.go:1187: Request #7 from internal-client got hostname: agnhost-3
    proxy_test.go:1187: Request #8 from internal-client got hostname: agnhost-3
    proxy_test.go:1187: Request #9 from internal-client got hostname: agnhost-3
    proxy_test.go:1197: 
        	Error Trace:	/home/runner/work/antrea/antrea/test/e2e/proxy_test.go:1197
        	            				/home/runner/work/antrea/antrea/test/e2e/proxy_test.go:1204
        	Error:      	"map[agnhost-1:{} agnhost-3:{}]" should have 1 item(s), but has 2
        	Test:       	TestProxyLoadBalancerModeDSR/IPv4,withSessionAffinity
        	Messages:   	Hostnames should be the same when session affinity is enabled
    proxy_test.go:1217: 
        	Error Trace:	/home/runner/work/antrea/antrea/test/e2e/proxy_test.go:1217
        	Error:      	Should not be: "1.1.1.1"
        	Test:       	TestProxyLoadBalancerModeDSR/IPv4,withSessionAffinity
        	Messages:   	Client IP should not be preserved with NAT mode
    fixtures.go:531: Deleting Pod 'external-client-jjk3ndrv'
=== RUN   TestProxyLoadBalancerModeDSR/IPv4,withoutSessionAffinity
    proxy_test.go:1187: Request #0 from external-client-ub9ixulw got hostname: agnhost-1
    proxy_test.go:1187: Request #1 from external-client-ub9ixulw got hostname: agnhost-2
    proxy_test.go:1187: Request #2 from external-client-ub9ixulw got hostname: agnhost-0
    proxy_test.go:1187: Request #3 from external-client-ub9ixulw got hostname: agnhost-2
    proxy_test.go:1187: Request #4 from external-client-ub9ixulw got hostname: agnhost-1
    proxy_test.go:1187: Request #5 from external-client-ub9ixulw got hostname: agnhost-3
    proxy_test.go:1187: Request #6 from external-client-ub9ixulw got hostname: agnhost-2
    proxy_test.go:1187: Request #7 from external-client-ub9ixulw got hostname: agnhost-2
    proxy_test.go:1187: Request #8 from external-client-ub9ixulw got hostname: agnhost-3
    proxy_test.go:1187: Request #9 from external-client-ub9ixulw got hostname: agnhost-0
    proxy_test.go:1187: Request #0 from internal-client got hostname: agnhost-1
    proxy_test.go:1187: Request #1 from internal-client got hostname: agnhost-3
    proxy_test.go:1187: Request #2 from internal-client got hostname: agnhost-2
    proxy_test.go:1187: Request #3 from internal-client got hostname: agnhost-3
    proxy_test.go:1187: Request #4 from internal-client got hostname: agnhost-0
    proxy_test.go:1187: Request #5 from internal-client got hostname: agnhost-3
    proxy_test.go:1187: Request #6 from internal-client got hostname: agnhost-0
    proxy_test.go:1187: Request #7 from internal-client got hostname: agnhost-3
    proxy_test.go:1187: Request #8 from internal-client got hostname: agnhost-2
    proxy_test.go:1187: Request #9 from internal-client got hostname: agnhost-1
    fixtures.go:531: Deleting Pod 'external-client-ub9ixulw'

Analysis

Looking at the antrea-agent log, I suspect it's because Antrea's proxy runner was throttled due to rate limiter while kube-proxy's runner wasn't. The following may be what happened:

  • 15:31:40.209: It received the Service creation, LB IP was not set yet.
  • 15:31:40.230: The 1st sync finished.
I0919 15:31:40.209279      13 config.go:242] Calling handler.OnServiceAdd
I0919 15:31:40.230760      13 proxier.go:1000] syncProxyRules took 21.428631ms
I0919 15:31:40.230777      13 runner.go:220] antrea-agent-proxy: ran, next possible in 1s, periodic in 30s
  • 15:31:40.240: It received the EndpointSlice creation. The 2nd sync started immediately because the burst is 2.
  • 15:31:40.244: During the 2nd sync it received the Service update, which had LB IP set and scheduled the 3rd sync in 999ms, around 15:31:41.251.
  • 15:31:40.252: The 2nd sync finished.
I0919 15:31:40.240839      13 config.go:333] "Calling handler.OnEndpointSliceAdd" endpointSlice="testproxyloadbalancermodedsr-3i55riyi/svc-dsr-b7mql"
I0919 15:31:40.244894      13 config.go:259] Calling handler.OnServiceUpdate
I0919 15:31:40.252834      13 proxier.go:1000] syncProxyRules took 11.562958ms
I0919 15:31:40.252903      13 runner.go:220] antrea-agent-proxy: ran, next possible in 1s, periodic in 30s
I0919 15:31:40.252914      13 runner.go:229] antrea-agent-proxy: 15.7µs since last run, possible in 999.9843ms, scheduled in 29.9999843s
I0919 15:31:40.252923      13 runner.go:236] antrea-agent-proxy: throttled, scheduling run in 999.9843ms
  • 15:31:41.277: During the 3rd sync, it added the LB IP to ipset. Before it, the requests were processed by kube-proxy's iptables rules, which should account for the error "Client IP should be preserved with DSR mode".
  • 15:31:41.278: The 3rd sync finished.
I0919 15:31:41.277980      13 route_linux.go:1924] "Added external IP to ipset" IPSet="ANTREA-EXTERNAL-IP" IP="1.1.2.1"
I0919 15:31:41.278022      13 proxier.go:1000] syncProxyRules took 24.607467ms
I0919 15:31:41.278036      13 runner.go:220] antrea-agent-proxy: ran, next possible in 1s, periodic in 30s
  • 15:31:41.727: It received the Service update, which changed the LoadBalancerMode to NAT mode and scheduled the 4rd sync in 549ms, around 15:31:42.277.
I0919 15:31:41.727486      13 config.go:259] Calling handler.OnServiceUpdate
I0919 15:31:41.728362      13 runner.go:229] antrea-agent-proxy: 450.324741ms since last run, possible in 549.675259ms, scheduled in 29.549675259s
I0919 15:31:41.728376      13 runner.go:236] antrea-agent-proxy: throttled, scheduling run in 549.675259ms
  • 15:31:41.796: It received the Service deletion. The change would also be handled by the 4th sync, causing the Service to be removed directly, never behaving in NAT mode, which should account for the error "Client IP should not be preserved with NAT mode".
I0919 15:31:41.796053      13 config.go:369] "Calling handler.OnEndpointSliceDelete" endpointSlice="testproxyloadbalancermodedsr-3i55riyi/svc-dsr-b7mql"
I0919 15:31:41.796082      13 proxier.go:1089] "Processing EndpointSlice DELETE event" EndpointSlice="testproxyloadbalancermodedsr-3i55riyi/svc-dsr-b7mql"
I0919 15:31:41.796219      13 runner.go:229] antrea-agent-proxy: 518.181092ms since last run, possible in 481.818908ms, scheduled in 29.481818908s
I0919 15:31:41.796273      13 runner.go:236] antrea-agent-proxy: throttled, scheduling run in 481.818908ms
  • 15:31:42.289: The 4th sync deleted the LB IP.
I0919 15:31:42.289073      13 route_linux.go:1960] "Deleted route for external IP" IP="1.1.2.1"
I0919 15:31:42.290635      13 route_linux.go:1966] "Deleted external IP from ipset" IPSet="ANTREA-EXTERNAL-IP" IP="1.1.2.1"
I0919 15:31:42.301581      13 proxier.go:1000] syncProxyRules took 22.762799ms
I0919 15:31:42.301593      13 runner.go:220] antrea-agent-proxy: ran, next possible in 1s, periodic in 30s

The code works as intended, with the rate limiter preventing the proxy runner from executing too frequently in response to each event. To avoid test flakiness, we could add a 1s delay for the Service to be fully realized. The sudden appearance and disappearance of the issue may be caused by performance fluctuations of the GitHub runners. If both EndpointSliceAdd and ServiceUpdate events arrived before the 1st sync finished, they would be handled together by the 2nd sync, and the test would succeed.

@tnqn tnqn added kind/bug Categorizes issue or PR as related to a bug. area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. and removed kind/bug Categorizes issue or PR as related to a bug. labels Sep 30, 2024
@tnqn tnqn self-assigned this Sep 30, 2024
@tnqn tnqn linked a pull request Sep 30, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test/e2e Issues or PRs related to Antrea specific end-to-end testing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant