Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot access ClusterIP service if the endpoint is on local Node when both AntreaProxy and Egress are enabled #2330

Closed
tnqn opened this issue Jun 29, 2021 · 8 comments · Fixed by #2332
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@tnqn
Copy link
Member

tnqn commented Jun 29, 2021

Describe the bug
When debugging test failure in #2306, I found the case "TestClusterIP" failed consistently on Windows+Linux mixed e2e test but never failed on Linux e2e test, even though the failure was between two Linux Pods in former case.

=== CONT  TestClusterIP
    service_test.go:59: 
        	Error Trace:	service_test.go:59
        	            				service_test.go:81
        	Error:      	Received unexpected error:
        	            	nc stdout: <>, stderr: <nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	>, err: <command terminated with exit code 1>
        	Test:       	TestClusterIP
        	Messages:   	Pod client-on-same-node should be able to connect 10.99.89.26:80, but was not able to connect
=== CONT  TestClusterIP/ClusterIP/Linux_Pod_on_same_Node_can_access_the_Service
    testing.go:1103: test executed panic(nil) or runtime.Goexit: subtest may have called FailNow on a parent test
=== CONT  TestClusterIP
    fixtures.go:220: Exporting test logs to '/var/lib/jenkins/workspace/antrea-windows-e2e-for-pull-request/antrea-test-logs/TestClusterIP/beforeTeardown.Jun29-09-18-08'
    fixtures.go:324: Error when exporting kubelet logs: error when running journalctl on Node 'a-ms-0005-0', is it available? Error: <nil>
    fixtures.go:345: Deleting 'antrea-test' K8s Namespace
--- FAIL: TestClusterIP (74.15s)
    --- FAIL: TestClusterIP/ClusterIP (0.00s)
        --- PASS: TestClusterIP/ClusterIP/Same_Linux_Node_can_access_the_Service (0.75s)
        --- PASS: TestClusterIP/ClusterIP/Different_Linux_Node_can_access_the_Service (0.90s)
        --- PASS: TestClusterIP/ClusterIP/Windows_host_can_access_the_Service (5.51s)
        --- PASS: TestClusterIP/ClusterIP/Linux_Pod_on_different_Node_can_access_the_Service (18.12s)
        --- FAIL: TestClusterIP/ClusterIP/Linux_Pod_on_same_Node_can_access_the_Service (36.10s)

The failure was not related to #2306, it was first caught in that PR because windows CI was down for a few hours and the PR that added the test case #2318 didn't test it on Windows testbed. However, the issue was not introduced by #2318 either. It just added the test case that can catch the problem.

The real issue is because a flow that will only be added when Egress feature is enabled cause the packet that is DNATed by AntreaProxy flows not to be delivered to local destination Pod.

table=70, n_packets=20, n_bytes=3544, priority=210,ct_state=+rpl+trk,ct_mark=0x20,ip actions=mod_dl_dst:ca:84:03:18:e8:ea,resubmit(,80)
table=70, n_packets=22, n_bytes=2476, priority=200,ip,reg0=0x2/0xffff,nw_dst=192.168.248.0/24 actions=resubmit(,80)
table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x2/0xffff,nw_dst=10.176.26.186 actions=resubmit(,80)
table=70, n_packets=0, n_bytes=0, priority=200,ct_mark=0x20,ip,reg0=0x2/0xffff actions=resubmit(,80)
table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.248.1 actions=mod_dl_dst:ca:84:03:18:e8:ea,resubmit(,80)
table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.248.37 actions=mod_dl_src:ca:84:03:18:e8:ea,mod_dl_dst:ca:bf:6a:fe:af:38,resubmit(,72)
table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.248.36 actions=mod_dl_src:ca:84:03:18:e8:ea,mod_dl_dst:0a:d1:10:14:6a:c7,resubmit(,72)

When a local Pod accesses another local Pod via ClusterIP, the packet will be DNATed by OpenFlows and is supposed to hit one rule in table 70 that can rewrite its destination MAC to the target Pod's MAC. But because of the presence of the second flow above, it jumps to next table directly with the destination MAC unchanged (i.e. is still antrea-gw0) MAC. Then the packet will be output to antrea-gw0. Then the host network will route the packet back to OVS, messing up the connection's ct state.

2021-06-29T15:19:05.959Z|00002|dpif(handler2)|WARN|system@ovs-system: execute ct(commit,zone=65520,mark=0x20/0xffffffff),recirc(0xd) failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=ca:84:03:18:e8:ea,dl_dst=0a:d1:10:14:6a:c7,nw_src=192.168.248.37,nw_dst=192.168.248.36,nw_tos=0,nw_ecn=0,nw_ttl=63,tp_src=57692,tp_dst=80,tcp_flags=syn tcp_csum:8233
 with metadata skb_priority(0),skb_mark(0),ct_state(0x21),ct_zone(0xfff0),ct_tuple4(src=192.168.248.37,dst=192.168.248.36,proto=6,tp_src=57692,tp_dst=80),in_port(2) mtu 0

The code that adds the second flow above:

l3FwdTable.BuildFlow(priorityNormal).
MatchProtocol(ipProto).
MatchRegRange(int(marksReg), markTrafficFromLocal, binding.Range{0, 15}).
MatchDstIPNet(localSubnet).
Action().GotoTable(nextTable).
Cookie(c.cookieAllocator.Request(category).Raw()).
Done(),

It was only caught by windows-e2e because antrea-agent was not re-deployed between TestEgress and TestClusterIP in windows-e2e (many tests were skipped), so TestClusterIP was run with Egress feature enabled.

To Reproduce

  1. Enabled AntreaProxy and Egress
  2. Deploy two Pods on a Node and expose one Pod as a ClusterIP service
  3. Access the ClusterIP from another Pod

Expected
The access should succeed.

Actual behavior
The access failed.

Versions:
Please provide the following information:

  • Antrea version (Docker image tag). v1.0.0~v1.1.0
@tnqn tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jun 29, 2021
@tnqn
Copy link
Member Author

tnqn commented Jun 29, 2021

@jianjuns is the above flow necessary to support Egress? why does it need to bypass normal L3 flows for local Pods?

@tnqn
Copy link
Member Author

tnqn commented Jun 29, 2021

I'm not sure if this is an issue on Windows: since the flow is installed unconditionally on Windows, would it prevent local Pods accessing each other via ClusterIP?

@jianjuns
Copy link
Contributor

@jianjuns is the above flow necessary to support Egress? why does it need to bypass normal L3 flows for local Pods?

The flow is added to bypass:

table=70, priority=190,ip,reg0=0x2/0xffff actions=goto_table:71

Do you have any good idea? I can only think about an extra table for MAC rewrite.

@jianjuns
Copy link
Contributor

jianjuns commented Jun 29, 2021

How about changing the priorities of MAC rewrite flows and/or the SNAT skipping (195?) flow? It is a little strange to send the local Service flow to TTL table (or do you think such packets should do TTL?).

@wenyingd
Copy link
Contributor

wenyingd commented Jun 30, 2021

This flow exists only on Windows, because we want to perform SNAT on the packets that are sent to the external. Since we can't predict destination of external traffic, we have to give a higher priority on the local traffic than the external packets. However, this flow and the rewrite MAC flow have the same priority currently, it is possible to prevent MAC rewrite actions on the packet that both source and destination on the same node. I would prefer to lower this flow's priority a bit (lower than 200), and the final priority should be higher than the SNAT flow. Or we could have a higher priority on the MAC rewrite flow?

table=70, n_packets=22, n_bytes=2476, priority=200,ip,reg0=0x2/0xffff,nw_dst=192.168.248.0/24 actions=resubmit(,80)

@tnqn
Copy link
Member Author

tnqn commented Jun 30, 2021

Thanks for your input @jianjuns @wenyingd.

@jianjuns I think normal LB/router will do TTL when forwarding the traffic, should we keep it same? I think TTL will be reduced in kube-proxy case too.

@wenyingd the flow exists on Linux too when enabling Egress feature, which requires some flows to SNAT Pod-to-External traffic.

For the solution that makes MAC rewrite flow have higher priority than SNAT skipping flow, it's doable but would introduce the 4th priority while we normally use low, normal, and high priorities. I'm thinking a solution that keeps them same priority and make SNAT skipping flow apply to non-macRewrite marked traffic, the flows would look this:

// For traffic to local Pod that don't require MAC rewriting (it must be L2 forwarding case), skip SNAT and TTL.
table=70, priority=200,ip,reg0=0/0x80000,nw_dst=192.168.0.0/24 actions=resubmit(,80)
// For traffic to local Pod that requires MAC rewriting (it must be L3 forwarding case, should come from uplink/tunnel or DNATed locally), skip SNAT.
table=70, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.0.35 actions=mod_dl_src:0e:6d:42:66:92:46,mod_dl_dst:d2:7b:cd:ce:ad:a9,resubmit(,72)
table=70, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.0.36 actions=mod_dl_src:0e:6d:42:66:92:46,mod_dl_dst:be:67:0f:33:59:32,resubmit(,72)

With above flows, all traffic to local Pods are handled properly in priority 200 using same match fields, regardless of where the traffic come from. I feel it's eaiser to understand, what do you think?

@jianjuns
Copy link
Contributor

But why you remove this flow?

table=70, n_packets=0, n_bytes=0, priority=200,ct_mark=0x20,ip,reg0=0x2/0xffff actions=resubmit(,80)

@tnqn
Copy link
Member Author

tnqn commented Jul 2, 2021

I did test on windows and confirmed the issue indeed applied to windows even when Egress feature is not enabled (and the feature cannot be enabled on windows currently).

But the access didn't always fail. It depended on which flow is enforced first. Even when they have same priority, one of them will be enforced first, skipping another. For example, in below case there will be no problem. That may explain some flaky tests on Windows. @wenyingd @lzhecheng

table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.248.37 actions=mod_dl_src:ca:84:03:18:e8:ea,mod_dl_dst:ca:bf:6a:fe:af:38,resubmit(,72)
table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.248.36 actions=mod_dl_src:ca:84:03:18:e8:ea,mod_dl_dst:0a:d1:10:14:6a:c7,resubmit(,72)
table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x2/0xffff,nw_dst=192.168.248.0/24 actions=resubmit(,80)

@tnqn tnqn closed this as completed in #2332 Jul 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants