Cannot access ClusterIP service if the endpoint is on local Node when both AntreaProxy and Egress are enabled #2330

tnqn · 2021-06-29T17:09:17Z

Describe the bug
When debugging test failure in #2306, I found the case "TestClusterIP" failed consistently on Windows+Linux mixed e2e test but never failed on Linux e2e test, even though the failure was between two Linux Pods in former case.

=== CONT  TestClusterIP
    service_test.go:59: 
        	Error Trace:	service_test.go:59
        	            				service_test.go:81
        	Error:      	Received unexpected error:
        	            	nc stdout: <>, stderr: <nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	>, err: <command terminated with exit code 1>
        	Test:       	TestClusterIP
        	Messages:   	Pod client-on-same-node should be able to connect 10.99.89.26:80, but was not able to connect
=== CONT  TestClusterIP/ClusterIP/Linux_Pod_on_same_Node_can_access_the_Service
    testing.go:1103: test executed panic(nil) or runtime.Goexit: subtest may have called FailNow on a parent test
=== CONT  TestClusterIP
    fixtures.go:220: Exporting test logs to '/var/lib/jenkins/workspace/antrea-windows-e2e-for-pull-request/antrea-test-logs/TestClusterIP/beforeTeardown.Jun29-09-18-08'
    fixtures.go:324: Error when exporting kubelet logs: error when running journalctl on Node 'a-ms-0005-0', is it available? Error: <nil>
    fixtures.go:345: Deleting 'antrea-test' K8s Namespace
--- FAIL: TestClusterIP (74.15s)
    --- FAIL: TestClusterIP/ClusterIP (0.00s)
        --- PASS: TestClusterIP/ClusterIP/Same_Linux_Node_can_access_the_Service (0.75s)
        --- PASS: TestClusterIP/ClusterIP/Different_Linux_Node_can_access_the_Service (0.90s)
        --- PASS: TestClusterIP/ClusterIP/Windows_host_can_access_the_Service (5.51s)
        --- PASS: TestClusterIP/ClusterIP/Linux_Pod_on_different_Node_can_access_the_Service (18.12s)
        --- FAIL: TestClusterIP/ClusterIP/Linux_Pod_on_same_Node_can_access_the_Service (36.10s)

The failure was not related to #2306, it was first caught in that PR because windows CI was down for a few hours and the PR that added the test case #2318 didn't test it on Windows testbed. However, the issue was not introduced by #2318 either. It just added the test case that can catch the problem.

The real issue is because a flow that will only be added when Egress feature is enabled cause the packet that is DNATed by AntreaProxy flows not to be delivered to local destination Pod.

table=70, n_packets=20, n_bytes=3544, priority=210,ct_state=+rpl+trk,ct_mark=0x20,ip actions=mod_dl_dst:ca:84:03:18:e8:ea,resubmit(,80)
table=70, n_packets=22, n_bytes=2476, priority=200,ip,reg0=0x2/0xffff,nw_dst=192.168.248.0/24 actions=resubmit(,80)
table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x2/0xffff,nw_dst=10.176.26.186 actions=resubmit(,80)
table=70, n_packets=0, n_bytes=0, priority=200,ct_mark=0x20,ip,reg0=0x2/0xffff actions=resubmit(,80)
table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.248.1 actions=mod_dl_dst:ca:84:03:18:e8:ea,resubmit(,80)
table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.248.37 actions=mod_dl_src:ca:84:03:18:e8:ea,mod_dl_dst:ca:bf:6a:fe:af:38,resubmit(,72)
table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.248.36 actions=mod_dl_src:ca:84:03:18:e8:ea,mod_dl_dst:0a:d1:10:14:6a:c7,resubmit(,72)

When a local Pod accesses another local Pod via ClusterIP, the packet will be DNATed by OpenFlows and is supposed to hit one rule in table 70 that can rewrite its destination MAC to the target Pod's MAC. But because of the presence of the second flow above, it jumps to next table directly with the destination MAC unchanged (i.e. is still antrea-gw0) MAC. Then the packet will be output to antrea-gw0. Then the host network will route the packet back to OVS, messing up the connection's ct state.

2021-06-29T15:19:05.959Z|00002|dpif(handler2)|WARN|system@ovs-system: execute ct(commit,zone=65520,mark=0x20/0xffffffff),recirc(0xd) failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=ca:84:03:18:e8:ea,dl_dst=0a:d1:10:14:6a:c7,nw_src=192.168.248.37,nw_dst=192.168.248.36,nw_tos=0,nw_ecn=0,nw_ttl=63,tp_src=57692,tp_dst=80,tcp_flags=syn tcp_csum:8233
 with metadata skb_priority(0),skb_mark(0),ct_state(0x21),ct_zone(0xfff0),ct_tuple4(src=192.168.248.37,dst=192.168.248.36,proto=6,tp_src=57692,tp_dst=80),in_port(2) mtu 0

The code that adds the second flow above:

antrea/pkg/agent/openflow/pipeline.go

Lines 1809 to 1815 in 4405aaa

    
           l3FwdTable.BuildFlow(priorityNormal). 
        
           	MatchProtocol(ipProto). 
        
           	MatchRegRange(int(marksReg), markTrafficFromLocal, binding.Range{0, 15}). 
        
           	MatchDstIPNet(localSubnet). 
        
           	Action().GotoTable(nextTable). 
        
           	Cookie(c.cookieAllocator.Request(category).Raw()). 
        
           	Done(),

It was only caught by windows-e2e because antrea-agent was not re-deployed between TestEgress and TestClusterIP in windows-e2e (many tests were skipped), so TestClusterIP was run with Egress feature enabled.

To Reproduce

Enabled AntreaProxy and Egress
Deploy two Pods on a Node and expose one Pod as a ClusterIP service
Access the ClusterIP from another Pod

Expected
The access should succeed.

Actual behavior
The access failed.

Versions:
Please provide the following information:

Antrea version (Docker image tag). v1.0.0~v1.1.0

The text was updated successfully, but these errors were encountered:

tnqn · 2021-06-29T17:11:03Z

@jianjuns is the above flow necessary to support Egress? why does it need to bypass normal L3 flows for local Pods?

tnqn · 2021-06-29T17:35:59Z

I'm not sure if this is an issue on Windows: since the flow is installed unconditionally on Windows, would it prevent local Pods accessing each other via ClusterIP?

jianjuns · 2021-06-29T17:41:21Z

@jianjuns is the above flow necessary to support Egress? why does it need to bypass normal L3 flows for local Pods?

The flow is added to bypass:

table=70, priority=190,ip,reg0=0x2/0xffff actions=goto_table:71

Do you have any good idea? I can only think about an extra table for MAC rewrite.

jianjuns · 2021-06-29T17:44:46Z

How about changing the priorities of MAC rewrite flows and/or the SNAT skipping (195?) flow? It is a little strange to send the local Service flow to TTL table (or do you think such packets should do TTL?).

wenyingd · 2021-06-30T11:54:59Z

This flow exists only on Windows, because we want to perform SNAT on the packets that are sent to the external. Since we can't predict destination of external traffic, we have to give a higher priority on the local traffic than the external packets. However, this flow and the rewrite MAC flow have the same priority currently, it is possible to prevent MAC rewrite actions on the packet that both source and destination on the same node. I would prefer to lower this flow's priority a bit (lower than 200), and the final priority should be higher than the SNAT flow. Or we could have a higher priority on the MAC rewrite flow?

table=70, n_packets=22, n_bytes=2476, priority=200,ip,reg0=0x2/0xffff,nw_dst=192.168.248.0/24 actions=resubmit(,80)

tnqn · 2021-06-30T12:59:59Z

Thanks for your input @jianjuns @wenyingd.

@jianjuns I think normal LB/router will do TTL when forwarding the traffic, should we keep it same? I think TTL will be reduced in kube-proxy case too.

@wenyingd the flow exists on Linux too when enabling Egress feature, which requires some flows to SNAT Pod-to-External traffic.

For the solution that makes MAC rewrite flow have higher priority than SNAT skipping flow, it's doable but would introduce the 4th priority while we normally use low, normal, and high priorities. I'm thinking a solution that keeps them same priority and make SNAT skipping flow apply to non-macRewrite marked traffic, the flows would look this:

// For traffic to local Pod that don't require MAC rewriting (it must be L2 forwarding case), skip SNAT and TTL.
table=70, priority=200,ip,reg0=0/0x80000,nw_dst=192.168.0.0/24 actions=resubmit(,80)
// For traffic to local Pod that requires MAC rewriting (it must be L3 forwarding case, should come from uplink/tunnel or DNATed locally), skip SNAT.
table=70, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.0.35 actions=mod_dl_src:0e:6d:42:66:92:46,mod_dl_dst:d2:7b:cd:ce:ad:a9,resubmit(,72)
table=70, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.0.36 actions=mod_dl_src:0e:6d:42:66:92:46,mod_dl_dst:be:67:0f:33:59:32,resubmit(,72)

With above flows, all traffic to local Pods are handled properly in priority 200 using same match fields, regardless of where the traffic come from. I feel it's eaiser to understand, what do you think?

jianjuns · 2021-06-30T21:07:28Z

But why you remove this flow?

table=70, n_packets=0, n_bytes=0, priority=200,ct_mark=0x20,ip,reg0=0x2/0xffff actions=resubmit(,80)

tnqn · 2021-07-02T16:41:13Z

I did test on windows and confirmed the issue indeed applied to windows even when Egress feature is not enabled (and the feature cannot be enabled on windows currently).

But the access didn't always fail. It depended on which flow is enforced first. Even when they have same priority, one of them will be enforced first, skipping another. For example, in below case there will be no problem. That may explain some flaky tests on Windows. @wenyingd @lzhecheng

table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.248.37 actions=mod_dl_src:ca:84:03:18:e8:ea,mod_dl_dst:ca:bf:6a:fe:af:38,resubmit(,72)
table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.248.36 actions=mod_dl_src:ca:84:03:18:e8:ea,mod_dl_dst:0a:d1:10:14:6a:c7,resubmit(,72)
table=70, n_packets=0, n_bytes=0, priority=200,ip,reg0=0x2/0xffff,nw_dst=192.168.248.0/24 actions=resubmit(,80)

tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jun 29, 2021

tnqn mentioned this issue Jun 29, 2021

[Windows]Check HostNamespace value when removing HcnEndpoint #2306

Merged

tnqn mentioned this issue Jun 30, 2021

Fix intra-Node service access when both Egress and AntreaProxy is enabled #2332

Merged

tnqn closed this as completed in #2332 Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot access ClusterIP service if the endpoint is on local Node when both AntreaProxy and Egress are enabled #2330

Cannot access ClusterIP service if the endpoint is on local Node when both AntreaProxy and Egress are enabled #2330

tnqn commented Jun 29, 2021

tnqn commented Jun 29, 2021

tnqn commented Jun 29, 2021

jianjuns commented Jun 29, 2021

jianjuns commented Jun 29, 2021 •

edited

Loading

wenyingd commented Jun 30, 2021 •

edited

Loading

tnqn commented Jun 30, 2021

jianjuns commented Jun 30, 2021

tnqn commented Jul 2, 2021

Cannot access ClusterIP service if the endpoint is on local Node when both AntreaProxy and Egress are enabled #2330

Cannot access ClusterIP service if the endpoint is on local Node when both AntreaProxy and Egress are enabled #2330

Comments

tnqn commented Jun 29, 2021

tnqn commented Jun 29, 2021

tnqn commented Jun 29, 2021

jianjuns commented Jun 29, 2021

jianjuns commented Jun 29, 2021 • edited Loading

wenyingd commented Jun 30, 2021 • edited Loading

tnqn commented Jun 30, 2021

jianjuns commented Jun 30, 2021

tnqn commented Jul 2, 2021

jianjuns commented Jun 29, 2021 •

edited

Loading

wenyingd commented Jun 30, 2021 •

edited

Loading