[Flaky e2e test] TestConnectivity/testOVSRestartSameNode #6338

antoninbas · 2024-05-16T22:00:23Z

While working on #6090, we realized that TestConnectivity/testOVSRestartSameNode was failing frequently.

Originally, we thought that the failure was caused by a change in the PR, but I was able to reproduce the failure on the main branch (at the time), using the normal Antrea images (i.e., not coverage-enabled).
The following can be observed in the test logs when the test fails:

=== RUN   TestConnectivity
2024/05/16 14:55:24 Waiting for all Antrea DaemonSet Pods
2024/05/16 14:55:25 Checking CoreDNS deployment
    fixtures.go:280: Creating 'testconnectivity-rx7x4lf3' K8s Namespace
=== RUN   TestConnectivity/testOVSRestartSameNode
    connectivity_test.go:338: Creating two toolbox test Pods on 'kind-worker'
    fixtures.go:579: Creating a test Pod 'test-pod-dv06q1uw' and waiting for IP
    fixtures.go:579: Creating a test Pod 'test-pod-frgii0i8' and waiting for IP
    connectivity_test.go:379: Restarting antrea-agent on Node 'kind-worker'
    connectivity_test.go:360: Arping loss rate: 4.000000%
    connectivity_test.go:367: ARPING 10.10.1.14
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=0 time=3.972 msec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=1 time=539.269 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=2 time=312.156 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=3 time=198.099 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=4 time=292.146 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=5 time=420.210 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=6 time=251.125 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=7 time=393.197 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=8 time=410.205 usec
        Timeout
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=9 time=3.484 msec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=10 time=278.139 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=11 time=639.319 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=12 time=188.094 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=13 time=87.044 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=14 time=352.176 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=15 time=95.047 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=16 time=63.032 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=17 time=373.186 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=18 time=333.167 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=19 time=99.049 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=20 time=214.107 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=21 time=298.149 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=22 time=699.350 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=23 time=271.135 usec

        --- 10.10.1.14 statistics ---
        25 packets transmitted, 24 packets received,   4%!u(MISSING)nanswered (0 extra)
        rtt min/avg/max/std-dev = 0.063/0.594/3.972/0.960 ms
    connectivity_test.go:385: Arping test failed: arping loss rate is 4.000000%
    fixtures.go:531: Deleting Pod 'test-pod-frgii0i8'
    fixtures.go:531: Deleting Pod 'test-pod-dv06q1uw'

As a result, and to avoid blocking the PR, the test case has been temporarily disabled:

antrea/test/e2e/connectivity_test.go

Lines 60 to 65 in 8d9c455

    
           t.Run("testOVSRestartSameNode", func(t *testing.T) { 
        
           	skipIfNotIPv4Cluster(t) 
        
           	skipIfHasWindowsNodes(t) 
        
           	t.Skip("Skipping test for now as it fails consistently") 
        
           	testOVSRestartSameNode(t, data, data.testNamespace) 
        
           })

The failure now needs to be investigated so that the test case can be restored.

The failure above seems to indicate that the OVS restart indeed leads to a disruption of the datapath. I believe that the start / stop delay in the old coverage-enabled images was somehow hiding this issue (and of course ww only run e2e tests with coverage enabled...). See:

antrea/build/charts/antrea/templates/agent/daemonset.yaml

Line 134 in 33e6da2

    
           args: ["-c", "sleep 2; antrea-agent-coverage -test.run=TestBincoverRunMain -test.coverprofile=antrea-agent.cov.out -args-file=/agent-arg-file; while true; do sleep 5 & wait $!; done"]

To reproduce

Create a Kind cluster: kind create cluster --config ci/kind/config-2nodes.yml
Install the latest Antrea: kubectl apply -f build/yamls/antrea.yml
Run the test: go test -timeout=10m -v -count=10 -run=TestConnectivity/testOVSRestartSameNode antrea.io/antrea/test/e2e -provider=kind -coverage=false -deploy-antrea=false. This command will run the test case 10 times, and I have observed a failure rate of around 25%.

The text was updated successfully, but these errors were encountered:

antoninbas · 2024-05-16T23:45:13Z

@tnqn I wanted your opinion on this. I found this old comment of yours (#625 (comment)):

Besides, I found another good reason of using "flow-restore-wait", previously datapath flows would be cleaned once ovs-vswitchd was started, so existing connections especially cross-node ones could still have some downtime before antrea-agent restores the flows.

This doesn't exactly match what I have observed while troubleshooting this test. I have observed that when we remove flow-restore-wait, datapath flows are flushed. If I'm monitoring continuously (dumping the flows every second), I can see the following:

=============
Thu May 16 23:33:50 UTC 2024
recirc_id(0),in_port(3),eth(src=5a:b8:c9:64:ce:a9,dst=5a:4f:1e:de:2c:8c),eth_type(0x0806),arp(sip=10.10.1.44,tip=10.10.1.1,op=1/0xff,sha=5a:b8:c9:64:ce:a9), packets:0, bytes:0, used:never, actions:2
recirc_id(0),tunnel(tun_id=0x0,src=172.18.0.2,dst=172.18.0.3,flags(-df-csum+key)),in_port(1),eth(),eth_type(0x0800),ipv4(dst=10.10.1.32/255.255.255.224,frag=no), packets:17, bytes:1666, used:0.948s, actions:ct(zone=65520,nat),recirc(0x2)
recirc_id(0x2),tunnel(tun_id=0x0,src=172.18.0.2,dst=172.18.0.3,flags(-df-csum+key)),in_port(1),skb_mark(0/0x80000000),ct_state(-new+est+rpl-inv+trk-snat),ct_mark(0x3/0x7f),eth(src=56:6f:1f:69:e6:87,dst=aa:bb:cc:dd:ee:ff),eth_type(0x0800),ipv4(dst=10.10.1.44,proto=1,ttl=63,frag=no), packets:17, bytes:1666, used:0.948s, actions:set(eth(src=5a:4f:1e:de:2c:8c,dst=5a:b8:c9:64:ce:a9)),set(ipv4(ttl=62)),3
recirc_id(0),in_port(2),eth(src=5a:4f:1e:de:2c:8c,dst=5a:b8:c9:64:ce:a9),eth_type(0x0806),arp(sip=10.10.1.1,tip=10.10.1.44,op=2/0xff,sha=5a:4f:1e:de:2c:8c), packets:0, bytes:0, used:never, actions:3
recirc_id(0x1),in_port(3),skb_mark(0/0x80000000),ct_state(-new+est-rpl-inv+trk-snat),ct_mark(0x3/0x7f),eth(src=5a:b8:c9:64:ce:a9,dst=5a:4f:1e:de:2c:8c),eth_type(0x0800),ipv4(dst=10.10.0.0/255.255.255.0,proto=1,tos=0/0x3,ttl=64,frag=no), packets:17, bytes:1666, used:0.948s, actions:set(tunnel(tun_id=0x0,dst=172.18.0.2,ttl=64,tp_dst=6081,flags(df|key))),set(eth(src=5a:4f:1e:de:2c:8c,dst=aa:bb:cc:dd:ee:ff)),set(ipv4(ttl=63)),1
recirc_id(0),in_port(3),eth(src=5a:b8:c9:64:ce:a9),eth_type(0x0800),ipv4(src=10.10.1.44,dst=10.10.0.0/255.255.255.0,frag=no), packets:17, bytes:1666, used:0.948s, actions:ct(zone=65520,nat),recirc(0x1)
=============
Thu May 16 23:33:51 UTC 2024
=============
Thu May 16 23:33:52 UTC 2024
recirc_id(0),in_port(3),eth(src=5a:b8:c9:64:ce:a9),eth_type(0x0800),ipv4(src=10.10.1.44,dst=10.10.0.0/255.255.255.0,frag=no), packets:0, bytes:0, used:never, actions:ct(zone=65520,nat),recirc(0x1)
recirc_id(0),tunnel(tun_id=0x0,src=172.18.0.2,dst=172.18.0.3,flags(-df-csum+key)),in_port(1),eth(),eth_type(0x0800),ipv4(dst=10.10.1.32/255.255.255.224,frag=no), packets:0, bytes:0, used:never, actions:ct(zone=65520,nat),recirc(0x2)
recirc_id(0x2),tunnel(tun_id=0x0,src=172.18.0.2,dst=172.18.0.3,flags(-df-csum+key)),in_port(1),skb_mark(0/0x80000000),ct_state(-new+est+rpl-inv+trk-snat),ct_mark(0x3/0x7f),eth(src=56:6f:1f:69:e6:87,dst=aa:bb:cc:dd:ee:ff),eth_type(0x0800),ipv4(dst=10.10.1.44,proto=1,ttl=63,frag=no), packets:0, bytes:0, used:never, actions:set(eth(src=5a:4f:1e:de:2c:8c,dst=5a:b8:c9:64:ce:a9)),set(ipv4(ttl=62)),3
recirc_id(0x1),in_port(3),skb_mark(0/0x80000000),ct_state(-new+est-rpl-inv+trk-snat),ct_mark(0x3/0x7f),eth(src=5a:b8:c9:64:ce:a9,dst=5a:4f:1e:de:2c:8c),eth_type(0x0800),ipv4(dst=10.10.0.0/255.255.255.0,proto=1,tos=0/0x3,ttl=64,frag=no), packets:0, bytes:0, used:never, actions:set(tunnel(tun_id=0x0,dst=172.18.0.2,ttl=64,tp_dst=6081,flags(df|key))),set(eth(src=5a:4f:1e:de:2c:8c,dst=aa:bb:cc:dd:ee:ff)),set(ipv4(ttl=63)),1
=============

Notice that the second dump shows no datapath flows, and that in the dump immediately after that, counters have been reset. When I match the timestamp to the timestamps in the Agent logs, it matches Agent initialization and the Cleaned up flow-restore-wait config log message (of course, this is not a very granular measurement).

I believe that this explains why the test is failing.

I looked at the ovs-vswitch code, and the observation seems consistent with the code:
https://github.com/openvswitch/ovs/blob/3833506db0de7a9c7e72b82323bc1c355d2c03b3/ofproto/ofproto-dpif.c#L374-L389

This code is called by the bridge_run function (via some other functions).

What do you think? I wanted to make sure that I am not missing something.

I am not sure what the best way to fix the test is based on this observation. If I change the test to tolerate one packet loss event, then I believe it will pass consistently.

tnqn · 2024-05-17T03:14:12Z

@tnqn I wanted your opinion on this. I found this old comment of yours (#625 (comment)):

Besides, I found another good reason of using "flow-restore-wait", previously datapath flows would be cleaned once ovs-vswitchd was started, so existing connections especially cross-node ones could still have some downtime before antrea-agent restores the flows.

This doesn't exactly match what I have observed while troubleshooting this test. I have observed that when we remove flow-restore-wait, datapath flows are flushed. If I'm monitoring continuously (dumping the flows every second), I can see the following:

I think the observation doesn't conflict with the previous comment, which was about preventing datapath flows from being flushed when ovs-vswitchd is started at which time no userspace flows is installed. My previous assumption is, after installing userspace flows, it should be graceful to flush datapath flows as new packet should consult the userspace flows and be forwarded in desired way.

To be more specific, there are several time points: stopping OVS, starting OVS, installing userspace flows, resetting flow-restore-wait (flushing datapath flows). We introduced "flow-restore-wait" to avoid disruption between starting OVS and installing userspace flows. However, I didn't expect the disruption after resetting flow-restore-wait. Maybe we could check why the packet didn't get forwarded based on userspace flows?

antoninbas · 2024-05-17T04:30:16Z

@tnqn You were right. The datapath flows being flushed is not what is causing the traffic interruption. After more investigation, it happens because by the time we remove the flow-restore-wait configuration, Pod flows have not been installed yet by the CNIServer reconciliation process. There is no synchronization between this reconciliation process and the removal of flow-restore-wait. The issue is exacerbated by #5777, because it makes reconciliation of Pod flows asynchronous and delays it until all "known" NetworkPolicies have been realized. Based on logs (I had to increase verbosity), there is a > 200ms delay between removal of flow-restore-wait and installation of Pod forwarding flows:

I0517 04:05:03.986256       1 server.go:751] "Starting reconciliation for CNI server"
# this is when the reconcile function returns:
I0517 04:05:04.005935       1 server.go:766] "Completed reconciliation for CNI server"
I0517 04:05:04.009654       1 agent.go:630] Cleaning up flow-restore-wait config
I0517 04:05:04.012567       1 agent.go:643] Cleaned up flow-restore-wait config
...
I0517 04:05:04.262696       1 pod_configuration.go:455] "Syncing Pod interface" Pod="default/antrea-toolbox-szqp7" iface="antrea-t-814b3b"

This matches the traffic interruption which I have observed, which is around 300ms on average:

# ping every 100ms
64 bytes from 10.10.0.7: icmp_seq=280 ttl=62 time=0.770 ms
64 bytes from 10.10.0.7: icmp_seq=281 ttl=62 time=0.390 ms
64 bytes from 10.10.0.7: icmp_seq=282 ttl=62 time=0.422 ms
64 bytes from 10.10.0.7: icmp_seq=283 ttl=62 time=0.251 ms
64 bytes from 10.10.0.7: icmp_seq=284 ttl=62 time=0.273 ms
# 400ms gap!
64 bytes from 10.10.0.7: icmp_seq=288 ttl=62 time=2.22 ms
64 bytes from 10.10.0.7: icmp_seq=289 ttl=62 time=0.663 ms
64 bytes from 10.10.0.7: icmp_seq=290 ttl=62 time=0.364 ms

Prior to #5777, the issue may not even have been observable / reproducible (except maybe with a large number of local Pods, if CNIServer reconciliation was taking a while?).

I will see if we can add some synchronization, so that we do avoid removing the flag until CNIServer "reconciliation" is complete.

Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in antrea-io#5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. Fixes antrea-io#6338 Signed-off-by: Antonin Bas <[email protected]>

Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in antrea-io#5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. This change is possible because of antrea-io#6361, which removed the dependency on the proxy (kube-proxy or AntreaProxy) to access the Antrea Controller. Prior to antrea-io#6361, there would have been a circular dependency in the case where kube-proxy was removed: flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NetworkPolicy controller has started its watchers, and that depends on antrea Service reachability which depends on flow-restore-wait being removed. Fixes antrea-io#6338 Signed-off-by: Antonin Bas <[email protected]>

Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in #5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. This change is possible because of #6361, which removed the dependency on the proxy (kube-proxy or AntreaProxy) to access the Antrea Controller. Prior to #6361, there would have been a circular dependency in the case where kube-proxy was removed: flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NetworkPolicy controller has started its watchers, and that depends on antrea Service reachability which depends on flow-restore-wait being removed. Fixes #6338 Signed-off-by: Antonin Bas <[email protected]>

antoninbas added area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels May 16, 2024

antoninbas self-assigned this May 16, 2024

antoninbas added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels May 17, 2024

antoninbas added this to the Antrea v2.1 release milestone May 17, 2024

antoninbas mentioned this issue May 18, 2024

Delay removal of flow-restore-wait #6342

Merged

antoninbas closed this as completed in #6342 Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky e2e test] TestConnectivity/testOVSRestartSameNode #6338

[Flaky e2e test] TestConnectivity/testOVSRestartSameNode #6338

antoninbas commented May 16, 2024

antoninbas commented May 16, 2024

tnqn commented May 17, 2024

antoninbas commented May 17, 2024

[Flaky e2e test] TestConnectivity/testOVSRestartSameNode #6338

[Flaky e2e test] TestConnectivity/testOVSRestartSameNode #6338

Comments

antoninbas commented May 16, 2024

To reproduce

antoninbas commented May 16, 2024

tnqn commented May 17, 2024

antoninbas commented May 17, 2024