Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky e2e test] TestConnectivity/testOVSRestartSameNode #6338

Closed
antoninbas opened this issue May 16, 2024 · 3 comments · Fixed by #6342
Closed

[Flaky e2e test] TestConnectivity/testOVSRestartSameNode #6338

antoninbas opened this issue May 16, 2024 · 3 comments · Fixed by #6342
Assignees
Labels
area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@antoninbas
Copy link
Contributor

While working on #6090, we realized that TestConnectivity/testOVSRestartSameNode was failing frequently.

Originally, we thought that the failure was caused by a change in the PR, but I was able to reproduce the failure on the main branch (at the time), using the normal Antrea images (i.e., not coverage-enabled).
The following can be observed in the test logs when the test fails:

=== RUN   TestConnectivity
2024/05/16 14:55:24 Waiting for all Antrea DaemonSet Pods
2024/05/16 14:55:25 Checking CoreDNS deployment
    fixtures.go:280: Creating 'testconnectivity-rx7x4lf3' K8s Namespace
=== RUN   TestConnectivity/testOVSRestartSameNode
    connectivity_test.go:338: Creating two toolbox test Pods on 'kind-worker'
    fixtures.go:579: Creating a test Pod 'test-pod-dv06q1uw' and waiting for IP
    fixtures.go:579: Creating a test Pod 'test-pod-frgii0i8' and waiting for IP
    connectivity_test.go:379: Restarting antrea-agent on Node 'kind-worker'
    connectivity_test.go:360: Arping loss rate: 4.000000%
    connectivity_test.go:367: ARPING 10.10.1.14
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=0 time=3.972 msec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=1 time=539.269 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=2 time=312.156 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=3 time=198.099 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=4 time=292.146 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=5 time=420.210 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=6 time=251.125 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=7 time=393.197 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=8 time=410.205 usec
        Timeout
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=9 time=3.484 msec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=10 time=278.139 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=11 time=639.319 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=12 time=188.094 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=13 time=87.044 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=14 time=352.176 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=15 time=95.047 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=16 time=63.032 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=17 time=373.186 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=18 time=333.167 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=19 time=99.049 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=20 time=214.107 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=21 time=298.149 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=22 time=699.350 usec
        42 bytes from 8e:9e:2f:e6:ea:4a (10.10.1.14): index=23 time=271.135 usec

        --- 10.10.1.14 statistics ---
        25 packets transmitted, 24 packets received,   4%!u(MISSING)nanswered (0 extra)
        rtt min/avg/max/std-dev = 0.063/0.594/3.972/0.960 ms
    connectivity_test.go:385: Arping test failed: arping loss rate is 4.000000%
    fixtures.go:531: Deleting Pod 'test-pod-frgii0i8'
    fixtures.go:531: Deleting Pod 'test-pod-dv06q1uw'

As a result, and to avoid blocking the PR, the test case has been temporarily disabled:

t.Run("testOVSRestartSameNode", func(t *testing.T) {
skipIfNotIPv4Cluster(t)
skipIfHasWindowsNodes(t)
t.Skip("Skipping test for now as it fails consistently")
testOVSRestartSameNode(t, data, data.testNamespace)
})

The failure now needs to be investigated so that the test case can be restored.

The failure above seems to indicate that the OVS restart indeed leads to a disruption of the datapath. I believe that the start / stop delay in the old coverage-enabled images was somehow hiding this issue (and of course ww only run e2e tests with coverage enabled...). See:

args: ["-c", "sleep 2; antrea-agent-coverage -test.run=TestBincoverRunMain -test.coverprofile=antrea-agent.cov.out -args-file=/agent-arg-file; while true; do sleep 5 & wait $!; done"]

To reproduce

  1. Create a Kind cluster: kind create cluster --config ci/kind/config-2nodes.yml
  2. Install the latest Antrea: kubectl apply -f build/yamls/antrea.yml
  3. Run the test: go test -timeout=10m -v -count=10 -run=TestConnectivity/testOVSRestartSameNode antrea.io/antrea/test/e2e -provider=kind -coverage=false -deploy-antrea=false. This command will run the test case 10 times, and I have observed a failure rate of around 25%.
@antoninbas antoninbas added area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels May 16, 2024
@antoninbas antoninbas self-assigned this May 16, 2024
@antoninbas
Copy link
Contributor Author

@tnqn I wanted your opinion on this. I found this old comment of yours (#625 (comment)):

Besides, I found another good reason of using "flow-restore-wait", previously datapath flows would be cleaned once ovs-vswitchd was started, so existing connections especially cross-node ones could still have some downtime before antrea-agent restores the flows.

This doesn't exactly match what I have observed while troubleshooting this test. I have observed that when we remove flow-restore-wait, datapath flows are flushed. If I'm monitoring continuously (dumping the flows every second), I can see the following:

=============
Thu May 16 23:33:50 UTC 2024
recirc_id(0),in_port(3),eth(src=5a:b8:c9:64:ce:a9,dst=5a:4f:1e:de:2c:8c),eth_type(0x0806),arp(sip=10.10.1.44,tip=10.10.1.1,op=1/0xff,sha=5a:b8:c9:64:ce:a9), packets:0, bytes:0, used:never, actions:2
recirc_id(0),tunnel(tun_id=0x0,src=172.18.0.2,dst=172.18.0.3,flags(-df-csum+key)),in_port(1),eth(),eth_type(0x0800),ipv4(dst=10.10.1.32/255.255.255.224,frag=no), packets:17, bytes:1666, used:0.948s, actions:ct(zone=65520,nat),recirc(0x2)
recirc_id(0x2),tunnel(tun_id=0x0,src=172.18.0.2,dst=172.18.0.3,flags(-df-csum+key)),in_port(1),skb_mark(0/0x80000000),ct_state(-new+est+rpl-inv+trk-snat),ct_mark(0x3/0x7f),eth(src=56:6f:1f:69:e6:87,dst=aa:bb:cc:dd:ee:ff),eth_type(0x0800),ipv4(dst=10.10.1.44,proto=1,ttl=63,frag=no), packets:17, bytes:1666, used:0.948s, actions:set(eth(src=5a:4f:1e:de:2c:8c,dst=5a:b8:c9:64:ce:a9)),set(ipv4(ttl=62)),3
recirc_id(0),in_port(2),eth(src=5a:4f:1e:de:2c:8c,dst=5a:b8:c9:64:ce:a9),eth_type(0x0806),arp(sip=10.10.1.1,tip=10.10.1.44,op=2/0xff,sha=5a:4f:1e:de:2c:8c), packets:0, bytes:0, used:never, actions:3
recirc_id(0x1),in_port(3),skb_mark(0/0x80000000),ct_state(-new+est-rpl-inv+trk-snat),ct_mark(0x3/0x7f),eth(src=5a:b8:c9:64:ce:a9,dst=5a:4f:1e:de:2c:8c),eth_type(0x0800),ipv4(dst=10.10.0.0/255.255.255.0,proto=1,tos=0/0x3,ttl=64,frag=no), packets:17, bytes:1666, used:0.948s, actions:set(tunnel(tun_id=0x0,dst=172.18.0.2,ttl=64,tp_dst=6081,flags(df|key))),set(eth(src=5a:4f:1e:de:2c:8c,dst=aa:bb:cc:dd:ee:ff)),set(ipv4(ttl=63)),1
recirc_id(0),in_port(3),eth(src=5a:b8:c9:64:ce:a9),eth_type(0x0800),ipv4(src=10.10.1.44,dst=10.10.0.0/255.255.255.0,frag=no), packets:17, bytes:1666, used:0.948s, actions:ct(zone=65520,nat),recirc(0x1)
=============
Thu May 16 23:33:51 UTC 2024
=============
Thu May 16 23:33:52 UTC 2024
recirc_id(0),in_port(3),eth(src=5a:b8:c9:64:ce:a9),eth_type(0x0800),ipv4(src=10.10.1.44,dst=10.10.0.0/255.255.255.0,frag=no), packets:0, bytes:0, used:never, actions:ct(zone=65520,nat),recirc(0x1)
recirc_id(0),tunnel(tun_id=0x0,src=172.18.0.2,dst=172.18.0.3,flags(-df-csum+key)),in_port(1),eth(),eth_type(0x0800),ipv4(dst=10.10.1.32/255.255.255.224,frag=no), packets:0, bytes:0, used:never, actions:ct(zone=65520,nat),recirc(0x2)
recirc_id(0x2),tunnel(tun_id=0x0,src=172.18.0.2,dst=172.18.0.3,flags(-df-csum+key)),in_port(1),skb_mark(0/0x80000000),ct_state(-new+est+rpl-inv+trk-snat),ct_mark(0x3/0x7f),eth(src=56:6f:1f:69:e6:87,dst=aa:bb:cc:dd:ee:ff),eth_type(0x0800),ipv4(dst=10.10.1.44,proto=1,ttl=63,frag=no), packets:0, bytes:0, used:never, actions:set(eth(src=5a:4f:1e:de:2c:8c,dst=5a:b8:c9:64:ce:a9)),set(ipv4(ttl=62)),3
recirc_id(0x1),in_port(3),skb_mark(0/0x80000000),ct_state(-new+est-rpl-inv+trk-snat),ct_mark(0x3/0x7f),eth(src=5a:b8:c9:64:ce:a9,dst=5a:4f:1e:de:2c:8c),eth_type(0x0800),ipv4(dst=10.10.0.0/255.255.255.0,proto=1,tos=0/0x3,ttl=64,frag=no), packets:0, bytes:0, used:never, actions:set(tunnel(tun_id=0x0,dst=172.18.0.2,ttl=64,tp_dst=6081,flags(df|key))),set(eth(src=5a:4f:1e:de:2c:8c,dst=aa:bb:cc:dd:ee:ff)),set(ipv4(ttl=63)),1
=============

Notice that the second dump shows no datapath flows, and that in the dump immediately after that, counters have been reset. When I match the timestamp to the timestamps in the Agent logs, it matches Agent initialization and the Cleaned up flow-restore-wait config log message (of course, this is not a very granular measurement).

I believe that this explains why the test is failing.

I looked at the ovs-vswitch code, and the observation seems consistent with the code:
https://github.com/openvswitch/ovs/blob/3833506db0de7a9c7e72b82323bc1c355d2c03b3/ofproto/ofproto-dpif.c#L374-L389

This code is called by the bridge_run function (via some other functions).

What do you think? I wanted to make sure that I am not missing something.

I am not sure what the best way to fix the test is based on this observation. If I change the test to tolerate one packet loss event, then I believe it will pass consistently.

@tnqn
Copy link
Member

tnqn commented May 17, 2024

@tnqn I wanted your opinion on this. I found this old comment of yours (#625 (comment)):

Besides, I found another good reason of using "flow-restore-wait", previously datapath flows would be cleaned once ovs-vswitchd was started, so existing connections especially cross-node ones could still have some downtime before antrea-agent restores the flows.

This doesn't exactly match what I have observed while troubleshooting this test. I have observed that when we remove flow-restore-wait, datapath flows are flushed. If I'm monitoring continuously (dumping the flows every second), I can see the following:

I think the observation doesn't conflict with the previous comment, which was about preventing datapath flows from being flushed when ovs-vswitchd is started at which time no userspace flows is installed. My previous assumption is, after installing userspace flows, it should be graceful to flush datapath flows as new packet should consult the userspace flows and be forwarded in desired way.

To be more specific, there are several time points: stopping OVS, starting OVS, installing userspace flows, resetting flow-restore-wait (flushing datapath flows). We introduced "flow-restore-wait" to avoid disruption between starting OVS and installing userspace flows. However, I didn't expect the disruption after resetting flow-restore-wait. Maybe we could check why the packet didn't get forwarded based on userspace flows?

@antoninbas
Copy link
Contributor Author

@tnqn You were right. The datapath flows being flushed is not what is causing the traffic interruption. After more investigation, it happens because by the time we remove the flow-restore-wait configuration, Pod flows have not been installed yet by the CNIServer reconciliation process. There is no synchronization between this reconciliation process and the removal of flow-restore-wait. The issue is exacerbated by #5777, because it makes reconciliation of Pod flows asynchronous and delays it until all "known" NetworkPolicies have been realized. Based on logs (I had to increase verbosity), there is a > 200ms delay between removal of flow-restore-wait and installation of Pod forwarding flows:

I0517 04:05:03.986256       1 server.go:751] "Starting reconciliation for CNI server"
# this is when the reconcile function returns:
I0517 04:05:04.005935       1 server.go:766] "Completed reconciliation for CNI server"
I0517 04:05:04.009654       1 agent.go:630] Cleaning up flow-restore-wait config
I0517 04:05:04.012567       1 agent.go:643] Cleaned up flow-restore-wait config
...
I0517 04:05:04.262696       1 pod_configuration.go:455] "Syncing Pod interface" Pod="default/antrea-toolbox-szqp7" iface="antrea-t-814b3b"

This matches the traffic interruption which I have observed, which is around 300ms on average:

# ping every 100ms
64 bytes from 10.10.0.7: icmp_seq=280 ttl=62 time=0.770 ms
64 bytes from 10.10.0.7: icmp_seq=281 ttl=62 time=0.390 ms
64 bytes from 10.10.0.7: icmp_seq=282 ttl=62 time=0.422 ms
64 bytes from 10.10.0.7: icmp_seq=283 ttl=62 time=0.251 ms
64 bytes from 10.10.0.7: icmp_seq=284 ttl=62 time=0.273 ms
# 400ms gap!
64 bytes from 10.10.0.7: icmp_seq=288 ttl=62 time=2.22 ms
64 bytes from 10.10.0.7: icmp_seq=289 ttl=62 time=0.663 ms
64 bytes from 10.10.0.7: icmp_seq=290 ttl=62 time=0.364 ms

Prior to #5777, the issue may not even have been observable / reproducible (except maybe with a large number of local Pods, if CNIServer reconciliation was taking a while?).

I will see if we can add some synchronization, so that we do avoid removing the flag until CNIServer "reconciliation" is complete.

@antoninbas antoninbas added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels May 17, 2024
@antoninbas antoninbas added this to the Antrea v2.1 release milestone May 17, 2024
antoninbas added a commit to antoninbas/antrea that referenced this issue May 18, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue May 18, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue May 20, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue May 31, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

This change is possible because of antrea-io#6361, which removed the dependency
on the proxy (kube-proxy or AntreaProxy) to access the Antrea
Controller. Prior to antrea-io#6361, there would have been a circular dependency
in the case where kube-proxy was removed: flow-restore-wait will not be
removed until the Pod network is "ready", which will not happen until
the NetworkPolicy controller has started its watchers, and that depends
on antrea Service reachability which depends on flow-restore-wait being
removed.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue May 31, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

This change is possible because of antrea-io#6361, which removed the dependency
on the proxy (kube-proxy or AntreaProxy) to access the Antrea
Controller. Prior to antrea-io#6361, there would have been a circular dependency
in the case where kube-proxy was removed: flow-restore-wait will not be
removed until the Pod network is "ready", which will not happen until
the NetworkPolicy controller has started its watchers, and that depends
on antrea Service reachability which depends on flow-restore-wait being
removed.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit that referenced this issue Jun 3, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in #5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

This change is possible because of #6361, which removed the dependency
on the proxy (kube-proxy or AntreaProxy) to access the Antrea
Controller. Prior to #6361, there would have been a circular dependency
in the case where kube-proxy was removed: flow-restore-wait will not be
removed until the Pod network is "ready", which will not happen until
the NetworkPolicy controller has started its watchers, and that depends
on antrea Service reachability which depends on flow-restore-wait being
removed.

Fixes #6338

Signed-off-by: Antonin Bas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants