Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VM Agent] Bugfix: antrea-agent failed to delete ExternalNode #5191

Merged
merged 1 commit into from
Jul 10, 2023

Conversation

wenyingd
Copy link
Contributor

@wenyingd wenyingd commented Jul 3, 2023

The issue is seen on a RHEL 8.4 VM on azure cloud, which is configured with dhclient to manage the network interface. The root cause is antrea-agent fails to recover the IP/Routes from host internal interface to uplink after ExternalNode is deleted, because the added IP/Routes is deleted or conflicted with dhclient configuration. Then in the continuous retry with ExternalNode deletion, antrea-agent is blocking at the precheck on the existentance of host internal interface always returns true as the uplink's name is already recovered.

Fix: #5111

@wenyingd
Copy link
Contributor Author

wenyingd commented Jul 3, 2023

/test-vm-e2e

// name is recovered. So the ips and routes in "adapterConfig" are actually read from the uplink and no need to
// move the configurations back. The issue was seen on VM with RHEL 8.4 on azure cloud.
if !hostInterfaceExists(uplinkIfName) {
klog.InfoS("Uplink is not existing on the host, return")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add the uplinkIfName in the log message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Copy link
Contributor

@Anandkumar26 Anandkumar26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@wenyingd
Copy link
Contributor Author

wenyingd commented Jul 4, 2023

/test-vm-e2e

// try after the error is returned, at this time the host internal interface is already deleted, and the uplink's
// name is recovered. So the ips and routes in "adapterConfig" are actually read from the uplink and no need to
// move the configurations back. The issue was seen on VM with RHEL 8.4 on azure cloud.
if !hostInterfaceExists(uplinkIfName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this run in the beginning of the function since it means running the other code is dummy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

// name is recovered. So the ips and routes in "adapterConfig" are actually read from the uplink and no need to
// move the configurations back. The issue was seen on VM with RHEL 8.4 on azure cloud.
if !hostInterfaceExists(uplinkIfName) {
klog.InfoS("Uplink is not existing on the host, return", "uplinkIfName", uplinkIfName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.InfoS("Uplink is not existing on the host, return", "uplinkIfName", uplinkIfName)
klog.InfoS("The interface with uplink name did not exist on the host, skipping its recovery", "uplinkIfName", uplinkIfName)

to avoid ambiguity (It's not that uplink doesn't exist) and grammar issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

The issue is seen on a RHEL 8.4 VM on azure cloud, which is configured with
dhclient to manage the network interface. The root cause is antrea-agent fails
to recover the IP/Routes from host internal interface to uplink after
ExternalNode is deleted, because the added IP/Routes is deleted or conflicted
with dhclient configuration. Then in the continuous retry with ExternalNode
deletion, antrea-agent is blocking at the precheck on the existentance of host
internal interface always returns true as the uplink's name is already
recovered.

Signed-off-by: wenyingd <[email protected]>
@wenyingd
Copy link
Contributor Author

wenyingd commented Jul 4, 2023

/test-vm-e2e

1 similar comment
@wenyingd
Copy link
Contributor Author

wenyingd commented Jul 4, 2023

/test-vm-e2e

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ceclinux
Copy link
Contributor

ceclinux commented Jul 4, 2023

/test-vm-e2e

// This is for issue #5111 (https://github.com/antrea-io/antrea/issues/5111), which may happen if an error occurs
// when moving the configuration back from host internal interface to uplink. This logic is run in the second
// try after the error is returned, at this time the host internal interface is already deleted, and the uplink's
// name is recovered. So the ips and routes in "adapterConfig" are actually read from the uplink and no need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may miss some thing here. If the error occurs, the uplink's name is recovered. But ip and routes can still be wrong, right? Our current logic can not handle this case and make it right later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your mentioned case, we can not fix it in this change, and it may also break agent without this change.

The motivation for this change is issue #5111 , in which ip address/routes configuration are failed because dhclient is modifying the interface in the meanwhile. Then antrea-agent may fail in the first try, and it still blocks at the check on the existence host internal interface (antrea thought the interface was supposed to not exist because it was removed from OVS and the uplink was not renamed yet), but the fact is the uplink is already renamed. As for the ip address/routes are actually configured back by dhclient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the background details. It is clear to me now.

@tnqn
Copy link
Member

tnqn commented Jul 10, 2023

/skip-conformance
/skip-networkpolicy

@tnqn tnqn merged commit 9ffb0a2 into antrea-io:main Jul 10, 2023
43 checks passed
@wenyingd wenyingd deleted the issue_5111 branch April 3, 2024 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

On Azure rhel8.4 VM antrea-agent goes into a state where it cannot manage ExternalNode.
5 participants