Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Pod or gateway use a different MAC address #4428

Merged
merged 1 commit into from
Dec 6, 2022

Conversation

wenyingd
Copy link
Contributor

@wenyingd wenyingd commented Dec 1, 2022

This issue is found on Ubuntu 22.04: the network interface for antrea-gw0
is different from the one we used in OpenFlow rules.

The reason is systemd-udev has modified the interface's MAC after it watches
a new one is created. So if Antrea Agent reads interface's information before
systemd-udev's modification, Antrea Agent would uses an incorrect value to
install OpenFlow rules.

To resolve the issue,

  1. Agent generates a static MAC for antrea-gw0 or the interface used by Pod
  2. Agent uses the generated MAC to create OVS internal port or veth pair

To implement the logic, some code is copied from containernetworking/plugins/ip/link_linux
latest versions to path thirdparty, this is to avoid unexpected issues introduced
when bumping up the dependent libraries.

Fix #4426

Signed-off-by: wenyingd [email protected]

@codecov
Copy link

codecov bot commented Dec 1, 2022

Codecov Report

Merging #4428 (cd1cb3d) into main (1bbfc93) will increase coverage by 0.05%.
The diff coverage is 39.20%.

❗ Current head cd1cb3d differs from pull request most recent head 1322dab. Consider uploading reports for the commit 1322dab to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4428      +/-   ##
==========================================
+ Coverage   65.35%   65.40%   +0.05%     
==========================================
  Files         400      400              
  Lines       56849    57513     +664     
==========================================
+ Hits        37152    37619     +467     
- Misses      17005    17187     +182     
- Partials     2692     2707      +15     
Flag Coverage Δ *Carryforward flag
e2e-tests 37.85% <30.43%> (?)
integration-tests 34.63% <45.83%> (ø) Carriedforward from 1bbfc93
kind-e2e-tests 47.34% <12.98%> (ø) Carriedforward from 1bbfc93
unit-tests 49.78% <26.66%> (ø) Carriedforward from 1bbfc93

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/agent_windows.go 0.60% <0.00%> (ø)
pkg/controller/networkpolicy/validate.go 41.89% <15.00%> (-8.57%) ⬇️
...ter/controllers/multicluster/gateway_controller.go 69.85% <28.57%> (ø)
pkg/agent/agent.go 65.68% <33.33%> (+12.40%) ⬆️
pkg/ovs/ovsconfig/ovs_client.go 68.49% <43.75%> (+3.21%) ⬆️
...kg/controller/networkpolicy/antreanetworkpolicy.go 89.14% <50.00%> (-1.77%) ⬇️
...g/agent/cniserver/interface_configuration_linux.go 27.97% <100.00%> (+0.54%) ⬆️
pkg/agent/interfacestore/types.go 100.00% <100.00%> (ø)
pkg/agent/util/net.go 56.39% <100.00%> (+1.25%) ⬆️
...g/controller/networkpolicy/clusternetworkpolicy.go 71.36% <100.00%> (-2.78%) ⬇️
... and 17 more

@wenyingd wenyingd force-pushed the preassignedmac branch 2 times, most recently from 1e5b931 to 9928a12 Compare December 1, 2022 08:11
@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 1, 2022

/test-all
/test-windows-all
/test-flexible-ipam-e2e

@wenyingd wenyingd requested a review from tnqn December 1, 2022 09:57
@wenyingd wenyingd changed the title Bugfix: Pod or gateway use a different MAC from the expectation Bugfix: Pod or gateway use a different MAC address Dec 1, 2022
@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 1, 2022

/test-windows-e2e
/test-flexible-ipam-e2e

Comment on lines 628 to 649
func mockUtilFunctions() {
setLinkUp = func(name string) (net.HardwareAddr, int, error) {
return util.GenerateRandomMAC(true), 10, nil
}
configureLinkAddresses = func(idx int, ipNets []*net.IPNet) error {
return nil
}
}

func restoreUtilFunctions() {
setLinkUp = util.SetLinkUp
configureLinkAddresses = util.ConfigureLinkAddresses
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It‘s a little obscure to know what the functions do when calling them in tests. To be clear and extensiable, please consider the following code:

Suggested change
func mockUtilFunctions() {
setLinkUp = func(name string) (net.HardwareAddr, int, error) {
return util.GenerateRandomMAC(true), 10, nil
}
configureLinkAddresses = func(idx int, ipNets []*net.IPNet) error {
return nil
}
}
func restoreUtilFunctions() {
setLinkUp = util.SetLinkUp
configureLinkAddresses = util.ConfigureLinkAddresses
}
func mockSetLinkUp(returnIndex int, returnErr error) func() {
originalSetLinkUp := setLinkUp
setLinkUp = func(name string) (int, error) {
return returnIndex, returnErr
}
return func() {
setLinkUp = originalSetLinkUp
}
}
func mockConfigureLinkAddresses(returnErr error) func() {
originalConfigureLinkAddresses := configureLinkAddresses
configureLinkAddresses = func(idx int, ipNets []*net.IPNet) error {
return returnErr
}
return func() {
configureLinkAddresses = originalConfigureLinkAddresses
}
}
// linux.go
func mockSetInterfaceMTU(returnErr error) func() {
return func() {}
}
// windows.go
func mockSetInterfaceMTU(returnErr error) func() {
originalSetInterfaceMTU := setInterfaceMTU
setInterfaceMTU = func(ifaceName string, mtu int) error {
return returnErr
}
return func() {
setInterfaceMTU = originalSetInterfaceMTU
}
}
func TestXXX() {
defer mockSetLinkUp(10, nil)()
defer mockConfigureLinkAddresses(nil)()
defer mockSetInterfaceMTU(nil)()
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated accordingly.

pkg/agent/agent_windows.go Outdated Show resolved Hide resolved
@@ -401,13 +401,20 @@ func GenerateUplinkInterfaceName(name string) string {
return name + bridgedUplinkSuffix
}

func GenerateRandomMAC() net.HardwareAddr {
func GenerateRandomMAC(global bool) net.HardwareAddr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All virtual devices should set local assignment bit:

  1. All MAC addresses including veth, OVS internal port, OVS datapath ID set it;
  2. linux kernel implementation does the same for all virtual devices: https://elixir.bootlin.com/linux/v4.15.18/source/include/linux/etherdevice.h#L227

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I hit some failures in the tests when creating network interfaces with a pre-allocated MAC if I only set the local bit, and I thought it was related with the local bit, so I add this global parameter. Since Linux kernel uses this logic, I would update accordingly.

pkg/agent/util/net.go Outdated Show resolved Hide resolved
pkg/ovs/ovsconfig/ovs_client.go Outdated Show resolved Hide resolved
var gwLinkIdx int
var err error
// Host link might not be queried at once after creating OVS internal port; retry max 5 times with 1s
// delay each time to ensure the link is ready.
for retry := 0; retry < maxRetryForHostLink; retry++ {
gwMAC, gwLinkIdx, err = util.SetLinkUp(i.hostGateway)
_, gwLinkIdx, err = setLinkUp(i.hostGateway)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first argument could be removed as it's no longer used by any caller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the first return value in function setLinkUp

@wenyingd wenyingd force-pushed the preassignedmac branch 3 times, most recently from 32d2270 to 4456386 Compare December 2, 2022 04:10
@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 2, 2022

@tnqn The patch coverage is lower than the requirement after a change in function net.SetLinkUp. Since the function is about calling a third-party library functions, and no if-else branches in the logic, shall we add unit test cases in a future PR?

tnqn
tnqn previously approved these changes Dec 2, 2022
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Dec 2, 2022

@tnqn The patch coverage is lower than the requirement after a change in function net.SetLinkUp. Since the function is about calling a third-party library functions, and no if-else branches in the logic, shall we add unit test cases in a future PR?

It's fine to ignore it.

@tnqn
Copy link
Member

tnqn commented Dec 2, 2022

/test-all

@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 2, 2022

/test-windows-all

@XinShuYang
Copy link
Contributor

/test-windows-e2e

@XinShuYang
Copy link
Contributor

/test-windows-e2e

@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 2, 2022

@tnqn, I just found that "mac_in_use" on Windows OVS is always "00:00:00:00:00:00", so this change may introduce a failure on Windows. So we still need to read the MAC from network interface on Windows.

@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 2, 2022

In the latest change, I revert Windows implementation with the original logic, and use the pre-allocated MAC only in Linux.

@tnqn Would you review it again?

@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 2, 2022

/test-all
/test-windows-all

@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 2, 2022

/test-e2e
/test-networkpolicy

@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 5, 2022

/test-e2e

@tnqn tnqn added this to the Antrea v1.10 release milestone Dec 5, 2022
@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 5, 2022

Both Windows and Linux tests are passed after the latest change.


i.nodeConfig.GatewayConfig = &config.GatewayConfig{Name: i.hostGateway, MAC: gwMAC, OFPort: uint32(gatewayIface.OFPort)}
gatewayIface.MAC = gwMAC
i.configureGatewayMAC(gatewayIface, gwMAC)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function also assigns gatewayConfig to i.nodeConfig which cannot be told from the function name.

Besides, I wonder if it could happen on linux that the ovs db doesn't have mac_in_use set when antrea-agent reads it if the field is loaded asynchonously. And to make newly created antrea-gw0 and existing antrea-gw0 consistent, I feel we could unify the code across OSes as following:

  1. Read "mac" from ovsdb, instead of "mac_in_use"
  2. Always assign "mac" from gatewayIface to "i.nodeConfig.GatewayConfig" regardless of platforms
  3. For existing antrea-gw0, "mac" in gatewayIface would be empty, use mac got from "setLinkUp" to update gatewayIface and update ovsdb to reflect it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Windows, the MAC field configured in Interface is not the actual value used in network adapter. So we can not use OVSDB as the single source for both Linux and Windows.

@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 5, 2022

/test-all
/test-windows-all

func (i *Initializer) configureGatewayMAC(gatewayIface *interfacestore.InterfaceConfig, gwMAC net.HardwareAddr) {
// Use the pre-assigned MAC in GatewayConfig to ensure the MAC address used in OpenFlow rules is consistent
// with the value configured in the network interface.
i.nodeConfig.GatewayConfig.MAC = gatewayIface.MAC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could the situation I described in https://github.com/antrea-io/antrea/pull/4428/files#r1039171906 happen?

the ovs db doesn't have mac_in_use set when antrea-agent reads it if the field is loaded asynchonously

If we are not sure or it's not guaranteed mac_in_use must already be set, should it use gwMAC if gatewayIface.MAC is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your concern is in the restart/upgrade case that antera-gw0 is already created, and Agent tries to load GatewayInterfaceConfig from OVSDB. According to my observation, OVS datapath writes the actual MAC of the nework interface into OVSDB "mac_in_use" field at once when it finds the corresponding network interface is created on the host, and then it updates the value when it finds the MAC is changed. And the data is "persistent" in OVSDB just like "mac" field, as long as it is not changed. For restart/upgrade case, I don't think the "mac_in_use" will be changed. So it should not be empty if we assume OVS is working well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source of "mac_in_use" field is the MAC address configured on the corresponding network interface, for an existing interface, its value should not be empty. Even for antrea-agent Pod restart case, the data is already in ovsdb conf file ( persistent on hard drive ), the field should not be empty as long as gw0 is not a newly created interface in restart case ( new interface should use the preconfigured value directly not read from OVSDB).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find mac_in_use persistent in ovsdb conf file. So I assume once OVS restarts, ovs-vswitchd will read it and update ovsdb. I'm not very sure if it's guaranteed it will first set the data then agent can connect to ovsdb, I guess there is no guarantee. The code path I can find is:
main() -> bridge_run() -> run_status_update() -> iface_refresh_netdev_status() -> ovsrec_interface_set_mac_in_use()
If ovs-switchd can make update call to ovsdb, why can't antrea-agent make get call to it before or after it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sence. Updated to read MAC address from "mac" field in OVS Interface after restart where Agent writes the allocated value, and re-configure the field in both OVSDB and gateway InterfaceConfig with the real value read from the network interface if they are is not set ( happened in upgrade case).

@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 6, 2022

/test-networkpolicy

@wenyingd wenyingd force-pushed the preassignedmac branch 2 times, most recently from 892e019 to b145a01 Compare December 6, 2022 04:37
…tion

This issue is found on Ubuntu 22.04: the network interface for antrea-gw0
is different from the one we used in OpenFlow rules.

The reason is systemd-udev has modified the interface's MAC after it watches
a new one is created. So if Antrea Agent reads interface's information before
systemd-udev's modification, Antrea Agent would uses an incorrect value to
install OpenFlow rules.

To resolve the issue,
  1. Agent generates a static MAC for antrea-gw0 or the interface used by Pod
  2. Agent uses the generated MAC to create OVS internal port or veth pair

To implement the logic, some code is copied from containernetworking/plugins/ip/link_linux
latest versions to path thirdparty, this is to avoid unexpected issues introduced
when bumping up the dependent libraries.

Signed-off-by: wenyingd <[email protected]>
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Dec 6, 2022

/test-all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Virtual network device's MAC address changes after it's created with systemd v242 and later
3 participants