-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: Pod or gateway use a different MAC address #4428
Conversation
Codecov Report
@@ Coverage Diff @@
## main #4428 +/- ##
==========================================
+ Coverage 65.35% 65.40% +0.05%
==========================================
Files 400 400
Lines 56849 57513 +664
==========================================
+ Hits 37152 37619 +467
- Misses 17005 17187 +182
- Partials 2692 2707 +15
*This pull request uses carry forward flags. Click here to find out more.
|
1e5b931
to
9928a12
Compare
/test-all |
/test-windows-e2e |
pkg/agent/agent_test.go
Outdated
func mockUtilFunctions() { | ||
setLinkUp = func(name string) (net.HardwareAddr, int, error) { | ||
return util.GenerateRandomMAC(true), 10, nil | ||
} | ||
configureLinkAddresses = func(idx int, ipNets []*net.IPNet) error { | ||
return nil | ||
} | ||
} | ||
|
||
func restoreUtilFunctions() { | ||
setLinkUp = util.SetLinkUp | ||
configureLinkAddresses = util.ConfigureLinkAddresses | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It‘s a little obscure to know what the functions do when calling them in tests. To be clear and extensiable, please consider the following code:
func mockUtilFunctions() { | |
setLinkUp = func(name string) (net.HardwareAddr, int, error) { | |
return util.GenerateRandomMAC(true), 10, nil | |
} | |
configureLinkAddresses = func(idx int, ipNets []*net.IPNet) error { | |
return nil | |
} | |
} | |
func restoreUtilFunctions() { | |
setLinkUp = util.SetLinkUp | |
configureLinkAddresses = util.ConfigureLinkAddresses | |
} | |
func mockSetLinkUp(returnIndex int, returnErr error) func() { | |
originalSetLinkUp := setLinkUp | |
setLinkUp = func(name string) (int, error) { | |
return returnIndex, returnErr | |
} | |
return func() { | |
setLinkUp = originalSetLinkUp | |
} | |
} | |
func mockConfigureLinkAddresses(returnErr error) func() { | |
originalConfigureLinkAddresses := configureLinkAddresses | |
configureLinkAddresses = func(idx int, ipNets []*net.IPNet) error { | |
return returnErr | |
} | |
return func() { | |
configureLinkAddresses = originalConfigureLinkAddresses | |
} | |
} | |
// linux.go | |
func mockSetInterfaceMTU(returnErr error) func() { | |
return func() {} | |
} | |
// windows.go | |
func mockSetInterfaceMTU(returnErr error) func() { | |
originalSetInterfaceMTU := setInterfaceMTU | |
setInterfaceMTU = func(ifaceName string, mtu int) error { | |
return returnErr | |
} | |
return func() { | |
setInterfaceMTU = originalSetInterfaceMTU | |
} | |
} | |
func TestXXX() { | |
defer mockSetLinkUp(10, nil)() | |
defer mockConfigureLinkAddresses(nil)() | |
defer mockSetInterfaceMTU(nil)() | |
... | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated accordingly.
pkg/agent/util/net.go
Outdated
@@ -401,13 +401,20 @@ func GenerateUplinkInterfaceName(name string) string { | |||
return name + bridgedUplinkSuffix | |||
} | |||
|
|||
func GenerateRandomMAC() net.HardwareAddr { | |||
func GenerateRandomMAC(global bool) net.HardwareAddr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All virtual devices should set local assignment bit:
- All MAC addresses including veth, OVS internal port, OVS datapath ID set it;
- linux kernel implementation does the same for all virtual devices: https://elixir.bootlin.com/linux/v4.15.18/source/include/linux/etherdevice.h#L227
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, I hit some failures in the tests when creating network interfaces with a pre-allocated MAC if I only set the local bit, and I thought it was related with the local bit, so I add this global parameter. Since Linux kernel uses this logic, I would update accordingly.
pkg/agent/agent.go
Outdated
var gwLinkIdx int | ||
var err error | ||
// Host link might not be queried at once after creating OVS internal port; retry max 5 times with 1s | ||
// delay each time to ensure the link is ready. | ||
for retry := 0; retry < maxRetryForHostLink; retry++ { | ||
gwMAC, gwLinkIdx, err = util.SetLinkUp(i.hostGateway) | ||
_, gwLinkIdx, err = setLinkUp(i.hostGateway) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first argument could be removed as it's no longer used by any caller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the first return value in function setLinkUp
32d2270
to
4456386
Compare
@tnqn The patch coverage is lower than the requirement after a change in function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
It's fine to ignore it. |
/test-all |
/test-windows-all |
/test-windows-e2e |
/test-windows-e2e |
@tnqn, I just found that "mac_in_use" on Windows OVS is always "00:00:00:00:00:00", so this change may introduce a failure on Windows. So we still need to read the MAC from network interface on Windows. |
4456386
to
3405c4e
Compare
In the latest change, I revert Windows implementation with the original logic, and use the pre-allocated MAC only in Linux. @tnqn Would you review it again? |
/test-all |
/test-e2e |
/test-e2e |
Both Windows and Linux tests are passed after the latest change. |
pkg/agent/agent.go
Outdated
|
||
i.nodeConfig.GatewayConfig = &config.GatewayConfig{Name: i.hostGateway, MAC: gwMAC, OFPort: uint32(gatewayIface.OFPort)} | ||
gatewayIface.MAC = gwMAC | ||
i.configureGatewayMAC(gatewayIface, gwMAC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function also assigns gatewayConfig to i.nodeConfig
which cannot be told from the function name.
Besides, I wonder if it could happen on linux that the ovs db doesn't have mac_in_use set when antrea-agent reads it if the field is loaded asynchonously. And to make newly created antrea-gw0 and existing antrea-gw0 consistent, I feel we could unify the code across OSes as following:
- Read "mac" from ovsdb, instead of "mac_in_use"
- Always assign "mac" from gatewayIface to "i.nodeConfig.GatewayConfig" regardless of platforms
- For existing antrea-gw0, "mac" in gatewayIface would be empty, use mac got from "setLinkUp" to update gatewayIface and update ovsdb to reflect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Windows, the MAC
field configured in Interface is not the actual value used in network adapter. So we can not use OVSDB as the single source for both Linux and Windows.
3405c4e
to
b520e48
Compare
/test-all |
pkg/agent/agent_linux.go
Outdated
func (i *Initializer) configureGatewayMAC(gatewayIface *interfacestore.InterfaceConfig, gwMAC net.HardwareAddr) { | ||
// Use the pre-assigned MAC in GatewayConfig to ensure the MAC address used in OpenFlow rules is consistent | ||
// with the value configured in the network interface. | ||
i.nodeConfig.GatewayConfig.MAC = gatewayIface.MAC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could the situation I described in https://github.com/antrea-io/antrea/pull/4428/files#r1039171906 happen?
the ovs db doesn't have mac_in_use set when antrea-agent reads it if the field is loaded asynchonously
If we are not sure or it's not guaranteed mac_in_use must already be set, should it use gwMAC if gatewayIface.MAC
is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your concern is in the restart/upgrade case that antera-gw0 is already created, and Agent tries to load GatewayInterfaceConfig from OVSDB. According to my observation, OVS datapath writes the actual MAC of the nework interface into OVSDB "mac_in_use" field at once when it finds the corresponding network interface is created on the host, and then it updates the value when it finds the MAC is changed. And the data is "persistent" in OVSDB just like "mac" field, as long as it is not changed. For restart/upgrade case, I don't think the "mac_in_use" will be changed. So it should not be empty if we assume OVS is working well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source of "mac_in_use" field is the MAC address configured on the corresponding network interface, for an existing interface, its value should not be empty. Even for antrea-agent Pod restart case, the data is already in ovsdb conf file ( persistent on hard drive ), the field should not be empty as long as gw0 is not a newly created interface in restart case ( new interface should use the preconfigured value directly not read from OVSDB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find mac_in_use persistent in ovsdb conf file. So I assume once OVS restarts, ovs-vswitchd will read it and update ovsdb. I'm not very sure if it's guaranteed it will first set the data then agent can connect to ovsdb, I guess there is no guarantee. The code path I can find is:
main() -> bridge_run() -> run_status_update() -> iface_refresh_netdev_status() -> ovsrec_interface_set_mac_in_use()
If ovs-switchd can make update call to ovsdb, why can't antrea-agent make get call to it before or after it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sence. Updated to read MAC address from "mac" field in OVS Interface after restart where Agent writes the allocated value, and re-configure the field in both OVSDB and gateway InterfaceConfig with the real value read from the network interface if they are is not set ( happened in upgrade case).
/test-networkpolicy |
892e019
to
b145a01
Compare
b145a01
to
02bcf0f
Compare
02bcf0f
to
2a81fbc
Compare
…tion This issue is found on Ubuntu 22.04: the network interface for antrea-gw0 is different from the one we used in OpenFlow rules. The reason is systemd-udev has modified the interface's MAC after it watches a new one is created. So if Antrea Agent reads interface's information before systemd-udev's modification, Antrea Agent would uses an incorrect value to install OpenFlow rules. To resolve the issue, 1. Agent generates a static MAC for antrea-gw0 or the interface used by Pod 2. Agent uses the generated MAC to create OVS internal port or veth pair To implement the logic, some code is copied from containernetworking/plugins/ip/link_linux latest versions to path thirdparty, this is to avoid unexpected issues introduced when bumping up the dependent libraries. Signed-off-by: wenyingd <[email protected]>
2a81fbc
to
1322dab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
This issue is found on Ubuntu 22.04: the network interface for antrea-gw0
is different from the one we used in OpenFlow rules.
The reason is systemd-udev has modified the interface's MAC after it watches
a new one is created. So if Antrea Agent reads interface's information before
systemd-udev's modification, Antrea Agent would uses an incorrect value to
install OpenFlow rules.
To resolve the issue,
To implement the logic, some code is copied from containernetworking/plugins/ip/link_linux
latest versions to path thirdparty, this is to avoid unexpected issues introduced
when bumping up the dependent libraries.
Fix #4426
Signed-off-by: wenyingd [email protected]