-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify the status of required routes gateway periodically #2091
Conversation
d1f2b60
to
5f687f9
Compare
pkg/agent/route/route_linux.go
Outdated
if err := c.syncRoutes(); err != nil { | ||
klog.Errorf("Failed to sync routes: %v", err) | ||
} | ||
if err := c.syncGwIp(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use upper cases for "IP", so do other places in this patch
pkg/agent/route/route_linux.go
Outdated
for _, route := range routes { | ||
exist := false | ||
for i := range routeList { | ||
if routeEqual(route, &routeList[i]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nested loop is O(n^2), which may be bad in a cluster of thousands of nodes, especially when there are many other unknown routes configured on the node.
could you optimize it to O(n) by using a map?
pkg/agent/route/route_linux.go
Outdated
return nil | ||
} | ||
|
||
func (c *Client) syncGwIp() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not very maintainable that the gw IP is configured in other place but resynced here and the module doesn't sound proper to manage the IP. Could you move it to the place where the gwIP was first configured by starting a goroutine there?
test/e2e/route_util.go
Outdated
@@ -0,0 +1,98 @@ | |||
package e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add license header
pkg/agent/agent.go
Outdated
@@ -222,6 +225,31 @@ func (i *Initializer) Initialize() error { | |||
return err | |||
} | |||
|
|||
// Periodically check whether IP configuration of the gateway is correct. | |||
// Terminated when stopCh is closed. | |||
go wait.Until(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think networkpolicyonly mode shouldn't call this, why don't just start a goroutine calling util.ConfigureLinkAddresses
after it's first called?
pkg/agent/route/route_linux.go
Outdated
} | ||
return true | ||
}) | ||
for _, route := range routes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this handling just be in the loop of c.nodeRoutes.Range
? It doesn't seem necessary to construct a slice first
pkg/agent/route/route_linux.go
Outdated
return nil | ||
} | ||
|
||
// func (c *Client) syncGwIp() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this
Codecov Report
@@ Coverage Diff @@
## main #2091 +/- ##
==========================================
+ Coverage 61.05% 61.07% +0.01%
==========================================
Files 270 270
Lines 20366 20410 +44
==========================================
+ Hits 12435 12465 +30
- Misses 6636 6642 +6
- Partials 1295 1303 +8
Flags with carried forward coverage won't be shown. Click here to find out more.
|
pkg/agent/agent.go
Outdated
@@ -845,6 +848,15 @@ func (i *Initializer) allocateGatewayAddresses(localSubnets []*net.IPNet, gatewa | |||
if err := util.ConfigureLinkAddresses(i.nodeConfig.GatewayConfig.LinkIndex, gwIPs); err != nil { | |||
return err | |||
} | |||
// Periodically check whether IP configuration of the gateway is correct. | |||
// Terminated when stopCh is closed. | |||
if !i.networkConfig.TrafficEncapMode.IsNetworkPolicyOnly() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it gets here, this check has been done. And since L848 already configures link addresses, logically it's ok to just start a goroutine to do the periodical work without any check.
pkg/agent/route/route_linux.go
Outdated
routeMap := make(map[string]*netlink.Route) | ||
for i := range routeList { | ||
r := &routeList[i] | ||
if r == nil || r.Dst == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's impossible to be nil given L168.
pkg/agent/route/route_linux.go
Outdated
} | ||
continue | ||
} | ||
if !routeEqual(route, r) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to reduce redundancy:
r, ok := routeMap[route.Dst.String()]
if ok && routeEqual(route, r) {
continue
}
if err := netlink.RouteAdd(route); err != nil {
klog.Errorf("Failed to add route to the gateway: %v", err)
return false
}
} | ||
|
||
func (c *Client) syncRoutes() error { | ||
routeList, err := netlink.RouteList(nil, netlink.FAMILY_ALL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be race condition between syncRoutes
and DeleteRoutes
:
if DeleteRoutes
has deleted actual routes and is about to delete the desired routes, syncRoutes
may see that there are desired rules but no actual routes and add the routes back. Then the routes will going to be stale.
test/e2e/basic_test.go
Outdated
if err != nil { | ||
t.Fatalf("Failed to detect gateway interface name from ConfigMap: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this doing?
test/e2e/basic_test.go
Outdated
func testSyncRoutes(t *testing.T, data *TestData, isIPv6 bool) { | ||
encapMode, err := data.GetEncapMode() | ||
if err != nil { | ||
t.Fatalf(" failed to get encap mode, err %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Fatalf(" failed to get encap mode, err %v", err) | |
t.Fatalf("Failed to get encap mode, err %v", err) |
test/e2e/basic_test.go
Outdated
if err := wait.Poll(30*time.Second, 3*time.Minute, func() (bool, error) { | ||
newRoutes, err := getGatewayRoutes(data, isIPv6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a big fan of including a test that long as we are trying to reduce the runtime of the e2e test suite
Any chance we can rely on an integration test instead, like we did for iptables rules reconciliation: #1751?
Codecov Report
@@ Coverage Diff @@
## main #2091 +/- ##
==========================================
+ Coverage 61.05% 61.23% +0.17%
==========================================
Files 270 269 -1
Lines 20366 20451 +85
==========================================
+ Hits 12435 12523 +88
+ Misses 6636 6633 -3
Partials 1295 1295
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -578,13 +620,13 @@ func (c *Client) DeleteRoutes(podCIDR *net.IPNet) error { | |||
|
|||
routes, exists := c.nodeRoutes.Load(podCIDRStr) | |||
if exists { | |||
c.nodeRoutes.Delete(podCIDRStr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would introduce new issue that the routes won't be removed if the first attempt fails.
I think it can add the remaining routes back to the cache when it fails, so it can retry to remove them later.
pkg/agent/route/route_linux.go
Outdated
if ok && routeEqual(route, r) { | ||
continue | ||
} | ||
if err := netlink.RouteAdd(route); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should use RouteReplace
, otherwise it will always fail when there is already a route that doesn't match
@@ -65,7 +65,7 @@ func TestGetPortFields(t *testing.T) { | |||
|
|||
// TestParseFlow tests if a flow can be parsed correctly. | |||
func TestParseFlow(t *testing.T) { | |||
tcs := []struct { | |||
tcs := []*struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this updates don't seem related
91165d5
to
e7aeaba
Compare
test/integration/agent/route_test.go
Outdated
nhCIDRIP := ip.NextIP(peerCIDR.IP) | ||
assert.NoError(t, routeClient.AddRoutes(peerCIDR, tc.nodeName, tc.peerIP, nhCIDRIP), "adding routes failed") | ||
|
||
if !tc.mode.NeedsEncapToPeer(tc.peerIP, nodeConfig.NodeIPAddr) && tc.mode.NeedsRoutingToPeer(tc.peerIP, nodeConfig.NodeIPAddr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the first condition redundant? And it doesn't seem necessary to add a case that will be skipped. But maybe it could check whatever the original route is (a valid route or nil), after removing the route and a sync loop, it can get the same route, instead of checking the route is not empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first condition is redundant. But the second is to avoid error when executing "ip route del" on nonexistent route since in that case the route will not be added according to https://github.com/vmware-tanzu/antrea/blob/3335e734071894f70a9ab33c734799f56c049689/pkg/agent/route/route_linux.go#L541
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why the case is even added to TestSyncRoute if it's skipped anyway? I think it should either remove the 3rd case and the mode check, or be more generic: only call "ip route del" if "expOutput" is not empty, so that the test works for all cases and don't have to check mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert.NoError(t, err, "error executing ip route command: %s", listCmd)
if len(expOutput) > 0 {
delCmd := fmt.Sprintf("ip route del %s", expOutput)
_, err = exec.Command("bash", "-c", delCmd).Output()
assert.NoError(t, err, "error executing ip route command: %s", delCmd)
}
…periodically Add checks to the routeClient. The required routes will be added back if they were deleted unexpectedly. Add IP configuration check of the gateway to the agent. An integration test is added to verify that the route will be added back correctly. Fixes antrea-io#627
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add checks to the routeClient. The required routes will be added back if they were
deleted unexpectedly. Add IP configuration check of the gateway to the agent.
An integration test is added to verify that the route will be added back correctly.
Fixes #627