Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify the status of required routes gateway periodically #2091

Merged
merged 1 commit into from
Apr 28, 2021

Conversation

hty690
Copy link
Contributor

@hty690 hty690 commented Apr 14, 2021

Add checks to the routeClient. The required routes will be added back if they were
deleted unexpectedly. Add IP configuration check of the gateway to the agent.
An integration test is added to verify that the route will be added back correctly.

Fixes #627

@hty690 hty690 force-pushed the sync_routes branch 2 times, most recently from d1f2b60 to 5f687f9 Compare April 14, 2021 05:29
if err := c.syncRoutes(); err != nil {
klog.Errorf("Failed to sync routes: %v", err)
}
if err := c.syncGwIp(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use upper cases for "IP", so do other places in this patch

for _, route := range routes {
exist := false
for i := range routeList {
if routeEqual(route, &routeList[i]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested loop is O(n^2), which may be bad in a cluster of thousands of nodes, especially when there are many other unknown routes configured on the node.
could you optimize it to O(n) by using a map?

return nil
}

func (c *Client) syncGwIp() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not very maintainable that the gw IP is configured in other place but resynced here and the module doesn't sound proper to manage the IP. Could you move it to the place where the gwIP was first configured by starting a goroutine there?

@@ -0,0 +1,98 @@
package e2e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add license header

cmd/antrea-agent/agent.go Show resolved Hide resolved
@@ -222,6 +225,31 @@ func (i *Initializer) Initialize() error {
return err
}

// Periodically check whether IP configuration of the gateway is correct.
// Terminated when stopCh is closed.
go wait.Until(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think networkpolicyonly mode shouldn't call this, why don't just start a goroutine calling util.ConfigureLinkAddresses after it's first called?

}
return true
})
for _, route := range routes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this handling just be in the loop of c.nodeRoutes.Range? It doesn't seem necessary to construct a slice first

return nil
}

// func (c *Client) syncGwIp() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this

@codecov-io
Copy link

codecov-io commented Apr 15, 2021

Codecov Report

Merging #2091 (808637f) into main (b88f39e) will increase coverage by 0.01%.
The diff coverage is 67.64%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2091      +/-   ##
==========================================
+ Coverage   61.05%   61.07%   +0.01%     
==========================================
  Files         270      270              
  Lines       20366    20410      +44     
==========================================
+ Hits        12435    12465      +30     
- Misses       6636     6642       +6     
- Partials     1295     1303       +8     
Flag Coverage Δ
kind-e2e-tests 51.87% <67.64%> (+0.16%) ⬆️
unit-tests 41.36% <0.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/util/net_linux.go 33.66% <ø> (ø)
pkg/agent/agent.go 47.86% <60.00%> (+0.14%) ⬆️
pkg/agent/route/route_linux.go 44.35% <68.96%> (+1.91%) ⬆️
pkg/ipfix/ipfix_process.go 81.81% <0.00%> (-18.19%) ⬇️
pkg/apiserver/handlers/endpoint/handler.go 58.82% <0.00%> (-11.77%) ⬇️
pkg/apiserver/certificate/certificate.go 69.86% <0.00%> (-6.85%) ⬇️
pkg/agent/flowexporter/exporter/exporter.go 69.45% <0.00%> (-1.61%) ⬇️
...ntroller/networkpolicy/networkpolicy_controller.go 70.00% <0.00%> (-0.13%) ⬇️
pkg/agent/controller/networkpolicy/cache.go 86.98% <0.00%> (+1.69%) ⬆️
pkg/controller/networkpolicy/status_controller.go 87.09% <0.00%> (+1.93%) ⬆️
... and 3 more

@@ -845,6 +848,15 @@ func (i *Initializer) allocateGatewayAddresses(localSubnets []*net.IPNet, gatewa
if err := util.ConfigureLinkAddresses(i.nodeConfig.GatewayConfig.LinkIndex, gwIPs); err != nil {
return err
}
// Periodically check whether IP configuration of the gateway is correct.
// Terminated when stopCh is closed.
if !i.networkConfig.TrafficEncapMode.IsNetworkPolicyOnly() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it gets here, this check has been done. And since L848 already configures link addresses, logically it's ok to just start a goroutine to do the periodical work without any check.

routeMap := make(map[string]*netlink.Route)
for i := range routeList {
r := &routeList[i]
if r == nil || r.Dst == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's impossible to be nil given L168.

}
continue
}
if !routeEqual(route, r) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to reduce redundancy:

r, ok := routeMap[route.Dst.String()]
if ok && routeEqual(route, r) {
      continue
}
if err := netlink.RouteAdd(route); err != nil {
    klog.Errorf("Failed to add route to the gateway: %v", err)
    return false
}

}

func (c *Client) syncRoutes() error {
routeList, err := netlink.RouteList(nil, netlink.FAMILY_ALL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be race condition between syncRoutes and DeleteRoutes:
if DeleteRoutes has deleted actual routes and is about to delete the desired routes, syncRoutes may see that there are desired rules but no actual routes and add the routes back. Then the routes will going to be stale.

pkg/agent/agent.go Outdated Show resolved Hide resolved
Comment on lines 507 to 509
if err != nil {
t.Fatalf("Failed to detect gateway interface name from ConfigMap: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this doing?

func testSyncRoutes(t *testing.T, data *TestData, isIPv6 bool) {
encapMode, err := data.GetEncapMode()
if err != nil {
t.Fatalf(" failed to get encap mode, err %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Fatalf(" failed to get encap mode, err %v", err)
t.Fatalf("Failed to get encap mode, err %v", err)

Comment on lines 551 to 552
if err := wait.Poll(30*time.Second, 3*time.Minute, func() (bool, error) {
newRoutes, err := getGatewayRoutes(data, isIPv6)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of including a test that long as we are trying to reduce the runtime of the e2e test suite
Any chance we can rely on an integration test instead, like we did for iptables rules reconciliation: #1751?

@codecov-commenter
Copy link

codecov-commenter commented Apr 21, 2021

Codecov Report

Merging #2091 (e7aeaba) into main (b88f39e) will increase coverage by 0.17%.
The diff coverage is 58.06%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2091      +/-   ##
==========================================
+ Coverage   61.05%   61.23%   +0.17%     
==========================================
  Files         270      269       -1     
  Lines       20366    20451      +85     
==========================================
+ Hits        12435    12523      +88     
+ Misses       6636     6633       -3     
  Partials     1295     1295              
Flag Coverage Δ
kind-e2e-tests 52.07% <58.06%> (+0.36%) ⬆️
unit-tests 41.42% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/util/net_linux.go 33.66% <ø> (ø)
pkg/agent/agent.go 47.74% <50.00%> (+0.02%) ⬆️
pkg/agent/route/route_linux.go 43.63% <59.25%> (+1.18%) ⬆️
pkg/apiserver/certificate/certificate.go 69.86% <0.00%> (-6.85%) ⬇️
pkg/antctl/raw/traceflow/command.go 23.82% <0.00%> (-1.92%) ⬇️
...ntroller/networkpolicy/networkpolicy_controller.go 69.35% <0.00%> (-0.78%) ⬇️
...agent/controller/traceflow/traceflow_controller.go 72.95% <0.00%> (-0.33%) ⬇️
pkg/agent/flowexporter/connections/connections.go 76.05% <0.00%> (-0.17%) ⬇️
pkg/antctl/antctl.go 100.00% <0.00%> (ø)
pkg/ipfix/ipfix_process.go 100.00% <0.00%> (ø)
... and 14 more

@@ -578,13 +620,13 @@ func (c *Client) DeleteRoutes(podCIDR *net.IPNet) error {

routes, exists := c.nodeRoutes.Load(podCIDRStr)
if exists {
c.nodeRoutes.Delete(podCIDRStr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would introduce new issue that the routes won't be removed if the first attempt fails.
I think it can add the remaining routes back to the cache when it fails, so it can retry to remove them later.

if ok && routeEqual(route, r) {
continue
}
if err := netlink.RouteAdd(route); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should use RouteReplace, otherwise it will always fail when there is already a route that doesn't match

@@ -65,7 +65,7 @@ func TestGetPortFields(t *testing.T) {

// TestParseFlow tests if a flow can be parsed correctly.
func TestParseFlow(t *testing.T) {
tcs := []struct {
tcs := []*struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this updates don't seem related

@hty690 hty690 force-pushed the sync_routes branch 2 times, most recently from 91165d5 to e7aeaba Compare April 22, 2021 01:45
pkg/agent/agent.go Outdated Show resolved Hide resolved
nhCIDRIP := ip.NextIP(peerCIDR.IP)
assert.NoError(t, routeClient.AddRoutes(peerCIDR, tc.nodeName, tc.peerIP, nhCIDRIP), "adding routes failed")

if !tc.mode.NeedsEncapToPeer(tc.peerIP, nodeConfig.NodeIPAddr) && tc.mode.NeedsRoutingToPeer(tc.peerIP, nodeConfig.NodeIPAddr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the first condition redundant? And it doesn't seem necessary to add a case that will be skipped. But maybe it could check whatever the original route is (a valid route or nil), after removing the route and a sync loop, it can get the same route, instead of checking the route is not empty

Copy link
Contributor Author

@hty690 hty690 Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first condition is redundant. But the second is to avoid error when executing "ip route del" on nonexistent route since in that case the route will not be added according to https://github.com/vmware-tanzu/antrea/blob/3335e734071894f70a9ab33c734799f56c049689/pkg/agent/route/route_linux.go#L541

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why the case is even added to TestSyncRoute if it's skipped anyway? I think it should either remove the 3rd case and the mode check, or be more generic: only call "ip route del" if "expOutput" is not empty, so that the test works for all cases and don't have to check mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

		assert.NoError(t, err, "error executing ip route command: %s", listCmd)

                if len(expOutput) > 0 {
		        delCmd := fmt.Sprintf("ip route del %s", expOutput)
		        _, err = exec.Command("bash", "-c", delCmd).Output()
		        assert.NoError(t, err, "error executing ip route command: %s", delCmd)
                }

test/e2e/route_util.go Outdated Show resolved Hide resolved
…periodically

Add checks to the routeClient. The required routes will be added back if they were
deleted unexpectedly. Add IP configuration check of the gateway to the agent.
An integration test is added to verify that the route will be added back correctly.

Fixes antrea-io#627
@tnqn
Copy link
Member

tnqn commented Apr 27, 2021

/test-all

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn tnqn merged commit 7355d27 into antrea-io:main Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Periodically verify that all required routes are present in the Node
5 participants