-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugs when Controller reconnecting to Switch #31
Conversation
ofctrl/fgraphSwitch.go
Outdated
@@ -54,13 +54,6 @@ func (self *OFSwitch) initFgraph() error { | |||
normalLookup.outputType = "normal" | |||
normalLookup.portNo = openflow13.P_NORMAL | |||
self.normalLookup = normalLookup | |||
|
|||
// Clear all existing flood lists | |||
groupMod := openflow13.NewGroupMod() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No flooding is not configured by default, so this cleanup operation is meaningless. A separate patch will be used to clean up the other legacy code.
ofctrl/ofSwitch.go
Outdated
self.changeStatus(true) | ||
|
||
// Send new feature request | ||
self.Send(openflow13.NewFeaturesRequest()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FeatureRequest is sent in the connection when Controller receives "Hello" message from Switch, so this request is not needed.
ofctrl/ofSwitch.go
Outdated
// Send new feature request | ||
self.Send(openflow13.NewFeaturesRequest()) | ||
|
||
self.Send(openflow13.NewEchoRequest()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echo request is sent periodically in function switchConnected
in another goroutine.
s := NewSwitch(stream, m.DPID, c.app, reConnChan, c.id) | ||
if err := s.switchConnected(); err != nil { | ||
log.Errorf("failed to initialize OpenFlow switch %s: %v", m.DPID, err) | ||
s.switchDisconnected(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconnection event is sent to Controller when this function returns, so do not need to send this event inside switchDisconnected
1dfe5d5
to
46e26e2
Compare
46e26e2
to
e322bf6
Compare
// Clear all existing flood lists | ||
groupMod := openflow15.NewGroupMod() | ||
groupMod.GroupId = openflow15.OFPG_ALL | ||
groupMod.Command = openflow15.OFPGC_DELETE | ||
groupMod.Type = openflow15.GT_ALL | ||
return self.Send(groupMod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to remove this? I remember agent relies on this or similar code to clean up previous installed groups so reinstalling groups won't conflict? @antoninbas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct. And there is a comment in the Antrea Agent code about this: https://github.com/antrea-io/antrea/blob/454df5dd383eac93aedab4df6f5b0e4b478e46db/pkg/agent/agent.go#L465-L470
I don't know if things have changed with Openflow 1.5. If this code is not needed any more, the comment in the Antrea Agent code should be updated as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, this is quite easy to test. You just need to kill the antrea-agent process / container and let it restart. The idea is that the Agent should restart, without OVS restarting. So you don't want to delete the antrea-agent Pod. If there is any error during Agent initialization after the restart, then we either need this code or more complex logic in the Antrea Agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the group modification message is used, since it is used to delete the groups with type "all" only. I could have a try with Antrea Agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see errors in Antrea logs in agent restart case. But I think a better way is to remove the existing groups before replaying groups in the reconnections in Antrea Agent code. So I add this part in another change in antrea repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the groups are always deleted in case of reconnection, and there is nothing new here?
I checked the existing code in antrea, and I didn't find related logic to delete the existing groups in re-connect case. Before the flexible pipeine logic, I don't remember we have the logic to replay groups.
But it opens up the following questions:
- is it the same for flows? When I added the cookie mechanism to delete stale flows 2+ years ago, I believe it was to solve an actual problem
Flow is different. There is a special field "cookie" in every flow, and we use several bits in the cookie to identify round number. Every time antrea is re-connected to OVS, we update the round number, then replay the flows with new cookie ID (including the new round number), and then we delete the flows with filter of the last round. Since the flows are override with the new cookie ID if they are replayed, the deletion operation only removes the flows not needed. But the cookie is not existing in group modification message.
- what about meters? When we added meters for packet-in rate limiting, we had to add code to the Antrea agent to delete meters in case of restart to avoid the
OFPMMFC_METER_EXISTS
error: Packet-in rate limiting with OF Meter antrea#2215 (comment). Are meters treated differently from flows / groups, that seems strange.
I didn't see meter replay operations in antrea existing code.
- we actually don't want flows / groups to be deleted on antrea-agent restart, to minimize traffic disruption as it can take some time to restore all flows. Can this behavior be modified by changing the fail-mode of the OVS bridge from standalone to secure?
In my understanding, standalone mode means the openflow entries are deleted once the OF controller is disconnected, and secure mode means the existing flows are deleted afterOVS is re-connected to OF Controller. The difference is if the existing flows are kept during the period when OVS is disconnected from Controller. I am not sure if the logic is also suitable on the groups and meters. But anyway, the flows are empty when antrea agent is reconnected to OVS, and this is a different behavior from the current logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenyingd I'm reading the Openflow spec and I think that the group type is ignored for a
OFPGC_DELETE
operation. So setting it to ALL should not matter. This code should be deleting all groups of all types.Did you test without this code, with a restart of the antrea-agent container?
Let me try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the operation to delete existing groups is needed, I don't think is a good practice to leave this behavior in ofnet. Shall we move it to antrea? Otherwise, we have to always remember that ofnet has this pre-configuration. @antoninbas @tnqn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenyingd I'm reading the Openflow spec and I think that the group type is ignored for a
OFPGC_DELETE
operation. So setting it to ALL should not matter. This code should be deleting all groups of all types.
Did you test without this code, with a restart of the antrea-agent container?Let me try.
@antoninbas You are correct, if I remove this code ( deleteing all groups in agent restart only case), I would got this error when replaying groups:
Received OpenFlow1.5 error: OFPGMFC_GROUP_EXISTS on message OFPT_GROUP_MOD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving to agent makes sense to me. Just wanted to make sure everything works as expected.
self.changeStatus(true) | ||
|
||
// Send new feature request | ||
self.Send(openflow15.NewFeaturesRequest()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it removing a useless request the lib sent before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, NewFeaturesRequest is sent in Controller (ofctrl.go) when receiving "hello" from Switch.
Update: I removed this duplicated SwitchFeatures request, and sent the SwitchConfig and Controller setting message in function switchConnected
, since this function is called when Controller received a SwitchFeature reply message. The change removes a duplicated message, but other configurations are the same.
// Send new feature request | ||
self.Send(openflow15.NewFeaturesRequest()) | ||
|
||
self.Send(openflow15.NewEchoRequest()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected to remove the first echo request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is another goroutine to send echo request periodically.
838a654
to
3efb2b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1. Maintain a variable for Switch connection, and close the previous one when reconnecting to Switch. This can ensure only one unix socket is used between Controller and Switch. 2. Add timeout when sending message to Switch or waiting for message from Switch, to avoid Controller entering blocking state in a long time. 3. Remove a duplicated SwitchFeatures request, and send SwitchConfig and Controller setting message when switch is connected. 4. Resolve OpenFlow version misconfiguration bugs in Hello message, and PortModification message. Signed-off-by: wenyingd <[email protected]>
3efb2b0
to
cb45146
Compare
I revert the change of removing group deletion in this patch, and move the call into function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
one when reconnecting to Switch. This can ensure only one unix
socket is used between Controller and Switch.
from Switch, to avoid Controller entering blocking state in a long time.
Signed-off-by: wenyingd [email protected]