Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs when Controller reconnecting to Switch #31

Merged
merged 1 commit into from
Aug 17, 2022

Conversation

wenyingd
Copy link

  1. Maintain a variable for Switch connection, and close the previous
    one when reconnecting to Switch. This can ensure only one unix
    socket is used between Controller and Switch.
  2. Add timeout when sending message to Switch or waiting for message
    from Switch, to avoid Controller entering blocking state in a long time.

Signed-off-by: wenyingd [email protected]

@wenyingd wenyingd requested a review from tnqn July 22, 2022 14:02
@@ -54,13 +54,6 @@ func (self *OFSwitch) initFgraph() error {
normalLookup.outputType = "normal"
normalLookup.portNo = openflow13.P_NORMAL
self.normalLookup = normalLookup

// Clear all existing flood lists
groupMod := openflow13.NewGroupMod()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No flooding is not configured by default, so this cleanup operation is meaningless. A separate patch will be used to clean up the other legacy code.

self.changeStatus(true)

// Send new feature request
self.Send(openflow13.NewFeaturesRequest())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeatureRequest is sent in the connection when Controller receives "Hello" message from Switch, so this request is not needed.

// Send new feature request
self.Send(openflow13.NewFeaturesRequest())

self.Send(openflow13.NewEchoRequest())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo request is sent periodically in function switchConnected in another goroutine.

s := NewSwitch(stream, m.DPID, c.app, reConnChan, c.id)
if err := s.switchConnected(); err != nil {
log.Errorf("failed to initialize OpenFlow switch %s: %v", m.DPID, err)
s.switchDisconnected(false)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconnection event is sent to Controller when this function returns, so do not need to send this event inside switchDisconnected

Comment on lines -58 to -63
// Clear all existing flood lists
groupMod := openflow15.NewGroupMod()
groupMod.GroupId = openflow15.OFPG_ALL
groupMod.Command = openflow15.OFPGC_DELETE
groupMod.Type = openflow15.GT_ALL
return self.Send(groupMod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to remove this? I remember agent relies on this or similar code to clean up previous installed groups so reinstalling groups won't conflict? @antoninbas

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct. And there is a comment in the Antrea Agent code about this: https://github.com/antrea-io/antrea/blob/454df5dd383eac93aedab4df6f5b0e4b478e46db/pkg/agent/agent.go#L465-L470

I don't know if things have changed with Openflow 1.5. If this code is not needed any more, the comment in the Antrea Agent code should be updated as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this is quite easy to test. You just need to kill the antrea-agent process / container and let it restart. The idea is that the Agent should restart, without OVS restarting. So you don't want to delete the antrea-agent Pod. If there is any error during Agent initialization after the restart, then we either need this code or more complex logic in the Antrea Agent.

Copy link
Author

@wenyingd wenyingd Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the group modification message is used, since it is used to delete the groups with type "all" only. I could have a try with Antrea Agent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see errors in Antrea logs in agent restart case. But I think a better way is to remove the existing groups before replaying groups in the reconnections in Antrea Agent code. So I add this part in another change in antrea repo.

Copy link
Author

@wenyingd wenyingd Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the groups are always deleted in case of reconnection, and there is nothing new here?

I checked the existing code in antrea, and I didn't find related logic to delete the existing groups in re-connect case. Before the flexible pipeine logic, I don't remember we have the logic to replay groups.

But it opens up the following questions:

  1. is it the same for flows? When I added the cookie mechanism to delete stale flows 2+ years ago, I believe it was to solve an actual problem

Flow is different. There is a special field "cookie" in every flow, and we use several bits in the cookie to identify round number. Every time antrea is re-connected to OVS, we update the round number, then replay the flows with new cookie ID (including the new round number), and then we delete the flows with filter of the last round. Since the flows are override with the new cookie ID if they are replayed, the deletion operation only removes the flows not needed. But the cookie is not existing in group modification message.

  1. what about meters? When we added meters for packet-in rate limiting, we had to add code to the Antrea agent to delete meters in case of restart to avoid the OFPMMFC_METER_EXISTS error: Packet-in rate limiting with OF Meter antrea#2215 (comment). Are meters treated differently from flows / groups, that seems strange.

I didn't see meter replay operations in antrea existing code.

  1. we actually don't want flows / groups to be deleted on antrea-agent restart, to minimize traffic disruption as it can take some time to restore all flows. Can this behavior be modified by changing the fail-mode of the OVS bridge from standalone to secure?

In my understanding, standalone mode means the openflow entries are deleted once the OF controller is disconnected, and secure mode means the existing flows are deleted afterOVS is re-connected to OF Controller. The difference is if the existing flows are kept during the period when OVS is disconnected from Controller. I am not sure if the logic is also suitable on the groups and meters. But anyway, the flows are empty when antrea agent is reconnected to OVS, and this is a different behavior from the current logic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenyingd I'm reading the Openflow spec and I think that the group type is ignored for a OFPGC_DELETE operation. So setting it to ALL should not matter. This code should be deleting all groups of all types.

Did you test without this code, with a restart of the antrea-agent container?

Let me try.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the operation to delete existing groups is needed, I don't think is a good practice to leave this behavior in ofnet. Shall we move it to antrea? Otherwise, we have to always remember that ofnet has this pre-configuration. @antoninbas @tnqn

Copy link
Author

@wenyingd wenyingd Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenyingd I'm reading the Openflow spec and I think that the group type is ignored for a OFPGC_DELETE operation. So setting it to ALL should not matter. This code should be deleting all groups of all types.
Did you test without this code, with a restart of the antrea-agent container?

Let me try.

@antoninbas You are correct, if I remove this code ( deleteing all groups in agent restart only case), I would got this error when replaying groups:

Received OpenFlow1.5 error: OFPGMFC_GROUP_EXISTS on message OFPT_GROUP_MOD

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to agent makes sense to me. Just wanted to make sure everything works as expected.

ofctrl/ofSwitch.go Outdated Show resolved Hide resolved
ofctrl/ofSwitch.go Outdated Show resolved Hide resolved
ofctrl/ofSwitch.go Outdated Show resolved Hide resolved
ofctrl/ofSwitch.go Show resolved Hide resolved
ofctrl/ofTlvMap.go Outdated Show resolved Hide resolved
ofctrl/ofctrl.go Outdated Show resolved Hide resolved
self.changeStatus(true)

// Send new feature request
self.Send(openflow15.NewFeaturesRequest())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it removing a useless request the lib sent before?

Copy link
Author

@wenyingd wenyingd Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, NewFeaturesRequest is sent in Controller (ofctrl.go) when receiving "hello" from Switch.

Update: I removed this duplicated SwitchFeatures request, and sent the SwitchConfig and Controller setting message in function switchConnected, since this function is called when Controller received a SwitchFeature reply message. The change removes a duplicated message, but other configurations are the same.

// Send new feature request
self.Send(openflow15.NewFeaturesRequest())

self.Send(openflow15.NewEchoRequest())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected to remove the first echo request?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is another goroutine to send echo request periodically.

@wenyingd wenyingd force-pushed the reconnect_issue branch 4 times, most recently from 838a654 to 3efb2b0 Compare August 9, 2022 07:14
@wenyingd wenyingd requested a review from tnqn August 9, 2022 07:17
tnqn
tnqn previously approved these changes Aug 16, 2022
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Aug 16, 2022

1. Maintain a variable for Switch connection, and close the previous
   one when reconnecting to Switch. This can ensure only one unix
   socket is used between Controller and Switch.
2. Add timeout when sending message to Switch or waiting for message
   from Switch, to avoid Controller entering blocking state in a long
   time.
3. Remove a duplicated SwitchFeatures request, and send SwitchConfig
   and Controller setting message when switch is connected.
4. Resolve OpenFlow version misconfiguration bugs in Hello message,
   and PortModification message.

Signed-off-by: wenyingd <[email protected]>
@wenyingd
Copy link
Author

wenyingd commented Aug 17, 2022

Same comment as https://github.com/antrea-io/antrea/pull/4091/files#r946697624

I revert the change of removing group deletion in this patch, and move the call into function switchConnected, so that we can trigger a reconnect if this cleanup is failed since it is the initial logic for a new connection. And a separate patch would be used to optimize this part.

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants