Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NetworkPolicy span calculation #5554

Merged
merged 1 commit into from
Oct 13, 2023
Merged

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Oct 7, 2023

A NetworkPolicy's span is calculated in internalNetworkPolicyWorker, based on the span of the AppliedToGroups it refers to, while the span of AppliedToGroup is calculated in appliedToGroupWorker which runs in parallel with internalNetworkPolicyWorker. It could happen that the calcuated span is out of date if AppliedToGroups' span is updated after internalNetworkPolicyWorker calculates a NetworkPolicy's span, and the NetworkPolicy wouldn't be enqueued for another sync if it's not committed to the storage yet.

On the other hand, if we commit the NetworkPolicy to the storage before calculating the NetworkPolicy's span, it would have to use a stale span first and might need to update the NetworkPolicy twice and generate two update events in one sync.

To fix the issue without generating extra events, we introduce a separate subscription mechanism that allows subscribing to update of AppliedToGroup for NetworkPolicy. With the subscription, we can still calculate the NetworkPolicy's span first, then commit it to the storage. If any of the subscribed AppliedToGroups are updated, the NetworkPolicy will be notified and resynced.

Fixes #5553

@tnqn tnqn added area/network-policy Issues or PRs related to network policies. action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Oct 7, 2023
@tnqn
Copy link
Member Author

tnqn commented Oct 7, 2023

/test-all
/test-ipv6-all
/test-ipv6-only-all
/test-windows-all

@tnqn tnqn marked this pull request as draft October 9, 2023 16:25
@antoninbas
Copy link
Contributor

@tnqn you marked this as as draft, so you no longer need an immediate review?

@tnqn
Copy link
Member Author

tnqn commented Oct 10, 2023

@tnqn you marked this as as draft, so you no longer need an immediate review?

Yes, I'm considering another fix, I will mark it as ready after I finalize the solution.

@tnqn tnqn force-pushed the fix-networkpolicy-span branch 2 times, most recently from 138860d to cabc71d Compare October 11, 2023 07:25
@tnqn
Copy link
Member Author

tnqn commented Oct 11, 2023

/test-all
/test-ipv6-all
/test-ipv6-only-all

@tnqn tnqn marked this pull request as ready for review October 11, 2023 09:46
@tnqn
Copy link
Member Author

tnqn commented Oct 11, 2023

@antoninbas it's ready for review now.

antoninbas
antoninbas previously approved these changes Oct 11, 2023
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - can we add a flag to a NP object in storage to indicate the span is calculated or not (and store the object before span calculation)?


import "sync"

// subscriber notifies multiple subscribers about any events that happen to the object they have subscribed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object -> objects

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import "sync"

// subscriber notifies multiple subscribers about any events that happen to the object they have subscribed.
type subscriber struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call it something like subscriptionService?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's more common in Go to name objects with an -er suffix, but I understand subscriber is somewhat confusing. I have renamed it to notifier, it should make sense to subscribe to a notifier and send a message via a notifier, what do you think?

@Dyanngg
Copy link
Contributor

Dyanngg commented Oct 11, 2023

Change LGTM. I'm assuming by the definition of the issue that this only happens for new internalNPs? For internalNPs update events this won't happen as they are already in store

@tnqn
Copy link
Member Author

tnqn commented Oct 12, 2023

Question - can we add a flag to a NP object in storage to indicate the span is calculated or not (and store the object before span calculation)?

I considered adding a flag but think it is more complicated and risky than decoupling notification mechanism from the storage:

  1. The issue doesn't just happen when processing a new NP, it could also happen when an existing NP is changed to use another AppliedToGroup. It commits an intermediate state of a NP to the storage, how to handle the object for the storage's readers, i.e. the watchers, should it be sent to a node based on the span or not? If yes, a NP may be sent to a Node it shouldn't do because the span may be outdated. If no, a NP may be deleted from a Node when it happens to reconnect the controller.
  2. Currently the span calculation of NP doesn't need the internalNetworkPolicyMutex, which is used to ensure NetworkPolicy's indices and AppliedToGroups/AddressGroups are updated atomically. If we store the object first, then calculate the span, then store it again, we have to acquire the mutex twice, or include a CPU-intense process in the critical section.

A NetworkPolicy's span is calculated in internalNetworkPolicyWorker,
based on the span of the AppliedToGroups it refers to, while the span of
AppliedToGroup is calculated in appliedToGroupWorker which runs in
parallel with internalNetworkPolicyWorker. It could happen that the
calcuated span is out of date if AppliedToGroups' span is updated after
internalNetworkPolicyWorker calculates a NetworkPolicy's span, and the
NetworkPolicy wouldn't be enqueued for another sync if it's not
committed to the storage yet.

On the other hand, if we commit the NetworkPolicy to the storage before
calculating the NetworkPolicy's span, it would have to use a stale span
first and might need to update the NetworkPolicy twice and generate two
update events in one sync.

To fix the issue without generating extra events, we introduce a
separate subscription mechanism that allows subscribing to update of
AppliedToGroup for NetworkPolicy. With the subscription, we can still
calculate the NetworkPolicy's span first, then commit it to the storage.
If any of the subscribed AppliedToGroups are updated, the NetworkPolicy
will be notified and resynced.

Signed-off-by: Quan Tian <[email protected]>
@tnqn
Copy link
Member Author

tnqn commented Oct 12, 2023

Change LGTM. I'm assuming by the definition of the issue that this only happens for new internalNPs? For internalNPs update events this won't happen as they are already in store

It could also happen if a NP is updated to use a new AppliedToGroup.

@jianjuns
Copy link
Contributor

Your points make sense. @tnqn

@tnqn
Copy link
Member Author

tnqn commented Oct 13, 2023

/test-all

@tnqn tnqn merged commit c2115fd into antrea-io:main Oct 13, 2023
49 of 57 checks passed
@tnqn tnqn deleted the fix-networkpolicy-span branch October 13, 2023 04:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/network-policy Issues or PRs related to network policies. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NetworkPolicy's span may be outdated in some cases
4 participants