Support NetworkPolicy Status for AntreaPolicy #1442

tnqn · 2020-10-27T17:58:03Z

This patch does the following to support NetworkPolicy Status:

Add "status" field to Antrea ClusterNetworkPolicy and Antrea NetworkPolicy CRD.
Add subresource API "status" to controlplane NetworkPolicy API.
Each antrea-agent reports controlplane NetworkPolicies' realization status on its own Node to antrea-controller.
antrea-controller calculates the aggregated status and syncs it with kube-apiserver.

antrea-bot · 2020-10-27T17:58:17Z

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

/test-e2e: to trigger e2e tests.
/skip-e2e: to skip e2e tests.
/test-conformance: to trigger conformance tests.
/skip-conformance: to skip conformance tests.
/test-whole-conformance: to trigger all conformance tests on linux.
/skip-whole-conformance: to skip all conformance tests on linux.
/test-networkpolicy: to trigger networkpolicy tests.
/skip-networkpolicy: to skip networkpolicy tests.
/test-windows-conformance: to trigger windows conformance tests.
/skip-windows-conformance: to skip windows conformance tests.
/test-windows-networkpolicy: to trigger windows networkpolicy tests.
/skip-windows-networkpolicy: to skip windows networkpolicy tests.
/test-hw-offload: to trigger ovs hardware offload test.
/skip-hw-offload: to skip ovs hardware offload test.
/test-all: to trigger all tests (except whole conformance).
/skip-all: to skip all tests (except whole conformance).

codecov-io · 2020-10-27T18:25:23Z

Codecov Report

Merging #1442 (b835db6) into master (d005e75) will increase coverage by 0.73%.
The diff coverage is 78.94%.

@@            Coverage Diff             @@
##           master    #1442      +/-   ##
==========================================
+ Coverage   62.50%   63.24%   +0.73%     
==========================================
  Files         167      170       +3     
  Lines       13969    14250     +281     
==========================================
+ Hits         8732     9012     +280     
+ Misses       4325     4300      -25     
- Partials      912      938      +26

Flag	Coverage Δ
kind-e2e-tests	`55.16% <76.97%> (+0.98%)`	⬆️
unit-tests	`41.22% <54.83%> (+0.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/controller/types/networkpolicy.go	`100.00% <ø> (ø)`
...gistry/networkpolicy/networkpolicy/subresources.go	`45.45% <45.45%> (ø)`
...ntroller/networkpolicy/networkpolicy_controller.go	`69.12% <69.23%> (+2.10%)`	⬆️
pkg/agent/controller/networkpolicy/cache.go	`85.00% <71.42%> (+1.49%)`	⬆️
...gent/controller/networkpolicy/status_controller.go	`72.60% <72.60%> (ø)`
pkg/controller/networkpolicy/status_controller.go	`81.69% <81.69%> (ø)`
pkg/apis/controlplane/register.go	`86.66% <100.00%> (+0.95%)`	⬆️
pkg/apis/controlplane/v1beta2/register.go	`93.33% <100.00%> (+0.47%)`	⬆️
pkg/apiserver/apiserver.go	`86.02% <100.00%> (+1.49%)`	⬆️
...kg/controller/networkpolicy/antreanetworkpolicy.go	`84.88% <100.00%> (+20.17%)`	⬆️
... and 14 more

pkg/apis/controlplane/v1beta1/conversion.go

pkg/apis/controlplane/types.go

pkg/apis/security/v1alpha1/types.go

pkg/agent/controller/networkpolicy/status_controller.go

antrea-bot · 2020-11-03T04:24:46Z

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

/test-e2e: to trigger e2e tests.
/skip-e2e: to skip e2e tests.
/test-conformance: to trigger conformance tests.
/skip-conformance: to skip conformance tests.
/test-all-features-conformance: to trigger conformance tests with all alpha features enabled.
/skip-all-features-conformance: to skip conformance tests with all alpha features enabled.
/test-whole-conformance: to trigger all conformance tests on linux.
/skip-whole-conformance: to skip all conformance tests on linux.
/test-networkpolicy: to trigger networkpolicy tests.
/skip-networkpolicy: to skip networkpolicy tests.
/test-windows-conformance: to trigger windows conformance tests.
/skip-windows-conformance: to skip windows conformance tests.
/test-windows-networkpolicy: to trigger windows networkpolicy tests.
/skip-windows-networkpolicy: to skip windows networkpolicy tests.
/test-hw-offload: to trigger ovs hardware offload test.
/skip-hw-offload: to skip ovs hardware offload test.
/test-all: to trigger all tests (except whole conformance).
/skip-all: to skip all tests (except whole conformance).

tnqn · 2020-11-03T12:43:04Z

/test-all

tnqn · 2020-11-03T16:39:14Z

/test-all

jianjuns · 2020-11-03T23:29:07Z

cmd/antrea-controller/controller.go

@@ -115,6 +115,11 @@ func run(o *Options) error {
 		appliedToGroupStore,
 		networkPolicyStore)

+	var networkPolicyStatusController *networkpolicy.StatusController
+	if features.DefaultFeatureGate.Enabled(features.AntreaPolicy) {
+		networkPolicyStatusController = networkpolicy.NewStatusController(crdClient, networkPolicyStore, cnpInformer, anpInformer)


I know we discussed this before and decided to reuse the AntreaPolicy gate. But do you think realization status will introduce much overhead or not?

I don't think it has expensive CPU overhead. On each Node, say 500 Policies applied to it, it's just 500*k times of int comparison. On antrea-controller, say 10000 Policies in total, 100 NodeStatus per Policy, it's 1,000,000 times of simple comparison and int increment even after restarting the controller, I feel it's something can be done in a few seconds, or maybe less. I could write a benchmark for the syncHandler later.

I'm not very sure the overhead of the network between agent and controller, though the message is small but needs timely update, not sure how the apiserver perform in this situation. I will have a scale test and paste the result here.

The network overhead between controller and kube-apiserver should be small, as we only update when the status changes.

I have added benchmark tests for syncHandlers. On controller side, sync a policy that spans 1000 Nodes only takes 70us (networking call is not counted), so even 10,000 policies with such span take only 700ms for calculation. On agent side, sync a policy that has 100 rules takes 47us(networking call is not counted), so 500 policies take only 24ms for calculation. Do they eliminate your concern about overhead?

These numbers look not bad. What you think about memory?
BTW, I assume you will capture your benchmark numbers somewhere for reference later?

For antrea-controller, it adds below map, each policy status is about (20+8) bytes, 10,000 policies, 100 nodes per policy may take 28MB, taking the length of keys and the struct of hash table, maybe around 80MB?

// statuses is a nested map that keeps the realization statuses reported by antrea-agents. // The outer map's keys are the NetworkPolicy keys. The inner map's keys are the Node names. The inner map's values // are statuses reported by each Node for a NetworkPolicy. statuses map[string]map[string]*controlplane.NetworkPolicyNodeStatus

For antrea-agent, it adds below a storage of below struct, each of which takes 16+36 bytes. 500 policies, 10 rules per policy may take 260KB, the whole storage should use less than 1MB.

type realizedRule struct { ruleID string policyID types.UID }

The benchmark number is in the comments of their tests.

pkg/agent/controller/networkpolicy/cache.go

jianjuns · 2020-11-04T01:41:03Z

pkg/agent/controller/networkpolicy/status_controller.go

+		c.ruleGenerationsLock.RLock()
+		defer c.ruleGenerationsLock.RUnlock()
+		for _, rule := range rules {
+			ruleGeneration, exists := c.ruleGenerations[rule.ID]


Question - could we save rule generation in ruleCache, then we need not a separate map of ruleGenerations?

And could we maintain a counter of rules match the expected generation, then we need not to check all rules here?

For the first question, we already save rule generation in ruleCache, but it's desired one, we need a place to store the realized one. Or you mean adding another field to rule struct for realized generation?

For the second question, a rule can be realized multiple times if its associated groups are changed, we cannot just increase the counter when a rule is realized and need to know whether it has been realized before, maybe we could use a set to maintain rules that match the expectation? combining your first question, maybe the struct could be:

type policyStatus struct { expectedGeneration int64 expectedRuleNum int realizedRuleSet sets.String } policyStatusMap map[string]policyStatus

So when calculate the status to report, it just checks expectedRuleNum == len(realizedRuleSet), what do you think?

Yes, I meant a separate field for the realized generation in the rule struct.

a rule can be realized multiple times if its associated groups are changed, we cannot just increase the counter when a rule is realized and need to know whether it has been realized before.
If we save the realized generation in the rule struct, we can check the current value of the field to decide whether or not to increment the counter?

I realized the rule actually shouldn't be generation related because rules are immutable, otherwise changing rule B could cause rule A updated unnecessarily (if counting policy generation into the hash value). When we made a change to an existing rule, it's considered a new rule added and an old rule deleted. The policy should be considered realized when all of its current rules are realized and all of its removed rules are uninstalled. Changed the implementation in latest patch.

pkg/apis/security/v1alpha1/types.go

pkg/controller/networkpolicy/status_controller.go

jianjuns · 2020-11-05T01:15:47Z

pkg/controller/networkpolicy/status_controller.go

+			if np.SourceRef.Type == controlplane.K8sNetworkPolicy {
+				continue
+			}
+			c.queue.Add(np.Name)


Is it true we only care about span changes for an update?

No, we also care about generation change, in which case the observedGeneration and phase should be updated.

Could you add a comment to indicate what changes we care about here?

jianjuns · 2020-11-10T00:59:41Z

pkg/agent/controller/networkpolicy/status_controller.go

+	for _, r := range actualRules {
+		actualRuleSet.Insert(r.(*realizedRule).ruleID)
+	}
+	if !desiredRuleSet.Equal(actualRuleSet) {


Is it faster than lookup and compare desiredRuleSet for every actaulRule in line 182?

Great suggestion. Improvement as below:

Before: 47754 ns/op 15320 B/op 23 allocs/op After: 37734 ns/op 10088 B/op 15 allocs/op

pkg/controller/networkpolicy/status_controller.go

pkg/agent/controller/networkpolicy/status_controller.go

jianjuns · 2020-11-10T01:24:39Z

pkg/controller/networkpolicy/status_controller.go

+func (c *StatusController) updateCNP(old, cur interface{}) {
+	curCNP := cur.(*secv1alpha1.ClusterNetworkPolicy)
+	oldCNP := old.(*secv1alpha1.ClusterNetworkPolicy)
+	if oldCNP.Status == curCNP.Status {


Why watch CNP/ANP changes and check their Status here? Is it for the case some other clients change the Status field?

It's not designed for the case other clients changing the status field but can work for it.
The main consideration is we need a source of truth of the status so that we can avoid doing dummy CRD update call.
There are two options: get it before updating, which is apparently not worth and doesn't reduce API call number.
Another option is the CRD Lister we already have for networkpolicycontroller. However, Lister is not always in sync with K8s API. For example, if a policy's CurrentNodesRealized is changed from 2 -> 3 -> 2 quickly, in the first round, we want to update it to 3 and found it's 2 in Lister so we will call the update API, in the second round, we want to update it to 2 and found it's also 2 in Lister (it's very possible because there is some delay for the first update to be reflected in Lister) then we won't call the update API, causing the status in API kept 3 wrongly.
Since Lister is our source of truth of status, if it is changed, we need a resync for the status. Also explained it in comments L82-L85.

Could we compare with the internal state instead of CRD Status to know if an update is needed?

By internal state, do you mean a "desired" status of CR's status? We don't have a separate storage for that and use Lister as the source of truth of data in apiserver. I guess we need to maintain the internal state in the end of syncHandler and use it in update eventhandler to prevent unnecessary tasks enqueued?

I'm not sure if it could solve the above problem: in the second round of above example, if controller hasn't received CR update event from 2 to 3, syncHandler that wants to update CR status from 3 to 2 won't make update API call because CR status in Lister is already 2. If the update event comes before it updates the internal state from 3 to 2, the eventhandler will also skip enqueuing because it thinks the latest CR status 3 matches the internal state 3. or does updating internal state before making API call prevent this problem?

I think it is not prevent tasks enqueued, but to prevent unnecessary K8s calls, and we should check the desired status in syncHandler before updating CR status.

But I think you still mean you want to cover the case the status is updated by other clients, so you want to watch CR status change here?

I think the problem is that we are not able to append information to the event (like it is status change, or span change, or generation change), so we probably have a lot unnecessary computation.
But anyway would you add some comments to explain the reasoning of handling status changes (in my mind if not to consider other clients, we can check the last desired state in syncHandler)?

I think my point about internal status vs CR status is we already have the latter. The former will be kind of redundant with it and cannot solve the restart case and the other client's updates case.

I just do not like the fact very CR status update will trigger a new event -> status recomputation. In a sense we double the computation?

You are correct that it needs to compute it another time after a successful update. This is also an inefficient part of most K8s controllers.
I could add a storage to maintain the expected status of CR, then use it in eventHandler, only enqueue the policy when the new status doesn't match the expected status, what do you think?

How about let us merge the current implementation first, then think about optimization?
I just hope you can describe these design considerations for us (probably mainly me as you know the code well) to track and understand later.

Sure, added more comments and TODO to L82-L88 to follow up.

tnqn · 2020-11-10T16:09:31Z

/test-all

tnqn · 2020-11-12T03:34:27Z

/test-all

jianjuns

Just two nits.

pkg/controller/networkpolicy/status_controller.go

This patch does the following to support NetworkPolicy Status: 1. Add "status" field to Antrea ClusterNetworkPolicy and Antrea NetworkPolicy CRD 2. Add subresource API "status" to controlplane NetworkPolicy API 3. Each antrea-agent reports controlplane NetworkPolicies' realization status on its own Node to antrea-controller. 4. antrea-controller calculates the aggregated status and syncs it with kube-apiserver.

tnqn · 2020-11-12T04:42:53Z

/test-all

vmwclabot added the cla-not-required label Oct 27, 2020

tnqn force-pushed the policy-status branch from 0778f3b to bbfba7b Compare October 27, 2020 18:21

tnqn force-pushed the policy-status branch from bbfba7b to 015080c Compare October 28, 2020 14:43

tnqn mentioned this pull request Oct 28, 2020

Make controlplane NetworkPolicy cluster scoped #1445

Merged

tnqn force-pushed the policy-status branch 5 times, most recently from 3309116 to 81a7cbe Compare October 29, 2020 17:10

tnqn changed the title ~~WIP: Support NetworkPolicy Status for AntreaPolicy~~ Support NetworkPolicy Status for AntreaPolicy Oct 29, 2020

jianjuns reviewed Oct 30, 2020

View reviewed changes

pkg/apis/controlplane/v1beta1/conversion.go Outdated Show resolved Hide resolved

pkg/apis/controlplane/types.go Show resolved Hide resolved

pkg/apis/security/v1alpha1/types.go Outdated Show resolved Hide resolved

pkg/agent/controller/networkpolicy/status_controller.go Show resolved Hide resolved

tnqn force-pushed the policy-status branch from 81a7cbe to cb75435 Compare November 3, 2020 04:24

tnqn force-pushed the policy-status branch 4 times, most recently from e9220e0 to 4c6a829 Compare November 3, 2020 12:42

tnqn requested review from abhiraut, Dyanngg, jianjuns and antoninbas November 3, 2020 12:43

tnqn force-pushed the policy-status branch from 4c6a829 to 449a459 Compare November 3, 2020 16:08

jianjuns reviewed Nov 4, 2020

View reviewed changes

jianjuns reviewed Nov 5, 2020

View reviewed changes

tnqn force-pushed the policy-status branch 2 times, most recently from b2e17a1 to c7887d7 Compare November 9, 2020 11:18

tnqn force-pushed the policy-status branch 2 times, most recently from 8d707dd to fe2f724 Compare November 9, 2020 17:49

jianjuns reviewed Nov 10, 2020

View reviewed changes

tnqn force-pushed the policy-status branch 2 times, most recently from 896f58e to 2aefdb2 Compare November 10, 2020 15:55

tnqn force-pushed the policy-status branch 2 times, most recently from 72d445b to 8d21baa Compare November 12, 2020 02:47

antoninbas added this to the Antrea v0.11.0 release milestone Nov 12, 2020

jianjuns previously approved these changes Nov 12, 2020

View reviewed changes

pkg/controller/networkpolicy/status_controller.go Outdated Show resolved Hide resolved

tnqn dismissed jianjuns’s stale review via b835db6 November 12, 2020 04:31

tnqn force-pushed the policy-status branch from 8d21baa to b835db6 Compare November 12, 2020 04:31

jianjuns approved these changes Nov 12, 2020

View reviewed changes

tnqn merged commit 9d3d10b into antrea-io:master Nov 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support NetworkPolicy Status for AntreaPolicy #1442

Support NetworkPolicy Status for AntreaPolicy #1442

tnqn commented Oct 27, 2020 •

edited

Loading

antrea-bot commented Oct 27, 2020

codecov-io commented Oct 27, 2020 •

edited

Loading

antrea-bot commented Nov 3, 2020

tnqn commented Nov 3, 2020

tnqn commented Nov 3, 2020

jianjuns Nov 3, 2020

tnqn Nov 4, 2020

tnqn Nov 9, 2020

jianjuns Nov 10, 2020

tnqn Nov 10, 2020

jianjuns Nov 4, 2020

tnqn Nov 4, 2020

jianjuns Nov 5, 2020

tnqn Nov 9, 2020

jianjuns Nov 5, 2020

tnqn Nov 9, 2020

jianjuns Nov 11, 2020

jianjuns Nov 10, 2020

tnqn Nov 10, 2020

jianjuns Nov 10, 2020

tnqn Nov 10, 2020

jianjuns Nov 10, 2020

tnqn Nov 11, 2020

jianjuns Nov 11, 2020

tnqn Nov 11, 2020

jianjuns Nov 11, 2020

tnqn Nov 11, 2020

jianjuns Nov 11, 2020

tnqn Nov 12, 2020

tnqn commented Nov 10, 2020

tnqn commented Nov 12, 2020

jianjuns left a comment

tnqn commented Nov 12, 2020

Support NetworkPolicy Status for AntreaPolicy #1442

Support NetworkPolicy Status for AntreaPolicy #1442

Conversation

tnqn commented Oct 27, 2020 • edited Loading

antrea-bot commented Oct 27, 2020

codecov-io commented Oct 27, 2020 • edited Loading

Codecov Report

antrea-bot commented Nov 3, 2020

tnqn commented Nov 3, 2020

tnqn commented Nov 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnqn commented Nov 10, 2020

tnqn commented Nov 12, 2020

jianjuns left a comment

Choose a reason for hiding this comment

tnqn commented Nov 12, 2020

tnqn commented Oct 27, 2020 •

edited

Loading

codecov-io commented Oct 27, 2020 •

edited

Loading