Fix issues #73

irinamihai · 2022-01-26T23:10:56Z

Description:

BZ#2042609 policy was still copied and never deleted if all clusters were already compliant

Fix: skip policies that are already compliant with all the clusters in the upgrade. Include 2 new fields in the UOCR status:
managedPoliciesCompliantBeforeUpgrade (to list the policies that are compliant when the upgrade starts) and managedPoliciesForUpgrade (policies that will be copied and enforced as part of the upgrade)
Fix: if the upgrade starts with all the policies compliant then it moves directly to UpgradeCompleted
BZ#2042502 Canaries are ignored if they are not included in clusters in UOCR

Fix: include check in validateCR and error if a canary cluster is not included in the list of clusters
BZ#2042517 upgrade pods crashed after applying UOCR without remediationStrategy spec

Fix: Make remediationStrategy and maxConcurrency requested fields. Leave the timeout default to 240 for now
BZ#2042601 upgrade started when blockingCR failed silently

Fix: Check conditions of UOCR and consider CR as not completed if the conditions are nil
BZ#2040833 UOCR validation should not fail due to maxConcurrency larger than cluster count

Fix: maxConcurrency is automatically adjusted to min(#clusters, 100, Spec.RemediationStrategy.maxConcurrency) if the
initially requested maxConcurrency is larger than that minimum
add KUTTL tests for: skipping compliant policies, blocking mechanisms, upgrade that starts with all the policies compliant, adjusting the maximum concurrency

rcarrillocruz · 2022-01-27T08:55:38Z

config/crd/bases/ran.openshift.io_clustergroupupgrades.yaml

@@ -92,7 +92,13 @@ spec:
                  timeout:
                    default: 240
                    type: integer
+                required:
+                - maxConcurrency


Why is this required?

IMHO, it should default to 1.

It was agreed on the latest discussion that we shouldn't assume what the user wants, especially since they can start an upgrade already enabled.

irinamihai · 2022-01-27T14:54:37Z

/retest

Missxiaoguo · 2022-01-27T19:33:12Z

controllers/clustergroupupgrade_controller.go

-	r.Log.Info("[Reconcile]", "CR", clusterGroupUpgrade.Name)
-
-	err = r.validateCR(ctx, clusterGroupUpgrade)
+	reconcile, err := r.validateCR(ctx, clusterGroupUpgrade)


This is not urgent but considering to have a new type status i.e. "ClusterGroupUpgradeValidationFailed" with proper reason displayed for more visibility to the users for debug and track the status, otherwise, user has to check the log to know what happened if validation failed.
Also currently the validation happens every time the reconcilation is triggered. Thinking if validation is needed only before upgrade is enabled or during the whole lifecycle of upgrade. It comes down to the question do we expect user to update UOCR after upgrade is enabled.

Yes, I agree we need better handling of error scenarios. I'll add this to the list of future enhancements.

Missxiaoguo · 2022-01-27T19:53:38Z

controllers/clustergroupupgrade_controller.go

@@ -1448,16 +1522,43 @@ func (r *ClusterGroupUpgradeReconciler) validateCR(ctx context.Context, clusterG
 		}, foundManagedCluster)

 		if err != nil {
-			return fmt.Errorf("Cluster %s is not a ManagedCluster", cluster)
+			return reconcile, fmt.Errorf("Cluster %s is not a ManagedCluster", cluster)


Should check Ready status of managedcluster as well? as policies only work when cluster is ready.

Yes, this would be good to have, but I would add it to the same list of enhancements as there is also the question of what status/action we want to take if such checks fail.

Missxiaoguo · 2022-01-27T19:54:06Z

controllers/clustergroupupgrade_controller.go

+		clusterGroupUpgrade.Spec.RemediationStrategy.MaxConcurrency = newMaxConcurrency
+		err = r.Client.Update(ctx, clusterGroupUpgrade)
+		if err != nil {
+			r.Log.Info("Error updating Cluster Group Upgrade")


Should this be r.Log.Error?

I'm returning the error below, so this info log can just be deleted.

Missxiaoguo · 2022-01-27T20:30:29Z

api/v1alpha1/clustergroupupgrade_types.go

+	// Contains the managed policies (and the namespaces) that have NonCompliant clusters
+	// that require updating.
+	ManagedPoliciesForUpgrade             []ManagedPolicyForUpgrade `json:"managedPoliciesForUpgrade,omitempty"`
+	ManagedPoliciesCompliantBeforeUpgrade []string                  `json:"managedPoliciesCompliantBeforeUpgrade,omitempty"`


Looks like ManagedPoliciesCompliantBeforeUpgrade is just for record? In my opinion, I feel it's more clear to understand if we can keep one entry for managedPolicies and have all information there like, name, namespace and compliant status.

Yes, I think it would be a good idea. I will add it to the list as changing it now would mean even more restructuring.

irinamihai · 2022-01-28T22:42:02Z

/retest

irinamihai · 2022-02-02T17:55:56Z

/retest-required

irinamihai · 2022-02-02T19:36:22Z

/retest-required

imiller0

Lots of really good work. A couple questions but overall looks good.

imiller0 · 2022-01-27T23:50:10Z

deploy/acm/policies/blocking_mechanisms/policy1-common-cluster-version-policy.yaml

@@ -0,0 +1,38 @@
+apiVersion: policy.open-cluster-management.io/v1


Is this test code? If so it would be helpful to have it under a testdata directory.

imiller0 · 2022-01-27T23:51:34Z

deploy/upgrades/patch-cgu-status.sh

+curl -k -s -X PATCH -H "Accept: application/json, */*" \
+-H "Content-Type: application/merge-patch+json" \
+http://localhost:8001/apis/ran.openshift.io/v1alpha1/namespaces/$1/clustergroupupgrades/$2/status \
+--data '{"status": {"conditions":[{"lastTransitionTime": "2021-12-15T18:55:59Z", "message": "All the clusters in the CR are compliant", "reason": "UpgradeCompleted", "status": "False", "type": "Ready"}]}}'


Minor: Newlines at the end of the file to keep github from flagging them.

imiller0 · 2022-02-03T14:53:43Z

controllers/actions.go

@@ -59,7 +59,7 @@ func (r *ClusterGroupUpgradeReconciler) takeActionsAfterCompletion(
 	}

 	// Cleanup resources
-	if *actionsAfterCompletion.DeleteObjects {
+	if actionsAfterCompletion.DeleteObjects == nil || *actionsAfterCompletion.DeleteObjects {


When possible it is better to handle default values as the CR is being read in so that you don't have to handle all the cases throughout. This can be addressed in a later cleanup if needed.

imiller0 · 2022-02-03T16:49:40Z

controllers/clustergroupupgrade_controller.go

+		utils.MaxNumberOfClustersForUpgrade)
+
+	if newMaxConcurrency != clusterGroupUpgrade.Spec.RemediationStrategy.MaxConcurrency {
+		clusterGroupUpgrade.Spec.RemediationStrategy.MaxConcurrency = newMaxConcurrency


In general, we shouldn't modify the value set by the user. If the CGU is part of a gitops flow this update will cause the CR to not match what is in GIT and we will be stuck in a battle of updates here against the gitops tools restoring the value. One option would be to compute the "effective" maxConcurrency and simply track and use that as an internal variable (can be reflected in the status for visibility).

imiller0 · 2022-02-03T16:51:07Z

controllers/managedclusterForCGU_controller.go

 	cguSpec := ranv1alpha1.ClusterGroupUpgradeSpec{
-		Enable:          true, // default
+		Enable:          &enable,


Why is this switched to a pointer?

There is an issue in controller-runtime: projectcapsule/capsule#342, kubernetes-sigs/kubebuilder#2109
If we are trying to update the CR, the boolean non pointer values are overwritten with the default value. Angie had the same issues with deleteObjectsOnCompletion and changed it to a pointer because of that. I changed it to avoid the same issue in the future and to also be consistent.

imiller0 · 2022-02-03T16:58:19Z

controllers/utils/common.go

+	for index, crtPolicy := range clusterGroupUpgrade.Status.ManagedPoliciesForUpgrade {
+		if index == policyIndex {
+			return &crtPolicy
+		}
+	}


Is this loop to handle indexes out of range?

imiller0 · 2022-02-03T17:13:46Z

controllers/clustergroupupgrade_controller.go

@@ -416,7 +421,7 @@ func (r *ClusterGroupUpgradeReconciler) addClustersToPlacementRule(
 			continue
 		}

-		policyName := clusterGroupUpgrade.Name + "-" + clusterGroupUpgrade.Spec.ManagedPolicies[managedPolicyIndex]
+		policyName := clusterGroupUpgrade.Name + "-" + clusterGroupUpgrade.Status.ManagedPoliciesForUpgrade[managedPolicyIndex].Name


If the user creates two ClusterGroupUpgrade CRs with the same name but different namespaces and which remediate the same policy (but maybe to two different clusters) will collide on this policyName.

Even if two copied policies for two different CGUs would have the same name, they would be created in the CGU's namespace, so we wouldn't have a conflict.

Description: - BZ#2042609 policy was still copied and never deleted if all clusters were already compliant: >>> Fix: skip policies that are already compliant with all the clusters in the upgrade. Include 2 new fields in the UOCR status: managedPoliciesCompliantBeforeUpgrade (to list the policies that are compliant when the upgrade starts) and managedPoliciesForUpgrade (policies that will be copied and enforced as part of the upgrade) >>> Fix: if the upgrade starts with all the policies compliant then it moved directly to UpgradeCompleted - BZ#2042502 Canaries are ignored if they are not included in clusters in UOCR: >>> Fix: include check in validateCR and error if a canary cluster is not included in the list of clusters - BZ#2042517 upgrade pods crashed after applying UOCR without remediationStrategy spec >>> Fix: Make remediationStrategy and maxConcurrency requested fields. Leave the timeout default to 240 for now - BZ#2042601 upgrade started when blockingCR failed silently >>> Fix: Check conditions of UOCR and consider CR as not completed if the conditions are nil - BZ#2040833 UOCR validation should not fail due to maxConcurrency larger than cluster count >>>> Fix: maxConcurrency is automatically adjusted to min(#clusters, 100, Spec.RemediationStrategy.maxConcurrency) if the initially requested maxConcurrency is larger than that minimum - add KUTTL tests for: skipping compliant policies, blocking mechanisms, upgrade that starts with all the policies compliant, adjusting the maximum concurrency

imiller0

I verified successful deployment of a cluster using full ZTP + a custom build this operator including these patches. Worked perfectly.
/lgtm

openshift-ci · 2022-02-11T17:04:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: imiller0, irinamihai, nishant-parekh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [imiller0,irinamihai,nishant-parekh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from browsell and vitus133 January 26, 2022 23:10

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2022

irinamihai requested review from rcarrillocruz, imiller0, nishant-parekh, serngawy and Missxiaoguo January 26, 2022 23:12

rcarrillocruz reviewed Jan 27, 2022

View reviewed changes

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 27, 2022

irinamihai force-pushed the BZs_4.10 branch from afecdef to 0968222 Compare January 27, 2022 19:37

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 27, 2022

Missxiaoguo reviewed Jan 27, 2022

View reviewed changes

irinamihai force-pushed the BZs_4.10 branch 2 times, most recently from 2a827b7 to 703fe81 Compare February 2, 2022 00:30

imiller0 reviewed Feb 3, 2022

View reviewed changes

irinamihai force-pushed the BZs_4.10 branch 7 times, most recently from 8aea548 to ccb9df7 Compare February 9, 2022 16:59

irinamihai added 2 commits February 10, 2022 12:12

Add unit tests for multicloud_util.go

2a8c1ab

imiller0 approved these changes Feb 11, 2022

View reviewed changes

openshift-ci bot assigned imiller0 Feb 11, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 11, 2022

Address review comments

1b60871

irinamihai force-pushed the BZs_4.10 branch from ccb9df7 to 1b60871 Compare February 11, 2022 16:49

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 11, 2022

nishant-parekh approved these changes Feb 11, 2022

View reviewed changes

openshift-ci bot assigned nishant-parekh Feb 11, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 11, 2022

openshift-merge-robot merged commit 04dccf2 into openshift-kni:main Feb 11, 2022

irinamihai deleted the BZs_4.10 branch April 19, 2022 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues #73

Fix issues #73

irinamihai commented Jan 26, 2022 •

edited

Loading

rcarrillocruz Jan 27, 2022

irinamihai Jan 27, 2022

browsell Jan 27, 2022

irinamihai commented Jan 27, 2022

Missxiaoguo Jan 27, 2022

irinamihai Jan 27, 2022

Missxiaoguo Jan 27, 2022

irinamihai Jan 27, 2022

Missxiaoguo Jan 27, 2022

irinamihai Jan 27, 2022

Missxiaoguo Jan 27, 2022

irinamihai Jan 27, 2022

irinamihai commented Jan 28, 2022

irinamihai commented Feb 2, 2022

irinamihai commented Feb 2, 2022

imiller0 left a comment

imiller0 Jan 27, 2022

imiller0 Jan 27, 2022

imiller0 Feb 3, 2022

imiller0 Feb 3, 2022

imiller0 Feb 3, 2022

irinamihai Feb 8, 2022

imiller0 Feb 3, 2022

imiller0 Feb 3, 2022

irinamihai Feb 8, 2022

imiller0 left a comment

openshift-ci bot commented Feb 11, 2022

		@@ -0,0 +1,38 @@
		apiVersion: policy.open-cluster-management.io/v1

Fix issues #73

Fix issues #73

Conversation

irinamihai commented Jan 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irinamihai commented Jan 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irinamihai commented Jan 28, 2022

irinamihai commented Feb 2, 2022

irinamihai commented Feb 2, 2022

imiller0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imiller0 left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Feb 11, 2022

irinamihai commented Jan 26, 2022 •

edited

Loading