-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issues #73
Fix issues #73
Conversation
@@ -92,7 +92,13 @@ spec: | |||
timeout: | |||
default: 240 | |||
type: integer | |||
required: | |||
- maxConcurrency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this required?
IMHO, it should default to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was agreed on the latest discussion that we shouldn't assume what the user wants, especially since they can start an upgrade already enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
/retest |
r.Log.Info("[Reconcile]", "CR", clusterGroupUpgrade.Name) | ||
|
||
err = r.validateCR(ctx, clusterGroupUpgrade) | ||
reconcile, err := r.validateCR(ctx, clusterGroupUpgrade) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not urgent but considering to have a new type status i.e. "ClusterGroupUpgradeValidationFailed" with proper reason displayed for more visibility to the users for debug and track the status, otherwise, user has to check the log to know what happened if validation failed.
Also currently the validation happens every time the reconcilation is triggered. Thinking if validation is needed only before upgrade is enabled or during the whole lifecycle of upgrade. It comes down to the question do we expect user to update UOCR after upgrade is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree we need better handling of error scenarios. I'll add this to the list of future enhancements.
@@ -1448,16 +1522,43 @@ func (r *ClusterGroupUpgradeReconciler) validateCR(ctx context.Context, clusterG | |||
}, foundManagedCluster) | |||
|
|||
if err != nil { | |||
return fmt.Errorf("Cluster %s is not a ManagedCluster", cluster) | |||
return reconcile, fmt.Errorf("Cluster %s is not a ManagedCluster", cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check Ready status of managedcluster as well? as policies only work when cluster is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would be good to have, but I would add it to the same list of enhancements as there is also the question of what status/action we want to take if such checks fail.
clusterGroupUpgrade.Spec.RemediationStrategy.MaxConcurrency = newMaxConcurrency | ||
err = r.Client.Update(ctx, clusterGroupUpgrade) | ||
if err != nil { | ||
r.Log.Info("Error updating Cluster Group Upgrade") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be r.Log.Error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm returning the error below, so this info log can just be deleted.
// Contains the managed policies (and the namespaces) that have NonCompliant clusters | ||
// that require updating. | ||
ManagedPoliciesForUpgrade []ManagedPolicyForUpgrade `json:"managedPoliciesForUpgrade,omitempty"` | ||
ManagedPoliciesCompliantBeforeUpgrade []string `json:"managedPoliciesCompliantBeforeUpgrade,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like ManagedPoliciesCompliantBeforeUpgrade is just for record? In my opinion, I feel it's more clear to understand if we can keep one entry for managedPolicies and have all information there like, name, namespace and compliant status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it would be a good idea. I will add it to the list as changing it now would mean even more restructuring.
/retest |
2a827b7
to
703fe81
Compare
/retest-required |
1 similar comment
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of really good work. A couple questions but overall looks good.
@@ -0,0 +1,38 @@ | |||
apiVersion: policy.open-cluster-management.io/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test code? If so it would be helpful to have it under a testdata directory.
curl -k -s -X PATCH -H "Accept: application/json, */*" \ | ||
-H "Content-Type: application/merge-patch+json" \ | ||
http://localhost:8001/apis/ran.openshift.io/v1alpha1/namespaces/$1/clustergroupupgrades/$2/status \ | ||
--data '{"status": {"conditions":[{"lastTransitionTime": "2021-12-15T18:55:59Z", "message": "All the clusters in the CR are compliant", "reason": "UpgradeCompleted", "status": "False", "type": "Ready"}]}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Newlines at the end of the file to keep github from flagging them.
controllers/actions.go
Outdated
@@ -59,7 +59,7 @@ func (r *ClusterGroupUpgradeReconciler) takeActionsAfterCompletion( | |||
} | |||
|
|||
// Cleanup resources | |||
if *actionsAfterCompletion.DeleteObjects { | |||
if actionsAfterCompletion.DeleteObjects == nil || *actionsAfterCompletion.DeleteObjects { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When possible it is better to handle default values as the CR is being read in so that you don't have to handle all the cases throughout. This can be addressed in a later cleanup if needed.
utils.MaxNumberOfClustersForUpgrade) | ||
|
||
if newMaxConcurrency != clusterGroupUpgrade.Spec.RemediationStrategy.MaxConcurrency { | ||
clusterGroupUpgrade.Spec.RemediationStrategy.MaxConcurrency = newMaxConcurrency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we shouldn't modify the value set by the user. If the CGU is part of a gitops flow this update will cause the CR to not match what is in GIT and we will be stuck in a battle of updates here against the gitops tools restoring the value. One option would be to compute the "effective" maxConcurrency and simply track and use that as an internal variable (can be reflected in the status for visibility).
cguSpec := ranv1alpha1.ClusterGroupUpgradeSpec{ | ||
Enable: true, // default | ||
Enable: &enable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this switched to a pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue in controller-runtime: projectcapsule/capsule#342, kubernetes-sigs/kubebuilder#2109
If we are trying to update the CR, the boolean non pointer values are overwritten with the default value. Angie had the same issues with deleteObjectsOnCompletion and changed it to a pointer because of that. I changed it to avoid the same issue in the future and to also be consistent.
for index, crtPolicy := range clusterGroupUpgrade.Status.ManagedPoliciesForUpgrade { | ||
if index == policyIndex { | ||
return &crtPolicy | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this loop to handle indexes out of range?
@@ -416,7 +421,7 @@ func (r *ClusterGroupUpgradeReconciler) addClustersToPlacementRule( | |||
continue | |||
} | |||
|
|||
policyName := clusterGroupUpgrade.Name + "-" + clusterGroupUpgrade.Spec.ManagedPolicies[managedPolicyIndex] | |||
policyName := clusterGroupUpgrade.Name + "-" + clusterGroupUpgrade.Status.ManagedPoliciesForUpgrade[managedPolicyIndex].Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user creates two ClusterGroupUpgrade CRs with the same name but different namespaces and which remediate the same policy (but maybe to two different clusters) will collide on this policyName.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if two copied policies for two different CGUs would have the same name, they would be created in the CGU's namespace, so we wouldn't have a conflict.
8aea548
to
ccb9df7
Compare
Description: - BZ#2042609 policy was still copied and never deleted if all clusters were already compliant: >>> Fix: skip policies that are already compliant with all the clusters in the upgrade. Include 2 new fields in the UOCR status: managedPoliciesCompliantBeforeUpgrade (to list the policies that are compliant when the upgrade starts) and managedPoliciesForUpgrade (policies that will be copied and enforced as part of the upgrade) >>> Fix: if the upgrade starts with all the policies compliant then it moved directly to UpgradeCompleted - BZ#2042502 Canaries are ignored if they are not included in clusters in UOCR: >>> Fix: include check in validateCR and error if a canary cluster is not included in the list of clusters - BZ#2042517 upgrade pods crashed after applying UOCR without remediationStrategy spec >>> Fix: Make remediationStrategy and maxConcurrency requested fields. Leave the timeout default to 240 for now - BZ#2042601 upgrade started when blockingCR failed silently >>> Fix: Check conditions of UOCR and consider CR as not completed if the conditions are nil - BZ#2040833 UOCR validation should not fail due to maxConcurrency larger than cluster count >>>> Fix: maxConcurrency is automatically adjusted to min(#clusters, 100, Spec.RemediationStrategy.maxConcurrency) if the initially requested maxConcurrency is larger than that minimum - add KUTTL tests for: skipping compliant policies, blocking mechanisms, upgrade that starts with all the policies compliant, adjusting the maximum concurrency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified successful deployment of a cluster using full ZTP + a custom build this operator including these patches. Worked perfectly.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: imiller0, irinamihai, nishant-parekh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description: