Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues #73

Merged
merged 3 commits into from
Feb 11, 2022
Merged

Fix issues #73

merged 3 commits into from
Feb 11, 2022

Conversation

irinamihai
Copy link
Collaborator

@irinamihai irinamihai commented Jan 26, 2022

Description:

  • BZ#2042609 policy was still copied and never deleted if all clusters were already compliant

    Fix: skip policies that are already compliant with all the clusters in the upgrade. Include 2 new fields in the UOCR status:
    managedPoliciesCompliantBeforeUpgrade (to list the policies that are compliant when the upgrade starts) and managedPoliciesForUpgrade (policies that will be copied and enforced as part of the upgrade)
    Fix: if the upgrade starts with all the policies compliant then it moves directly to UpgradeCompleted

  • BZ#2042502 Canaries are ignored if they are not included in clusters in UOCR

    Fix: include check in validateCR and error if a canary cluster is not included in the list of clusters

  • BZ#2042517 upgrade pods crashed after applying UOCR without remediationStrategy spec

    Fix: Make remediationStrategy and maxConcurrency requested fields. Leave the timeout default to 240 for now

  • BZ#2042601 upgrade started when blockingCR failed silently

    Fix: Check conditions of UOCR and consider CR as not completed if the conditions are nil

  • BZ#2040833 UOCR validation should not fail due to maxConcurrency larger than cluster count

    Fix: maxConcurrency is automatically adjusted to min(#clusters, 100, Spec.RemediationStrategy.maxConcurrency) if the
    initially requested maxConcurrency is larger than that minimum

  • add KUTTL tests for: skipping compliant policies, blocking mechanisms, upgrade that starts with all the policies compliant, adjusting the maximum concurrency

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2022
@@ -92,7 +92,13 @@ spec:
timeout:
default: 240
type: integer
required:
- maxConcurrency
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required?

IMHO, it should default to 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was agreed on the latest discussion that we shouldn't assume what the user wants, especially since they can start an upgrade already enabled.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@irinamihai
Copy link
Collaborator Author

/retest

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 27, 2022
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 27, 2022
r.Log.Info("[Reconcile]", "CR", clusterGroupUpgrade.Name)

err = r.validateCR(ctx, clusterGroupUpgrade)
reconcile, err := r.validateCR(ctx, clusterGroupUpgrade)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not urgent but considering to have a new type status i.e. "ClusterGroupUpgradeValidationFailed" with proper reason displayed for more visibility to the users for debug and track the status, otherwise, user has to check the log to know what happened if validation failed.
Also currently the validation happens every time the reconcilation is triggered. Thinking if validation is needed only before upgrade is enabled or during the whole lifecycle of upgrade. It comes down to the question do we expect user to update UOCR after upgrade is enabled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree we need better handling of error scenarios. I'll add this to the list of future enhancements.

@@ -1448,16 +1522,43 @@ func (r *ClusterGroupUpgradeReconciler) validateCR(ctx context.Context, clusterG
}, foundManagedCluster)

if err != nil {
return fmt.Errorf("Cluster %s is not a ManagedCluster", cluster)
return reconcile, fmt.Errorf("Cluster %s is not a ManagedCluster", cluster)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check Ready status of managedcluster as well? as policies only work when cluster is ready.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would be good to have, but I would add it to the same list of enhancements as there is also the question of what status/action we want to take if such checks fail.

clusterGroupUpgrade.Spec.RemediationStrategy.MaxConcurrency = newMaxConcurrency
err = r.Client.Update(ctx, clusterGroupUpgrade)
if err != nil {
r.Log.Info("Error updating Cluster Group Upgrade")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be r.Log.Error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm returning the error below, so this info log can just be deleted.

// Contains the managed policies (and the namespaces) that have NonCompliant clusters
// that require updating.
ManagedPoliciesForUpgrade []ManagedPolicyForUpgrade `json:"managedPoliciesForUpgrade,omitempty"`
ManagedPoliciesCompliantBeforeUpgrade []string `json:"managedPoliciesCompliantBeforeUpgrade,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like ManagedPoliciesCompliantBeforeUpgrade is just for record? In my opinion, I feel it's more clear to understand if we can keep one entry for managedPolicies and have all information there like, name, namespace and compliant status.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be a good idea. I will add it to the list as changing it now would mean even more restructuring.

@irinamihai
Copy link
Collaborator Author

/retest

@irinamihai irinamihai force-pushed the BZs_4.10 branch 2 times, most recently from 2a827b7 to 703fe81 Compare February 2, 2022 00:30
@irinamihai
Copy link
Collaborator Author

/retest-required

1 similar comment
@irinamihai
Copy link
Collaborator Author

/retest-required

Copy link
Contributor

@imiller0 imiller0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of really good work. A couple questions but overall looks good.

@@ -0,0 +1,38 @@
apiVersion: policy.open-cluster-management.io/v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test code? If so it would be helpful to have it under a testdata directory.

curl -k -s -X PATCH -H "Accept: application/json, */*" \
-H "Content-Type: application/merge-patch+json" \
http://localhost:8001/apis/ran.openshift.io/v1alpha1/namespaces/$1/clustergroupupgrades/$2/status \
--data '{"status": {"conditions":[{"lastTransitionTime": "2021-12-15T18:55:59Z", "message": "All the clusters in the CR are compliant", "reason": "UpgradeCompleted", "status": "False", "type": "Ready"}]}}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Newlines at the end of the file to keep github from flagging them.

@@ -59,7 +59,7 @@ func (r *ClusterGroupUpgradeReconciler) takeActionsAfterCompletion(
}

// Cleanup resources
if *actionsAfterCompletion.DeleteObjects {
if actionsAfterCompletion.DeleteObjects == nil || *actionsAfterCompletion.DeleteObjects {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When possible it is better to handle default values as the CR is being read in so that you don't have to handle all the cases throughout. This can be addressed in a later cleanup if needed.

utils.MaxNumberOfClustersForUpgrade)

if newMaxConcurrency != clusterGroupUpgrade.Spec.RemediationStrategy.MaxConcurrency {
clusterGroupUpgrade.Spec.RemediationStrategy.MaxConcurrency = newMaxConcurrency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we shouldn't modify the value set by the user. If the CGU is part of a gitops flow this update will cause the CR to not match what is in GIT and we will be stuck in a battle of updates here against the gitops tools restoring the value. One option would be to compute the "effective" maxConcurrency and simply track and use that as an internal variable (can be reflected in the status for visibility).

cguSpec := ranv1alpha1.ClusterGroupUpgradeSpec{
Enable: true, // default
Enable: &enable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this switched to a pointer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue in controller-runtime: projectcapsule/capsule#342, kubernetes-sigs/kubebuilder#2109
If we are trying to update the CR, the boolean non pointer values are overwritten with the default value. Angie had the same issues with deleteObjectsOnCompletion and changed it to a pointer because of that. I changed it to avoid the same issue in the future and to also be consistent.

Comment on lines +11 to +15
for index, crtPolicy := range clusterGroupUpgrade.Status.ManagedPoliciesForUpgrade {
if index == policyIndex {
return &crtPolicy
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this loop to handle indexes out of range?

@@ -416,7 +421,7 @@ func (r *ClusterGroupUpgradeReconciler) addClustersToPlacementRule(
continue
}

policyName := clusterGroupUpgrade.Name + "-" + clusterGroupUpgrade.Spec.ManagedPolicies[managedPolicyIndex]
policyName := clusterGroupUpgrade.Name + "-" + clusterGroupUpgrade.Status.ManagedPoliciesForUpgrade[managedPolicyIndex].Name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user creates two ClusterGroupUpgrade CRs with the same name but different namespaces and which remediate the same policy (but maybe to two different clusters) will collide on this policyName.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if two copied policies for two different CGUs would have the same name, they would be created in the CGU's namespace, so we wouldn't have a conflict.

@irinamihai irinamihai force-pushed the BZs_4.10 branch 7 times, most recently from 8aea548 to ccb9df7 Compare February 9, 2022 16:59
Description:
- BZ#2042609 policy was still copied and never deleted if all clusters
  were already compliant:
  >>> Fix: skip policies that are already compliant with all the
  clusters in the upgrade. Include 2 new fields in the UOCR status:
  managedPoliciesCompliantBeforeUpgrade (to list the policies that
  are compliant when the upgrade starts) and
  managedPoliciesForUpgrade (policies that will be copied and
  enforced as part of the upgrade)
  >>> Fix: if the upgrade starts with all the policies compliant then
  it moved directly to UpgradeCompleted
- BZ#2042502 Canaries are ignored if they are not included in clusters
  in UOCR:
  >>> Fix: include check in validateCR and error if a canary cluster
  is not included in the list of clusters
- BZ#2042517 upgrade pods crashed after applying UOCR without
  remediationStrategy spec
  >>> Fix: Make remediationStrategy and maxConcurrency requested
  fields. Leave the timeout default to 240 for now
- BZ#2042601 upgrade started when blockingCR failed silently
  >>> Fix: Check conditions of UOCR and consider CR as not completed
  if the conditions are nil
- BZ#2040833 UOCR validation should not fail due to maxConcurrency
  larger than cluster count
  >>>> Fix: maxConcurrency is automatically adjusted to
  min(#clusters, 100, Spec.RemediationStrategy.maxConcurrency) if the
  initially requested maxConcurrency is larger than that minimum
- add KUTTL tests for: skipping compliant policies, blocking mechanisms,
  upgrade that starts with all the policies compliant, adjusting
  the maximum concurrency
Copy link
Contributor

@imiller0 imiller0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified successful deployment of a cluster using full ZTP + a custom build this operator including these patches. Worked perfectly.
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 11, 2022
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 11, 2022
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 11, 2022
@openshift-ci
Copy link

openshift-ci bot commented Feb 11, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: imiller0, irinamihai, nishant-parekh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [imiller0,irinamihai,nishant-parekh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 04dccf2 into openshift-kni:main Feb 11, 2022
@irinamihai irinamihai deleted the BZs_4.10 branch April 19, 2022 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants