-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle template errors better #136
Handle template errors better #136
Conversation
210fe53
to
450a61c
Compare
@JustinKuli a bit of an odd thought, but in a template error situation where the current configuration policy has pruneObjectBehavior, could the template-sync disable |
@mprahl it could... It would potentially lose the "createdByPolicy" information (which I think we already document as unreliable), and my instinct is that the process would be unreliable. |
What would be unreliable? I think an update and then delete would be reliable since the config-policy-controller won't notice the deletion until it has refreshed its cache and thus seen the update. I'm thinking that it's probably the more expected behavior to remove the previous template rather than having an in between state where old clusters have the old version and new clusters don't have a version of the policy template. |
The plan is to add template support to more policy types. Currently it is restricted to ConfigurationPolicies. This removes that restriction and reports hub template errors that would be added by the propagator, so that each policy controller does not need to duplicate the logic to emit those events themselves. Refs: - https://issues.redhat.com/browse/ACM-10858 Signed-off-by: Justin Kulikauskas <[email protected]>
The compliance events emitted by the template-sync controller should have priority over any created by the policy controllers. For example, in the past we had issues where a "late" compliance event from a controller would override the Pending status of a policy, and there are other template-error statuses that are emitted by the template-sync which could have the same situation. To ensure that a template-sync event is always emitted after a different compliance event, this controller should reconcile in more situations. Note that it might seem better if the template-sync controller removed the templates that have had an error, but in the case of ConfigurationPolicy, removing it can cause undesirable side-effects due to PruneObjectBehavior. Refs: - https://issues.redhat.com/browse/ACM-10858 Signed-off-by: Justin Kulikauskas <[email protected]>
When adding a template-error to the status for a template that can't be decoded, the status-sync was incorrectly appending to the details array twice. So for something that would be called "template-0", the 0th item in the details array would be almost empty, and the 1st item would have all of the status events. This was causing some events to be emitted repeatedly, since the template-sync logic to prevent this was looking at the 0th item, which would never get the event. This commit also cleans up some awkward pointer gymnastics. Signed-off-by: Justin Kulikauskas <[email protected]>
This should help reduce some duplication of status events, but it might not fully prevent it. Signed-off-by: Justin Kulikauskas <[email protected]>
With the template-sync now reconciling more often (there are no longer any cases where the reconcile can be skipped), some event duplication might be unavoidable. Signed-off-by: Justin Kulikauskas <[email protected]>
450a61c
to
3aac660
Compare
@mprahl I would prefer to more thoroughly discuss separately whether (and how) to remove templates during an error case, rather than hold this PR for the decision. I think the behavior of this PR is a step forward. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JustinKuli, mprahl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
57ebda2
into
open-cluster-management-io:main
It appears that "special" compliance events related to template-errors can potentially be superseded by later events coming from a policy controller. Specifically, if a template is valid and gets created, but is later edited to have an error, the original template is not removed. Because undesired side-effects could be caused by removing a template (eg PruneObjectBehavior in a ConfigurationPolicy), it would be risky to "fix" this by removing the template in this situation. Instead, this PR proposes that the template-sync always re-reconcile after the status is updated, which allows it to correct the "late" event by emitting a new template-error.
The increased number of reconciles does increase the possibility of duplicate compliance events. Accordingly, some tests have been relaxed, and the duplicate detection process now refreshes the policy to try and decrease the probability. It also fixes a bug that would cause a large number of duplicated events in this scheme, because the event was being recorded on the wrong item in the history.
This PR also handles hub-template errors here, instead of in individual policy controllers.
Refs: