Add Status Conditions for FAR CR #69

razo7 · 2023-07-27T14:52:29Z

Adding three status conditions (Processing, FenceAgentActionSucceeded, and Succeeded) to help FAR in two ways:

Convey the status of processing the CR, if the fence agent action was succeeded (only once), and if the whole remediation was succeeded (node was tainted with FAR taint, fence agent action was succeeded, and the workloads have been deleted).
Limit the available sections in the reconcile based on the conditions, e.g., reboot and resource deletion will finish successfully only once.

Moreover, each status condition includes a reason and a message based on ProcessingChangeReason that changed the condition value.
The PR also updates reconcile structure :

Fetch FAR CR, validate CR name, check NHC timeout annotation, and add finalizer. ProcessingChangeReason = RemediationStarted
Try to add FAR taint, build FA command and execute it (until it succeeds). ProcessingChangeReason = FenceAgentSucceeded
Try to delete workloads. ProcessingChangeReason = RemediationFinished

ECOPROJECT-1411
ECOPROJECT-1484

openshift-ci · 2023-07-27T14:52:34Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2023-07-27T14:52:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: razo7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [razo7]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

slintes

Please use conditions

Some resources in the v1 API contain fields called phase, and associated message, reason, and other status fields. The pattern of using phase is deprecated. Newer API types should use conditions instead. Phase was essentially a state-machine enumeration field, that contradicted system-design principles and hampered evolution, since adding new enum values breaks backward compatibility. Rather than encouraging clients to infer implicit properties from phases, we prefer to explicitly expose the individual conditions that clients need to monitor. Conditions also have the benefit that it is possible to create some conditions with uniform meaning across all resource types, while still exposing others that are unique to specific resource types. See #7856 for more details and discussion.

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

razo7 · 2023-07-30T13:23:48Z

/test 4.13-openshift-e2e

razo7 · 2023-07-30T14:54:03Z

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e

mshitrit · 2023-07-31T08:38:26Z

/test 4.12-openshift-e2e

razo7 · 2023-07-31T10:58:12Z

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e

api/v1alpha1/fenceagentsremediation_types.go

slintes · 2023-07-31T12:47:32Z

api/v1alpha1/fenceagentsremediation_types.go

+
+const (
+	// RemediationStarted - CR was found, its name matches a node, and a finalizer was set
+	RemediationStarted ProcessingChangeReason = "remediationStarted"


reasons should be CamelCase

slintes · 2023-07-31T12:51:05Z

api/v1alpha1/fenceagentsremediation_types.go

+
+	// Represents the observations of a FenceAgentsRemediation's current state.
+	// Known .status.conditions.type are: "Processing", "FenceAgentActionSucceeded", and "Succeeded".
+	// +patchMergeKey=type


patch* doesn't work on CRDs, remove please

slintes · 2023-07-31T13:16:15Z

controllers/fenceagentsremediation_controller.go

-		r.Log.Error(err, "Fence Agent response wasn't a success message", "CR's Name", req.Name)
-		return emptyResult, err
+
+	if meta.IsStatusConditionPresentAndEqual(far.Status.Conditions, commonConditions.ProcessingType, metav1.ConditionTrue) &&


IMHO the check for the processing condition should be done earlier

a better check for the FASucceeded condition would be != True. Then you can set the condition to False in case execution failed.

I think the code which prepares command execution can be moved from above inside this if block? (getting the pod, building params...)

a better check for the FASucceeded condition would be != True.

👍🏻

I think the code which prepares command execution can be moved from above inside this if block? (getting the pod, building params..

I though of that, and I went with the current implementation in order to limit the code sections which are affected by the status conditions. Having said that do you still see a greater value of adding the suggested code under the if block?

Then you can set the condition to False in case execution failed.

ATM there is no ProcessingChangeReason for this use case. But I might add something for that

Having said that do you still see a greater value of adding the suggested code under the if block?

yes, why executing all that code when it's not used 🤷🏼‍♂️

ATM there is no ProcessingChangeReason

tbh, I dislike this "one reason for updating all conditions" pattern anyway, and this is why...

slintes · 2023-07-31T13:19:12Z

controllers/fenceagentsremediation_controller.go

+		}
+		r.Log.Info("FenceAgentsRemediation CR has completed to remediate the node", "Node Name", req.Name)
+
+		return ctrl.Result{Requeue: true}, nil


do we need to requeue here? 🤔

Not really. Can be discarded

slintes · 2023-07-31T14:15:41Z

api/v1alpha1/fenceagentsremediation_types.go

 	// +listType=map
 	// +listMapKey=type
 	// +operator-sdk:csv:customresourcedefinitions:type=status,displayName="conditions",xDescriptors="urn:alm:descriptor:io.kubernetes.conditions"
-	Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type" protobuf:"bytes,1,rep,name=conditions"`
+	Conditions []metav1.Condition `json:"conditions,omitempty" protobuf:"bytes,1,rep,name=conditions"`


missed in the first review: why the protobuf tag?

I have followed the example of conditions from https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties, and it seems like the example is already adding this protobuf anyway in the CSV description. So I will delete it from
fenceagentsremediation_types.go to align with how we set conditions in other operators (without the protobuf tag), e.g. NHC's conditions .

Conditions []metav1.Condition json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type" protobuf:"bytes,1,rep,name=conditions"

That doc is primarily targeting core k8s types. So in general it's good to follow its recommendations, but not everything applies to CRDs as well 🙂

razo7 · 2023-07-31T15:14:58Z

/test 4.13-openshift-e2e

slintes · 2023-07-31T15:23:41Z

controllers/fenceagentsremediation_controller.go

+			r.Log.Error(err, "Invalid sharedParameters/nodeParameters from CR - edit/recreate the CR", "CR's Name", req.Name)
+			return emptyResult, nil
+		}
+		// Add FAR (medik8s) remediation taint


should the taint be in this block as well? 🤔

razo7 · 2023-07-31T15:57:47Z

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e

razo7 · 2023-07-31T16:06:38Z

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e

slintes · 2023-07-31T16:15:53Z

controllers/fenceagentsremediation_controller.go

+
+	var (
+		processingConditionStatus, fenceAgentActionSucceededConditionStatus, succeededConditionStatus metav1.ConditionStatus
+		conditinMessage                                                                               string


typo: conditionMessage

slintes · 2023-07-31T16:27:26Z

api/v1alpha1/fenceagentsremediation_types.go

 const (
 	// FARFinalizer is a finalizer for a FenceAgentsRemediation CR deletion
 	FARFinalizer string = "fence-agents-remediation.medik8s.io/far-finalizer"
 	// Taints
 	FARNoExecuteTaintKey = "medik8s.io/fence-agents-remediation"
+	// FenceAgentActionSucceededType is the condition type used to signal whether the Fence Agent action was succeeded successfully or not
+	FenceAgentActionSucceededType = "FenceAgentActionSucceeded"
+	// error status messages


these errors are used for tests only, correct? They shouldn't be in the api package then.

I have moved them to errors.go file

slintes · 2023-07-31T16:32:10Z

controllers/fenceagentsremediation_controller_test.go

 		log.Info("Pod exist", "pod", podName)
 	}
 }

+// verifyStatusCondition checks whether the status condition is set with the expected value
+func verifyStatusCondition(testFAR *v1alpha1.FenceAgentsRemediation, conditionType, expectedResult string, conditionStatus metav1.ConditionStatus) {


I won't block on this, but using this "expectedResult" string is strange at least. Type and status params should be enough for the tests, not? Either they match or not.

Yes, but I added the expectedResult so the Eventually can "catch" an expected error, e.g., in unit-test when the conditions haven't been set.
I have changed the name to verifyExpectedStatusConditionError as it better captures the essence of this function.

you could make conditionStatus a pointer and use nil, if you want to test that a condition isn't set 🙂

But then the test would fail, since eventually will expect utils.ConditionSetAndMatchSuccess while verifyStatusCondition/verifyExpectedStatusConditionError would return utils.ConditionSetButNoMatchError.

this works: slintes@55781ea?diff=unified

razo7 · 2023-08-01T07:56:52Z

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e

razo7 · 2023-08-01T09:22:26Z

/test 4.13-openshift-e2e

Processing, FenceAgentActionSucceeded, and Succeeded status conditions have been added to verify FAR remediation status

Processing, FenceAgentActionSucceeded, and Succeeded status conditions help exclude the FA execution and workload deletion sections from every reconcile call. It would help us avoid any second (and more) remediation

Unit tests and e2e tests have been added to verify the expected behaviour by looking whether the status conditions have been set and what's their value

razo7 · 2023-08-01T10:43:08Z

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e

razo7 · 2023-08-01T12:42:02Z

/retest

slintes · 2023-08-01T17:18:54Z

oh, CI missed to add the label

/lgtm

openshift-ci bot added the do-not-merge/work-in-progress label Jul 27, 2023

openshift-ci bot added the approved label Jul 27, 2023

razo7 changed the title ~~Add Remediation Phase for FAR CR~~ [WIP] Add Remediation Phase for FAR CR Jul 27, 2023

slintes requested changes Jul 28, 2023

View reviewed changes

openshift-ci bot assigned slintes Jul 28, 2023

razo7 force-pushed the add-status branch from f62cf51 to de1d9a7 Compare July 30, 2023 13:06

razo7 changed the title ~~[WIP] Add Remediation Phase for FAR CR~~ [WIP] Add Status Conditions for FAR CR Jul 30, 2023

razo7 mentioned this pull request Jul 30, 2023

Improve Verification of FA Success Response #70

Merged

razo7 changed the title ~~[WIP] Add Status Conditions for FAR CR~~ Add Status Conditions for FAR CR Jul 31, 2023

slintes requested changes Jul 31, 2023

View reviewed changes

slintes reviewed Jul 31, 2023

View reviewed changes

razo7 force-pushed the add-status branch from 65d2627 to e2ebaa6 Compare July 31, 2023 16:04

slintes reviewed Jul 31, 2023

View reviewed changes

razo7 force-pushed the add-status branch from e2ebaa6 to badcfa4 Compare August 1, 2023 07:54

openshift-merge-robot added the needs-rebase label Aug 1, 2023

Add three status conditions for FAR CR

0c91ae1

Processing, FenceAgentActionSucceeded, and Succeeded status conditions have been added to verify FAR remediation status

razo7 force-pushed the add-status branch from badcfa4 to 1bd3750 Compare August 1, 2023 10:22

openshift-merge-robot removed the needs-rebase label Aug 1, 2023

razo7 added 2 commits August 1, 2023 13:38

Restructure reconcile using the status conditions of FAR CR

9b28a03

Processing, FenceAgentActionSucceeded, and Succeeded status conditions help exclude the FA execution and workload deletion sections from every reconcile call. It would help us avoid any second (and more) remediation

Add tests for status conditions

1b38f7f

Unit tests and e2e tests have been added to verify the expected behaviour by looking whether the status conditions have been set and what's their value

razo7 force-pushed the add-status branch from 1bd3750 to 1b38f7f Compare August 1, 2023 10:38

razo7 marked this pull request as ready for review August 1, 2023 14:07

openshift-ci bot removed the do-not-merge/work-in-progress label Aug 1, 2023

slintes approved these changes Aug 1, 2023

View reviewed changes

openshift-ci bot added the lgtm label Aug 1, 2023

openshift-merge-robot merged commit d6df2be into medik8s:main Aug 1, 2023
1 check passed

Add Status Conditions for FAR CR #69

Add Status Conditions for FAR CR #69

Conversation

razo7 commented Jul 27, 2023 • edited by openshift-ci bot Loading

openshift-ci bot commented Jul 27, 2023

openshift-ci bot commented Jul 27, 2023

slintes left a comment

Choose a reason for hiding this comment

razo7 commented Jul 30, 2023

razo7 commented Jul 30, 2023

mshitrit commented Jul 31, 2023

razo7 commented Jul 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razo7 commented Jul 31, 2023

Choose a reason for hiding this comment

razo7 commented Jul 31, 2023

razo7 commented Jul 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razo7 commented Aug 1, 2023

razo7 commented Aug 1, 2023

razo7 commented Aug 1, 2023

razo7 commented Aug 1, 2023

slintes commented Aug 1, 2023

razo7 commented Jul 27, 2023 •

edited by openshift-ci bot

Loading