Rework running Fence Agent command #106

clobrano · 2023-11-21T09:53:38Z

Run the fence agent command asynchronously in a dedicated goroutine on the same controller's container.
The goroutine is also responsible to update FAR status with the command outcome. For this reason two new Status Conditions have been added to take into account fence agents failures or timeouts.

The fence agent command has three new, optional, Spec values:

RetryCount is the number of times the fencing agent will be executed in case of failures (default: 5)
RetryInterval is the interval between each fencing agent retries (default 5s)
Timeout is the timeout for each fencing agent execution (default 60s)

TODO

replace fence_ipmilan mock with software mock of exec.Command only
add fence agent failure case to updateConditions
use updateConditions in executor
split change in smaller chunks
clean up Executor runners Map when goroutine ends
investigate why e2e tests with AWS cluster time out: flaky tests 🤷

openshift-ci · 2023-11-21T10:03:08Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2023-11-21T10:03:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [clobrano]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

clobrano · 2023-11-21T12:39:41Z

/test

openshift-ci · 2023-11-21T12:39:45Z

@clobrano: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test 4.12-ci-bundle-my-bundle
/test 4.12-images
/test 4.12-openshift-e2e
/test 4.12-test
/test 4.13-ci-bundle-my-bundle
/test 4.13-images
/test 4.13-openshift-e2e
/test 4.13-test
/test 4.14-ci-bundle-my-bundle
/test 4.14-images
/test 4.14-openshift-e2e
/test 4.14-test
/test 4.15-ci-bundle-my-bundle
/test 4.15-images
/test 4.15-openshift-e2e
/test 4.15-test

Use /test all to run all jobs.

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

clobrano · 2023-11-21T12:39:56Z

/test 4.15-openshift-e2e

pkg/cli/cliexecuter.go

clobrano · 2023-11-22T13:43:58Z

Working on a different approach for testing, so you might want to wait before reviewing this PR

Move UpdateConditions function and related assets to the utils package to let the Executor update FAR status when Fence Agent execution completes. Signed-off-by: Carlo Lobrano <[email protected]>

- improved test independence - replaced custom functions with Gomega alternatives - other small improvements Signed-off-by: Carlo Lobrano <[email protected]>

clobrano · 2023-11-24T17:07:44Z

/test 4.14-openshift-e2e

mshitrit · 2023-11-26T08:24:27Z

Working on a different approach for testing, so you might want to wait before reviewing this PR

Might want to add a [WIP] suffix on the PR until ready for review

pkg/cli/cliexecuter.go

- run the command directly on the container without API requests - run the command asynchronously to free Reconciler loop - let the goroutine running the command also update FAR status accordingly to the result of the command. The status update will then trigger a new reconcile loop for the rest of the actions. - add new conditions to handle Fence Agent failures The goroutine running the fence agent is mapped to the FAR CR UID. see: https://issues.redhat.com/browse/ECOPROJECT-1755 Signed-off-by: Carlo Lobrano <[email protected]>

Signed-off-by: Carlo Lobrano <[email protected]>

Drop verification of the existence of the "Success" message in the controller's logs. This check is a strong dependency from the implementation, which means the test might fail in the future just because the log changes (even just in the AWS fence agent). Moreover the E2E checks already skip this control when the target node is the one where FAR resides, which means the other checks are sufficient for the test to pass. Signed-off-by: Carlo Lobrano <[email protected]>

clobrano · 2023-12-10T19:34:19Z

/test 4.14-openshift-e2e

Signed-off-by: Carlo Lobrano <[email protected]>

- add verifyRemediationTaintExists - add verifyRemediationConditions

clobrano · 2023-12-18T16:21:33Z

/retest

Signed-off-by: Carlo Lobrano <[email protected]>

clobrano · 2023-12-19T19:37:09Z

/retest

mshitrit · 2023-12-20T07:29:38Z

pkg/cli/cliexecuter.go

-		config:    config,
-		clientSet: clientSet,
+// NewExecuter builds an Executer with configurable runnerFunc for testing
+func NewFakeExecuter(client client.Client, fn runnerFunc) (*Executer, error) {


I think it's a good rule of thumb to separate test code from production code, based on that I'd expect this method to be in a different file (and once moved it can probably made private ).

good point, I'll move it to the test code

Signed-off-by: Carlo Lobrano <[email protected]>

For coherence, also fix the boolean return value if the status update was interrupted, even if in that case we are also returing an error which will makes the ExponentialBackoffWithContext function exit anyway. Signed-off-by: Carlo Lobrano <[email protected]>

runWithRetry function use a constant time back-off, not linear Signed-off-by: Carlo Lobrano <[email protected]>

clobrano · 2023-12-20T14:29:39Z

/retest

Signed-off-by: Carlo Lobrano <[email protected]>

razo7 · 2023-12-20T15:17:00Z

/lgtm

clobrano · 2023-12-20T15:21:39Z

/hold
waiting for @mshitrit feedback on his change request

clobrano · 2023-12-20T15:39:15Z

/retest

slintes

overall lgtm... few comments inline. Not sure about context handling at one place 🤔

slintes · 2023-12-20T17:29:05Z

pkg/cli/fake.go

@@ -0,0 +1,19 @@
+package cli


can this be fake_test.go?

I think the _test.go makes this file a test and not usable as source for another test.
At least, only changing the name breaks the unit test

In case I'm missing something, I'll fix this in a following PR

pkg/cli/cliexecuter.go

slintes · 2023-12-20T17:54:39Z

pkg/cli/cliexecuter.go

+	retryErr = wait.ExponentialBackoffWithContext(ctx,
+		backoff,
+		func(ctx context.Context) (bool, error) {
+			ctxWithTimeout, cancel := context.WithTimeout(ctx, timeout)


do we need this? 🤔 IIUC, the context we get here is the same which we pass to ExponentialBackoffWithContext. Why would we need to cancel that one when we leave this function. Even more, isn't that an issue when do that?

but maybe it's just too late for me to understand it completely

Why would we need to cancel that one when we leave this function

Not when we leave the function, but when we are in the middle of the function (either in the retry or during a command call), and we want to stop it (e.g. NHC time out)

Not when we leave the function

but that's we do with the defer one line below, not? 🤔

I mixed contexts here (no pun intended 😄)

The code is in a different place now, but my intention here is not to cancel the context, but to give the exec.CommandContext a timeout to run the fence agent command

clobrano · 2023-12-21T09:46:18Z

/retest

razo7 · 2023-12-24T13:30:33Z

/lgtm

mshitrit · 2023-12-27T07:41:58Z

I'm unholding this PR since only contain Nits and it has an E2E fix which is relevant for other PRs.
As @clobrano mentioned in case needed it can be addressed in followup PR.
/unhold

openshift-ci bot added the do-not-merge/work-in-progress label Nov 21, 2023

openshift-ci bot added the approved label Nov 21, 2023

clobrano force-pushed the rework-fa-exec-async/1 branch 2 times, most recently from 58169ba to b35b878 Compare November 21, 2023 10:10

razo7 reviewed Nov 22, 2023

View reviewed changes

pkg/cli/cliexecuter.go Outdated Show resolved Hide resolved

razo7 reviewed Nov 22, 2023

View reviewed changes

pkg/cli/cliexecuter.go Outdated Show resolved Hide resolved

clobrano force-pushed the rework-fa-exec-async/1 branch 3 times, most recently from 8e82aa7 to d7e62f0 Compare November 24, 2023 15:14

clobrano added 2 commits November 24, 2023 18:00

Move UpdateConditions to utils package

fedd5b9

Move UpdateConditions function and related assets to the utils package to let the Executor update FAR status when Fence Agent execution completes. Signed-off-by: Carlo Lobrano <[email protected]>

Test refactoring

1283aa2

- improved test independence - replaced custom functions with Gomega alternatives - other small improvements Signed-off-by: Carlo Lobrano <[email protected]>

clobrano force-pushed the rework-fa-exec-async/1 branch from d7e62f0 to e988abf Compare November 24, 2023 17:06

clobrano changed the title ~~Rework running Fence Agent command~~ [WIP] Rework running Fence Agent command Nov 26, 2023

razo7 reviewed Nov 27, 2023

View reviewed changes

pkg/cli/cliexecuter.go Outdated Show resolved Hide resolved

razo7 reviewed Nov 27, 2023

View reviewed changes

pkg/cli/cliexecuter.go Outdated Show resolved Hide resolved

razo7 reviewed Nov 27, 2023

View reviewed changes

pkg/cli/cliexecuter.go Show resolved Hide resolved

clobrano force-pushed the rework-fa-exec-async/1 branch from e988abf to 9a91fcb Compare December 2, 2023 16:48

clobrano added 3 commits December 10, 2023 20:24

Add configurable Fence Agent command retries and timeout

c8e924f

Signed-off-by: Carlo Lobrano <[email protected]>

clobrano force-pushed the rework-fa-exec-async/1 branch from 9a91fcb to 4ebea0e Compare December 10, 2023 19:24

clobrano added 3 commits December 18, 2023 15:47

Fix typos and documentation in controller tests

13223d6

Signed-off-by: Carlo Lobrano <[email protected]>

Refactor common test code in single functions

12985e9

- add verifyRemediationTaintExists - add verifyRemediationConditions

Fix typo in test message

0a8e9c0

clobrano added 3 commits December 19, 2023 16:53

Minor refactor of updateStatusWithRetry error handling

281dd59

Use const fence agent messages to help unittests

8bb393c

Signed-off-by: Carlo Lobrano <[email protected]>

Do not expose full fence agent command line

9ab9ee0

Signed-off-by: Carlo Lobrano <[email protected]>

clobrano mentioned this pull request Dec 19, 2023

Add Events for far Remediation Process #107

Merged

mshitrit reviewed Dec 20, 2023

View reviewed changes

clobrano added 3 commits December 20, 2023 10:03

Move NewFakeExecutor into a separate file

682b894

Signed-off-by: Carlo Lobrano <[email protected]>

Remove incorrect comment

a1ef1c1

runWithRetry function use a constant time back-off, not linear Signed-off-by: Carlo Lobrano <[email protected]>

Rephrased error messages in updateStatusWithRetry

fee00a4

Signed-off-by: Carlo Lobrano <[email protected]>

openshift-ci bot assigned razo7 Dec 20, 2023

openshift-ci bot added the lgtm label Dec 20, 2023

openshift-ci bot added the do-not-merge/hold label Dec 20, 2023

slintes reviewed Dec 20, 2023

View reviewed changes

openshift-ci bot removed the lgtm label Dec 21, 2023

clobrano force-pushed the rework-fa-exec-async/1 branch from b665a12 to fee00a4 Compare December 21, 2023 08:14

openshift-ci bot added the lgtm label Dec 24, 2023

openshift-ci bot removed the do-not-merge/hold label Dec 27, 2023

openshift-merge-bot bot merged commit b9f55d1 into medik8s:main Dec 27, 2023
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework running Fence Agent command #106

Rework running Fence Agent command #106

clobrano commented Nov 21, 2023 •

edited

Loading

openshift-ci bot commented Nov 21, 2023

openshift-ci bot commented Nov 21, 2023

clobrano commented Nov 21, 2023

openshift-ci bot commented Nov 21, 2023

clobrano commented Nov 21, 2023

clobrano commented Nov 22, 2023

clobrano commented Nov 24, 2023

mshitrit commented Nov 26, 2023

clobrano commented Dec 10, 2023

clobrano commented Dec 18, 2023

clobrano commented Dec 19, 2023

mshitrit Dec 20, 2023 •

edited

Loading

clobrano Dec 20, 2023

clobrano commented Dec 20, 2023

razo7 commented Dec 20, 2023

clobrano commented Dec 20, 2023

clobrano commented Dec 20, 2023

slintes left a comment

slintes Dec 20, 2023

clobrano Dec 21, 2023

clobrano Dec 21, 2023

slintes Dec 20, 2023

slintes Dec 20, 2023

clobrano Dec 20, 2023

slintes Jan 3, 2024

clobrano Jan 5, 2024

clobrano commented Dec 21, 2023

razo7 commented Dec 24, 2023

mshitrit commented Dec 27, 2023

Rework running Fence Agent command #106

Rework running Fence Agent command #106

Conversation

clobrano commented Nov 21, 2023 • edited Loading

openshift-ci bot commented Nov 21, 2023

openshift-ci bot commented Nov 21, 2023

clobrano commented Nov 21, 2023

openshift-ci bot commented Nov 21, 2023

clobrano commented Nov 21, 2023

clobrano commented Nov 22, 2023

clobrano commented Nov 24, 2023

mshitrit commented Nov 26, 2023

clobrano commented Dec 10, 2023

clobrano commented Dec 18, 2023

clobrano commented Dec 19, 2023

mshitrit Dec 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clobrano commented Dec 20, 2023

razo7 commented Dec 20, 2023

clobrano commented Dec 20, 2023

clobrano commented Dec 20, 2023

slintes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clobrano commented Dec 21, 2023

razo7 commented Dec 24, 2023

mshitrit commented Dec 27, 2023

clobrano commented Nov 21, 2023 •

edited

Loading

mshitrit Dec 20, 2023 •

edited

Loading