Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework running Fence Agent command #106

Merged
merged 28 commits into from
Dec 27, 2023

Conversation

clobrano
Copy link
Contributor

@clobrano clobrano commented Nov 21, 2023

Run the fence agent command asynchronously in a dedicated goroutine on the same controller's container.
The goroutine is also responsible to update FAR status with the command outcome. For this reason two new Status Conditions have been added to take into account fence agents failures or timeouts.

The fence agent command has three new, optional, Spec values:

  • RetryCount is the number of times the fencing agent will be executed in case of failures (default: 5)
  • RetryInterval is the interval between each fencing agent retries (default 5s)
  • Timeout is the timeout for each fencing agent execution (default 60s)

TODO

  • replace fence_ipmilan mock with software mock of exec.Command only
  • add fence agent failure case to updateConditions
  • use updateConditions in executor
  • split change in smaller chunks
  • clean up Executor runners Map when goroutine ends
  • investigate why e2e tests with AWS cluster time out: flaky tests 🤷

Copy link
Contributor

openshift-ci bot commented Nov 21, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Nov 21, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@clobrano clobrano force-pushed the rework-fa-exec-async/1 branch 2 times, most recently from 58169ba to b35b878 Compare November 21, 2023 10:10
@clobrano
Copy link
Contributor Author

/test

Copy link
Contributor

openshift-ci bot commented Nov 21, 2023

@clobrano: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test 4.12-ci-bundle-my-bundle
  • /test 4.12-images
  • /test 4.12-openshift-e2e
  • /test 4.12-test
  • /test 4.13-ci-bundle-my-bundle
  • /test 4.13-images
  • /test 4.13-openshift-e2e
  • /test 4.13-test
  • /test 4.14-ci-bundle-my-bundle
  • /test 4.14-images
  • /test 4.14-openshift-e2e
  • /test 4.14-test
  • /test 4.15-ci-bundle-my-bundle
  • /test 4.15-images
  • /test 4.15-openshift-e2e
  • /test 4.15-test

Use /test all to run all jobs.

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@clobrano
Copy link
Contributor Author

/test 4.15-openshift-e2e

pkg/cli/cliexecuter.go Outdated Show resolved Hide resolved
pkg/cli/cliexecuter.go Outdated Show resolved Hide resolved
@clobrano
Copy link
Contributor Author

Working on a different approach for testing, so you might want to wait before reviewing this PR

@clobrano clobrano force-pushed the rework-fa-exec-async/1 branch 3 times, most recently from 8e82aa7 to d7e62f0 Compare November 24, 2023 15:14
Move UpdateConditions function and related assets to the utils package
to let the Executor update FAR status when Fence Agent execution
completes.

Signed-off-by: Carlo Lobrano <[email protected]>
- improved test independence
- replaced custom functions with Gomega alternatives
- other small improvements

Signed-off-by: Carlo Lobrano <[email protected]>
@clobrano
Copy link
Contributor Author

/test 4.14-openshift-e2e

@mshitrit
Copy link
Member

Working on a different approach for testing, so you might want to wait before reviewing this PR

Might want to add a [WIP] suffix on the PR until ready for review

@clobrano clobrano changed the title Rework running Fence Agent command [WIP] Rework running Fence Agent command Nov 26, 2023
pkg/cli/cliexecuter.go Outdated Show resolved Hide resolved
pkg/cli/cliexecuter.go Outdated Show resolved Hide resolved
- run the command directly on the container without API requests
- run the command asynchronously to free Reconciler loop
- let the goroutine running the command also update FAR status
  accordingly to the result of the command. The status update will then
  trigger a new reconcile loop for the rest of the actions.
- add new conditions to handle Fence Agent failures

The goroutine running the fence agent is mapped to the FAR CR UID.

see: https://issues.redhat.com/browse/ECOPROJECT-1755

Signed-off-by: Carlo Lobrano <[email protected]>
Drop verification of the existence of the "Success" message in the
controller's logs.

This check is a strong dependency from the implementation, which means
the test might fail in the future just because the log changes (even
just in the AWS fence agent).

Moreover the E2E checks already skip this control when the target node
is the one where FAR resides, which means the other checks are
sufficient for the test to pass.

Signed-off-by: Carlo Lobrano <[email protected]>
@clobrano
Copy link
Contributor Author

/test 4.14-openshift-e2e

- add verifyRemediationTaintExists
- add verifyRemediationConditions
@clobrano
Copy link
Contributor Author

/retest

@clobrano
Copy link
Contributor Author

/retest

config: config,
clientSet: clientSet,
// NewExecuter builds an Executer with configurable runnerFunc for testing
func NewFakeExecuter(client client.Client, fn runnerFunc) (*Executer, error) {
Copy link
Member

@mshitrit mshitrit Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good rule of thumb to separate test code from production code, based on that I'd expect this method to be in a different file (and once moved it can probably made private ).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I'll move it to the test code

For coherence, also fix the boolean return value if the status update
was interrupted, even if in that case we are also returing an error
which will makes the ExponentialBackoffWithContext function exit anyway.

Signed-off-by: Carlo Lobrano <[email protected]>
runWithRetry function use a constant time back-off, not linear

Signed-off-by: Carlo Lobrano <[email protected]>
@clobrano
Copy link
Contributor Author

/retest

@razo7
Copy link
Member

razo7 commented Dec 20, 2023

/lgtm

@clobrano
Copy link
Contributor Author

/hold
waiting for @mshitrit feedback on his change request

@clobrano
Copy link
Contributor Author

/retest

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm... few comments inline. Not sure about context handling at one place 🤔

@@ -0,0 +1,19 @@
package cli
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be fake_test.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the _test.go makes this file a test and not usable as source for another test.
At least, only changing the name breaks the unit test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case I'm missing something, I'll fix this in a following PR

pkg/cli/cliexecuter.go Outdated Show resolved Hide resolved
retryErr = wait.ExponentialBackoffWithContext(ctx,
backoff,
func(ctx context.Context) (bool, error) {
ctxWithTimeout, cancel := context.WithTimeout(ctx, timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? 🤔 IIUC, the context we get here is the same which we pass to ExponentialBackoffWithContext. Why would we need to cancel that one when we leave this function. Even more, isn't that an issue when do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but maybe it's just too late for me to understand it completely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we need to cancel that one when we leave this function

Not when we leave the function, but when we are in the middle of the function (either in the retry or during a command call), and we want to stop it (e.g. NHC time out)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not when we leave the function

but that's we do with the defer one line below, not? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mixed contexts here (no pun intended 😄)

The code is in a different place now, but my intention here is not to cancel the context, but to give the exec.CommandContext a timeout to run the fence agent command

@clobrano
Copy link
Contributor Author

/retest

@razo7
Copy link
Member

razo7 commented Dec 24, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Dec 24, 2023
@mshitrit
Copy link
Member

I'm unholding this PR since only contain Nits and it has an E2E fix which is relevant for other PRs.
As @clobrano mentioned in case needed it can be addressed in followup PR.
/unhold

@openshift-merge-bot openshift-merge-bot bot merged commit b9f55d1 into medik8s:main Dec 27, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants