Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle unexpected use case where SNR's configuration is deleted #209

Merged
merged 17 commits into from
Jun 30, 2024

Conversation

mshitrit
Copy link
Member

@mshitrit mshitrit commented Jun 9, 2024

Why we need this PR

SNR should have only one default configuration, since the configuration affects the SNR agents which are running on each node (and every node has one agent).
Deleting this configuration will prevent the operator working properly, since preventing deletion of the configuration is problematic (for example it'll prevents OLM cleanup) we're making sure that SNR is properly disabled when the configuration is deleted.

Changes made

  • Issue a webhook warning when the configuration is deleted
  • Stop outstanding remediation while there is no configuration and set Disabled status on the remediation
  • Remove the disabled status form the remediation when the configuration is created in order to re-trigger the remediation

Which issue(s) this PR fixes

ECOPROJECT-1996

Test plan

Copy link
Contributor

openshift-ci bot commented Jun 9, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved label Jun 9, 2024
@mshitrit
Copy link
Member Author

mshitrit commented Jun 9, 2024

/test 4.15-openshift-e2e

1 similar comment
@mshitrit
Copy link
Member Author

mshitrit commented Jun 9, 2024

/test 4.15-openshift-e2e

@slintes
Copy link
Member

slintes commented Jun 10, 2024

/hold

see my comments on the issue

@mshitrit mshitrit force-pushed the prevent-configuration-delete branch from dd666ae to 59026d3 Compare June 16, 2024 13:38
@mshitrit mshitrit force-pushed the prevent-configuration-delete branch from 59026d3 to fa2b7c2 Compare June 16, 2024 17:30
@mshitrit mshitrit changed the title [WIP] validate default configuration can't be deleted [WIP] handle configuration been deleted Jun 16, 2024
@mshitrit mshitrit changed the title [WIP] handle configuration been deleted handle configuration been deleted Jun 18, 2024
@mshitrit
Copy link
Member Author

/test 4.15-openshift-e2e

2 similar comments
@mshitrit
Copy link
Member Author

/test 4.15-openshift-e2e

@mshitrit
Copy link
Member Author

/test 4.15-openshift-e2e

@mshitrit
Copy link
Member Author

/test 4.14-openshift-e2e

controllers/selfnoderemediation_controller.go Outdated Show resolved Hide resolved
controllers/selfnoderemediation_controller.go Outdated Show resolved Hide resolved
@mshitrit
Copy link
Member Author

/test 4.15-openshift-e2e

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is wrong with this PR. It contains commits and modified code of changes which are in main already... :/

Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I still need to check the part where SNR is disabled, but I left some comments

api/v1alpha1/selfnoderemediationconfig_webhook_test.go Outdated Show resolved Hide resolved
controllers/selfnoderemediation_controller.go Outdated Show resolved Hide resolved
controllers/selfnoderemediation_controller.go Outdated Show resolved Hide resolved
Signed-off-by: Michael Shitrit <[email protected]>
- genralizing condition reasons
- setting a new condition on CR if no config is found and removing this condition when config is created

Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit
Copy link
Member Author

/test 4.15-openshift-e2e

Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit
Copy link
Member Author

/test 4.15-openshift-e2e

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits and comments

@@ -38,13 +38,9 @@ var _ = Describe("Calculator tests", func() {
})

JustBeforeEach(func() {
Expect(k8sClient.Create(context.Background(), snrConfig)).To(Succeed())
createConfig(snrConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please explain what the value of this change (introducing functions which do much more than needed IMHO) is? The tests are testing the config, they fail if it's not created, there is no need to test existence when creating and even less when deleting it 🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.
"GetRebootTime should return correct value" was flaky.
IMO it was because the config didn't create fast enough to be set in the calculator.

For me it makes sense to verify setup steps that takes before the test before the test starts, I find it easier to troubleshoot later.

I don't mind reverting this change, and you can introduce a different fix in a separate PR if that's something you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's flaky, than increasing the test timeout is an easier fix but with the same effect IMHO. WDYT?

Copy link
Member Author

@mshitrit mshitrit Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's easier.
But I think the alternative is a better fix, for the following reasons:

  • it'll reduce the flakiness of all the tests that has a prerequisite of config and not just that specific one
  • we'll have easier time troubleshooting in case test fail - in case config isn't created we have an early indication instead of trying to to figure it out from the test failing (for example in this case we need to figure it out because GetRebootDuration doesn't return the expected value)
  • I think it simplify the test workflow: since this test is not about testing etcd that config is created, but to see how the config creation affects the calculator I think verifying the config is created should not be part of the test but part of the setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and not just that specific one

You need to increase the timeout on both tests of course

in case config isn't created we have an early indication

To my best knowledge this never happened in unit tests so far, it's just more or less slow depending on the host. In the end the only effect of the added code is additional timeout. Or did you see any other issues than running into a timeout?

we need to figure it out because GetRebootDuration doesn't return the expected value

The error message is pretty clear in that case.

And all that still doesn't explain the value of the existence test in the cleanup... the test would have failed in setup or in the actual test without config already, why fail it in cleanup as well?
Actually I see the old version would fail as well, which is unneeded, failures in the delete call can be ignored. The important part is that the config doesn't exist when the test finishes.

But ok, we won't agree anyway, not worth further discussion 🤷🏼‍♂️

/lgtm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end the only effect of the added code is additional timeout. Or did you see any other issues than running into a timeout?

Not sure if you wanted a reply or not so feel free to ignore (I assumed you might want a reply because it was a question).

I'll try to explain myself better with an example.
I've simulated configuration not created on time for both use cases.

In the first use case we'll get the following error (see below), so we still needs to do some research as to the reason it failed.
This is something I generally prefer to avoid by making sure that a test only tests what it's suppose to (in this case the value of GetRebootDuration and not the creation of the configuration)

 SelfNodeRemediationConfig not set yet, can't calculate minimum reboot duration
      {
          msg: "SelfNodeRemediationConfig not set yet, can't calculate minimum reboot duration",
         ...
      }

In the second use case (when create is verified before the test) we'll get the following error, which cuts down the required investigation and since this is done in a shared block the value applied to all current and future tests.

  The function passed to Eventually failed at /home/mshitrit/gitRepos/forked/medik8s/self-node/pkg/reboot/calculator_test.go:148 with:
  Expected success, but got an error:
      <*errors.StatusError | 0xc00055f040>: 
      SelfNodeRemediationConfig.self-node-remediation.medik8s.io "self-node-remediation-config" not found
      {
         ...
              Message: "SelfNodeRemediationConfig.self-node-remediation.medik8s.io \"self-node-remediation-config\" not found",
              ...
      }

Signed-off-by: Michael Shitrit <[email protected]>
@openshift-ci openshift-ci bot added the lgtm label Jun 27, 2024
@slintes
Copy link
Member

slintes commented Jun 27, 2024

/hold

not sure if other threads are resolved

Copy link
Contributor

openshift-ci bot commented Jun 27, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, mshitrit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@clobrano
Copy link
Contributor

My observations have been addressed. Lgtm too

@mshitrit mshitrit marked this pull request as ready for review June 30, 2024 07:12
@openshift-ci openshift-ci bot requested review from razo7 and slintes June 30, 2024 07:13
@mshitrit
Copy link
Member Author

/retest

1 similar comment
@mshitrit
Copy link
Member Author

/retest

@mshitrit mshitrit merged commit 9eb177e into medik8s:main Jun 30, 2024
25 checks passed
@mshitrit mshitrit changed the title handle configuration been deleted Handle unexpected use case where SNR's configuration is deleted Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants