Handle unexpected use case where SNR's configuration is deleted #209

mshitrit · 2024-06-09T11:12:53Z

Why we need this PR

SNR should have only one default configuration, since the configuration affects the SNR agents which are running on each node (and every node has one agent).
Deleting this configuration will prevent the operator working properly, since preventing deletion of the configuration is problematic (for example it'll prevents OLM cleanup) we're making sure that SNR is properly disabled when the configuration is deleted.

Changes made

Issue a webhook warning when the configuration is deleted
Stop outstanding remediation while there is no configuration and set Disabled status on the remediation
Remove the disabled status form the remediation when the configuration is created in order to re-trigger the remediation

Which issue(s) this PR fixes

ECOPROJECT-1996

Test plan

openshift-ci · 2024-06-09T11:12:58Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

mshitrit · 2024-06-09T11:59:08Z

/test 4.15-openshift-e2e

mshitrit · 2024-06-09T13:39:47Z

/test 4.15-openshift-e2e

slintes · 2024-06-10T08:19:44Z

/hold

see my comments on the issue

mshitrit · 2024-06-18T11:54:07Z

/test 4.15-openshift-e2e

mshitrit · 2024-06-18T13:26:13Z

/test 4.15-openshift-e2e

mshitrit · 2024-06-18T19:08:47Z

/test 4.15-openshift-e2e

mshitrit · 2024-06-19T06:31:28Z

/test 4.14-openshift-e2e

controllers/selfnoderemediation_controller.go

mshitrit · 2024-06-20T15:06:45Z

/test 4.15-openshift-e2e

slintes

Something is wrong with this PR. It contains commits and modified code of changes which are in main already... :/

controllers/selfnoderemediation_controller.go

controllers/tests/config/selfnoderemediationconfig_controller_test.go

api/v1alpha1/selfnoderemediationconfig_webhook_test.go

clobrano

Sorry, I still need to check the part where SNR is disabled, but I left some comments

api/v1alpha1/selfnoderemediationconfig_webhook_test.go

controllers/selfnoderemediation_controller.go

Signed-off-by: Michael Shitrit <[email protected]>

- genralizing condition reasons - setting a new condition on CR if no config is found and removing this condition when config is created Signed-off-by: Michael Shitrit <[email protected]>

Signed-off-by: Michael Shitrit <[email protected]>

…it's removed Signed-off-by: Michael Shitrit <[email protected]>

controllers/selfnoderemediation_controller.go

controllers/selfnoderemediationconfig_controller.go

controllers/tests/config/selfnoderemediationconfig_controller_test.go

Signed-off-by: Michael Shitrit <[email protected]>

mshitrit · 2024-06-25T12:04:25Z

/test 4.15-openshift-e2e

Signed-off-by: Michael Shitrit <[email protected]>

mshitrit · 2024-06-25T14:45:14Z

/test 4.15-openshift-e2e

controllers/selfnoderemediation_controller.go

slintes

some nits and comments

controllers/tests/config/selfnoderemediationconfig_controller_test.go

slintes · 2024-06-27T08:38:28Z

pkg/reboot/calculator_test.go

@@ -38,13 +38,9 @@ var _ = Describe("Calculator tests", func() {
 	})

 	JustBeforeEach(func() {
-		Expect(k8sClient.Create(context.Background(), snrConfig)).To(Succeed())
+		createConfig(snrConfig)


can you please explain what the value of this change (introducing functions which do much more than needed IMHO) is? The tests are testing the config, they fail if it's not created, there is no need to test existence when creating and even less when deleting it 🤷

Sure.
"GetRebootTime should return correct value" was flaky.
IMO it was because the config didn't create fast enough to be set in the calculator.

For me it makes sense to verify setup steps that takes before the test before the test starts, I find it easier to troubleshoot later.

I don't mind reverting this change, and you can introduce a different fix in a separate PR if that's something you prefer.

if it's flaky, than increasing the test timeout is an easier fix but with the same effect IMHO. WDYT?

I agree it's easier.
But I think the alternative is a better fix, for the following reasons:

it'll reduce the flakiness of all the tests that has a prerequisite of config and not just that specific one

we'll have easier time troubleshooting in case test fail - in case config isn't created we have an early indication instead of trying to to figure it out from the test failing (for example in this case we need to figure it out because GetRebootDuration doesn't return the expected value)

I think it simplify the test workflow: since this test is not about testing etcd that config is created, but to see how the config creation affects the calculator I think verifying the config is created should not be part of the test but part of the setup.

and not just that specific one

You need to increase the timeout on both tests of course

in case config isn't created we have an early indication

To my best knowledge this never happened in unit tests so far, it's just more or less slow depending on the host. In the end the only effect of the added code is additional timeout. Or did you see any other issues than running into a timeout?

we need to figure it out because GetRebootDuration doesn't return the expected value

The error message is pretty clear in that case.

And all that still doesn't explain the value of the existence test in the cleanup... the test would have failed in setup or in the actual test without config already, why fail it in cleanup as well?
Actually I see the old version would fail as well, which is unneeded, failures in the delete call can be ignored. The important part is that the config doesn't exist when the test finishes.

But ok, we won't agree anyway, not worth further discussion 🤷🏼‍♂️

/lgtm

In the end the only effect of the added code is additional timeout. Or did you see any other issues than running into a timeout?

Not sure if you wanted a reply or not so feel free to ignore (I assumed you might want a reply because it was a question).

I'll try to explain myself better with an example.
I've simulated configuration not created on time for both use cases.

In the first use case we'll get the following error (see below), so we still needs to do some research as to the reason it failed.
This is something I generally prefer to avoid by making sure that a test only tests what it's suppose to (in this case the value of GetRebootDuration and not the creation of the configuration)

SelfNodeRemediationConfig not set yet, can't calculate minimum reboot duration { msg: "SelfNodeRemediationConfig not set yet, can't calculate minimum reboot duration", ... }

In the second use case (when create is verified before the test) we'll get the following error, which cuts down the required investigation and since this is done in a shared block the value applied to all current and future tests.

The function passed to Eventually failed at /home/mshitrit/gitRepos/forked/medik8s/self-node/pkg/reboot/calculator_test.go:148 with: Expected success, but got an error: <*errors.StatusError | 0xc00055f040>: SelfNodeRemediationConfig.self-node-remediation.medik8s.io "self-node-remediation-config" not found { ... Message: "SelfNodeRemediationConfig.self-node-remediation.medik8s.io \"self-node-remediation-config\" not found", ... }

Signed-off-by: Michael Shitrit <[email protected]>

This reverts commit a2d72a1.

slintes · 2024-06-27T13:07:11Z

/hold

not sure if other threads are resolved

openshift-ci · 2024-06-27T13:21:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, mshitrit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [clobrano,mshitrit]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

clobrano · 2024-06-27T13:21:48Z

My observations have been addressed. Lgtm too

mshitrit · 2024-06-30T11:19:04Z

/retest

mshitrit · 2024-06-30T13:02:38Z

/retest

openshift-ci bot added the do-not-merge/work-in-progress label Jun 9, 2024

openshift-ci bot added the approved label Jun 9, 2024

openshift-ci bot added the do-not-merge/hold label Jun 10, 2024

openshift-merge-robot added the needs-rebase label Jun 16, 2024

mshitrit force-pushed the prevent-configuration-delete branch from dd666ae to 59026d3 Compare June 16, 2024 13:38

openshift-merge-robot removed the needs-rebase label Jun 16, 2024

mshitrit force-pushed the prevent-configuration-delete branch from 59026d3 to fa2b7c2 Compare June 16, 2024 17:30

mshitrit changed the title ~~[WIP] validate default configuration can't be deleted~~ [WIP] handle configuration been deleted Jun 16, 2024

mshitrit changed the title ~~[WIP] handle configuration been deleted~~ handle configuration been deleted Jun 18, 2024

slintes requested changes Jun 20, 2024

View reviewed changes

controllers/selfnoderemediation_controller.go Outdated Show resolved Hide resolved

controllers/selfnoderemediation_controller.go Outdated Show resolved Hide resolved

openshift-ci bot assigned slintes Jun 20, 2024

slintes reviewed Jun 21, 2024

View reviewed changes

controllers/selfnoderemediation_controller.go Outdated Show resolved Hide resolved

controllers/tests/config/selfnoderemediationconfig_controller_test.go Show resolved Hide resolved

clobrano reviewed Jun 21, 2024

View reviewed changes

api/v1alpha1/selfnoderemediationconfig_webhook_test.go Outdated Show resolved Hide resolved

clobrano requested changes Jun 21, 2024

View reviewed changes

openshift-ci bot assigned clobrano Jun 21, 2024

mshitrit added 6 commits June 23, 2024 13:44

handle configuration delete

562f5e8

Signed-off-by: Michael Shitrit <[email protected]>

fix logic for testing of deleted config

4e6e7f1

Signed-off-by: Michael Shitrit <[email protected]>

- creating an enum for condition types

43886dd

- genralizing condition reasons - setting a new condition on CR if no config is found and removing this condition when config is created Signed-off-by: Michael Shitrit <[email protected]>

add UT

551f87c

Signed-off-by: Michael Shitrit <[email protected]>

add message to status condition

6b0df92

Signed-off-by: Michael Shitrit <[email protected]>

verifying status condition is persisted in test code before veriying …

848a733

…it's removed Signed-off-by: Michael Shitrit <[email protected]>

clobrano requested changes Jun 24, 2024

View reviewed changes

openshift-merge-robot added the needs-rebase label Jun 25, 2024

Merge branch 'main' into prevent-configuration-delete

015e9ec

openshift-merge-robot removed the needs-rebase label Jun 25, 2024

mshitrit added 4 commits June 25, 2024 13:54

typo fix

11fd188

Signed-off-by: Michael Shitrit <[email protected]>

fix tests breaking due to merge changes

b01b339

Signed-off-by: Michael Shitrit <[email protected]>

improve comment syntax

b4f52d4

Signed-off-by: Michael Shitrit <[email protected]>

update name of test config create method

e2b1251

Signed-off-by: Michael Shitrit <[email protected]>

mshitrit added 2 commits June 25, 2024 16:32

remove confusing test case

f3c917f

Signed-off-by: Michael Shitrit <[email protected]>

fix flaky unit test

a2d72a1

Signed-off-by: Michael Shitrit <[email protected]>

slintes reviewed Jun 27, 2024

View reviewed changes

controllers/selfnoderemediation_controller.go Outdated Show resolved Hide resolved

controllers/selfnoderemediation_controller.go Outdated Show resolved Hide resolved

slintes reviewed Jun 27, 2024

View reviewed changes

controllers/selfnoderemediation_controller.go Outdated Show resolved Hide resolved

controllers/selfnoderemediation_controller.go Outdated Show resolved Hide resolved

slintes reviewed Jun 27, 2024

View reviewed changes

mshitrit added 2 commits June 27, 2024 13:25

fixing minor syntax issues

d2e830f

Signed-off-by: Michael Shitrit <[email protected]>

Revert "fix flaky unit test"

14223a0

This reverts commit a2d72a1.

openshift-ci bot added the lgtm label Jun 27, 2024

clobrano approved these changes Jun 27, 2024

View reviewed changes

mshitrit marked this pull request as ready for review June 30, 2024 07:12

openshift-ci bot removed the do-not-merge/work-in-progress label Jun 30, 2024

openshift-ci bot requested review from razo7 and slintes June 30, 2024 07:13

mshitrit merged commit 9eb177e into medik8s:main Jun 30, 2024
25 checks passed

mshitrit changed the title ~~handle configuration been deleted~~ Handle unexpected use case where SNR's configuration is deleted Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle unexpected use case where SNR's configuration is deleted #209

Handle unexpected use case where SNR's configuration is deleted #209

mshitrit commented Jun 9, 2024 •

edited

Loading

openshift-ci bot commented Jun 9, 2024

mshitrit commented Jun 9, 2024

mshitrit commented Jun 9, 2024

slintes commented Jun 10, 2024

mshitrit commented Jun 18, 2024

mshitrit commented Jun 18, 2024

mshitrit commented Jun 18, 2024

mshitrit commented Jun 19, 2024

mshitrit commented Jun 20, 2024

slintes left a comment

clobrano left a comment

mshitrit commented Jun 25, 2024

mshitrit commented Jun 25, 2024

slintes left a comment

slintes Jun 27, 2024

mshitrit Jun 27, 2024

slintes Jun 27, 2024

mshitrit Jun 27, 2024 •

edited

Loading

slintes Jun 27, 2024

mshitrit Jun 30, 2024

slintes commented Jun 27, 2024

openshift-ci bot commented Jun 27, 2024

clobrano commented Jun 27, 2024

mshitrit commented Jun 30, 2024

mshitrit commented Jun 30, 2024

Handle unexpected use case where SNR's configuration is deleted #209

Handle unexpected use case where SNR's configuration is deleted #209

Conversation

mshitrit commented Jun 9, 2024 • edited Loading

Why we need this PR

Changes made

Which issue(s) this PR fixes

Test plan

openshift-ci bot commented Jun 9, 2024

mshitrit commented Jun 9, 2024

mshitrit commented Jun 9, 2024

slintes commented Jun 10, 2024

mshitrit commented Jun 18, 2024

mshitrit commented Jun 18, 2024

mshitrit commented Jun 18, 2024

mshitrit commented Jun 19, 2024

mshitrit commented Jun 20, 2024

slintes left a comment

Choose a reason for hiding this comment

clobrano left a comment

Choose a reason for hiding this comment

mshitrit commented Jun 25, 2024

mshitrit commented Jun 25, 2024

slintes left a comment

Choose a reason for hiding this comment

slintes Jun 27, 2024

Choose a reason for hiding this comment

mshitrit Jun 27, 2024

Choose a reason for hiding this comment

slintes Jun 27, 2024

Choose a reason for hiding this comment

mshitrit Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

slintes Jun 27, 2024

Choose a reason for hiding this comment

mshitrit Jun 30, 2024

Choose a reason for hiding this comment

slintes commented Jun 27, 2024

openshift-ci bot commented Jun 27, 2024

clobrano commented Jun 27, 2024

mshitrit commented Jun 30, 2024

mshitrit commented Jun 30, 2024

mshitrit commented Jun 9, 2024 •

edited

Loading

mshitrit Jun 27, 2024 •

edited

Loading