-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle unexpected use case where SNR's configuration is deleted #209
Handle unexpected use case where SNR's configuration is deleted #209
Conversation
Skipping CI for Draft Pull Request. |
/test 4.15-openshift-e2e |
1 similar comment
/test 4.15-openshift-e2e |
/hold see my comments on the issue |
dd666ae
to
59026d3
Compare
59026d3
to
fa2b7c2
Compare
/test 4.15-openshift-e2e |
2 similar comments
/test 4.15-openshift-e2e |
/test 4.15-openshift-e2e |
/test 4.14-openshift-e2e |
/test 4.15-openshift-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is wrong with this PR. It contains commits and modified code of changes which are in main already... :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I still need to check the part where SNR is disabled, but I left some comments
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
- genralizing condition reasons - setting a new condition on CR if no config is found and removing this condition when config is created Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
…it's removed Signed-off-by: Michael Shitrit <[email protected]>
controllers/tests/config/selfnoderemediationconfig_controller_test.go
Outdated
Show resolved
Hide resolved
controllers/tests/config/selfnoderemediationconfig_controller_test.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
/test 4.15-openshift-e2e |
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
/test 4.15-openshift-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits and comments
controllers/tests/config/selfnoderemediationconfig_controller_test.go
Outdated
Show resolved
Hide resolved
pkg/reboot/calculator_test.go
Outdated
@@ -38,13 +38,9 @@ var _ = Describe("Calculator tests", func() { | |||
}) | |||
|
|||
JustBeforeEach(func() { | |||
Expect(k8sClient.Create(context.Background(), snrConfig)).To(Succeed()) | |||
createConfig(snrConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please explain what the value of this change (introducing functions which do much more than needed IMHO) is? The tests are testing the config, they fail if it's not created, there is no need to test existence when creating and even less when deleting it 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
"GetRebootTime should return correct value" was flaky.
IMO it was because the config didn't create fast enough to be set in the calculator.
For me it makes sense to verify setup steps that takes before the test before the test starts, I find it easier to troubleshoot later.
I don't mind reverting this change, and you can introduce a different fix in a separate PR if that's something you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's flaky, than increasing the test timeout is an easier fix but with the same effect IMHO. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's easier.
But I think the alternative is a better fix, for the following reasons:
- it'll reduce the flakiness of all the tests that has a prerequisite of config and not just that specific one
- we'll have easier time troubleshooting in case test fail - in case config isn't created we have an early indication instead of trying to to figure it out from the test failing (for example in this case we need to figure it out because GetRebootDuration doesn't return the expected value)
- I think it simplify the test workflow: since this test is not about testing etcd that config is created, but to see how the config creation affects the calculator I think verifying the config is created should not be part of the test but part of the setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and not just that specific one
You need to increase the timeout on both tests of course
in case config isn't created we have an early indication
To my best knowledge this never happened in unit tests so far, it's just more or less slow depending on the host. In the end the only effect of the added code is additional timeout. Or did you see any other issues than running into a timeout?
we need to figure it out because GetRebootDuration doesn't return the expected value
The error message is pretty clear in that case.
And all that still doesn't explain the value of the existence test in the cleanup... the test would have failed in setup or in the actual test without config already, why fail it in cleanup as well?
Actually I see the old version would fail as well, which is unneeded, failures in the delete call can be ignored. The important part is that the config doesn't exist when the test finishes.
But ok, we won't agree anyway, not worth further discussion 🤷🏼♂️
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end the only effect of the added code is additional timeout. Or did you see any other issues than running into a timeout?
Not sure if you wanted a reply or not so feel free to ignore (I assumed you might want a reply because it was a question).
I'll try to explain myself better with an example.
I've simulated configuration not created on time for both use cases.
In the first use case we'll get the following error (see below), so we still needs to do some research as to the reason it failed.
This is something I generally prefer to avoid by making sure that a test only tests what it's suppose to (in this case the value of GetRebootDuration
and not the creation of the configuration)
SelfNodeRemediationConfig not set yet, can't calculate minimum reboot duration
{
msg: "SelfNodeRemediationConfig not set yet, can't calculate minimum reboot duration",
...
}
In the second use case (when create is verified before the test) we'll get the following error, which cuts down the required investigation and since this is done in a shared block the value applied to all current and future tests.
The function passed to Eventually failed at /home/mshitrit/gitRepos/forked/medik8s/self-node/pkg/reboot/calculator_test.go:148 with:
Expected success, but got an error:
<*errors.StatusError | 0xc00055f040>:
SelfNodeRemediationConfig.self-node-remediation.medik8s.io "self-node-remediation-config" not found
{
...
Message: "SelfNodeRemediationConfig.self-node-remediation.medik8s.io \"self-node-remediation-config\" not found",
...
}
Signed-off-by: Michael Shitrit <[email protected]>
This reverts commit a2d72a1.
/hold not sure if other threads are resolved |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clobrano, mshitrit The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
My observations have been addressed. Lgtm too |
/retest |
1 similar comment
/retest |
Why we need this PR
SNR should have only one default configuration, since the configuration affects the SNR agents which are running on each node (and every node has one agent).
Deleting this configuration will prevent the operator working properly, since preventing deletion of the configuration is problematic (for example it'll prevents OLM cleanup) we're making sure that SNR is properly disabled when the configuration is deleted.
Changes made
Which issue(s) this PR fixes
ECOPROJECT-1996
Test plan