[WIP] Dynamiclly set safe time to assume node rebooted seconds #197

mshitrit · 2024-04-18T12:04:22Z

Why we need this PR

Safe Time to reboot is a configured value, however it can't be lower than a minimum calculated value.
The calculated value may differ between clusters so we need a mechanism to override the configured Safe Time to reboot in case it's lower than the calculated value.
This conflict can cause to agents crash-loop when the operator is installed since the agents will not not run with an invalid (lower than calculated value) Safe Time to reboot value.

Changes made

min time to reboot is calculated when snr agent is initialized and set in an annotation on the configuration.
in case the calculated value is lower than the SafeTime in the configuration spec, the value in the configuration is overridden (this can happen when another field that affect the calculation is changed)
when SafeTime is changed in the configuration a webhook is used to verify this value against the calculated min value that was set in the config annotation.

Which issue(s) this PR fixes

ECOPROJECT-1875

Test plan

- overrriding SafeTimeToReboot value in configuration in case it's invalid Signed-off-by: Michael Shitrit <[email protected]>

…ime to reboot Signed-off-by: Michael Shitrit <[email protected]>

openshift-ci · 2024-04-18T12:04:28Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2024-04-18T12:04:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mshitrit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mshitrit]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mshitrit · 2024-04-19T15:36:39Z

/test 4.15-openshift-e2e

api/v1alpha1/selfnoderemediationconfig_webhook_test.go

Signed-off-by: Michael Shitrit <[email protected]>

mshitrit · 2024-04-25T13:26:46Z

/test 4.15-openshift-e2e

clobrano · 2024-04-28T09:36:52Z

/lgtm
Giving a chance to get more reviews
/hold

slintes

I don't think it's a good idea to set the spec in a controller, even less when it's overwriting a configuration value which can be set by users
Even more, the value we want to set is dynamic, it depends on node specific values (watchdog timeout) and cluster size. So we can potentially have many agents fighting for the "right" value

mshitrit · 2024-04-30T10:49:09Z

I don't think it's a good idea to set the spec in a controller, even less when it's overwriting a configuration value which can be set by users

I agree it's not ideal, however we will override the value only in 2 use cases:

When SNR starts and the default value will cause SNR agent crash, at that stage user didn't have a chance to set any value yet and any value that will be set by the user will be verified by the webhook.
When user changes another configuration field whose change will cause SafeTime to become invalid because minSafe time change (which will also lead to SNR agent crashing)

Even more, the value we want to set is dynamic, it depends on node specific values (watchdog timeout) and cluster size. So we can potentially have many agents fighting for the "right" value

In case several agents has different values which makes SafeTime invalid, the highest (safest) one will be applied, the annotation used by the webhook may still have a too low value though (we can discuss whether/if to address this later in case it's relevant).

IMO among the alternatives this is the best solution but I think it's worth to share the alternatives in case you disagree or had something else in mind:

Removing SafeTime from the Spec (major API change)
Removing default value, so initial value will always be set dynamically
Doing nothing (keep the crush loop bug)
Setting a fixed higher default value for SafeTime (will reduce the occurrence of this bug without really fixing it while increasing overall remediation time for small clusters)

slintes · 2024-04-30T11:36:43Z

we will override the value only in 2 use cases

and when users set a too low value, not?

Removing SafeTime from the Spec (major API change)
Removing default value, so initial value will always be set dynamically
Doing nothing (keep the crush loop bug)
Setting a fixed higher default value for SafeTime (will reduce the occurrence of this bug without really fixing it while increasing overall remediation time for small clusters)

just do not modify the spec but use the status instead, for reporting problems and values which are used instead?

Spec is always the DESIRED state. When it can't be reached, modifying the spec isn't the solution.

mshitrit · 2024-04-30T15:21:17Z

and when users set a too low value, not?

In this case it'll be rejected by the webhook.

just do not modify the spec but use the status instead, for reporting problems and values which are used instead?

Hmm I don't quite follow, maybe you can give an example ?

beekhof · 2024-05-01T01:39:14Z

I'm with Marc on this one, operators should not modify the spec.
If the value is missing or too low, either:

refuse to start (taking advantage of crash-loop-backoff)
refuse to make progress and report an error via the status
use a calculated value and report a warning via the status

mshitrit · 2024-05-01T06:42:39Z

I'm with Marc on this one, operators should not modify the spec. If the value is missing or too low, either:

refuse to start (taking advantage of crash-loop-backoff)

refuse to make progress and report an error via the status

use a calculated value and report a warning via the status

I think option 1 isn't really viable - having the operator crash-loop from the get go with default configuration seems to me like an awfully user experience.

Number 2 and 3 are better in that aspect, but still not great IMO.

How about removing the default value of SafeTime and making it optional ?

In case it's not assigned - we can use the calculated value
In case the user tries to assign an invalid valuewe can use the webhook to reject it.
In case the user manipulate other field making minSafeTime higher than SafeTime we can fallback to option 2 or 3
Same as above for an upgrade use case where value is already invalid.

beekhof · 2024-05-02T02:30:44Z

No objection, but if you need to fall back to 2 or 3 anyway... why not always use that mechanism?

mshitrit · 2024-05-02T05:54:15Z

No objection, but if you need to fall back to 2 or 3 anyway... why not always use that mechanism?

I'm using the fallback to cover edge case which are rare, using this mechanism for all use cases is a terrible user experience IMO.

For example let's say that we always use option 2, consider the following scenario:

User installs SNR for the first time without modifying any of the defaults
SNR default safe time is 300S
For that specific cluster the calculate min Safe Time is 315s
As a result SNR will not work (and report problem in the status)

I think that as a user it's a reasonable expectation to have the operator working with default configuration which will not be the case here.
Best case scenario the user notices it after installation and worst case user assumes that SNR is working.
In any case I think it's a bad user experience.

slintes · 2024-05-02T08:12:43Z

As a result SNR will not work (and report problem in the status)

It can work, just report in the status what you are doing, e.g. using the calculated value. However I agree that the user experience isn't great, when the default value doesn't work...

How about removing the default value of SafeTime and making it optional ?

Wfm. Plus always report the calculated value in the status, plus a warning in case it's higher than the spec's values in case it's filled.

In case several agents has different values which makes SafeTime invalid, the highest (safest) one will be applied

that will make fencing slower than needed on nodes with lower watchdog timeout, but I have no better idea without bigger changes to the process... wfm

Signed-off-by: Michael Shitrit <[email protected]>

mshitrit · 2024-05-02T11:31:31Z

Plus always report the calculated value in the status

ATM it's stored in an annotation, is that good enough or do you think it's mandatory to have it in the status ?

plus a warning in case it's higher than the spec's values in case it's filled.

using an example in order to clarify to make sure we mean the same thing:

in case min Safe time changed (due to change of another field that affects it)
if the new value is higher than Safe Time
than add a warning to Config status

…value can't yet be verified Signed-off-by: Michael Shitrit <[email protected]>

Signed-off-by: Michael Shitrit <[email protected]>

mshitrit · 2024-05-06T16:49:06Z

/test 4.15-openshift-e2e

- Make sure safe time can be deleted - Additional log and events - e2e fix Signed-off-by: Michael Shitrit <[email protected]>

mshitrit · 2024-05-06T18:13:28Z

/test 4.15-openshift-e2e

clobrano

only minor of comments from my side

pkg/reboot/calculator.go

api/v1alpha1/selfnoderemediationconfig_types.go

Signed-off-by: Michael Shitrit <[email protected]>

mshitrit · 2024-05-08T13:32:21Z

/test 4.15-openshift-e2e

slintes · 2024-05-15T08:26:39Z

api/v1alpha1/selfnoderemediationconfig_types.go

@@ -46,8 +46,7 @@ type SelfNodeRemediationConfigSpec struct {
 	// node will likely lead to data corruption and violation of run-once semantics.
 	// In an effort to prevent this, the operator ignores values lower than a minimum calculated from the
 	// ApiCheckInterval, ApiServerTimeout, MaxApiErrorThreshold, PeerDialTimeout, and PeerRequestTimeout fields.
-	// +kubebuilder:validation:Minimum=0
-	// +kubebuilder:default=180
+	// +kubebuilder:validation:Minimum=1


in theory at least this can be considered as an API change. It makes CRs with 0 values invalid...

[Context] I've added this change in order to differentiate between 0 value which is just the default (when field is empty) and 0 value field in by the user (This differentiation was needed in the webhook).

in theory at least this can be considered as an API change. It makes CRs with 0 values invalid...

That's a good point, after some thinking I think we are still in the clear though.

Here is my line of thought:

A 0 value of an older version would cause the removal of SafeTimeToAssumeNodeRebootedSeconds .

So IIUC the risk here is a user missing SafeTimeToAssumeNodeRebootedSeconds (that was set to 0) after an upgrade

since I think both the risk is low and the consequences of this risk materializing are minor, I think we can go ahead with this change.

api/v1alpha1/selfnoderemediationconfig_types.go

pkg/reboot/calculator.go

slintes · 2024-05-15T10:13:19Z

pkg/reboot/calculator.go

 	return nil
 }

+// manageSafeRebootTimeInConfiguration does two things:
+//  1. It sets Status.MinSafeTimeToAssumeNodeRebootedSeconds in case it's changed by latest calculation.


it still gives me a bit of headache that we set a dynamic value (depends on size of cluster, and watchdog timeout on node)... and the last agent being started just wins... (which solved the cluster size issue when it's increased at least... but not if it decreases. And having potentially different watchdog configs on the nodes is a topic on its own...)

Those are valid points.
I do have some ideas on how to improve this.
Mainly keeping a map with a separate value for each agent - which should address all of the issues (cluster size change is a bit tricky but still can be managed by comparing other agents entries), but I don't think it's a good idea to do it in this PR, but the least we can do is discussing it.

Signed-off-by: Michael Shitrit <[email protected]>

api/v1alpha1/selfnoderemediationconfig_types.go

pkg/reboot/calculator.go

slintes · 2024-05-21T07:56:43Z

api/v1alpha1/selfnoderemediationconfig_webhook.go

@@ -172,3 +175,26 @@ func validateToleration(toleration v1.Toleration) error {
 	}
 	return nil
 }
+
+func (r *SelfNodeRemediationConfig) validateMinRebootTime() error {
+	if r.Status.MinSafeTimeToAssumeNodeRebootedSeconds == 0 {


I was referring to this comment ("below" as "in a comment for a later line of code 😉" ) : #197 (comment)

It's weird that validation fails because a status field (or annotation, doesn't matter) isn't set yet, isn't it? 🤔
I think this might be a usecase for the new warning, which can be returned instead of an error. That needs some dependency updates though, IIUC (see new method signature e.g. here: https://github.com/medik8s/node-healthcheck-operator/blob/9d59a0387a11c4d38ee45f8fb055a37727e02b74/api/v1alpha1/nodehealthcheck_webhook.go#L68)

Signed-off-by: Michael Shitrit <[email protected]>

- improve method readability Signed-off-by: Michael Shitrit <[email protected]>

mshitrit · 2024-05-28T10:42:09Z

/test 4.16-openshift-e2e

api/v1alpha1/selfnoderemediationconfig_types.go

- use midsentence lowercase - remove redundant display name Signed-off-by: Michael Shitrit <[email protected]>

Signed-off-by: Michael Shitrit <[email protected]>

openshift-merge-robot · 2024-06-05T21:45:23Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-06-06T10:28:58Z

@mshitrit: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/4.12-ci-bundle-self-node-remediation-bundle	`04a6964`	link	true	`/test 4.12-ci-bundle-self-node-remediation-bundle`
ci/prow/4.13-ci-bundle-self-node-remediation-bundle	`04a6964`	link	true	`/test 4.13-ci-bundle-self-node-remediation-bundle`
ci/prow/4.14-ci-bundle-self-node-remediation-bundle	`04a6964`	link	true	`/test 4.14-ci-bundle-self-node-remediation-bundle`
ci/prow/4.15-ci-bundle-self-node-remediation-bundle	`04a6964`	link	true	`/test 4.15-ci-bundle-self-node-remediation-bundle`
ci/prow/4.16-ci-bundle-self-node-remediation-bundle	`04a6964`	link	true	`/test 4.16-ci-bundle-self-node-remediation-bundle`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

mshitrit · 2024-06-24T10:41:23Z

closing in favor of #214

mshitrit added 2 commits April 18, 2024 14:52

- Setting an annotation with min reboot time on configuration

617924b

- overrriding SafeTimeToReboot value in configuration in case it's invalid Signed-off-by: Michael Shitrit <[email protected]>

Using the webhook to prevent user setting an invalid value for Safe T…

f01a339

…ime to reboot Signed-off-by: Michael Shitrit <[email protected]>

openshift-ci bot added the do-not-merge/work-in-progress label Apr 18, 2024

openshift-ci bot added the approved label Apr 18, 2024

mshitrit force-pushed the dynamiclly-set-SafeTimeToAssumeNodeRebootedSeconds branch from 414e9ef to 7efc1e4 Compare April 18, 2024 16:37

clobrano requested changes Apr 22, 2024

View reviewed changes

api/v1alpha1/selfnoderemediationconfig_webhook_test.go Outdated Show resolved Hide resolved

api/v1alpha1/selfnoderemediationconfig_webhook_test.go Outdated Show resolved Hide resolved

api/v1alpha1/selfnoderemediationconfig_webhook_test.go Outdated Show resolved Hide resolved

openshift-ci bot assigned clobrano Apr 22, 2024

mshitrit added 2 commits April 25, 2024 16:08

UT & e2e test

1410563

Signed-off-by: Michael Shitrit <[email protected]>

updating var names & typo fix

46c4744

Signed-off-by: Michael Shitrit <[email protected]>

mshitrit force-pushed the dynamiclly-set-SafeTimeToAssumeNodeRebootedSeconds branch from 7efc1e4 to 46c4744 Compare April 25, 2024 13:26

openshift-ci bot added do-not-merge/hold lgtm labels Apr 28, 2024

slintes requested changes Apr 29, 2024

View reviewed changes

openshift-ci bot assigned slintes Apr 29, 2024

openshift-ci bot removed the lgtm label Apr 29, 2024

remove default value

98b9a51

Signed-off-by: Michael Shitrit <[email protected]>

mshitrit added 3 commits May 5, 2024 20:20

Prevent non emty value for safe time on creation - at this point the …

9f47e0a

…value can't yet be verified Signed-off-by: Michael Shitrit <[email protected]>

Replacing annotation with Status field

916e9f1

Signed-off-by: Michael Shitrit <[email protected]>

Add/remove status warning in case min value is too low

df5d35a

Signed-off-by: Michael Shitrit <[email protected]>

Some additional nits.

d200dce

- Make sure safe time can be deleted - Additional log and events - e2e fix Signed-off-by: Michael Shitrit <[email protected]>

mshitrit force-pushed the dynamiclly-set-SafeTimeToAssumeNodeRebootedSeconds branch from 9d10cfa to d200dce Compare May 6, 2024 18:13

clobrano requested changes May 7, 2024

View reviewed changes

pkg/reboot/calculator.go Show resolved Hide resolved

api/v1alpha1/selfnoderemediationconfig_types.go Outdated Show resolved Hide resolved

update status field warning name

d16195a

Signed-off-by: Michael Shitrit <[email protected]>

slintes requested changes May 15, 2024

View reviewed changes

mshitrit added 5 commits May 19, 2024 15:38

remove SetTimeToAssumeNodeRebooted

9196575

Signed-off-by: Michael Shitrit <[email protected]>

remove empty line - fixing generated crd description

0ca7523

Signed-off-by: Michael Shitrit <[email protected]>

replace status warning with condition

8bdbfa9

Signed-off-by: Michael Shitrit <[email protected]>

remove webhook validation of SafeTime empty when config is created

6d05fa3

Signed-off-by: Michael Shitrit <[email protected]>

use original context instead of creating a new one

11e21a7

Signed-off-by: Michael Shitrit <[email protected]>

slintes reviewed May 21, 2024

View reviewed changes

mshitrit added 3 commits May 23, 2024 12:49

adding context

ac1b238

Signed-off-by: Michael Shitrit <[email protected]>

improve conditions's field readabilty

057f2e5

Signed-off-by: Michael Shitrit <[email protected]>

- update condition's Reason/Message

db71444

- improve method readability Signed-off-by: Michael Shitrit <[email protected]>

slintes requested changes May 29, 2024

View reviewed changes

api/v1alpha1/selfnoderemediationconfig_types.go Outdated Show resolved Hide resolved

api/v1alpha1/selfnoderemediationconfig_types.go Outdated Show resolved Hide resolved

mshitrit added 2 commits May 30, 2024 13:57

- update descriptor

bf88778

- use midsentence lowercase - remove redundant display name Signed-off-by: Michael Shitrit <[email protected]>

using "safet" minimum calculated value

04a6964

Signed-off-by: Michael Shitrit <[email protected]>

openshift-merge-robot added the needs-rebase label Jun 5, 2024

slintes mentioned this pull request Jun 11, 2024

Improve safe time to reboot handling and more #214

Merged

mshitrit closed this Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Dynamiclly set safe time to assume node rebooted seconds #197

[WIP] Dynamiclly set safe time to assume node rebooted seconds #197

mshitrit commented Apr 18, 2024 •

edited by openshift-ci bot

Loading

openshift-ci bot commented Apr 18, 2024

openshift-ci bot commented Apr 18, 2024

mshitrit commented Apr 19, 2024

mshitrit commented Apr 25, 2024

clobrano commented Apr 28, 2024

slintes left a comment

mshitrit commented Apr 30, 2024

slintes commented Apr 30, 2024

mshitrit commented Apr 30, 2024 •

edited

Loading

beekhof commented May 1, 2024

mshitrit commented May 1, 2024

beekhof commented May 2, 2024

mshitrit commented May 2, 2024

slintes commented May 2, 2024

mshitrit commented May 2, 2024

mshitrit commented May 6, 2024

mshitrit commented May 6, 2024

clobrano left a comment

mshitrit commented May 8, 2024

slintes May 15, 2024

mshitrit May 19, 2024

slintes May 15, 2024

mshitrit May 20, 2024

slintes May 21, 2024

mshitrit commented May 28, 2024

openshift-merge-robot commented Jun 5, 2024

openshift-ci bot commented Jun 6, 2024

mshitrit commented Jun 24, 2024

[WIP] Dynamiclly set safe time to assume node rebooted seconds #197

[WIP] Dynamiclly set safe time to assume node rebooted seconds #197

Conversation

mshitrit commented Apr 18, 2024 • edited by openshift-ci bot Loading

Why we need this PR

Changes made

Which issue(s) this PR fixes

Test plan

openshift-ci bot commented Apr 18, 2024

openshift-ci bot commented Apr 18, 2024

mshitrit commented Apr 19, 2024

mshitrit commented Apr 25, 2024

clobrano commented Apr 28, 2024

slintes left a comment

Choose a reason for hiding this comment

mshitrit commented Apr 30, 2024

slintes commented Apr 30, 2024

mshitrit commented Apr 30, 2024 • edited Loading

beekhof commented May 1, 2024

mshitrit commented May 1, 2024

beekhof commented May 2, 2024

mshitrit commented May 2, 2024

slintes commented May 2, 2024

mshitrit commented May 2, 2024

mshitrit commented May 6, 2024

mshitrit commented May 6, 2024

clobrano left a comment

Choose a reason for hiding this comment

mshitrit commented May 8, 2024

slintes May 15, 2024

Choose a reason for hiding this comment

mshitrit May 19, 2024

Choose a reason for hiding this comment

slintes May 15, 2024

Choose a reason for hiding this comment

mshitrit May 20, 2024

Choose a reason for hiding this comment

slintes May 21, 2024

Choose a reason for hiding this comment

mshitrit commented May 28, 2024

openshift-merge-robot commented Jun 5, 2024

openshift-ci bot commented Jun 6, 2024

mshitrit commented Jun 24, 2024

mshitrit commented Apr 18, 2024 •

edited by openshift-ci bot

Loading

mshitrit commented Apr 30, 2024 •

edited

Loading