Configurable minimum worker nodecount #238

novasbc · 2024-10-02T17:38:28Z

Why we need this PR

Existing code requires there to be at least one other peer worker node before remediation can occur, precluding SNR from remediating on a configuration with 3 control plane nodes + 1 worker node, which is a scenario that we support for bare minimum deployments.

Changes made

Add minPeersForRemediation configuration value. It defaults to 1, which maintains backward compatibility with existing deployments
Update getWorkerPeersResponse to take into account the new configuration value and not fail when there isn't another peer, and the user has configured the minimum to zero

Which issue(s) this PR fixes

Fixes #213

Test plan

characters.

novasbc · 2024-10-02T17:38:39Z

/test 4.15-openshift-e2e

openshift-ci · 2024-10-02T17:38:45Z

Hi @novasbc. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-10-02T17:38:56Z

@novasbc: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test 4.15-openshift-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

slintes · 2024-10-08T09:37:45Z

Hi, do you mind extending the description please? What's the issue, how do you fix it, how do you test the changes...
Also, please check the failed test.
Thanks

novasbc · 2024-10-16T15:00:42Z

Hi, do you mind extending the description please? What's the issue, how do you fix it, how do you test the changes... Also, please check the failed test. Thanks

@slintes I updated the description, included the issue # as well.

Also, fixed the build which was failing with 'make verify-bundle', because the bundle hadn't been updated.

slintes · 2024-10-17T09:51:42Z

Thanks!

/test 4.16-openshift-e2e

With this one can specify the number of worker peers needed to be able to contact before determining a node is unhealthy. It covers the case in which there are 3 control plane nodes and a single worker node, and yet you still want to be able to perform remediations on that worker node It has a default of 1, which maintains existing behaviors without explicitly altering the value.

novasbc · 2024-10-17T20:39:28Z

fixed an issue which was causing a failure with make test, regarding rebooter being nil

novasbc · 2024-10-18T14:11:37Z

/test 4.15-openshift-e2e

novasbc · 2024-10-18T14:23:57Z

/test 4.16-openshift-e2e

novasbc · 2024-10-22T21:14:04Z

/test 4.15-openshift-e2e

novasbc · 2024-10-22T21:25:41Z

/test 4.13-openshift-e2e

novasbc · 2024-10-23T13:49:59Z

/test 4.13-openshift-e2e

novasbc · 2024-10-23T17:02:05Z

@razo7 @mshitrit

I looked into the e2e failures reported over the past few days and realized that it was due to temporary/environmental issues. When I re-ran they started passing better. We can't run the tests in an openshift environment, so weren't seeing the same things locally.

Anyhow, I believe this is ready for review.

Thanks!

Makefile

clobrano · 2024-10-24T12:58:46Z

pkg/apicheck/check.go

-		// TODO: maybe we need to check if this happens too much and reboot
+	if peersToAsk == nil && c.config.MinPeersForRemediation != 0 || len(peersToAsk) < c.config.MinPeersForRemediation {
+		c.config.Log.Info("Peers list is empty and / or less than the minimum required peers for remediation, so consider the node being healthy")
+		//todo maybe we need to check if this happens too much and reboot


Please, keep the older "TODO:" format, some tools look for it to count the open items :)

Please, keep the older "TODO:" format, some tools look for it to count the open items :)

Apologies, some automated tooling we are using made that change 🤷‍♂️

clobrano · 2024-10-24T13:40:56Z

pkg/apicheck/check.go

-	if peersToAsk == nil || len(peersToAsk) == 0 {
-		c.config.Log.Info("Peers list is empty and / or couldn't be retrieved from server, nothing we can do, so consider the node being healthy")
-		// TODO: maybe we need to check if this happens too much and reboot
+	if peersToAsk == nil && c.config.MinPeersForRemediation != 0 || len(peersToAsk) < c.config.MinPeersForRemediation {


It's a bit tricky, but as len(peersToAsk) is zero if peersToAsk is nil, I think you can get rid of the first part and just use len(peersToAsk) < c.config.MinPeersForRemediation.

if peersToAsk == nil (and so len(...) == 0) and c.config.MinPeersForRemediation != 0, then also len(peersToAsk) < c.config.MinPeersForRemediation is True.

For all the other combinations, we always need to evaluate the part after || anyway

clobrano · 2024-10-24T13:47:27Z

pkg/apicheck/check.go

 		return peers.Response{IsHealthy: true, Reason: peers.HealthyBecauseNoPeersWereFound}
 	}

+	//if MinPeersForRemediation == 0 and there are no peers to contact, assume node is unhealthy
+	if peersToAsk == nil || len(peersToAsk) == 0 {


I wonder if it is meaningful to allow MinPeersForRemediation to be zero

this is purpose of the PR: change the behaviour from "we are healthy" to "we are unhealthy" in case we have no peer, in a non breaking way :)

clobrano · 2024-10-24T13:52:21Z

api/v1alpha1/selfnoderemediationconfig_types.go

+
+	// +kubebuilder:default:=1
+	// +kubebuilder:validation:Minimum=0
+	// Minimum number of peer workers/control nodes to attempt to contact before deciding if node is unhealthy or not


I might understand the minimum value for workers, but for CP? Do we want to limit the peers to contact for both workers and CP, even if we have more than one?

We do support single node cluster install for some configs that go out the door

slintes · 2024-10-24T14:03:54Z

api/v1alpha1/selfnoderemediationconfig_types.go

+
+	// +kubebuilder:default:=1
+	// +kubebuilder:validation:Minimum=0
+	// Minimum number of peer workers/control nodes to attempt to contact before deciding if node is unhealthy or not


please add what happens when this is set 0
nit: please move comment above the kubebuilder markers

also: does it make sense allow a higher value than 1? If so, what happens when we have less workers available? 🤔

In case higher value does not make sense, I suggest switching to a boolean value

also: does it make sense allow a higher value than 1? If so, what happens when we have less workers available? 🤔

We wanted to make this configurable such that for different configs that we have this being used in - different amounts of nodes and differing use cases, we gave flexibility of choice.

I do believe that in general it's going to be either zero or one, but didn't want to make that change when one of our configurations drive it.

please add what happens when this is set 0 nit: please move comment above the kubebuilder markers

I added more details and moved the comments

slintes · 2024-10-24T14:06:28Z

api/v1alpha1/selfnoderemediationconfig_types.go

@@ -170,6 +176,7 @@ func NewDefaultSelfNodeRemediationConfig() SelfNodeRemediationConfig {
 		Spec: SelfNodeRemediationConfigSpec{
 			WatchdogFilePath:        defaultWatchdogPath,
 			IsSoftwareRebootEnabled: defaultIsSoftwareRebootEnabled,
+			MinPeersForRemediation:  defaultMinPeersForRemediation,


this is redundant, will be set by api server because of the default defined in the kubebuilder marker

Removed the redundant line

slintes · 2024-10-24T14:26:58Z

pkg/apicheck/check.go

 		return peers.Response{IsHealthy: true, Reason: peers.HealthyBecauseNoPeersWereFound}
 	}

+	//if MinPeersForRemediation == 0 and there are no peers to contact, assume node is unhealthy
+	if peersToAsk == nil || len(peersToAsk) == 0 {


this is purpose of the PR: change the behaviour from "we are healthy" to "we are unhealthy" in case we have no peer, in a non breaking way :)

openshift-ci · 2024-10-24T14:27:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: novasbc
Once this PR has been reviewed and has the lgtm label, please ask for approval from clobrano. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/apicheck/check.go

With this one can specify the number of worker peers needed to be able to contact before determining a node is unhealthy. It covers the case in which there are 3 control plane nodes and a single worker node, and yet you still want to be able to perform remediations on that worker node It has a default of 1, which maintains existing behaviors without explicitly altering the value.

mshitrit · 2024-11-03T06:51:01Z

pkg/apicheck/check.go

+	//	and don't want to remediate a node when we shouldn't.  Note: It would be unusual for MinPeersForRemediation
+	//	to be greater than 1 unless the environment has specific requirements.
+	if len(peersToAsk) < c.config.MinPeersForRemediation {
+		c.config.Log.Info("Peers list is empty and / or less than the minimum required peers for remediation, " +


Log message can be confusing for this use case.
Assuming we don't use only 1 or 0 values for MinPeersForRemediation Peer List might not be empty.

Update Makefile to support file paths with space

0bc6866

characters.

openshift-ci bot added the do-not-merge/work-in-progress label Oct 2, 2024

openshift-ci bot added the needs-ok-to-test label Oct 2, 2024

novasbc force-pushed the configurable_minimum_worker_nodecount_2024-10-02 branch from e14b8aa to af2b099 Compare October 2, 2024 17:40

beekhof added the ok-to-test label Oct 2, 2024

novasbc force-pushed the configurable_minimum_worker_nodecount_2024-10-02 branch from af2b099 to 2beddcb Compare October 16, 2024 14:54

openshift-ci bot removed the needs-ok-to-test label Oct 16, 2024

novasbc force-pushed the configurable_minimum_worker_nodecount_2024-10-02 branch from 2beddcb to a99eed1 Compare October 17, 2024 20:38

novasbc changed the title ~~Configurable minimum worker nodecount 2024 10 02~~ [WIP] Configurable minimum worker nodecount 2024 10 02 Oct 18, 2024

novasbc changed the title ~~[WIP] Configurable minimum worker nodecount 2024 10 02~~ Configurable minimum worker nodecount Oct 22, 2024

novasbc marked this pull request as ready for review October 22, 2024 14:50

openshift-ci bot removed the do-not-merge/work-in-progress label Oct 22, 2024

openshift-ci bot requested review from mshitrit and razo7 October 22, 2024 14:50

clobrano requested changes Oct 24, 2024

View reviewed changes

openshift-ci bot assigned clobrano Oct 24, 2024

clobrano reviewed Oct 24, 2024

View reviewed changes

slintes requested changes Oct 24, 2024

View reviewed changes

openshift-ci bot assigned slintes Oct 24, 2024

mshitrit reviewed Oct 30, 2024

View reviewed changes

pkg/apicheck/check.go Outdated Show resolved Hide resolved

mshitrit reviewed Nov 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable minimum worker nodecount #238

Configurable minimum worker nodecount #238

novasbc commented Oct 2, 2024 •

edited

Loading

novasbc commented Oct 2, 2024

openshift-ci bot commented Oct 2, 2024

openshift-ci bot commented Oct 2, 2024

slintes commented Oct 8, 2024

novasbc commented Oct 16, 2024 •

edited

Loading

slintes commented Oct 17, 2024

novasbc commented Oct 17, 2024

novasbc commented Oct 18, 2024

novasbc commented Oct 18, 2024

novasbc commented Oct 22, 2024

novasbc commented Oct 22, 2024

novasbc commented Oct 23, 2024

novasbc commented Oct 23, 2024

clobrano Oct 24, 2024

novasbc Oct 31, 2024

clobrano Oct 24, 2024

clobrano Oct 24, 2024

slintes Oct 24, 2024

clobrano Oct 24, 2024

novasbc Oct 31, 2024

slintes Oct 24, 2024

slintes Oct 24, 2024

mshitrit Oct 30, 2024

novasbc Oct 31, 2024

novasbc Oct 31, 2024

slintes Oct 24, 2024

novasbc Oct 31, 2024

slintes Oct 24, 2024

openshift-ci bot commented Oct 24, 2024

mshitrit Nov 3, 2024

Configurable minimum worker nodecount #238

Are you sure you want to change the base?

Configurable minimum worker nodecount #238

Conversation

novasbc commented Oct 2, 2024 • edited Loading

Why we need this PR

Changes made

Which issue(s) this PR fixes

Test plan

novasbc commented Oct 2, 2024

openshift-ci bot commented Oct 2, 2024

openshift-ci bot commented Oct 2, 2024

slintes commented Oct 8, 2024

novasbc commented Oct 16, 2024 • edited Loading

slintes commented Oct 17, 2024

novasbc commented Oct 17, 2024

novasbc commented Oct 18, 2024

novasbc commented Oct 18, 2024

novasbc commented Oct 22, 2024

novasbc commented Oct 22, 2024

novasbc commented Oct 23, 2024

novasbc commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Oct 24, 2024

Choose a reason for hiding this comment

novasbc commented Oct 2, 2024 •

edited

Loading

novasbc commented Oct 16, 2024 •

edited

Loading