Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable minimum worker nodecount #238

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

novasbc
Copy link

@novasbc novasbc commented Oct 2, 2024

Why we need this PR

Existing code requires there to be at least one other peer worker node before remediation can occur, precluding SNR from remediating on a configuration with 3 control plane nodes + 1 worker node, which is a scenario that we support for bare minimum deployments.

Changes made

  • Add minPeersForRemediation configuration value. It defaults to 1, which maintains backward compatibility with existing deployments
  • Update getWorkerPeersResponse to take into account the new configuration value and not fail when there isn't another peer, and the user has configured the minimum to zero

Which issue(s) this PR fixes

Fixes #213

Test plan

@novasbc
Copy link
Author

novasbc commented Oct 2, 2024

/test 4.15-openshift-e2e

Copy link
Contributor

openshift-ci bot commented Oct 2, 2024

Hi @novasbc. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Oct 2, 2024

@novasbc: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test 4.15-openshift-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@novasbc novasbc force-pushed the configurable_minimum_worker_nodecount_2024-10-02 branch from e14b8aa to af2b099 Compare October 2, 2024 17:40
@slintes
Copy link
Member

slintes commented Oct 8, 2024

Hi, do you mind extending the description please? What's the issue, how do you fix it, how do you test the changes...
Also, please check the failed test.
Thanks

@novasbc novasbc force-pushed the configurable_minimum_worker_nodecount_2024-10-02 branch from af2b099 to 2beddcb Compare October 16, 2024 14:54
@novasbc
Copy link
Author

novasbc commented Oct 16, 2024

Hi, do you mind extending the description please? What's the issue, how do you fix it, how do you test the changes... Also, please check the failed test. Thanks

@slintes I updated the description, included the issue # as well.

Also, fixed the build which was failing with 'make verify-bundle', because the bundle hadn't been updated.

@slintes
Copy link
Member

slintes commented Oct 17, 2024

Thanks!

/test 4.16-openshift-e2e

With this one can specify the number of worker peers needed to
be able to contact before determining a node is unhealthy.

It covers the case in which there are 3 control plane nodes and a single
worker node, and yet you still want to be able to perform remediations
on that worker node

It has a default of 1, which maintains existing behaviors without
explicitly altering the value.
@novasbc novasbc force-pushed the configurable_minimum_worker_nodecount_2024-10-02 branch from 2beddcb to a99eed1 Compare October 17, 2024 20:38
@novasbc
Copy link
Author

novasbc commented Oct 17, 2024

fixed an issue which was causing a failure with make test, regarding rebooter being nil

@novasbc novasbc changed the title Configurable minimum worker nodecount 2024 10 02 [WIP] Configurable minimum worker nodecount 2024 10 02 Oct 18, 2024
@novasbc
Copy link
Author

novasbc commented Oct 18, 2024

/test 4.15-openshift-e2e

@novasbc
Copy link
Author

novasbc commented Oct 18, 2024

/test 4.16-openshift-e2e

@novasbc novasbc changed the title [WIP] Configurable minimum worker nodecount 2024 10 02 Configurable minimum worker nodecount Oct 22, 2024
@novasbc novasbc marked this pull request as ready for review October 22, 2024 14:50
@openshift-ci openshift-ci bot requested review from mshitrit and razo7 October 22, 2024 14:50
@novasbc
Copy link
Author

novasbc commented Oct 22, 2024

/test 4.15-openshift-e2e

@novasbc
Copy link
Author

novasbc commented Oct 22, 2024

/test 4.13-openshift-e2e

1 similar comment
@novasbc
Copy link
Author

novasbc commented Oct 23, 2024

/test 4.13-openshift-e2e

@novasbc
Copy link
Author

novasbc commented Oct 23, 2024

@razo7 @mshitrit

I looked into the e2e failures reported over the past few days and realized that it was due to temporary/environmental issues. When I re-ran they started passing better. We can't run the tests in an openshift environment, so weren't seeing the same things locally.

Anyhow, I believe this is ready for review.

Thanks!

Makefile Show resolved Hide resolved
// TODO: maybe we need to check if this happens too much and reboot
if peersToAsk == nil && c.config.MinPeersForRemediation != 0 || len(peersToAsk) < c.config.MinPeersForRemediation {
c.config.Log.Info("Peers list is empty and / or less than the minimum required peers for remediation, so consider the node being healthy")
//todo maybe we need to check if this happens too much and reboot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, keep the older "TODO:" format, some tools look for it to count the open items :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, keep the older "TODO:" format, some tools look for it to count the open items :)

Apologies, some automated tooling we are using made that change 🤷‍♂️

if peersToAsk == nil || len(peersToAsk) == 0 {
c.config.Log.Info("Peers list is empty and / or couldn't be retrieved from server, nothing we can do, so consider the node being healthy")
// TODO: maybe we need to check if this happens too much and reboot
if peersToAsk == nil && c.config.MinPeersForRemediation != 0 || len(peersToAsk) < c.config.MinPeersForRemediation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit tricky, but as len(peersToAsk) is zero if peersToAsk is nil, I think you can get rid of the first part and just use len(peersToAsk) < c.config.MinPeersForRemediation.

  • if peersToAsk == nil (and so len(...) == 0) and c.config.MinPeersForRemediation != 0, then also len(peersToAsk) < c.config.MinPeersForRemediation is True.
  • For all the other combinations, we always need to evaluate the part after || anyway

return peers.Response{IsHealthy: true, Reason: peers.HealthyBecauseNoPeersWereFound}
}

//if MinPeersForRemediation == 0 and there are no peers to contact, assume node is unhealthy
if peersToAsk == nil || len(peersToAsk) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is meaningful to allow MinPeersForRemediation to be zero

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is purpose of the PR: change the behaviour from "we are healthy" to "we are unhealthy" in case we have no peer, in a non breaking way :)


// +kubebuilder:default:=1
// +kubebuilder:validation:Minimum=0
// Minimum number of peer workers/control nodes to attempt to contact before deciding if node is unhealthy or not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might understand the minimum value for workers, but for CP? Do we want to limit the peers to contact for both workers and CP, even if we have more than one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do support single node cluster install for some configs that go out the door


// +kubebuilder:default:=1
// +kubebuilder:validation:Minimum=0
// Minimum number of peer workers/control nodes to attempt to contact before deciding if node is unhealthy or not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add what happens when this is set 0
nit: please move comment above the kubebuilder markers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: does it make sense allow a higher value than 1? If so, what happens when we have less workers available? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case higher value does not make sense, I suggest switching to a boolean value

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: does it make sense allow a higher value than 1? If so, what happens when we have less workers available? 🤔

We wanted to make this configurable such that for different configs that we have this being used in - different amounts of nodes and differing use cases, we gave flexibility of choice.

I do believe that in general it's going to be either zero or one, but didn't want to make that change when one of our configurations drive it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add what happens when this is set 0 nit: please move comment above the kubebuilder markers

I added more details and moved the comments

@@ -170,6 +176,7 @@ func NewDefaultSelfNodeRemediationConfig() SelfNodeRemediationConfig {
Spec: SelfNodeRemediationConfigSpec{
WatchdogFilePath: defaultWatchdogPath,
IsSoftwareRebootEnabled: defaultIsSoftwareRebootEnabled,
MinPeersForRemediation: defaultMinPeersForRemediation,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is redundant, will be set by api server because of the default defined in the kubebuilder marker

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the redundant line

return peers.Response{IsHealthy: true, Reason: peers.HealthyBecauseNoPeersWereFound}
}

//if MinPeersForRemediation == 0 and there are no peers to contact, assume node is unhealthy
if peersToAsk == nil || len(peersToAsk) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is purpose of the PR: change the behaviour from "we are healthy" to "we are unhealthy" in case we have no peer, in a non breaking way :)

Copy link
Contributor

openshift-ci bot commented Oct 24, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: novasbc
Once this PR has been reviewed and has the lgtm label, please ask for approval from clobrano. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/apicheck/check.go Outdated Show resolved Hide resolved
With this one can specify the number of worker peers needed to
be able to contact before determining a node is unhealthy.

It covers the case in which there are 3 control plane nodes and a single
worker node, and yet you still want to be able to perform remediations
on that worker node

It has a default of 1, which maintains existing behaviors without
explicitly altering the value.
// and don't want to remediate a node when we shouldn't. Note: It would be unusual for MinPeersForRemediation
// to be greater than 1 unless the environment has specific requirements.
if len(peersToAsk) < c.config.MinPeersForRemediation {
c.config.Log.Info("Peers list is empty and / or less than the minimum required peers for remediation, " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log message can be confusing for this use case.
Assuming we don't use only 1 or 0 values for MinPeersForRemediation Peer List might not be empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for remediation on single worker node configurations
5 participants