-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable minimum worker nodecount #238
base: main
Are you sure you want to change the base?
Configurable minimum worker nodecount #238
Conversation
/test 4.15-openshift-e2e |
Hi @novasbc. Thanks for your PR. I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@novasbc: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
e14b8aa
to
af2b099
Compare
Hi, do you mind extending the description please? What's the issue, how do you fix it, how do you test the changes... |
af2b099
to
2beddcb
Compare
@slintes I updated the description, included the issue # as well. Also, fixed the build which was failing with 'make verify-bundle', because the bundle hadn't been updated. |
Thanks! /test 4.16-openshift-e2e |
With this one can specify the number of worker peers needed to be able to contact before determining a node is unhealthy. It covers the case in which there are 3 control plane nodes and a single worker node, and yet you still want to be able to perform remediations on that worker node It has a default of 1, which maintains existing behaviors without explicitly altering the value.
2beddcb
to
a99eed1
Compare
fixed an issue which was causing a failure with make test, regarding rebooter being nil |
/test 4.15-openshift-e2e |
/test 4.16-openshift-e2e |
/test 4.15-openshift-e2e |
/test 4.13-openshift-e2e |
1 similar comment
/test 4.13-openshift-e2e |
I looked into the e2e failures reported over the past few days and realized that it was due to temporary/environmental issues. When I re-ran they started passing better. We can't run the tests in an openshift environment, so weren't seeing the same things locally. Anyhow, I believe this is ready for review. Thanks! |
pkg/apicheck/check.go
Outdated
// TODO: maybe we need to check if this happens too much and reboot | ||
if peersToAsk == nil && c.config.MinPeersForRemediation != 0 || len(peersToAsk) < c.config.MinPeersForRemediation { | ||
c.config.Log.Info("Peers list is empty and / or less than the minimum required peers for remediation, so consider the node being healthy") | ||
//todo maybe we need to check if this happens too much and reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, keep the older "TODO:" format, some tools look for it to count the open items :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, keep the older "TODO:" format, some tools look for it to count the open items :)
Apologies, some automated tooling we are using made that change 🤷♂️
pkg/apicheck/check.go
Outdated
if peersToAsk == nil || len(peersToAsk) == 0 { | ||
c.config.Log.Info("Peers list is empty and / or couldn't be retrieved from server, nothing we can do, so consider the node being healthy") | ||
// TODO: maybe we need to check if this happens too much and reboot | ||
if peersToAsk == nil && c.config.MinPeersForRemediation != 0 || len(peersToAsk) < c.config.MinPeersForRemediation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit tricky, but as len(peersToAsk)
is zero if peersToAsk
is nil
, I think you can get rid of the first part and just use len(peersToAsk) < c.config.MinPeersForRemediation
.
- if
peersToAsk == nil
(and solen(...) == 0
) andc.config.MinPeersForRemediation != 0
, then alsolen(peersToAsk) < c.config.MinPeersForRemediation
is True. - For all the other combinations, we always need to evaluate the part after
||
anyway
pkg/apicheck/check.go
Outdated
return peers.Response{IsHealthy: true, Reason: peers.HealthyBecauseNoPeersWereFound} | ||
} | ||
|
||
//if MinPeersForRemediation == 0 and there are no peers to contact, assume node is unhealthy | ||
if peersToAsk == nil || len(peersToAsk) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is meaningful to allow MinPeersForRemediation
to be zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is purpose of the PR: change the behaviour from "we are healthy" to "we are unhealthy" in case we have no peer, in a non breaking way :)
|
||
// +kubebuilder:default:=1 | ||
// +kubebuilder:validation:Minimum=0 | ||
// Minimum number of peer workers/control nodes to attempt to contact before deciding if node is unhealthy or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might understand the minimum value for workers, but for CP? Do we want to limit the peers to contact for both workers and CP, even if we have more than one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do support single node cluster install for some configs that go out the door
|
||
// +kubebuilder:default:=1 | ||
// +kubebuilder:validation:Minimum=0 | ||
// Minimum number of peer workers/control nodes to attempt to contact before deciding if node is unhealthy or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add what happens when this is set 0
nit: please move comment above the kubebuilder markers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also: does it make sense allow a higher value than 1? If so, what happens when we have less workers available? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case higher value does not make sense, I suggest switching to a boolean value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also: does it make sense allow a higher value than 1? If so, what happens when we have less workers available? 🤔
We wanted to make this configurable such that for different configs that we have this being used in - different amounts of nodes and differing use cases, we gave flexibility of choice.
I do believe that in general it's going to be either zero or one, but didn't want to make that change when one of our configurations drive it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add what happens when this is set 0 nit: please move comment above the kubebuilder markers
I added more details and moved the comments
@@ -170,6 +176,7 @@ func NewDefaultSelfNodeRemediationConfig() SelfNodeRemediationConfig { | |||
Spec: SelfNodeRemediationConfigSpec{ | |||
WatchdogFilePath: defaultWatchdogPath, | |||
IsSoftwareRebootEnabled: defaultIsSoftwareRebootEnabled, | |||
MinPeersForRemediation: defaultMinPeersForRemediation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is redundant, will be set by api server because of the default defined in the kubebuilder marker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the redundant line
pkg/apicheck/check.go
Outdated
return peers.Response{IsHealthy: true, Reason: peers.HealthyBecauseNoPeersWereFound} | ||
} | ||
|
||
//if MinPeersForRemediation == 0 and there are no peers to contact, assume node is unhealthy | ||
if peersToAsk == nil || len(peersToAsk) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is purpose of the PR: change the behaviour from "we are healthy" to "we are unhealthy" in case we have no peer, in a non breaking way :)
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: novasbc The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
With this one can specify the number of worker peers needed to be able to contact before determining a node is unhealthy. It covers the case in which there are 3 control plane nodes and a single worker node, and yet you still want to be able to perform remediations on that worker node It has a default of 1, which maintains existing behaviors without explicitly altering the value.
// and don't want to remediate a node when we shouldn't. Note: It would be unusual for MinPeersForRemediation | ||
// to be greater than 1 unless the environment has specific requirements. | ||
if len(peersToAsk) < c.config.MinPeersForRemediation { | ||
c.config.Log.Info("Peers list is empty and / or less than the minimum required peers for remediation, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log message can be confusing for this use case.
Assuming we don't use only 1 or 0 values for MinPeersForRemediation
Peer List might not be empty.
Why we need this PR
Existing code requires there to be at least one other peer worker node before remediation can occur, precluding SNR from remediating on a configuration with 3 control plane nodes + 1 worker node, which is a scenario that we support for bare minimum deployments.
Changes made
Which issue(s) this PR fixes
Fixes #213
Test plan