Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable minimum worker nodecount #238

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ manifests: controller-gen ## Generate WebhookConfiguration, ClusterRole and Cust
.PHONY: generate
generate: controller-gen protoc ## Generate code containing DeepCopy, DeepCopyInto, and DeepCopyObject method implementations. Also generate protoc / gRPC code
$(CONTROLLER_GEN) object:headerFile="hack/boilerplate.go.txt" paths="./..."
PATH=$(PATH):$(shell pwd)/bin/proto/bin && $(PROTOC) --go_out=. --go-grpc_out=. pkg/peerhealth/peerhealth.proto
PATH='$(PATH)':$(shell pwd)/bin/proto/bin && $(PROTOC) --go_out=. --go-grpc_out=. pkg/peerhealth/peerhealth.proto
clobrano marked this conversation as resolved.
Show resolved Hide resolved

.PHONY: fmt
fmt: ## Run go fmt against code.
Expand Down
7 changes: 7 additions & 0 deletions api/v1alpha1/selfnoderemediationconfig_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ const (
ConfigCRName = "self-node-remediation-config"
defaultWatchdogPath = "/dev/watchdog"
defaultIsSoftwareRebootEnabled = true
defaultMinPeersForRemediation = 1
)

// SelfNodeRemediationConfigSpec defines the desired state of SelfNodeRemediationConfig
Expand Down Expand Up @@ -127,6 +128,11 @@ type SelfNodeRemediationConfigSpec struct {
// CustomDsTolerations allows to add custom tolerations snr agents that are running on the ds in order to support remediation for different types of nodes.
// +optional
CustomDsTolerations []v1.Toleration `json:"customDsTolerations,omitempty"`

// +kubebuilder:default:=1
// +kubebuilder:validation:Minimum=0
// Minimum number of peer workers/control nodes to attempt to contact before deciding if node is unhealthy or not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might understand the minimum value for workers, but for CP? Do we want to limit the peers to contact for both workers and CP, even if we have more than one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do support single node cluster install for some configs that go out the door

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add what happens when this is set 0
nit: please move comment above the kubebuilder markers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: does it make sense allow a higher value than 1? If so, what happens when we have less workers available? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case higher value does not make sense, I suggest switching to a boolean value

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: does it make sense allow a higher value than 1? If so, what happens when we have less workers available? 🤔

We wanted to make this configurable such that for different configs that we have this being used in - different amounts of nodes and differing use cases, we gave flexibility of choice.

I do believe that in general it's going to be either zero or one, but didn't want to make that change when one of our configurations drive it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add what happens when this is set 0 nit: please move comment above the kubebuilder markers

I added more details and moved the comments

MinPeersForRemediation int `json:"minPeersForRemediation,omitempty"`
}

// SelfNodeRemediationConfigStatus defines the observed state of SelfNodeRemediationConfig
Expand Down Expand Up @@ -170,6 +176,7 @@ func NewDefaultSelfNodeRemediationConfig() SelfNodeRemediationConfig {
Spec: SelfNodeRemediationConfigSpec{
WatchdogFilePath: defaultWatchdogPath,
IsSoftwareRebootEnabled: defaultIsSoftwareRebootEnabled,
MinPeersForRemediation: defaultMinPeersForRemediation,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is redundant, will be set by api server because of the default defined in the kubebuilder marker

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the redundant line

},
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,12 @@ spec:
its peers.
minimum: 1
type: integer
minPeersForRemediation:
default: 1
description: Minimum number of peer workers/control nodes to attempt
to contact before deciding if node is unhealthy or not
minimum: 0
type: integer
peerApiServerTimeout:
default: 5s
description: |-
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,12 @@ spec:
its peers.
minimum: 1
type: integer
minPeersForRemediation:
default: 1
description: Minimum number of peer workers/control nodes to attempt
to contact before deciding if node is unhealthy or not
minimum: 0
type: integer
peerApiServerTimeout:
default: 5s
description: |-
Expand Down
1 change: 1 addition & 0 deletions controllers/selfnoderemediationconfig_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ func (r *SelfNodeRemediationConfigReconciler) syncConfigDaemonSet(ctx context.Co
data.Data["PeerRequestTimeout"] = snrConfig.Spec.PeerRequestTimeout.Nanoseconds()
data.Data["MaxApiErrorThreshold"] = snrConfig.Spec.MaxApiErrorThreshold
data.Data["EndpointHealthCheckUrl"] = snrConfig.Spec.EndpointHealthCheckUrl
data.Data["MinPeersForRemediation"] = snrConfig.Spec.MinPeersForRemediation
data.Data["HostPort"] = snrConfig.Spec.HostPort
data.Data["IsSoftwareRebootEnabled"] = fmt.Sprintf("\"%t\"", snrConfig.Spec.IsSoftwareRebootEnabled)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,7 @@ var _ = Describe("SNR Config Test", func() {
Expect(createdConfig.Spec.ApiServerTimeout.Seconds()).To(BeEquivalentTo(5))
Expect(createdConfig.Spec.ApiCheckInterval.Seconds()).To(BeEquivalentTo(15))
Expect(createdConfig.Spec.PeerUpdateInterval.Seconds()).To(BeEquivalentTo(15 * 60))
Expect(createdConfig.Spec.MinPeersForRemediation).To(BeEquivalentTo(1))
})
})

Expand Down
9 changes: 9 additions & 0 deletions controllers/tests/config/suite_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,10 @@ import (
"github.com/medik8s/self-node-remediation/pkg/apicheck"
"github.com/medik8s/self-node-remediation/pkg/certificates"
"github.com/medik8s/self-node-remediation/pkg/peers"

//+kubebuilder:scaffold:imports
"github.com/medik8s/self-node-remediation/pkg/reboot"
"github.com/medik8s/self-node-remediation/pkg/watchdog"
)

// These tests use Ginkgo (BDD-style Go testing framework). Refer to
Expand All @@ -52,6 +55,7 @@ var cancelFunc context.CancelFunc
var k8sClient *shared.K8sClientWrapper
var certReader certificates.CertStorageReader
var managerReconciler *controllers.SelfNodeRemediationReconciler
var dummyDog watchdog.Watchdog

func TestAPIs(t *testing.T) {
RegisterFailHandler(Fail)
Expand Down Expand Up @@ -120,12 +124,17 @@ var _ = BeforeSuite(func() {
Expect(err).ToNot(HaveOccurred())

certReader = certificates.NewSecretCertStorage(k8sClient, ctrl.Log.WithName("SecretCertStorage"), shared.Namespace)

dummyDog = watchdog.NewFake(true)
rebooter := reboot.NewWatchdogRebooter(dummyDog, ctrl.Log.WithName("rebooter"))

apiConnectivityCheckConfig := &apicheck.ApiConnectivityCheckConfig{
Log: ctrl.Log.WithName("api-check"),
MyNodeName: shared.UnhealthyNodeName,
CheckInterval: shared.ApiCheckInterval,
MaxErrorsThreshold: shared.MaxErrorThreshold,
Peers: peers,
Rebooter: rebooter,
Cfg: cfg,
CertReader: certReader,
}
Expand Down
2 changes: 2 additions & 0 deletions install/self-node-remediation-deamonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,8 @@ spec:
value: {{.EndpointHealthCheckUrl}}
- name: HOST_PORT
value: "{{.HostPort}}"
- name: MIN_PEERS_FOR_REMEDIATION
value: "{{.MinPeersForRemediation}}"
image: {{.Image}}
imagePullPolicy: Always
volumeMounts:
Expand Down
12 changes: 9 additions & 3 deletions pkg/apicheck/check.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ type ApiConnectivityCheckConfig struct {
PeerRequestTimeout time.Duration
PeerHealthPort int
MaxTimeForNoPeersResponse time.Duration
MinPeersForRemediation int
}

func New(config *ApiConnectivityCheckConfig, controlPlaneManager *controlplane.Manager) *ApiConnectivityCheck {
Expand Down Expand Up @@ -129,12 +130,17 @@ func (c *ApiConnectivityCheck) getWorkerPeersResponse() peers.Response {

c.config.Log.Info("Error count exceeds threshold, trying to ask other nodes if I'm healthy")
peersToAsk := c.config.Peers.GetPeersAddresses(peers.Worker)
if peersToAsk == nil || len(peersToAsk) == 0 {
c.config.Log.Info("Peers list is empty and / or couldn't be retrieved from server, nothing we can do, so consider the node being healthy")
// TODO: maybe we need to check if this happens too much and reboot
if peersToAsk == nil && c.config.MinPeersForRemediation != 0 || len(peersToAsk) < c.config.MinPeersForRemediation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit tricky, but as len(peersToAsk) is zero if peersToAsk is nil, I think you can get rid of the first part and just use len(peersToAsk) < c.config.MinPeersForRemediation.

  • if peersToAsk == nil (and so len(...) == 0) and c.config.MinPeersForRemediation != 0, then also len(peersToAsk) < c.config.MinPeersForRemediation is True.
  • For all the other combinations, we always need to evaluate the part after || anyway

mshitrit marked this conversation as resolved.
Show resolved Hide resolved
c.config.Log.Info("Peers list is empty and / or less than the minimum required peers for remediation, so consider the node being healthy")
//todo maybe we need to check if this happens too much and reboot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, keep the older "TODO:" format, some tools look for it to count the open items :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, keep the older "TODO:" format, some tools look for it to count the open items :)

Apologies, some automated tooling we are using made that change 🤷‍♂️

return peers.Response{IsHealthy: true, Reason: peers.HealthyBecauseNoPeersWereFound}
}

//if MinPeersForRemediation == 0 and there are no peers to contact, assume node is unhealthy
if peersToAsk == nil || len(peersToAsk) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is meaningful to allow MinPeersForRemediation to be zero

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is purpose of the PR: change the behaviour from "we are healthy" to "we are unhealthy" in case we have no peer, in a non breaking way :)

return peers.Response{IsHealthy: false, Reason: peers.UnHealthyBecauseNodeIsIsolated}
}

apiErrorsResponsesSum := 0
nrAllPeers := len(peersToAsk)
// peersToAsk is being reduced at every iteration, iterate until no peers left to ask
Expand Down
Loading