Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Dynamiclly set safe time to assume node rebooted seconds #197

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 31 additions & 10 deletions api/v1alpha1/selfnoderemediationconfig_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,12 @@ import (
// NOTE: json tags are required. Any new fields you add must have json tags for the fields to be serialized.

const (
ConfigCRName = "self-node-remediation-config"
defaultWatchdogPath = "/dev/watchdog"
DefaultSafeToAssumeNodeRebootTimeout = 180
defaultIsSoftwareRebootEnabled = true
ConfigCRName = "self-node-remediation-config"
defaultWatchdogPath = "/dev/watchdog"
defaultIsSoftwareRebootEnabled = true

// SafeTimeToAssumeNodeRebootedOverriddenConditionType is the condition type used to signal whether SNR is overriding SelfNodeRemediationConfigSpec.SafeTimeToAssumeNodeRebootedSeconds with SelfNodeRemediationConfigStatus.MinSafeTimeToAssumeNodeRebootedSeconds
SafeTimeToAssumeNodeRebootedOverriddenConditionType = "SafeTimeToAssumeNodeRebootedOverridden"
)

// SelfNodeRemediationConfigSpec defines the desired state of SelfNodeRemediationConfig
Expand All @@ -46,8 +48,7 @@ type SelfNodeRemediationConfigSpec struct {
// node will likely lead to data corruption and violation of run-once semantics.
// In an effort to prevent this, the operator ignores values lower than a minimum calculated from the
// ApiCheckInterval, ApiServerTimeout, MaxApiErrorThreshold, PeerDialTimeout, and PeerRequestTimeout fields.
// +kubebuilder:validation:Minimum=0
// +kubebuilder:default=180
// +kubebuilder:validation:Minimum=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory at least this can be considered as an API change. It makes CRs with 0 values invalid...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Context] I've added this change in order to differentiate between 0 value which is just the default (when field is empty) and 0 value field in by the user (This differentiation was needed in the webhook).

in theory at least this can be considered as an API change. It makes CRs with 0 values invalid...

That's a good point, after some thinking I think we are still in the clear though.

Here is my line of thought:

  • A 0 value of an older version would cause the removal of SafeTimeToAssumeNodeRebootedSeconds .
  • So IIUC the risk here is a user missing SafeTimeToAssumeNodeRebootedSeconds (that was set to 0) after an upgrade
  • since I think both the risk is low and the consequences of this risk materializing are minor, I think we can go ahead with this change.

SafeTimeToAssumeNodeRebootedSeconds int `json:"safeTimeToAssumeNodeRebootedSeconds,omitempty"`

// Valid time units are "ms", "s", "m", "h".
Expand All @@ -69,7 +70,7 @@ type SelfNodeRemediationConfigSpec struct {
// Valid time units are "ms", "s", "m", "h".
// +optional
// +kubebuilder:default:="15m"
// +kubebuilder:validation:Pattern="^(0|([0-9]+(\\.[0-9]+)?(ms|s|m|h)))$"
// +kubebuilder:validation:Pattern="^(0|([0-9]+(\\.[0-9]+)?(ms|s|m|h)))$|^([1-9][0-9]*m0s)$"
slintes marked this conversation as resolved.
Show resolved Hide resolved
// +kubebuilder:validation:Type:=string
PeerUpdateInterval *metav1.Duration `json:"peerUpdateInterval,omitempty"`

Expand Down Expand Up @@ -128,6 +129,27 @@ type SelfNodeRemediationConfigSpec struct {
type SelfNodeRemediationConfigStatus struct {
// INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
// Important: Run "make" to regenerate code after modifying this file

// MinSafeTimeToAssumeNodeRebootedSeconds is the minimum value that can be assigned to SelfNodeRemediationConfigSpec.SafeTimeToAssumeNodeRebootedSeconds, it is calculated and assigned dynamically.
slintes marked this conversation as resolved.
Show resolved Hide resolved
// +optional
// +kubebuilder:validation:Minimum=0
MinSafeTimeToAssumeNodeRebootedSeconds int `json:"minSafeTimeToAssumeNodeRebootedSeconds,omitempty"`

// Conditions represents the observations of a SelfNodeRemediationConfig's current state.
// Known .status.conditions.type are: "SafeTimeToAssumeNodeRebootedOverridden"
// +operator-sdk:csv:customresourcedefinitions:type=status,xDescriptors="urn:alm:descriptor:io.kubernetes.conditions"
// +listType=map
// +listMapKey=type
// +optional
Conditions []metav1.Condition `json:"conditions,omitempty"`

// LastUpdateTime is the last time the status was updated.
//
//+optional
//+kubebuilder:validation:Type=string
//+kubebuilder:validation:Format=date-time
//+operator-sdk:csv:customresourcedefinitions:type=status
LastUpdateTime *metav1.Time `json:"lastUpdateTime,omitempty"`
}

//+kubebuilder:object:root=true
Expand Down Expand Up @@ -161,9 +183,8 @@ func NewDefaultSelfNodeRemediationConfig() SelfNodeRemediationConfig {
return SelfNodeRemediationConfig{
ObjectMeta: metav1.ObjectMeta{Name: ConfigCRName},
Spec: SelfNodeRemediationConfigSpec{
WatchdogFilePath: defaultWatchdogPath,
SafeTimeToAssumeNodeRebootedSeconds: DefaultSafeToAssumeNodeRebootTimeout,
IsSoftwareRebootEnabled: defaultIsSoftwareRebootEnabled,
WatchdogFilePath: defaultWatchdogPath,
IsSoftwareRebootEnabled: defaultIsSoftwareRebootEnabled,
},
}
}
30 changes: 24 additions & 6 deletions api/v1alpha1/selfnoderemediationconfig_webhook.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,13 @@ import (

// fields names
const (
peerApiServerTimeout = "PeerApiServerTimeout"
apiServerTimeout = "ApiServerTimeout"
peerDialTimeout = "PeerDialTimeout"
peerRequestTimeout = "PeerRequestTimeout"
apiCheckInterval = "ApiCheckInterval"
peerUpdateInterval = "PeerUpdateInterval"
peerApiServerTimeout = "PeerApiServerTimeout"
apiServerTimeout = "ApiServerTimeout"
peerDialTimeout = "PeerDialTimeout"
peerRequestTimeout = "PeerRequestTimeout"
apiCheckInterval = "ApiCheckInterval"
peerUpdateInterval = "PeerUpdateInterval"
safeTimeToAssumeNodeRebootedSeconds = "SafeTimeToAssumeNodeRebootedSeconds"
)

// minimal time durations allowed for fields
Expand Down Expand Up @@ -85,6 +86,7 @@ func (r *SelfNodeRemediationConfig) ValidateUpdate(_ runtime.Object) error {
return errors.NewAggregate([]error{
r.validateTimes(),
r.validateCustomTolerations(),
r.validateMinRebootTime(),
})
}

Expand Down Expand Up @@ -172,3 +174,19 @@ func validateToleration(toleration v1.Toleration) error {
}
return nil
}

func (r *SelfNodeRemediationConfig) validateMinRebootTime() error {
if r.Status.MinSafeTimeToAssumeNodeRebootedSeconds == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending a on status field being set is very weird, see comment below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's depending on the calculated value of MinSafeTimeToAssumeNodeRebootedSeconds .
Initially it was set on an annotation but it didn't make sense to keep this annotation once we populate the Status.

Does it bother you because you consider Status fields user informational only ?

see comment below

Not sure which one do you mean 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to this comment ("below" as "in a comment for a later line of code 😉" ) : #197 (comment)

It's weird that validation fails because a status field (or annotation, doesn't matter) isn't set yet, isn't it? 🤔
I think this might be a usecase for the new warning, which can be returned instead of an error. That needs some dependency updates though, IIUC (see new method signature e.g. here: https://github.com/medik8s/node-healthcheck-operator/blob/9d59a0387a11c4d38ee45f8fb055a37727e02b74/api/v1alpha1/nodehealthcheck_webhook.go#L68)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that validation fails because a status field (or annotation, doesn't matter) isn't set yet, isn't it? 🤔

I don't think so, IMO we should consider two main factors:

  • Why it wasn't set: In our case this field is set when the agent DS is started so in case it's not set this is an indication that something went very wrong
  • The implication of this field being empty: in our case it means that we can't safely determine a "Safe" value for Spec.SafeTimeToAssumeNodeRebootedSeconds , IIUC using an unsafe value could have very bad implications.

I think this might be a usecase for the new warning, which can be returned instead of an error.

Hmm good to know ! I didn't even consider the warning when I've added that piece of code - I just needed a customized validator that can has access to the client.
Do we actually use the warning anywhere in the code ? any idea what does it do as far as user experience ?

return fmt.Errorf("failed to verify min value of SafeRebootTimeSec, Status.MinSafeTimeToAssumeNodeRebootedSeconds should not be empty")
}

//allow removing this optional field
if r.Spec.SafeTimeToAssumeNodeRebootedSeconds == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I understand the change of the minValue from 0 to 1. But IMHO using 0 as indication for "not set" is error prone. The field should be a pointer when it's optional IMHO...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'm just not sure regarding whether this qualifies as an API change or not 🤔 , WDTY ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't consider making a field optional as API change. The other way around I would.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically I don't think we are making the field optional since IIRC it's optional by default in case it doesn't have required, we just remove the default value (In any case I agree we are in the clear).

What I was worried of is changing from int to *int.
IIUC you don't think it's an issue ?

return nil
}

if r.Status.MinSafeTimeToAssumeNodeRebootedSeconds > r.Spec.SafeTimeToAssumeNodeRebootedSeconds {
return fmt.Errorf("can not set SafeTimeToAssumeNodeRebootedSeconds value below the calculated minimum value of: %d", r.Status.MinSafeTimeToAssumeNodeRebootedSeconds)
}
return nil
}
42 changes: 41 additions & 1 deletion api/v1alpha1/selfnoderemediationconfig_webhook_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,41 @@ var _ = Describe("SelfNodeRemediationConfig Validation", func() {
// test update validation on a valid CR
testValidCR("update")

Context("SafeTimeToAssumeNodeRebootedSeconds Validation", func() {
var originalSnrc, updatedSnrc *SelfNodeRemediationConfig
BeforeEach(func() {
originalSnrc = createDefaultSelfNodeRemediationConfigCR()
updatedSnrc = originalSnrc.DeepCopy()
updatedSnrc.Spec.SafeTimeToAssumeNodeRebootedSeconds = 200
})
When("MinSafeTimeToAssumeNodeRebootedSeconds does not exist or empty", func() {
It("validation should fail", func() {
err := updatedSnrc.ValidateUpdate(originalSnrc)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(MatchRegexp("Status.MinSafeTimeToAssumeNodeRebootedSeconds should not be empty"))
})
})
When("MinSafeTimeToAssumeNodeRebootedSeconds value is higher than user assigned value", func() {
BeforeEach(func() {
updatedSnrc.Status.MinSafeTimeToAssumeNodeRebootedSeconds = 220
})
It("validation should fail", func() {
err := updatedSnrc.ValidateUpdate(originalSnrc)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(MatchRegexp("value below the calculated minimum value of: 220"))
})
})
When("Annotation value is lower than user assigned value", func() {
BeforeEach(func() {
updatedSnrc.Status.MinSafeTimeToAssumeNodeRebootedSeconds = 190
})
It("validation should pass", func() {
Expect(updatedSnrc.ValidateUpdate(originalSnrc)).To(Succeed())
})
})

})

})

})
Expand Down Expand Up @@ -174,7 +209,10 @@ func testValidCR(validationType string) {
snrc.Spec.ApiCheckInterval = &metav1.Duration{Duration: 10*time.Second + 500*time.Millisecond}
snrc.Spec.PeerUpdateInterval = &metav1.Duration{Duration: 10 * time.Second}
snrc.Spec.CustomDsTolerations = []v1.Toleration{{Key: "validValue", Effect: v1.TaintEffectNoExecute}, {}, {Operator: v1.TolerationOpEqual, TolerationSeconds: pointer.Int64(-5)}, {Value: "SomeValidValue"}}

snrc.Status.MinSafeTimeToAssumeNodeRebootedSeconds = 150
if validationType == "update" {
snrc.Spec.SafeTimeToAssumeNodeRebootedSeconds = 160
}
Context("for valid CR", func() {
It("should not be rejected", func() {
var err error
Expand Down Expand Up @@ -231,5 +269,7 @@ func setFieldValue(snrc *SelfNodeRemediationConfig, fieldName string, value time
snrc.Spec.ApiCheckInterval = timeValue
case peerUpdateInterval:
snrc.Spec.PeerUpdateInterval = timeValue
case safeTimeToAssumeNodeRebootedSeconds:
snrc.Spec.SafeTimeToAssumeNodeRebootedSeconds = int(value)
}
}
13 changes: 12 additions & 1 deletion api/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions bundle/manifests/self-node-remediation.clusterserviceversion.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,16 @@ spec:
- kind: SelfNodeRemediationConfig
name: selfnoderemediationconfigs
version: v1alpha1
statusDescriptors:
- description: 'Conditions represents the observations of a SelfNodeRemediationConfig''s
current state. Known .status.conditions.type are: "SafeTimeToAssumeNodeRebootedOverridden"'
displayName: Conditions
path: conditions
x-descriptors:
- urn:alm:descriptor:io.kubernetes.conditions
- description: LastUpdateTime is the last time the status was updated.
displayName: Last Update Time
path: lastUpdateTime
version: v1alpha1
- description: SelfNodeRemediation is the Schema for the selfnoderemediations
API
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -149,18 +149,17 @@ spec:
peerUpdateInterval:
default: 15m
description: Valid time units are "ms", "s", "m", "h".
pattern: ^(0|([0-9]+(\.[0-9]+)?(ms|s|m|h)))$
pattern: ^(0|([0-9]+(\.[0-9]+)?(ms|s|m|h)))$|^([1-9][0-9]*m0s)$
type: string
safeTimeToAssumeNodeRebootedSeconds:
default: 180
description: |-
SafeTimeToAssumeNodeRebootedSeconds is the time after which the healthy self node remediation
agents will assume the unhealthy node has been rebooted, and it is safe to recover affected workloads.
This is extremely important as starting replacement Pods while they are still running on the failed
node will likely lead to data corruption and violation of run-once semantics.
In an effort to prevent this, the operator ignores values lower than a minimum calculated from the
ApiCheckInterval, ApiServerTimeout, MaxApiErrorThreshold, PeerDialTimeout, and PeerRequestTimeout fields.
minimum: 0
minimum: 1
type: integer
watchdogFilePath:
default: /dev/watchdog
Expand All @@ -171,6 +170,92 @@ spec:
status:
description: SelfNodeRemediationConfigStatus defines the observed state
of SelfNodeRemediationConfig
properties:
conditions:
description: |-
Conditions represents the observations of a SelfNodeRemediationConfig's current state.
Known .status.conditions.type are: "SafeTimeToAssumeNodeRebootedOverridden"
items:
description: "Condition contains details for one aspect of the current
state of this API Resource.\n---\nThis struct is intended for
direct use as an array at the field path .status.conditions. For
example,\n\n\n\ttype FooStatus struct{\n\t // Represents the
observations of a foo's current state.\n\t // Known .status.conditions.type
are: \"Available\", \"Progressing\", and \"Degraded\"\n\t //
+patchMergeKey=type\n\t // +patchStrategy=merge\n\t // +listType=map\n\t
\ // +listMapKey=type\n\t Conditions []metav1.Condition `json:\"conditions,omitempty\"
patchStrategy:\"merge\" patchMergeKey:\"type\" protobuf:\"bytes,1,rep,name=conditions\"`\n\n\n\t
\ // other fields\n\t}"
properties:
lastTransitionTime:
description: |-
lastTransitionTime is the last time the condition transitioned from one status to another.
This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
format: date-time
type: string
message:
description: |-
message is a human readable message indicating details about the transition.
This may be an empty string.
maxLength: 32768
type: string
observedGeneration:
description: |-
observedGeneration represents the .metadata.generation that the condition was set based upon.
For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date
with respect to the current state of the instance.
format: int64
minimum: 0
type: integer
reason:
description: |-
reason contains a programmatic identifier indicating the reason for the condition's last transition.
Producers of specific condition types may define expected values and meanings for this field,
and whether the values are considered a guaranteed API.
The value should be a CamelCase string.
This field may not be empty.
maxLength: 1024
minLength: 1
pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$
type: string
status:
description: status of the condition, one of True, False, Unknown.
enum:
- "True"
- "False"
- Unknown
type: string
type:
description: |-
type of condition in CamelCase or in foo.example.com/CamelCase.
---
Many .condition.type values are consistent across resources like Available, but because arbitrary conditions can be
useful (see .node.status.conditions), the ability to deconflict is important.
The regex it matches is (dns1123SubdomainFmt/)?(qualifiedNameFmt)
maxLength: 316
pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$
type: string
required:
- lastTransitionTime
- message
- reason
- status
- type
type: object
type: array
x-kubernetes-list-map-keys:
- type
x-kubernetes-list-type: map
lastUpdateTime:
description: LastUpdateTime is the last time the status was updated.
format: date-time
type: string
minSafeTimeToAssumeNodeRebootedSeconds:
description: MinSafeTimeToAssumeNodeRebootedSeconds is the minimum
value that can be assigned to SelfNodeRemediationConfigSpec.SafeTimeToAssumeNodeRebootedSeconds,
it is calculated and assigned dynamically.
minimum: 0
type: integer
type: object
type: object
served: true
Expand Down
Loading
Loading