-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
short-circuiting-backoff #673
Changes from all commits
cc01360
87495d3
8eb78e3
1657c1f
99d18f9
c2e1e35
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
--- | ||
title: short-circuiting backoff | ||
|
||
authors: | ||
- @mshitrit | ||
|
||
reviewers: | ||
- @beekhof | ||
- @n1r1 | ||
- @slintes | ||
|
||
approvers: | ||
- @JoelSpeed | ||
- @michaelgugino | ||
- @enxebre | ||
|
||
creation-date: 2021-03-01 | ||
|
||
last-updated: 2021-03-01 | ||
|
||
status: implementable | ||
|
||
see-also: | ||
- https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md | ||
--- | ||
|
||
# Support backoff when short-circuiting | ||
|
||
## Release Signoff Checklist | ||
|
||
- [ ] Enhancement is `implementable` | ||
- [ ] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
## Summary | ||
|
||
By using `MachineHealthChecks` a cluster admin can configure automatic remediation of unhealthy machines and nodes. | ||
The machine healthcheck controller's remediation strategy is deleting the machine, and letting the cloud provider | ||
create a new one. This isn't the best remediation strategy in all environments. | ||
|
||
Any Machine that enters the `Failed` state is remediated immediately, without waiting, by the MHC. | ||
When this occurs, if the error which caused the failure is persistent (spot price too low, configuration error), replacement Machines will also be `Failed`. | ||
As replacement machines start and fail, MHC causes a hot loop of Machine being deleted and recreated. | ||
This hot looping makes it difficult for users to find out why their Machines are failing. | ||
Another side effect of machines constantly failing, is the risk of hitting the benchmark of machine failures percentage - thus triggering the "short-circuit" mechanism which will prevent all remediations. | ||
|
||
With this enhancement we propose a better mechanism. | ||
In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID, it will not be remediated until after a certain time period has passed - thus allowing a manual intervention in order to break to hot loop. | ||
|
||
## Motivation | ||
|
||
- Preventing remediation hot loop, in order to allow a manual fix and prevent unnecessary resource usage. | ||
|
||
### Goals | ||
|
||
- Create the opportunity for users to enact custom remediations for Machines that enter the `Failed` state. | ||
|
||
### Non-Goals | ||
|
||
- This enhancement does not seek to create a pluggable remediation system in the MHC. | ||
|
||
## Proposal | ||
|
||
We propose modifying the MachineHealthCheck CRD to support a failed node startup timeout. This timeout defines the period after which a `Failed` machine will be remediated by the MachineHealthCheck. | ||
|
||
### User Stories | ||
|
||
#### Story 1 | ||
|
||
As an admin of a hardware based cluster, I would like failed machines to delay before automatically re-provisioning so I'll have a time frame in which to manually analyze and fix them. | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
If no value for `FailedNodeStartupTimeout` is defined for the MachineHealthCheck CR, the existing remediation flow | ||
is preserved. | ||
|
||
In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it's remediation will be requeued by `FailedNodeStartupTimeout`. | ||
After that time has passed if the machine current state remains, remediation will be performed. | ||
|
||
|
||
#### MHC struct enhancement | ||
|
||
```go | ||
type MachineHealthCheckSpec struct { | ||
... | ||
|
||
// +optional | ||
FailedNodeStartupTimeout metav1.Duration `json:"failedNodeStartupTimeout,omitempty"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Coming back to this, is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Per the implementation There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Andrew pointed to something: we only apply that timeout when there is no nodeRef or providerId, isn't that implicitely the same as "during startup"? 🤔 @JoelSpeed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack, yeah, lets make sure that's clear in the proposal because I haven't reviewed the implementation in a while and it wasn't clear to me that this only affects startup, hence the comment. Ok with it staying as is |
||
} | ||
``` | ||
|
||
#### Example CRs | ||
|
||
MachineHealthCheck: | ||
```yaml | ||
kind: MachineHealthCheck | ||
apiVersion: machine.openshift.io/v1beta1 | ||
metadata: | ||
name: REMEDIATION_GROUP | ||
namespace: NAMESPACE_OF_UNHEALTHY_MACHINE | ||
spec: | ||
selector: | ||
matchLabels: | ||
... | ||
failedNodeStartupTimeout: 48h | ||
``` | ||
|
||
### Risks and Mitigations | ||
In case `FailedNodeStartupTimeout` is undefined default behaviour is preserved (i.e. remediation is not postponed). | ||
The Pro is that naive users aren't being surprised with a new behavior however, the con is that naive users do not benefit from the new behavior. | ||
|
||
## Design Details | ||
|
||
### Open Questions | ||
|
||
See deprecation and upgrade. | ||
|
||
### Test Plan | ||
|
||
The existing remediation tests will be reviewed / adapted / extended as needed. | ||
mshitrit marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Graduation Criteria | ||
|
||
TBD | ||
|
||
#### Dev Preview -> Tech Preview | ||
|
||
TBD | ||
|
||
#### Tech Preview -> GA | ||
|
||
TBD | ||
|
||
#### Removing a deprecated feature | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
### Version Skew Strategy | ||
|
||
## Implementation History | ||
|
||
- [x] 03/01/2021: Opened enhancement PR | ||
|
||
## Drawbacks | ||
|
||
no known drawbacks | ||
|
||
## Alternatives | ||
|
||
In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID do not perform remediation on that machine. | ||
|
||
This alternative is simpler since it does not require `FailedNodeStartupTimeout` however it does not allow an option to retain the system previous behaviour and relies completely on manual fix of the failed machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the implementation you've set a default, this will be incompatible with this statement, you won't be able to remove the value. If you want to have no default, that would actually be preferable as this would then preserve existing behaviour for users who upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right.
Once we decide what's the best way to proceed regarding default/non default I'll make the proper adjustments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did we decide on this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default was removed in the implementation, so I guess that was the decision
openshift/machine-api-operator@e3d7784
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
Here is link to the correspondence.