From cc01360ac67e8ebd76f17e0c637fd6934329eb0e Mon Sep 17 00:00:00 2001 From: mshitrit Date: Tue, 2 Mar 2021 09:37:04 +0200 Subject: [PATCH] short-circuiting-backoff Signed-off-by: mshitrit --- .../machine-api/short-circuiting-backoff.md | 164 ++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 enhancements/machine-api/short-circuiting-backoff.md diff --git a/enhancements/machine-api/short-circuiting-backoff.md b/enhancements/machine-api/short-circuiting-backoff.md new file mode 100644 index 0000000000..c12b9b3a9e --- /dev/null +++ b/enhancements/machine-api/short-circuiting-backoff.md @@ -0,0 +1,164 @@ +--- +title: short-circuiting backoff + +authors: + - @mshitrit + +reviewers: + - @beekhof + - @n1r1 + - @slintes + +approvers: + - @JoelSpeed + - @michaelgugino + - @enxebre + +creation-date: 2021-03-01 + +last-updated: 2021-03-01 + +status: implementable + +see-also: + - https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md +--- + +# Support backoff when short-circuiting + +## Release Signoff Checklist + +- [ ] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +By using `MachineHealthChecks` a cluster admin can configure automatic remediation of unhealthy machines and nodes. +The machine healthcheck controller's remediation strategy is deleting the machine, and letting the cloud provider +create a new one. This isn't the best remediation strategy in all environments. + +Any Machine that enters the `Failed` state is remediated immediately, without waiting, by the MHC +When this occurs, if the error which caused the failure is persistent (spot price too low, configuration error), replacement Machines will also be `Failed` +As replacement machines start and fail, MHC causes a hot loop of Machine being deleted and recreated +Hot loop makes it difficult for users to find out why their Machines are failing. + +With this enhancement we propose a better mechanism. +In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it'll be remediated after a certain time period has passed - thus allowing a manual intervention in order to break to hot loop. + +## Motivation + +- Preventing remediation hot loop, in order to allow a manual fix and prevent unnecessary resource usage. + +### Goals + +- Create the ability to define customized remediation for Machine that enters the `Failed` state. + +### Non-Goals + +TBD + +## Proposal + +We propose modifying the MachineHealthCheck CRD to support a FailedNodeStartupTimeout, this is the time period which controls remediation of a machine that enters the `Failed` state. + +### User Stories + +#### Story 1 + +As an admin of a hardware based cluster, I would like failed machines to delay before automatically re-provisioning so I'll have a time frame in which to manually analyze and fix them . + +### Implementation Details/Notes/Constraints + +If no value for failedNodeStartupTimeout is defined for the MachineHealthCheck CR, the existing remediation flow +is preserved. + +In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it's remediation will be requeued by failedNodeStartupTimeout. +After that time has passed if the machine current state remains remediation will be performed. + + +#### MHC struct enhancement + +```go + type MachineHealthCheckSpec struct { + ... + + // +optional + FailedNodeStartupTimeout metav1.Duration `json:"failedNodeStartupTimeout,omitempty"` + } +``` + +#### Example CRs + +MachineHealthCheck: +```yaml + kind: MachineHealthCheck + apiVersion: machine.openshift.io/v1beta1 + metadata: + name: REMEDIATION_GROUP + namespace: NAMESPACE_OF_UNHEALTHY_MACHINE + spec: + selector: + matchLabels: + ... + failedNodeStartupTimeout: 48h +``` + +### Risks and Mitigations + +No known risks + +## Design Details + +### Open Questions + +See deprecation and upgrade + +### Test Plan + +The existing remediation tests will be reviewed / adapted / extended as needed. + +### Graduation Criteria + +TBD + +#### Examples + +TBD + +##### Dev Preview -> Tech Preview + +TBD + +##### Tech Preview -> GA + +TBD + +##### Removing a deprecated feature + + +### Upgrade / Downgrade Strategy + +### Version Skew Strategy + +## Implementation History + +- [x] 03/01/2021: Opened enhancement PR + +## Drawbacks + +no known drawbacks + +## Alternatives + +- Instead of delaying, canceling the remediation for failed machines. + +## Infrastructure Needed [optional] + +Use this section if you need things from the project. Examples include a new +subproject, repos requested, github details, and/or testing infrastructure. + +Listing these here allows the community to get the process for these resources +started right away.