From cc01360ac67e8ebd76f17e0c637fd6934329eb0e Mon Sep 17 00:00:00 2001 From: mshitrit Date: Tue, 2 Mar 2021 09:37:04 +0200 Subject: [PATCH 1/6] short-circuiting-backoff Signed-off-by: mshitrit --- .../machine-api/short-circuiting-backoff.md | 164 ++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 enhancements/machine-api/short-circuiting-backoff.md diff --git a/enhancements/machine-api/short-circuiting-backoff.md b/enhancements/machine-api/short-circuiting-backoff.md new file mode 100644 index 0000000000..c12b9b3a9e --- /dev/null +++ b/enhancements/machine-api/short-circuiting-backoff.md @@ -0,0 +1,164 @@ +--- +title: short-circuiting backoff + +authors: + - @mshitrit + +reviewers: + - @beekhof + - @n1r1 + - @slintes + +approvers: + - @JoelSpeed + - @michaelgugino + - @enxebre + +creation-date: 2021-03-01 + +last-updated: 2021-03-01 + +status: implementable + +see-also: + - https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md +--- + +# Support backoff when short-circuiting + +## Release Signoff Checklist + +- [ ] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +By using `MachineHealthChecks` a cluster admin can configure automatic remediation of unhealthy machines and nodes. +The machine healthcheck controller's remediation strategy is deleting the machine, and letting the cloud provider +create a new one. This isn't the best remediation strategy in all environments. + +Any Machine that enters the `Failed` state is remediated immediately, without waiting, by the MHC +When this occurs, if the error which caused the failure is persistent (spot price too low, configuration error), replacement Machines will also be `Failed` +As replacement machines start and fail, MHC causes a hot loop of Machine being deleted and recreated +Hot loop makes it difficult for users to find out why their Machines are failing. + +With this enhancement we propose a better mechanism. +In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it'll be remediated after a certain time period has passed - thus allowing a manual intervention in order to break to hot loop. + +## Motivation + +- Preventing remediation hot loop, in order to allow a manual fix and prevent unnecessary resource usage. + +### Goals + +- Create the ability to define customized remediation for Machine that enters the `Failed` state. + +### Non-Goals + +TBD + +## Proposal + +We propose modifying the MachineHealthCheck CRD to support a FailedNodeStartupTimeout, this is the time period which controls remediation of a machine that enters the `Failed` state. + +### User Stories + +#### Story 1 + +As an admin of a hardware based cluster, I would like failed machines to delay before automatically re-provisioning so I'll have a time frame in which to manually analyze and fix them . + +### Implementation Details/Notes/Constraints + +If no value for failedNodeStartupTimeout is defined for the MachineHealthCheck CR, the existing remediation flow +is preserved. + +In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it's remediation will be requeued by failedNodeStartupTimeout. +After that time has passed if the machine current state remains remediation will be performed. + + +#### MHC struct enhancement + +```go + type MachineHealthCheckSpec struct { + ... + + // +optional + FailedNodeStartupTimeout metav1.Duration `json:"failedNodeStartupTimeout,omitempty"` + } +``` + +#### Example CRs + +MachineHealthCheck: +```yaml + kind: MachineHealthCheck + apiVersion: machine.openshift.io/v1beta1 + metadata: + name: REMEDIATION_GROUP + namespace: NAMESPACE_OF_UNHEALTHY_MACHINE + spec: + selector: + matchLabels: + ... + failedNodeStartupTimeout: 48h +``` + +### Risks and Mitigations + +No known risks + +## Design Details + +### Open Questions + +See deprecation and upgrade + +### Test Plan + +The existing remediation tests will be reviewed / adapted / extended as needed. + +### Graduation Criteria + +TBD + +#### Examples + +TBD + +##### Dev Preview -> Tech Preview + +TBD + +##### Tech Preview -> GA + +TBD + +##### Removing a deprecated feature + + +### Upgrade / Downgrade Strategy + +### Version Skew Strategy + +## Implementation History + +- [x] 03/01/2021: Opened enhancement PR + +## Drawbacks + +no known drawbacks + +## Alternatives + +- Instead of delaying, canceling the remediation for failed machines. + +## Infrastructure Needed [optional] + +Use this section if you need things from the project. Examples include a new +subproject, repos requested, github details, and/or testing infrastructure. + +Listing these here allows the community to get the process for these resources +started right away. From 87495d3cbf8e9b4e9a4d551b130331f5e1c1ac8e Mon Sep 17 00:00:00 2001 From: mshitrit Date: Sun, 14 Mar 2021 10:23:14 +0200 Subject: [PATCH 2/6] PR review fixes Signed-off-by: mshitrit --- .../machine-api/short-circuiting-backoff.md | 27 ++++++++++--------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/enhancements/machine-api/short-circuiting-backoff.md b/enhancements/machine-api/short-circuiting-backoff.md index c12b9b3a9e..f3363037f7 100644 --- a/enhancements/machine-api/short-circuiting-backoff.md +++ b/enhancements/machine-api/short-circuiting-backoff.md @@ -40,10 +40,11 @@ By using `MachineHealthChecks` a cluster admin can configure automatic remediati The machine healthcheck controller's remediation strategy is deleting the machine, and letting the cloud provider create a new one. This isn't the best remediation strategy in all environments. -Any Machine that enters the `Failed` state is remediated immediately, without waiting, by the MHC -When this occurs, if the error which caused the failure is persistent (spot price too low, configuration error), replacement Machines will also be `Failed` -As replacement machines start and fail, MHC causes a hot loop of Machine being deleted and recreated -Hot loop makes it difficult for users to find out why their Machines are failing. +Any Machine that enters the `Failed` state is remediated immediately, without waiting, by the MHC. +When this occurs, if the error which caused the failure is persistent (spot price too low, configuration error), replacement Machines will also be `Failed`. +As replacement machines start and fail, MHC causes a hot loop of Machine being deleted and recreated. +This hot looping makes it difficult for users to find out why their Machines are failing. +Another side effect of machines constantly failing, is the risk of hitting the benchmark of machine failures percentage - thus triggering the "short-circuit" mechanism which will prevent all remediations. With this enhancement we propose a better mechanism. In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it'll be remediated after a certain time period has passed - thus allowing a manual intervention in order to break to hot loop. @@ -54,11 +55,11 @@ In case a machine enters the `Failed` state and does not have a NodeRef or a Pro ### Goals -- Create the ability to define customized remediation for Machine that enters the `Failed` state. +- Create the opportunity for users to enact custom remediations for Machines that enter the `Failed` state. ### Non-Goals -TBD +- This enhancement does not seek to create a pluggable remediation system in the MHC. ## Proposal @@ -68,15 +69,15 @@ We propose modifying the MachineHealthCheck CRD to support a FailedNodeStartupTi #### Story 1 -As an admin of a hardware based cluster, I would like failed machines to delay before automatically re-provisioning so I'll have a time frame in which to manually analyze and fix them . +As an admin of a hardware based cluster, I would like failed machines to delay before automatically re-provisioning so I'll have a time frame in which to manually analyze and fix them. ### Implementation Details/Notes/Constraints -If no value for failedNodeStartupTimeout is defined for the MachineHealthCheck CR, the existing remediation flow +If no value for `FailedNodeStartupTimeout` is defined for the MachineHealthCheck CR, the existing remediation flow is preserved. -In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it's remediation will be requeued by failedNodeStartupTimeout. -After that time has passed if the machine current state remains remediation will be performed. +In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it's remediation will be requeued by `FailedNodeStartupTimeout`. +After that time has passed if the machine current state remains, remediation will be performed. #### MHC struct enhancement @@ -86,7 +87,7 @@ After that time has passed if the machine current state remains remediation will ... // +optional - FailedNodeStartupTimeout metav1.Duration `json:"failedNodeStartupTimeout,omitempty"` + FailedNodeStartupTimeout metav1.Duration `json:"failedNodeStartupTimeout,omitempty"` } ``` @@ -108,13 +109,13 @@ MachineHealthCheck: ### Risks and Mitigations -No known risks +No known risks. ## Design Details ### Open Questions -See deprecation and upgrade +See deprecation and upgrade. ### Test Plan From 8eb78e3de3cc0363d9c9284ca298ca245f5e9eaa Mon Sep 17 00:00:00 2001 From: Michael Shitrit <76515081+mshitrit@users.noreply.github.com> Date: Thu, 25 Mar 2021 09:48:06 +0200 Subject: [PATCH 3/6] Update enhancements/machine-api/short-circuiting-backoff.md Co-authored-by: Michael McCune --- enhancements/machine-api/short-circuiting-backoff.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/machine-api/short-circuiting-backoff.md b/enhancements/machine-api/short-circuiting-backoff.md index f3363037f7..124d182ac5 100644 --- a/enhancements/machine-api/short-circuiting-backoff.md +++ b/enhancements/machine-api/short-circuiting-backoff.md @@ -63,7 +63,7 @@ In case a machine enters the `Failed` state and does not have a NodeRef or a Pro ## Proposal -We propose modifying the MachineHealthCheck CRD to support a FailedNodeStartupTimeout, this is the time period which controls remediation of a machine that enters the `Failed` state. +We propose modifying the MachineHealthCheck CRD to support a failed node startup timeout. This timeout defines the period after which a `Failed` machine will be remediated by the MachineHealthCheck. ### User Stories From 1657c1f35d3aa2b8d150a299e7bdc009faf7a6d7 Mon Sep 17 00:00:00 2001 From: Michael Shitrit Date: Tue, 29 Jun 2021 09:21:28 +0300 Subject: [PATCH 4/6] Update enhancements/machine-api/short-circuiting-backoff.md Signed-off-by: Michael Shitrit --- .../machine-api/short-circuiting-backoff.md | 25 ++++++------------- 1 file changed, 7 insertions(+), 18 deletions(-) diff --git a/enhancements/machine-api/short-circuiting-backoff.md b/enhancements/machine-api/short-circuiting-backoff.md index 124d182ac5..1b9c179e17 100644 --- a/enhancements/machine-api/short-circuiting-backoff.md +++ b/enhancements/machine-api/short-circuiting-backoff.md @@ -108,8 +108,8 @@ MachineHealthCheck: ``` ### Risks and Mitigations - -No known risks. +In case `FailedNodeStartupTimeout` is undefined default behaviour is preserved (i.e. remediation is not postponed). +The Pro is that naive users aren't being surprised with a new behavior however, the con is that naive users do benefit from the new behavior. ## Design Details @@ -125,20 +125,15 @@ The existing remediation tests will be reviewed / adapted / extended as needed. TBD -#### Examples - -TBD - -##### Dev Preview -> Tech Preview +#### Dev Preview -> Tech Preview TBD -##### Tech Preview -> GA +#### Tech Preview -> GA TBD -##### Removing a deprecated feature - +#### Removing a deprecated feature ### Upgrade / Downgrade Strategy @@ -154,12 +149,6 @@ no known drawbacks ## Alternatives -- Instead of delaying, canceling the remediation for failed machines. - -## Infrastructure Needed [optional] - -Use this section if you need things from the project. Examples include a new -subproject, repos requested, github details, and/or testing infrastructure. +In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID do not perform remediation on that machine. -Listing these here allows the community to get the process for these resources -started right away. +This alternative is simpler since it does not require `FailedNodeStartupTimeout` however it does not allow an option to retain the system previous behaviour and relies completely on manual fix of the failed machines. \ No newline at end of file From 99d18f9503699b631ca1904f28aeec82885f8084 Mon Sep 17 00:00:00 2001 From: Michael Shitrit <76515081+mshitrit@users.noreply.github.com> Date: Sun, 17 Oct 2021 09:37:36 +0300 Subject: [PATCH 5/6] Update enhancements/machine-api/short-circuiting-backoff.md Co-authored-by: Andrew Beekhof --- enhancements/machine-api/short-circuiting-backoff.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/machine-api/short-circuiting-backoff.md b/enhancements/machine-api/short-circuiting-backoff.md index 1b9c179e17..bf97cb58c7 100644 --- a/enhancements/machine-api/short-circuiting-backoff.md +++ b/enhancements/machine-api/short-circuiting-backoff.md @@ -47,7 +47,7 @@ This hot looping makes it difficult for users to find out why their Machines are Another side effect of machines constantly failing, is the risk of hitting the benchmark of machine failures percentage - thus triggering the "short-circuit" mechanism which will prevent all remediations. With this enhancement we propose a better mechanism. -In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it'll be remediated after a certain time period has passed - thus allowing a manual intervention in order to break to hot loop. +In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID, it will not be remediated until after a certain time period has passed - thus allowing a manual intervention in order to break to hot loop. ## Motivation From c2e1e35a54fb138d490ce6bc37a4641e17624286 Mon Sep 17 00:00:00 2001 From: Michael Shitrit <76515081+mshitrit@users.noreply.github.com> Date: Tue, 19 Oct 2021 10:02:48 +0300 Subject: [PATCH 6/6] Update enhancements/machine-api/short-circuiting-backoff.md Co-authored-by: Michael McCune --- enhancements/machine-api/short-circuiting-backoff.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/machine-api/short-circuiting-backoff.md b/enhancements/machine-api/short-circuiting-backoff.md index bf97cb58c7..f9a86451fd 100644 --- a/enhancements/machine-api/short-circuiting-backoff.md +++ b/enhancements/machine-api/short-circuiting-backoff.md @@ -109,7 +109,7 @@ MachineHealthCheck: ### Risks and Mitigations In case `FailedNodeStartupTimeout` is undefined default behaviour is preserved (i.e. remediation is not postponed). -The Pro is that naive users aren't being surprised with a new behavior however, the con is that naive users do benefit from the new behavior. +The Pro is that naive users aren't being surprised with a new behavior however, the con is that naive users do not benefit from the new behavior. ## Design Details