From cc01360ac67e8ebd76f17e0c637fd6934329eb0e Mon Sep 17 00:00:00 2001
From: mshitrit <mshitrit@redhat.com>
Date: Tue, 2 Mar 2021 09:37:04 +0200
Subject: [PATCH] short-circuiting-backoff

Signed-off-by: mshitrit <mshitrit@redhat.com>
---
 .../machine-api/short-circuiting-backoff.md   | 164 ++++++++++++++++++
 1 file changed, 164 insertions(+)
 create mode 100644 enhancements/machine-api/short-circuiting-backoff.md

diff --git a/enhancements/machine-api/short-circuiting-backoff.md b/enhancements/machine-api/short-circuiting-backoff.md
new file mode 100644
index 0000000000..c12b9b3a9e
--- /dev/null
+++ b/enhancements/machine-api/short-circuiting-backoff.md
@@ -0,0 +1,164 @@
+---
+title: short-circuiting backoff
+
+authors:
+  - @mshitrit
+
+reviewers:
+  - @beekhof
+  - @n1r1
+  - @slintes
+
+approvers:
+  - @JoelSpeed
+  - @michaelgugino
+  - @enxebre
+
+creation-date: 2021-03-01
+
+last-updated: 2021-03-01
+
+status: implementable
+
+see-also:
+  - https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md
+---
+
+# Support backoff when short-circuiting
+
+## Release Signoff Checklist
+
+- [ ] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+By using `MachineHealthChecks` a cluster admin can configure automatic remediation of unhealthy machines and nodes.
+The machine healthcheck controller's remediation strategy is deleting the machine, and letting the cloud provider
+create a new one. This isn't the best remediation strategy in all environments.
+
+Any Machine that enters the `Failed` state is remediated immediately, without waiting, by the MHC
+When this occurs, if the error which caused the failure is persistent (spot price too low, configuration error), replacement Machines will also be `Failed`
+As replacement machines start and fail, MHC causes a hot loop of Machine being deleted and recreated
+Hot loop makes it difficult for users to find out why their Machines are failing.
+
+With this enhancement we propose a better mechanism.
+In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it'll be remediated after a certain time period has passed - thus allowing a manual intervention in order to break to hot loop.
+
+## Motivation
+
+- Preventing remediation hot loop, in order to allow a manual fix and prevent unnecessary resource usage.
+
+### Goals
+
+- Create the ability to define customized remediation for Machine that enters the `Failed` state.
+
+### Non-Goals
+
+TBD
+
+## Proposal
+
+We propose modifying the MachineHealthCheck CRD to support a FailedNodeStartupTimeout, this is the time period which controls remediation of a machine that enters the `Failed` state.
+
+### User Stories
+
+#### Story 1
+
+As an admin of a hardware based cluster, I would like failed machines to delay before automatically re-provisioning so I'll have a time frame in which to manually analyze and fix them .
+
+### Implementation Details/Notes/Constraints
+
+If no value for failedNodeStartupTimeout is defined for the MachineHealthCheck CR, the existing remediation flow
+is preserved.
+
+In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it's remediation will be requeued by failedNodeStartupTimeout.
+After that time has passed if the machine current state remains remediation will be performed.
+
+
+#### MHC struct enhancement
+
+```go
+    type MachineHealthCheckSpec struct {
+        ...
+    
+        // +optional
+		FailedNodeStartupTimeout metav1.Duration `json:"failedNodeStartupTimeout,omitempty"`
+    }
+```
+
+#### Example CRs
+
+MachineHealthCheck:
+```yaml
+    kind: MachineHealthCheck
+    apiVersion: machine.openshift.io/v1beta1
+    metadata:
+      name: REMEDIATION_GROUP
+      namespace: NAMESPACE_OF_UNHEALTHY_MACHINE
+    spec:
+      selector:
+        matchLabels: 
+          ...
+      failedNodeStartupTimeout: 48h
+```
+
+### Risks and Mitigations
+
+No known risks
+
+## Design Details
+
+### Open Questions
+
+See deprecation and upgrade
+
+### Test Plan
+
+The existing remediation tests will be reviewed / adapted / extended as needed.
+
+### Graduation Criteria
+
+TBD
+
+#### Examples
+
+TBD
+
+##### Dev Preview -> Tech Preview
+
+TBD
+
+##### Tech Preview -> GA
+
+TBD
+
+##### Removing a deprecated feature
+
+
+### Upgrade / Downgrade Strategy
+
+### Version Skew Strategy
+
+## Implementation History
+
+- [x] 03/01/2021: Opened enhancement PR
+
+## Drawbacks
+
+no known drawbacks
+
+## Alternatives
+
+- Instead of delaying, canceling the remediation for failed machines.
+
+## Infrastructure Needed [optional]
+
+Use this section if you need things from the project. Examples include a new
+subproject, repos requested, github details, and/or testing infrastructure.
+
+Listing these here allows the community to get the process for these resources
+started right away.