diff --git a/enhancements/machine-api/external-remediations.md b/enhancements/machine-api/external-remediations.md new file mode 100644 index 0000000000..ca1a4f8fbb --- /dev/null +++ b/enhancements/machine-api/external-remediations.md @@ -0,0 +1,242 @@ +--- +title: external-remediations +authors: + - @slintes +reviewers: + - @beekhof + - @n1r1 +approvers: + - @JoelSpeed + - @michaelgugino + - @enxebre +creation-date: 2020-11-29 +last-updated: 2020-11-29 +status: implementable +see-also: + - https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md + - https://github.com/kubernetes-sigs/cluster-api/pull/3606 +--- + +# External remediations + +## Release Signoff Checklist + +- [ ] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +By using `MachineHealthChecks` a cluster admin can configure automatic remediation of unhealthy machines and nodes. +The machine healthcheck controller's remediation strategy is deleting the machine, and letting the cloud provider +create a new one. This isn't the best remediation strategy in all environments. + +There is already a mechanism to provide an alternative, external remediation strategy, by adding an annotation to the +`MachineHealthCheck` and then to `Machine`s. However, this is isn't very maintainable and diverges from upstream. + +With this enhancement we propose a better, future-proof mechanism, that aligns us with the mechanism implemented upstream. +This proposal is a backport of parts of the upstream machine healthcheck proposal [0], which +also is already implemented [1]. + +- [0] [upstream machine healthcheck proposal](https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md) +- [1] [upstream machine healthcheck implementation](https://github.com/kubernetes-sigs/cluster-api/pull/3606) + +## Motivation + +- Environments consisting of hardware based clusters are significantly slower to (re)provision unhealthy machines, +so they have a need for a remediation flow that includes at least one attempt at power-cycling unhealthy nodes. +- Other environments and vendors also have specific remediation requirements, so there is a need to provide a generic +mechanism for implementing custom remediation logic. + +### Goals + +- Create the ability to define customized remediation flows outside of the Machine Health Check and Machine API +Operator codebase. +- Migrate the existing external remediation mechanism to the new one. +- Ensure that at most one external mechanism will attempt to recover the node. + +### Non-Goals + +TBD + +## Proposal + +We propose modifying the MachineHealthCheck CRD to add a new field, `externalRemediationTemplate`, an ObjectReference to +a provider-specific remediation template CR. + +### User Stories + +#### Story 1 + +As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can recover +from transient errors faster and begin application recovery sooner. + +#### Story 2 + +As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect +non-transient issues faster. +If automatic power-cycles don't resolve the issue it helps the admin to rule out transient issues like software bugs, etc. + +### Implementation Details/Notes/Constraints + +If no value for externalRemediationTemplate is defined for the MachineHealthCheck CR, the existing remediation flow +is preserved. + +If a value for externalRemediationTemplate is supplied and the Machine enters an unhealthy state, the template will +be instantiated, with the same name and namespace of the target Machine, which passes the remediation flow to an +External Remediation Controller (ERC) watching for that CR. + +No further action (deletion or applying conditions) will be taken by the MachineHealthCheck controller until the +Node becomes healthy. After that, it will locate and delete the instantiated MachineRemediation CR. + +We use the same name and namespace for the External Machine Remediation CR to ensure uniqueness and lessen the +possibility for multiple parallel remediations of the same Machine. + +The lifespan of the EMRs is that of the remediation process, and they are not intended to be a record of past events. +The EMR will also contain an ownerRef to the Machine, to ensure that it does not outlive the Machine it references. + +The only signaling between the MHC and the external controller watching EMR CRs is the creation and deletion of the +EMR itself. Any actions or changes that admins should be informed about should be emitted as events for consoles +and UIs to consume if necessary. They are informational only and do not result in or expect any behaviour from the MHC, +Node, or Machine as a result. + +When the external remediation controller detects the new EMR it starts remediation and performs whatever actions +it deems appropriate until the EMR is deleted by the MHC. It is a detail of the ERC when and how to retry remediation +in the event that a EMR is not deleted after the ERC considers remediation complete. + +The ERC may wish to register a finalizer on its CR to ensure it has an opportunity to perform any additional cleanup +in the case that the unhealthy state was transient and the Node returned to a healthy state prior to the completion +of the full custom ERC flow. + +#### MHC struct enhancement + +```go + type MachineHealthCheckSpec struct { + ... + + // +optional + ExternalRemediationTemplate *ObjectReference `json:"externalRemediationTemplate,omitempty"` + } +``` + +#### Example CRs + +MachineHealthCheck: +```yaml + kind: MachineHealthCheck + apiVersion: machine.openshift.io/v1beta1 + metadata: + name: REMEDIATION_GROUP + namespace: NAMESPACE_OF_UNHEALTHY_MACHINE + spec: + selector: + matchLabels: + ... + externalRemediationTemplate: + kind: Metal3RemediationTemplate + apiVersion: remediation.metal3.io/v1alphaX + name: M3_REMEDIATION_GROUP +``` + +Metal3RemediationTemplate: +```yaml + kind: Metal3RemediationTemplate + apiVersion: remediation.metal3.io/v1alphaX + metadata: + name: M3_REMEDIATION_GROUP + namespace: openshift-machine-api + spec: + template: + spec: + strategy: escalate + deleteAfterRetries: 10 + powerOnTimeoutSeconds: 600 + powerOffTimeoutSeconds: 120 +``` + +Metal3Remediation: +```yaml + apiVersion: remediation.metal3.io/v1alphaX + kind: Metal3Remediation + metadata: + name: NAME_OF_UNHEALTHY_MACHINE + namespace: openshift-machine-api + finalizer: + - remediation.metal3.io + ownerReferences: + - apiVersion: machine.openshift.io/v1beta1 + kind: Machine + name: NAME_OF_UNHEALTHY_MACHINE + uid: ... + spec: + strategy: escalate + deleteAfterRetries: 10 + powerOnTimeoutSeconds: 600 + powerOffTimeoutSeconds: 120 + status: + phase: power-off + retryCount: 1 +``` + +### Risks and Mitigations + +No known risks + +## Design Details + +### Open Questions + +See deprecation and upgrade + +### Test Plan + +The existing external remediation tests will be reviewed / adapted / extended as needed, and the upstream tests will +be backported as well. + +### Graduation Criteria + +TBD + +#### Examples + +TBD + +##### Dev Preview -> Tech Preview + +TBD + +##### Tech Preview -> GA + +TBD + +##### Removing a deprecated feature + +- The annotation based external remediation needs to be deprecated +- Open question: for how long do we need to support both mechanisms in parallel (if at all)? + +### Upgrade / Downgrade Strategy + +- Open question: do we need an automatic MHC conversion from the existing annotation based mechanism to the new one? + +### Version Skew Strategy + +There is a dependency between the machine-api-operator (which contains the machine healthcheck controller) and +cluster-api-provider-baremetal (which provides the current external baremetal remediation controller), both are part +of the OCP release payload. That means that there can be a short living version skew during upgrades. This isn't a +problem though, because an updated MHC can only be applied (or an automatic conversion can only happen) when both +controllers and their CRDs are updated: the MHC needs to have the new template field, and the remediation CRD and +its template CRD need to exist. + +## Implementation History + +- [x] 11/30/2020: Opened enhancement PR + +## Drawbacks + +no known drawbacks + +## Alternatives + +- Keep the existing annotation based mechanism. \ No newline at end of file