-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose to backport the "external remediation template" feature #551
Changes from 5 commits
625b35d
7ca934a
c5b121e
99d7e56
8976c2d
817c3d9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,260 @@ | ||||||
--- | ||||||
title: external-remediations | ||||||
authors: | ||||||
- @slintes | ||||||
reviewers: | ||||||
- @beekhof | ||||||
- @n1r1 | ||||||
approvers: | ||||||
- @JoelSpeed | ||||||
- @michaelgugino | ||||||
- @enxebre | ||||||
creation-date: 2020-11-29 | ||||||
last-updated: 2020-11-29 | ||||||
status: implementable | ||||||
see-also: | ||||||
- https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md | ||||||
- https://github.com/kubernetes-sigs/cluster-api/pull/3606 | ||||||
--- | ||||||
|
||||||
# External remediations | ||||||
|
||||||
## Release Signoff Checklist | ||||||
|
||||||
- [ ] Enhancement is `implementable` | ||||||
- [ ] Design details are appropriately documented from clear requirements | ||||||
- [ ] Test plan is defined | ||||||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||||||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||||||
|
||||||
## Summary | ||||||
|
||||||
By using `MachineHealthChecks` a cluster admin can configure automatic remediation of unhealthy machines and nodes. | ||||||
The machine healthcheck controller's remediation strategy is deleting the machine, and letting the cloud provider | ||||||
create a new one. This isn't the best remediation strategy in all environments. | ||||||
|
||||||
There is already a mechanism to provide an alternative, external remediation strategy, by adding an annotation to the | ||||||
`MachineHealthCheck` and then to `Machine`s. However, this is isn't very maintainable and diverges from upstream. | ||||||
|
||||||
With this enhancement we propose a better, future-proof mechanism, that aligns us with the mechanism implemented upstream. | ||||||
This proposal is a backport of parts of the upstream machine healthcheck proposal [0], which | ||||||
also is already implemented [1]. | ||||||
|
||||||
- [0] https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md | ||||||
- [1] https://github.com/kubernetes-sigs/cluster-api/pull/3606 | ||||||
|
||||||
## Motivation | ||||||
|
||||||
- Environments consisting of hardware based clusters are significantly slower to (re)provision unhealthy machines, | ||||||
so they have a need for a remediation flow that includes at least one attempt at power-cycling unhealthy nodes. | ||||||
- Other environments and vendors also have specific remediation requirements, so there is a need to provide a generic | ||||||
mechanism for implementing custom remediation logic. | ||||||
|
||||||
### Goals | ||||||
|
||||||
- Create the ability to define customized remediation flows outside of the Machine Health Check and Machine API | ||||||
Operator codebase. | ||||||
- Migrate the existing external remediation mechanism to the new one. | ||||||
- Ensure that at most one external mechanism will attempt to recover the node. | ||||||
|
||||||
### Non-Goals | ||||||
|
||||||
TBD | ||||||
|
||||||
## Proposal | ||||||
|
||||||
We propose modifying the MachineHealthCheck CRD to support a externalRemediationTemplate, an ObjectReference to | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit, I think this would be better if it were slightly more specific
Suggested change
|
||||||
a provider-specific remediation template CR. | ||||||
|
||||||
### User Stories | ||||||
|
||||||
#### Story 1 | ||||||
|
||||||
As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can recover | ||||||
from transient errors faster and begin application recovery sooner. | ||||||
|
||||||
#### Story 2 | ||||||
|
||||||
As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect | ||||||
non-transient issues faster. | ||||||
Comment on lines
+78
to
+79
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure this really makes all that much sense, does power cycling not effectively reset and prevent you from diagnosing the error? I don't see how this proposal helps detect the issues faster? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If automatic power-cycles don't resolve the issue it helps you to rule out transient issues like software bugs, etc. If an admin wouldn't have these automatic power-cycles, he might have try to reboot the node first to see if the problem persists or not. Perhaps we need to rephrase this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, I've rephrased 👍 |
||||||
|
||||||
#### Story 3 | ||||||
|
||||||
As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes, | ||||||
so that they are automatically added back to the cluster when I fix the underlying problem. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does attempting to power cycle while you are remediating the issue not actually make this problem worse? This sounds undesirable to me, if I'm working trying to fix a hardware issue, I don't want the machine to magically come back on mid way through the hardware change. Perhaps this story can be clarified a bit., I'm not huge on baremetal these days so I assume there's some nuances I'm not seeing here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe the intention here is external issues, such as network problems (e.g. a host that can't reach the api-server). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mshitrit Do you have any thoughts on this one? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree - removed |
||||||
|
||||||
### Implementation Details/Notes/Constraints | ||||||
|
||||||
If no value for externalRemediationTemplate is defined for the MachineHealthCheck CR, the existing remediation flow | ||||||
is preserved. | ||||||
|
||||||
If a value for externalRemediationTemplate is supplied and the Machine enters an unhealthy state, the template will | ||||||
be instantiated, with the same name and namespace of the target Machine, which passes the remediation flow to an | ||||||
External Remediation Controller (ERC) watching for that CR. | ||||||
|
||||||
No further action (deletion or applying conditions) will be taken by the MachineHealthCheck controller until the | ||||||
Node becomes healthy. After that, it will locate and delete the instantiated MachineRemediation CR. | ||||||
|
||||||
When a Machine enters an unhealthy state, the MHC will: | ||||||
* Look up the referenced template | ||||||
* Instantiate the template (for simplicity, we will refer to this as a External Machine Remediation CR, or EMR) | ||||||
* Force the name and namespace to match the unhealthy Machine | ||||||
* Save the new object in etcd | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to duplicate what is said in the paragraphs above, do we need it twice? |
||||||
|
||||||
We use the same name and namespace for the External Machine Remediation CR to ensure uniqueness and lessen the | ||||||
possibility for multiple parallel remediations of the same Machine. | ||||||
|
||||||
The lifespan of the EMRs is that of the remediation process, and they are not intended to be a record of past events. | ||||||
The EMR will also contain an ownerRef to the Machine, to ensure that it does not outlive the Machine it references. | ||||||
|
||||||
The only signaling between the MHC and the external controller watching EMR CRs is the creation and deletion of the | ||||||
EMR itself. Any actions or changes that admins should be informed about should be emitted as events for consoles | ||||||
and UIs to consume if necessary. They are informational only and do not result in or expect any behaviour from the MHC, | ||||||
Node, or Machine as a result. | ||||||
|
||||||
When the external remediation controller detects the new EMR it starts remediation and performs whatever actions | ||||||
it deems appropriate until the EMR is deleted by the MHC. It is a detail of the ERC when and how to retry remediation | ||||||
in the event that a EMR is not deleted after the ERC considers remediation complete. | ||||||
|
||||||
The ERC may wish to register a finalizer on its CR to ensure it has an opportunity to perform any additional cleanup | ||||||
in the case that the unhealthy state was transient and the Node returned to a healthy state prior to the completion | ||||||
of the full custom ERC flow. | ||||||
|
||||||
#### MHC struct enhancement | ||||||
|
||||||
```go | ||||||
type MachineHealthCheckSpec struct { | ||||||
... | ||||||
|
||||||
// +optional | ||||||
ExternalRemediationTemplate *ObjectReference `json:"externalRemediationTemplate,omitempty"` | ||||||
} | ||||||
``` | ||||||
|
||||||
#### Example CRs | ||||||
|
||||||
MachineHealthCheck: | ||||||
```yaml | ||||||
kind: MachineHealthCheck | ||||||
apiVersion: machine.openshift.io/v1beta1 | ||||||
metadata: | ||||||
name: REMEDIATION_GROUP | ||||||
namespace: NAMESPACE_OF_UNHEALTHY_MACHINE | ||||||
spec: | ||||||
selector: | ||||||
matchLabels: | ||||||
... | ||||||
externalRemediationTemplate: | ||||||
kind: Metal3RemediationTemplate | ||||||
apiVersion: remediation.metal3.io/v1alphaX | ||||||
name: M3_REMEDIATION_GROUP | ||||||
``` | ||||||
|
||||||
Metal3RemediationTemplate: | ||||||
```yaml | ||||||
kind: Metal3RemediationTemplate | ||||||
apiVersion: remediation.metal3.io/v1alphaX | ||||||
metadata: | ||||||
name: M3_REMEDIATION_GROUP | ||||||
namespace: openshift-machine-api | ||||||
spec: | ||||||
template: | ||||||
spec: | ||||||
strategy: escalate | ||||||
deleteAfterRetries: 10 | ||||||
powerOnTimeoutSeconds: 600 | ||||||
powerOffTimeoutSeconds: 120 | ||||||
``` | ||||||
|
||||||
Metal3Remediation: | ||||||
```yaml | ||||||
apiVersion: remediation.metal3.io/v1alphaX | ||||||
kind: Metal3Remediation | ||||||
metadata: | ||||||
name: NAME_OF_UNHEALTHY_MACHINE | ||||||
namespace: openshift-machine-api | ||||||
finalizer: | ||||||
- remediation.metal3.io | ||||||
ownerReferences: | ||||||
- apiVersion: machine.openshift.io/v1beta1 | ||||||
kind: Machine | ||||||
name: NAME_OF_UNHEALTHY_MACHINE | ||||||
uid: ... | ||||||
spec: | ||||||
strategy: escalate | ||||||
deleteAfterRetries: 10 | ||||||
powerOnTimeoutSeconds: 600 | ||||||
powerOffTimeoutSeconds: 120 | ||||||
status: | ||||||
phase: power-off | ||||||
retryCount: 1 | ||||||
``` | ||||||
|
||||||
### Risks and Mitigations | ||||||
|
||||||
No known risks | ||||||
|
||||||
## Design Details | ||||||
|
||||||
### Open Questions | ||||||
|
||||||
See deprecation and upgrade | ||||||
|
||||||
### Test Plan | ||||||
|
||||||
The existing external remediation tests will be reviewed / adapted / extended as needed, and the upstream tests will | ||||||
be backported as well. | ||||||
|
||||||
### Graduation Criteria | ||||||
|
||||||
TBD | ||||||
|
||||||
#### Examples | ||||||
|
||||||
TBD | ||||||
|
||||||
##### Dev Preview -> Tech Preview | ||||||
|
||||||
TBD | ||||||
|
||||||
##### Tech Preview -> GA | ||||||
|
||||||
TBD | ||||||
|
||||||
##### Removing a deprecated feature | ||||||
|
||||||
- The annotation based external remediation needs to be deprecated | ||||||
- Open question: for how long do we need to support both mechanisms in parallel (if at all)? | ||||||
|
||||||
### Upgrade / Downgrade Strategy | ||||||
|
||||||
- Open question: do we need an automatic MHC conversion from the existing annotation based mechanism to the new one? | ||||||
|
||||||
### Version Skew Strategy | ||||||
|
||||||
There is a dependency between the machine-api-operator (which contains the machine healthcheck controller) and | ||||||
cluster-api-provider-baremetal (which provides the current external baremetal remediation controller), both are part | ||||||
of the OCP release payload. That means that there can be a short living version skew during upgrades. This isn't a | ||||||
problem though, because an updated MHC can only be applied (or an automatic conversion can only happen) when both | ||||||
controllers and their CRDs are updated: the MHC needs to have the new template field, and the remediation CRD and | ||||||
its template CRD need to exist. | ||||||
|
||||||
## Implementation History | ||||||
|
||||||
- [x] 11/30/2020: Opened enhancement PR | ||||||
|
||||||
## Drawbacks | ||||||
|
||||||
no known drawbacks | ||||||
|
||||||
## Alternatives | ||||||
|
||||||
- Keep the existing annotation based mechanism. | ||||||
|
||||||
## Infrastructure Needed [optional] | ||||||
|
||||||
Use this section if you need things from the project. Examples include a new | ||||||
subproject, repos requested, github details, and/or testing infrastructure. | ||||||
|
||||||
Listing these here allows the community to get the process for these resources | ||||||
started right away. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can drop this heading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, any reason not to inline these links?