-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose to backport the "external remediation template" feature #551
Propose to backport the "external remediation template" feature #551
Conversation
Signed-off-by: Marc Sluiter <[email protected]>
##### Removing a deprecated feature | ||
|
||
- The annotation based external remediation needs to be deprecated | ||
- Open question: for how long do we need to support both mechanisms in parallel (if at all)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The annotation could just be a syntactic shortcut for an equivalent externalRemediationTemplate
if no other one is provided.
Wouldn't be too burdensome to support.
|
||
### Upgrade / Downgrade Strategy | ||
|
||
- Open question: do we need an automatic MHC conversion from the existing annotation based mechanism to the new one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Fencing must not break due to an upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same goes for downgrade?
i.e. should we convert specific external remediation template to annotation in existing MHC on downgrade? is this possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering when and how a downgrade will ever happen...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enhancement template contains "Downgrade Strategy" and I remember Clayton saying this is an important one and a core platform requirement, so I guess this is a supported option.
as for "when", maybe to rollback version if you're having an issues with the new version.
as for "how", no idea :)
create a new one. This isn't the best remediation strategy in all environments. | ||
|
||
There is already a mechanism to provide an alternative, external remediation strategy, by adding an annotation to the | ||
`MachineHealthCheck` and then to `Machine`s. However, this is isn't very maintainable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to elaborate more on the downsides of having an annotation instead of CR.
|
||
### User Stories | ||
|
||
#### Story 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a story for non-BM case?
|
||
### Upgrade / Downgrade Strategy | ||
|
||
- Open question: do we need an automatic MHC conversion from the existing annotation based mechanism to the new one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same goes for downgrade?
i.e. should we convert specific external remediation template to annotation in existing MHC on downgrade? is this possible?
Forward looking, maybe we'll want all remediation strategies (including the default one) to rely on a CR. |
Co-authored-by: Andrew Beekhof <[email protected]>
Co-authored-by: Andrew Beekhof <[email protected]>
Interesting idea, I guess that would be a follow up though? |
Signed-off-by: Marc Sluiter <[email protected]>
yeah. just something to keep in mind.
I remember we discussed this upstream, but I'm not aware of a concrete plan to do this. |
Signed-off-by: Marc Sluiter <[email protected]>
/cc @JoelSpeed @michaelgugino @enxebre Hi, it was suggested to add you as approvers to this. Do you mind giving a review? Thanks! |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale. |
this reads well and the implementation generally makes sense to me. i do have a question about the interaction, or lack thereof, between the MHC and ERC. is there any consideration about the notion that the MHC could create an EMR which never gets reconciled (maybe the ERC is down or something)? i'm just curious if we would want the MHC to create an alert if an EMR hasn't been removed in like 24-48 hours? |
/remove-lifecycle stale |
@mshitrit fyi |
Adding to this - it's a valid case to keep the EMR. e.g. if the ERC has failed to remediate or if it is doing some backoff. Creating an alert makes sense to me. No matter if it's an ERC that is down or a machine that couldn't be remediated. |
that makes sense to me, this is why i was thinking a really long timer on the alert. |
I agree as well - IMO this will be a good improvement to this feature. |
This proposal is a backport of parts of the upstream machine healthcheck proposal [0], which | ||
also is already implemented [1]. | ||
|
||
- [0] https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md | ||
- [1] https://github.com/kubernetes-sigs/cluster-api/pull/3606 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, any reason not to inline these links?
|
||
## Proposal | ||
|
||
We propose modifying the MachineHealthCheck CRD to support a externalRemediationTemplate, an ObjectReference to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, I think this would be better if it were slightly more specific
We propose modifying the MachineHealthCheck CRD to support a externalRemediationTemplate, an ObjectReference to | |
We propose modifying the MachineHealthCheck CRD to add a new field, `externalRemediationTemplate`, an ObjectReference to |
As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect | ||
non-transient issues faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this really makes all that much sense, does power cycling not effectively reset and prevent you from diagnosing the error? I don't see how this proposal helps detect the issues faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If automatic power-cycles don't resolve the issue it helps you to rule out transient issues like software bugs, etc.
If an admin wouldn't have these automatic power-cycles, he might have try to reboot the node first to see if the problem persists or not.
Once he have the automatic reboots, he can skip that stage.
Perhaps we need to rephrase this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've rephrased 👍
As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes, | ||
so that they are automatically added back to the cluster when I fix the underlying problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does attempting to power cycle while you are remediating the issue not actually make this problem worse? This sounds undesirable to me, if I'm working trying to fix a hardware issue, I don't want the machine to magically come back on mid way through the hardware change.
Perhaps this story can be clarified a bit., I'm not huge on baremetal these days so I assume there's some nuances I'm not seeing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the intention here is external issues, such as network problems (e.g. a host that can't reach the api-server).
TBH I'd expect the system to be able to recover itself in such cases even without power-cycle, so maybe this user story is not very compelling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshitrit Do you have any thoughts on this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - removed
When a Machine enters an unhealthy state, the MHC will: | ||
* Look up the referenced template | ||
* Instantiate the template (for simplicity, we will refer to this as a External Machine Remediation CR, or EMR) | ||
* Force the name and namespace to match the unhealthy Machine | ||
* Save the new object in etcd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to duplicate what is said in the paragraphs above, do we need it twice?
## Infrastructure Needed [optional] | ||
|
||
Use this section if you need things from the project. Examples include a new | ||
subproject, repos requested, github details, and/or testing infrastructure. | ||
|
||
Listing these here allows the community to get the process for these resources | ||
started right away. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can drop this heading
7050b35
to
34e3d68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshitrit I'm pretty happy to give my approval, just wanted your input on one thread before we do, seems maybe a redundant user story?
As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes, | ||
so that they are automatically added back to the cluster when I fix the underlying problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshitrit Do you have any thoughts on this one?
_ Improve phrasing - Remove redundant parts - Remove trailing spaces Signed-off-by: Michael Shitrit <[email protected]>
34e3d68
to
817c3d9
Compare
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
With this enhancement we propose to backport the "external remediation template" feature.
See