Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose to backport the "external remediation template" feature #551

Merged

Conversation

slintes
Copy link
Member

@slintes slintes commented Nov 30, 2020

@slintes
Copy link
Member Author

slintes commented Nov 30, 2020

/cc @beekhof @n1r1

a 1st round of review is appreciated before spreading this, thanks!

@jwforres jwforres removed their request for review November 30, 2020 20:49
##### Removing a deprecated feature

- The annotation based external remediation needs to be deprecated
- Open question: for how long do we need to support both mechanisms in parallel (if at all)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotation could just be a syntactic shortcut for an equivalent externalRemediationTemplate if no other one is provided.
Wouldn't be too burdensome to support.


### Upgrade / Downgrade Strategy

- Open question: do we need an automatic MHC conversion from the existing annotation based mechanism to the new one?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fencing must not break due to an upgrade

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same goes for downgrade?
i.e. should we convert specific external remediation template to annotation in existing MHC on downgrade? is this possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering when and how a downgrade will ever happen...?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enhancement template contains "Downgrade Strategy" and I remember Clayton saying this is an important one and a core platform requirement, so I guess this is a supported option.

as for "when", maybe to rollback version if you're having an issues with the new version.
as for "how", no idea :)

enhancements/baremetal/external-remediations.md Outdated Show resolved Hide resolved
create a new one. This isn't the best remediation strategy in all environments.

There is already a mechanism to provide an alternative, external remediation strategy, by adding an annotation to the
`MachineHealthCheck` and then to `Machine`s. However, this is isn't very maintainable.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to elaborate more on the downsides of having an annotation instead of CR.


### User Stories

#### Story 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a story for non-BM case?

enhancements/baremetal/external-remediations.md Outdated Show resolved Hide resolved
enhancements/baremetal/external-remediations.md Outdated Show resolved Hide resolved
enhancements/baremetal/external-remediations.md Outdated Show resolved Hide resolved

### Upgrade / Downgrade Strategy

- Open question: do we need an automatic MHC conversion from the existing annotation based mechanism to the new one?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same goes for downgrade?
i.e. should we convert specific external remediation template to annotation in existing MHC on downgrade? is this possible?

@n1r1
Copy link

n1r1 commented Dec 1, 2020

Forward looking, maybe we'll want all remediation strategies (including the default one) to rely on a CR.
This will allow separation of detection (MHC) and remediation.
So if a user didn't specify externalRemediationTemplate, MHC will create a CR that the default remediation controller will consume.

@slintes
Copy link
Member Author

slintes commented Dec 1, 2020

Forward looking, maybe we'll want all remediation strategies (including the default one) to rely on a CR.
This will allow separation of detection (MHC) and remediation.
So if a user didn't specify externalRemediationTemplate, MHC will create a CR that the default remediation controller will consume.

Interesting idea, I guess that would be a follow up though?
Are there similar plans upstream already?

@n1r1
Copy link

n1r1 commented Dec 1, 2020

Forward looking, maybe we'll want all remediation strategies (including the default one) to rely on a CR.
This will allow separation of detection (MHC) and remediation.
So if a user didn't specify externalRemediationTemplate, MHC will create a CR that the default remediation controller will consume.

Interesting idea, I guess that would be a follow up though?

yeah. just something to keep in mind.

Are there similar plans upstream already?

I remember we discussed this upstream, but I'm not aware of a concrete plan to do this.

Signed-off-by: Marc Sluiter <[email protected]>
@slintes
Copy link
Member Author

slintes commented Dec 2, 2020

/cc @JoelSpeed @michaelgugino @enxebre

Hi, it was suggested to add you as approvers to this. Do you mind giving a review? Thanks!

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2021
@JoelSpeed
Copy link
Contributor

/remove-lifecycle stale.

@elmiko
Copy link
Contributor

elmiko commented Mar 30, 2021

this reads well and the implementation generally makes sense to me. i do have a question about the interaction, or lack thereof, between the MHC and ERC. is there any consideration about the notion that the MHC could create an EMR which never gets reconciled (maybe the ERC is down or something)?

i'm just curious if we would want the MHC to create an alert if an EMR hasn't been removed in like 24-48 hours?

@elmiko
Copy link
Contributor

elmiko commented Mar 30, 2021

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 30, 2021
@slintes
Copy link
Member Author

slintes commented Apr 1, 2021

@mshitrit fyi

@n1r1
Copy link

n1r1 commented Apr 1, 2021

i do have a question about the interaction, or lack thereof, between the MHC and ERC. is there any consideration about the notion that the MHC could create an EMR which never gets reconciled (maybe the ERC is down or something)?

i'm just curious if we would want the MHC to create an alert if an EMR hasn't been removed in like 24-48 hours?

Adding to this - it's a valid case to keep the EMR. e.g. if the ERC has failed to remediate or if it is doing some backoff.

Creating an alert makes sense to me. No matter if it's an ERC that is down or a machine that couldn't be remediated.

@elmiko
Copy link
Contributor

elmiko commented Apr 1, 2021

Adding to this - it's a valid case to keep the EMR. e.g. if the ERC has failed to remediate or if it is doing some backoff.

that makes sense to me, this is why i was thinking a really long timer on the alert.

@mshitrit
Copy link
Contributor

mshitrit commented Apr 2, 2021

Adding to this - it's a valid case to keep the EMR. e.g. if the ERC has failed to remediate or if it is doing some backoff.

I agree as well - IMO this will be a good improvement to this feature.
However I don't think the lack of it should block us from merging to the current release.
/cc @beekhof

Comment on lines 40 to 44
This proposal is a backport of parts of the upstream machine healthcheck proposal [0], which
also is already implemented [1].

- [0] https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md
- [1] https://github.com/kubernetes-sigs/cluster-api/pull/3606
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, any reason not to inline these links?


## Proposal

We propose modifying the MachineHealthCheck CRD to support a externalRemediationTemplate, an ObjectReference to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, I think this would be better if it were slightly more specific

Suggested change
We propose modifying the MachineHealthCheck CRD to support a externalRemediationTemplate, an ObjectReference to
We propose modifying the MachineHealthCheck CRD to add a new field, `externalRemediationTemplate`, an ObjectReference to

Comment on lines +78 to +79
As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect
non-transient issues faster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this really makes all that much sense, does power cycling not effectively reset and prevent you from diagnosing the error? I don't see how this proposal helps detect the issues faster?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If automatic power-cycles don't resolve the issue it helps you to rule out transient issues like software bugs, etc.

If an admin wouldn't have these automatic power-cycles, he might have try to reboot the node first to see if the problem persists or not.
Once he have the automatic reboots, he can skip that stage.

Perhaps we need to rephrase this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've rephrased 👍

Comment on lines 83 to 84
As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes,
so that they are automatically added back to the cluster when I fix the underlying problem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does attempting to power cycle while you are remediating the issue not actually make this problem worse? This sounds undesirable to me, if I'm working trying to fix a hardware issue, I don't want the machine to magically come back on mid way through the hardware change.

Perhaps this story can be clarified a bit., I'm not huge on baremetal these days so I assume there's some nuances I'm not seeing here

Copy link

@n1r1 n1r1 Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the intention here is external issues, such as network problems (e.g. a host that can't reach the api-server).
TBH I'd expect the system to be able to recover itself in such cases even without power-cycle, so maybe this user story is not very compelling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mshitrit Do you have any thoughts on this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - removed

Comment on lines 98 to 102
When a Machine enters an unhealthy state, the MHC will:
* Look up the referenced template
* Instantiate the template (for simplicity, we will refer to this as a External Machine Remediation CR, or EMR)
* Force the name and namespace to match the unhealthy Machine
* Save the new object in etcd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to duplicate what is said in the paragraphs above, do we need it twice?

Comment on lines 254 to 260
## Infrastructure Needed [optional]

Use this section if you need things from the project. Examples include a new
subproject, repos requested, github details, and/or testing infrastructure.

Listing these here allows the community to get the process for these resources
started right away.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop this heading

@mshitrit mshitrit force-pushed the external-remediation-template branch from 7050b35 to 34e3d68 Compare April 8, 2021 06:23
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mshitrit I'm pretty happy to give my approval, just wanted your input on one thread before we do, seems maybe a redundant user story?

Comment on lines 83 to 84
As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes,
so that they are automatically added back to the cluster when I fix the underlying problem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mshitrit Do you have any thoughts on this one?

_ Improve phrasing
- Remove redundant parts
- Remove trailing spaces

Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit mshitrit force-pushed the external-remediation-template branch from 34e3d68 to 817c3d9 Compare April 18, 2021 06:26
@JoelSpeed
Copy link
Contributor

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2021
@mshitrit
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 19, 2021
@openshift-merge-robot openshift-merge-robot merged commit a658e5a into openshift:master Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants