Propose to backport the "external remediation template" feature #551

slintes · 2020-11-30T18:42:52Z

With this enhancement we propose to backport the "external remediation template" feature.

See

Signed-off-by: Marc Sluiter <[email protected]>

slintes · 2020-11-30T18:43:40Z

/cc @beekhof @n1r1

a 1st round of review is appreciated before spreading this, thanks!

enhancements/baremetal/external-remediations.md

beekhof · 2020-12-01T00:31:57Z

enhancements/baremetal/external-remediations.md

+##### Removing a deprecated feature
+
+- The annotation based external remediation needs to be deprecated
+- Open question: for how long do we need to support both mechanisms in parallel (if at all)?


The annotation could just be a syntactic shortcut for an equivalent externalRemediationTemplate if no other one is provided.
Wouldn't be too burdensome to support.

beekhof · 2020-12-01T00:32:43Z

enhancements/baremetal/external-remediations.md

+
+### Upgrade / Downgrade Strategy
+
+- Open question: do we need an automatic MHC conversion from the existing annotation based mechanism to the new one? 


Yes. Fencing must not break due to an upgrade

same goes for downgrade?
i.e. should we convert specific external remediation template to annotation in existing MHC on downgrade? is this possible?

I'm wondering when and how a downgrade will ever happen...?

The enhancement template contains "Downgrade Strategy" and I remember Clayton saying this is an important one and a core platform requirement, so I guess this is a supported option.

as for "when", maybe to rollback version if you're having an issues with the new version.
as for "how", no idea :)

enhancements/baremetal/external-remediations.md

n1r1 · 2020-12-01T07:43:24Z

enhancements/baremetal/external-remediations.md

+create a new one. This isn't the best remediation strategy in all environments.
+
+There is already a mechanism to provide an alternative, external remediation strategy, by adding an annotation to the
+`MachineHealthCheck` and then to `Machine`s. However, this is isn't very maintainable.


I suggest to elaborate more on the downsides of having an annotation instead of CR.

n1r1 · 2020-12-01T07:44:56Z

enhancements/baremetal/external-remediations.md

+
+### User Stories 
+
+#### Story 1


Maybe add a story for non-BM case?

enhancements/baremetal/external-remediations.md

n1r1 · 2020-12-01T07:54:58Z

enhancements/baremetal/external-remediations.md

+
+### Upgrade / Downgrade Strategy
+
+- Open question: do we need an automatic MHC conversion from the existing annotation based mechanism to the new one? 


same goes for downgrade?
i.e. should we convert specific external remediation template to annotation in existing MHC on downgrade? is this possible?

n1r1 · 2020-12-01T08:00:43Z

Forward looking, maybe we'll want all remediation strategies (including the default one) to rely on a CR.
This will allow separation of detection (MHC) and remediation.
So if a user didn't specify externalRemediationTemplate, MHC will create a CR that the default remediation controller will consume.

Co-authored-by: Andrew Beekhof <[email protected]>

slintes · 2020-12-01T12:21:23Z

Forward looking, maybe we'll want all remediation strategies (including the default one) to rely on a CR.
This will allow separation of detection (MHC) and remediation.
So if a user didn't specify externalRemediationTemplate, MHC will create a CR that the default remediation controller will consume.

Interesting idea, I guess that would be a follow up though?
Are there similar plans upstream already?

Signed-off-by: Marc Sluiter <[email protected]>

n1r1 · 2020-12-01T13:41:05Z

Forward looking, maybe we'll want all remediation strategies (including the default one) to rely on a CR.
This will allow separation of detection (MHC) and remediation.
So if a user didn't specify externalRemediationTemplate, MHC will create a CR that the default remediation controller will consume.

Interesting idea, I guess that would be a follow up though?

yeah. just something to keep in mind.

Are there similar plans upstream already?

I remember we discussed this upstream, but I'm not aware of a concrete plan to do this.

Signed-off-by: Marc Sluiter <[email protected]>

slintes · 2020-12-02T17:18:47Z

/cc @JoelSpeed @michaelgugino @enxebre

Hi, it was suggested to add you as approvers to this. Do you mind giving a review? Thanks!

openshift-bot · 2021-03-02T19:39:15Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

JoelSpeed · 2021-03-03T10:49:26Z

/remove-lifecycle stale.

elmiko · 2021-03-30T21:43:01Z

this reads well and the implementation generally makes sense to me. i do have a question about the interaction, or lack thereof, between the MHC and ERC. is there any consideration about the notion that the MHC could create an EMR which never gets reconciled (maybe the ERC is down or something)?

i'm just curious if we would want the MHC to create an alert if an EMR hasn't been removed in like 24-48 hours?

elmiko · 2021-03-30T21:43:55Z

/remove-lifecycle stale

slintes · 2021-04-01T16:32:12Z

@mshitrit fyi

n1r1 · 2021-04-01T19:59:00Z

i do have a question about the interaction, or lack thereof, between the MHC and ERC. is there any consideration about the notion that the MHC could create an EMR which never gets reconciled (maybe the ERC is down or something)?

i'm just curious if we would want the MHC to create an alert if an EMR hasn't been removed in like 24-48 hours?

Adding to this - it's a valid case to keep the EMR. e.g. if the ERC has failed to remediate or if it is doing some backoff.

Creating an alert makes sense to me. No matter if it's an ERC that is down or a machine that couldn't be remediated.

elmiko · 2021-04-01T20:04:00Z

Adding to this - it's a valid case to keep the EMR. e.g. if the ERC has failed to remediate or if it is doing some backoff.

that makes sense to me, this is why i was thinking a really long timer on the alert.

mshitrit · 2021-04-02T05:01:26Z

Adding to this - it's a valid case to keep the EMR. e.g. if the ERC has failed to remediate or if it is doing some backoff.

I agree as well - IMO this will be a good improvement to this feature.
However I don't think the lack of it should block us from merging to the current release.
/cc @beekhof

JoelSpeed · 2021-04-07T11:56:34Z

enhancements/machine-api/external-remediations.md

+This proposal is a backport of parts of the upstream machine healthcheck proposal [0], which
+also is already implemented [1].
+
+- [0] https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md
+- [1] https://github.com/kubernetes-sigs/cluster-api/pull/3606


Nit, any reason not to inline these links?

JoelSpeed · 2021-04-07T11:59:12Z

enhancements/machine-api/external-remediations.md

+
+## Proposal
+
+We propose modifying the MachineHealthCheck CRD to support a externalRemediationTemplate, an ObjectReference to


Nit, I think this would be better if it were slightly more specific

Suggested change

We propose modifying the MachineHealthCheck CRD to support a externalRemediationTemplate, an ObjectReference to

We propose modifying the MachineHealthCheck CRD to add a new field, `externalRemediationTemplate`, an ObjectReference to

JoelSpeed · 2021-04-07T12:00:51Z

enhancements/machine-api/external-remediations.md

+As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect
+non-transient issues faster.


Not sure this really makes all that much sense, does power cycling not effectively reset and prevent you from diagnosing the error? I don't see how this proposal helps detect the issues faster?

If automatic power-cycles don't resolve the issue it helps you to rule out transient issues like software bugs, etc.

If an admin wouldn't have these automatic power-cycles, he might have try to reboot the node first to see if the problem persists or not.
Once he have the automatic reboots, he can skip that stage.

Perhaps we need to rephrase this.

Thanks, I've rephrased 👍

JoelSpeed · 2021-04-07T12:03:23Z

enhancements/machine-api/external-remediations.md

+As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes,
+so that they are automatically added back to the cluster when I fix the underlying problem.


Does attempting to power cycle while you are remediating the issue not actually make this problem worse? This sounds undesirable to me, if I'm working trying to fix a hardware issue, I don't want the machine to magically come back on mid way through the hardware change.

Perhaps this story can be clarified a bit., I'm not huge on baremetal these days so I assume there's some nuances I'm not seeing here

I believe the intention here is external issues, such as network problems (e.g. a host that can't reach the api-server).
TBH I'd expect the system to be able to recover itself in such cases even without power-cycle, so maybe this user story is not very compelling

@mshitrit Do you have any thoughts on this one?

I agree - removed

JoelSpeed · 2021-04-07T12:07:47Z

enhancements/machine-api/external-remediations.md

+When a Machine enters an unhealthy state, the MHC will:
+* Look up the referenced template
+* Instantiate the template (for simplicity, we will refer to this as a External Machine Remediation CR, or EMR)
+* Force the name and namespace to match the unhealthy Machine
+* Save the new object in etcd


This seems to duplicate what is said in the paragraphs above, do we need it twice?

JoelSpeed · 2021-04-07T12:14:11Z

enhancements/machine-api/external-remediations.md

+## Infrastructure Needed [optional]
+
+Use this section if you need things from the project. Examples include a new
+subproject, repos requested, github details, and/or testing infrastructure.
+
+Listing these here allows the community to get the process for these resources
+started right away.


I think we can drop this heading

JoelSpeed

@mshitrit I'm pretty happy to give my approval, just wanted your input on one thread before we do, seems maybe a redundant user story?

JoelSpeed · 2021-04-14T14:59:59Z

enhancements/machine-api/external-remediations.md

+As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes,
+so that they are automatically added back to the cluster when I fix the underlying problem.


@mshitrit Do you have any thoughts on this one?

_ Improve phrasing - Remove redundant parts - Remove trailing spaces Signed-off-by: Michael Shitrit <[email protected]>

JoelSpeed · 2021-04-19T10:27:16Z

/approve

openshift-ci-robot · 2021-04-19T10:27:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mshitrit · 2021-04-19T10:30:38Z

/lgtm

Added external remediation template proposal

625b35d

Signed-off-by: Marc Sluiter <[email protected]>

openshift-ci-robot requested review from jwforres and sttts November 30, 2020 18:43

openshift-ci-robot requested review from beekhof and n1r1 November 30, 2020 18:43

jwforres removed their request for review November 30, 2020 20:49

beekhof reviewed Dec 1, 2020

View reviewed changes

enhancements/baremetal/external-remediations.md Outdated Show resolved Hide resolved

beekhof reviewed Dec 1, 2020

View reviewed changes

enhancements/baremetal/external-remediations.md Outdated Show resolved Hide resolved

beekhof reviewed Dec 1, 2020

View reviewed changes

n1r1 reviewed Dec 1, 2020

View reviewed changes

slintes and others added 2 commits December 1, 2020 13:06

Update enhancements/baremetal/external-remediations.md

7ca934a

Co-authored-by: Andrew Beekhof <[email protected]>

Update enhancements/baremetal/external-remediations.md

c5b121e

Co-authored-by: Andrew Beekhof <[email protected]>

Moved to machine-api and adressed feedback

99d7e56

Signed-off-by: Marc Sluiter <[email protected]>

Added approvers

8976c2d

Signed-off-by: Marc Sluiter <[email protected]>

openshift-ci-robot requested review from enxebre, JoelSpeed and michaelgugino December 2, 2020 17:18

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2021

openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 30, 2021

openshift-ci-robot requested a review from beekhof April 2, 2021 05:01

JoelSpeed reviewed Apr 7, 2021

View reviewed changes

mshitrit force-pushed the external-remediation-template branch from 7050b35 to 34e3d68 Compare April 8, 2021 06:23

JoelSpeed reviewed Apr 14, 2021

View reviewed changes

- Inlining Links

817c3d9

_ Improve phrasing - Remove redundant parts - Remove trailing spaces Signed-off-by: Michael Shitrit <[email protected]>

mshitrit force-pushed the external-remediation-template branch from 34e3d68 to 817c3d9 Compare April 18, 2021 06:26

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2021

openshift-ci-robot assigned mshitrit Apr 19, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 19, 2021

openshift-merge-robot merged commit a658e5a into openshift:master Apr 19, 2021

mshitrit mentioned this pull request May 10, 2021

WIP ✨ Alert old emr kubernetes-sigs/cluster-api#4571

Closed


		### Upgrade / Downgrade Strategy

		- Open question: do we need an automatic MHC conversion from the existing annotation based mechanism to the new one?


		## Proposal

		We propose modifying the MachineHealthCheck CRD to support a externalRemediationTemplate, an ObjectReference to

		As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect
		non-transient issues faster.

		As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes,
		so that they are automatically added back to the cluster when I fix the underlying problem.

Propose to backport the "external remediation template" feature #551

Propose to backport the "external remediation template" feature #551

Conversation

slintes commented Nov 30, 2020

slintes commented Nov 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n1r1 commented Dec 1, 2020

slintes commented Dec 1, 2020

n1r1 commented Dec 1, 2020

slintes commented Dec 2, 2020

openshift-bot commented Mar 2, 2021

JoelSpeed commented Mar 3, 2021

elmiko commented Mar 30, 2021

elmiko commented Mar 30, 2021

slintes commented Apr 1, 2021

n1r1 commented Apr 1, 2021

elmiko commented Apr 1, 2021

mshitrit commented Apr 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n1r1 Apr 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoelSpeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoelSpeed commented Apr 19, 2021

openshift-ci-robot commented Apr 19, 2021

mshitrit commented Apr 19, 2021

n1r1 Apr 7, 2021 •

edited

Loading