short-circuiting-backoff #673

mshitrit · 2021-03-01T18:36:12Z

Signed-off-by: mshitrit [email protected]

mshitrit · 2021-03-01T19:07:39Z

/wip

Signed-off-by: mshitrit <[email protected]>

elmiko

added a couple comments, and i have a couple questions.

i see you've added the note about the failed state being paired with the missing noderef and providerid, but i wonder if this feature should be broader in scope? for example, a cluster admin might want to check all mhc remediations, and want to have a blanket timeout for any machine that is going to be remediated. i'm curious if this is something you considered?

the name of this is "short circuiting backoff", but i didn't see mention of the short circuit mechanisms. i imagine that there will be cases where the machines that are held in the failed state will cause the mhc to go above its max unhealthy limit, is this the backoff addressed? assuming so, i think it should be more specifically mentioned.

enhancements/machine-api/short-circuiting-backoff.md

mshitrit · 2021-03-08T08:45:58Z

i see you've added the note about the failed state being paired with the missing noderef and providerid, but i wonder if this feature should be broader in scope? for example, a cluster admin might want to check all mhc remediations, and want to have a blanket timeout for any machine that is going to be remediated. i'm curious if this is something you considered?

Hmm I could be wrong, but I think what you are referring to is NodeStartUpTime which already exist.

the name of this is "short circuiting backoff", but i didn't see mention of the short circuit mechanisms. i imagine that there will be cases where the machines that are held in the failed state will cause the mhc to go above its max unhealthy limit, is this the backoff addressed? assuming so, i think it should be more specifically mentioned.

Good point - Thanks !

elmiko · 2021-03-08T19:44:57Z

looking good to me, thanks for the update. @mshitrit is this still wip or are you ready for labels?

mshitrit · 2021-03-09T09:51:37Z

Hi @elmiko
I'm ready for labels, bring it on 😄

elmiko

hey @mshitrit , this generally looks good to me. i added some suggestions that i think help to clarify and a few grammar nits.

enhancements/machine-api/short-circuiting-backoff.md

Signed-off-by: mshitrit <[email protected]>

elmiko

this is looking good to me. we might remove some of the unused sections, but i'm ok as is.
/lgtm

openshift-ci-robot · 2021-03-15T14:30:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elmiko
To complete the pull request process, please assign miciah after the PR has been reviewed.
You can assign the PR to them by writing /assign @miciah in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JoelSpeed

Please take a look at the comments I've added, there's some problems I think need exploring further before we can merge around defaulting/disabling the new field and also what happens to existing users when they upgrade, will this change the behaviour for them, is that desirable?

JoelSpeed · 2021-03-23T17:08:53Z

enhancements/machine-api/short-circuiting-backoff.md

+If no value for `FailedNodeStartupTimeout` is defined for the MachineHealthCheck CR, the existing remediation flow
+is preserved.


In the implementation you've set a default, this will be incompatible with this statement, you won't be able to remove the value. If you want to have no default, that would actually be preferable as this would then preserve existing behaviour for users who upgrade

You are right.
Once we decide what's the best way to proceed regarding default/non default I'll make the proper adjustments.

What did we decide on this one?

The default was removed in the implementation, so I guess that was the decision
openshift/machine-api-operator@e3d7784

Indeed.
Here is link to the correspondence.

JoelSpeed · 2021-03-23T17:12:21Z

enhancements/machine-api/short-circuiting-backoff.md

+
+### Risks and Mitigations
+
+No known risks.


Currently, you're setting a default in the implementation and this would affect existing users, may be worth discussing pros/cons of a default here so we know whether to have one or not

That's a good point, here is my take on that:
non default (pro) - naive users aren't being surprised with a new behavior.
default (pro) - naive users do benefit from the new behavior.
I guess the real question here is whether this new behavior benefit all users or not.
Let me know what you think.

I think for now lets keep no default and maintain the existing behaviour, but can we get these pros/cons fleshed out in the risks/mitigations section within the doc?

enhancements/machine-api/short-circuiting-backoff.md

Co-authored-by: Michael McCune <[email protected]>

openshift-ci-robot · 2021-03-25T07:48:08Z

New changes are detected. LGTM label has been removed.

openshift-bot · 2021-06-23T07:59:06Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

JoelSpeed

No real objections to this, would like to see some extra details fleshed out, I've highlighted where these are in the document

Did we ever start a conversation upstream about this, if so, would be good to link that into this document too

JoelSpeed · 2021-06-28T14:36:21Z

enhancements/machine-api/short-circuiting-backoff.md

+
+### Risks and Mitigations
+
+No known risks.


I think for now lets keep no default and maintain the existing behaviour, but can we get these pros/cons fleshed out in the risks/mitigations section within the doc?

enhancements/machine-api/short-circuiting-backoff.md

mshitrit · 2021-06-29T07:30:03Z

Hi @JoelSpeed,
Looks like removing the unused sections is causing the ci/prow/markdownlint Job to fail.
Do you want me to revert those changes, or keep them ?

enhancements/machine-api/short-circuiting-backoff.md missing "### Graduation Criteria"
enhancements/machine-api/short-circuiting-backoff.md missing "#### Dev Preview -> Tech Preview"
enhancements/machine-api/short-circuiting-backoff.md missing "#### Tech Preview -> GA"
enhancements/machine-api/short-circuiting-backoff.md missing "#### Removing a deprecated feature"
enhancements/machine-api/short-circuiting-backoff.md missing "### Upgrade / Downgrade Strategy"

JoelSpeed · 2021-06-29T09:50:22Z

Do you want me to revert those changes, or keep them ?

We will have to revert as the markdownlint is required, I'll pass this as feedback to the archs though, seems weird to enforce these titles for all enhancements

openshift-bot · 2021-07-29T17:48:57Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-08-28T20:52:48Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-08-28T20:54:09Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

slintes · 2021-10-11T09:34:15Z

We still want this

/reopen
/remove-lifecycle rotten
/test all

openshift-ci · 2021-10-11T09:34:42Z

@slintes: Reopened this PR.

In response to this:

We still want this

/reopen
/remove-lifecycle rotten
/test all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

slintes · 2021-10-11T10:43:50Z

Hi @JoelSpeed and @elmiko , we want to revive this task. I understand there was a lgtm already on this, but just the linter prevented it to be merged. The linter is green today, anything else left before merging? Thanks in advance 🙂

JoelSpeed

Did we propose this feature upstream at all?

JoelSpeed · 2021-10-12T12:55:32Z

enhancements/machine-api/short-circuiting-backoff.md

+If no value for `FailedNodeStartupTimeout` is defined for the MachineHealthCheck CR, the existing remediation flow
+is preserved.


What did we decide on this one?

JoelSpeed · 2021-10-12T12:56:38Z

enhancements/machine-api/short-circuiting-backoff.md

+        ...
+
+        // +optional
+        FailedNodeStartupTimeout metav1.Duration `json:"failedNodeStartupTimeout,omitempty"`


Coming back to this, is Startup really involved here? Won't this FailedNode timeout apply to all failed nodes? Should this just be a FailedNodeTimeout?

Good question.
From reading this enhancement I'd say the same.
From reading the implemenation I'd say Startup makes sense, because the timeout is applied to the machine's creation timestamp. Not sure if that implementation is correct though. Maybe the timeout should be applied to the time when the machine reached the failed phase? (Not sure if that information is available...).
@beekhof @mshitrit WDYT?

Per the implementation failedNodeStartupTimeout kicks in for machines which their nodes presumably failed to start, so basically I agree with Marc that the name does make sense assuming the implementation is correct.
Here is the relevant implementation code.

Andrew pointed to something: we only apply that timeout when there is no nodeRef or providerId, isn't that implicitely the same as "during startup"? 🤔 @JoelSpeed

Ack, yeah, lets make sure that's clear in the proposal because I haven't reviewed the implementation in a while and it wasn't clear to me that this only affects startup, hence the comment. Ok with it staying as is

mshitrit · 2021-10-14T06:40:39Z

Did we propose this feature upstream at all?

Not as far as I'm aware of.

enhancements/machine-api/short-circuiting-backoff.md

Co-authored-by: Andrew Beekhof <[email protected]>

elmiko

re-read this again today, it's looking mostly good to me but i found one small error in the text.

also, what are we doing about the sections marked "TBD" ?

enhancements/machine-api/short-circuiting-backoff.md

beekhof · 2021-10-19T00:52:09Z

Did we propose this feature upstream at all?

My folks didn't, but this idea came from the cloud team so maybe one of your people did?

beekhof · 2021-10-19T00:54:35Z

also, what are we doing about the sections marked "TBD" ?

I would expect it to be supported when shipped. So nothing needed for the TBD sections

Co-authored-by: Michael McCune <[email protected]>

mshitrit · 2021-10-19T07:14:22Z

also, what are we doing about the sections marked "TBD" ?

Originally we wanted to remove them, but it caused the build to fail - so decided to keep it as is.
I think @JoelSpeed might have queried further on this issue.

elmiko

thanks for updating the text @mshitrit , and answering the TBD question. i'm good with this

/lgtm

JoelSpeed · 2021-10-20T08:54:35Z

/approve
/hold

Would like to make sure we have an approval from the dragonfly team on this as well. We should also pursue pushing the same design upstream (kubernetes-sigs/cluster-api#3106)

openshift-ci · 2021-10-20T08:56:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

slintes · 2021-10-20T09:02:06Z

Thanks for approval and the pointer to the upstream issue, will have a look.
Team dragonfly is fine with this :)

/hold cancel

openshift-ci-robot requested review from adambkaplan and danehans March 1, 2021 18:36

short-circuiting-backoff

cc01360

Signed-off-by: mshitrit <[email protected]>

mshitrit force-pushed the short_circuit_backoff branch from a66a371 to cc01360 Compare March 2, 2021 07:38

elmiko reviewed Mar 3, 2021

View reviewed changes

enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved

enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved

elmiko reviewed Mar 11, 2021

View reviewed changes

PR review fixes

87495d3

Signed-off-by: mshitrit <[email protected]>

mshitrit force-pushed the short_circuit_backoff branch from 314eafd to 87495d3 Compare March 14, 2021 08:23

elmiko approved these changes Mar 15, 2021

View reviewed changes

openshift-ci-robot assigned elmiko Mar 15, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 15, 2021

zaneb mentioned this pull request Mar 19, 2021

MachineHealthCheck fights with MachineSet on invalid configuration openshift/machine-api-operator#736

Closed

JoelSpeed reviewed Mar 23, 2021

View reviewed changes

mshitrit mentioned this pull request Mar 24, 2021

Short circuiting backoff openshift/machine-api-operator#814

Closed

Update enhancements/machine-api/short-circuiting-backoff.md

8eb78e3

Co-authored-by: Michael McCune <[email protected]>

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 25, 2021

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 23, 2021

JoelSpeed reviewed Jun 28, 2021

View reviewed changes

mshitrit force-pushed the short_circuit_backoff branch 3 times, most recently from ba99f6c to 16d55ac Compare June 29, 2021 11:56

mshitrit force-pushed the short_circuit_backoff branch from fe61274 to 1657c1f Compare June 29, 2021 12:24

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 29, 2021

openshift-ci bot closed this Aug 28, 2021

openshift-ci bot reopened this Oct 11, 2021

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 11, 2021

JoelSpeed reviewed Oct 12, 2021

View reviewed changes

beekhof reviewed Oct 14, 2021

View reviewed changes

enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved

Update enhancements/machine-api/short-circuiting-backoff.md

99d18f9

Co-authored-by: Andrew Beekhof <[email protected]>

elmiko reviewed Oct 18, 2021

View reviewed changes

enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved

Update enhancements/machine-api/short-circuiting-backoff.md

c2e1e35

Co-authored-by: Michael McCune <[email protected]>

elmiko reviewed Oct 19, 2021

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2021

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 20, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 20, 2021

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 20, 2021

openshift-merge-robot merged commit 38c8422 into openshift:master Oct 20, 2021

		If no value for `FailedNodeStartupTimeout` is defined for the MachineHealthCheck CR, the existing remediation flow
		is preserved.

short-circuiting-backoff #673

short-circuiting-backoff #673

Conversation

mshitrit commented Mar 1, 2021

mshitrit commented Mar 1, 2021

elmiko left a comment • edited Loading

Choose a reason for hiding this comment

mshitrit commented Mar 8, 2021

elmiko commented Mar 8, 2021

mshitrit commented Mar 9, 2021

elmiko left a comment

Choose a reason for hiding this comment

elmiko left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Mar 15, 2021

JoelSpeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Mar 25, 2021

openshift-bot commented Jun 23, 2021

JoelSpeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mshitrit commented Jun 29, 2021

JoelSpeed commented Jun 29, 2021

openshift-bot commented Jul 29, 2021

openshift-bot commented Aug 28, 2021

openshift-ci bot commented Aug 28, 2021

slintes commented Oct 11, 2021

openshift-ci bot commented Oct 11, 2021

slintes commented Oct 11, 2021

JoelSpeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mshitrit commented Oct 14, 2021

elmiko left a comment

Choose a reason for hiding this comment

beekhof commented Oct 19, 2021

beekhof commented Oct 19, 2021

mshitrit commented Oct 19, 2021 • edited Loading

elmiko left a comment

Choose a reason for hiding this comment

JoelSpeed commented Oct 20, 2021

openshift-ci bot commented Oct 20, 2021

slintes commented Oct 20, 2021

elmiko left a comment •

edited

Loading

mshitrit commented Oct 19, 2021 •

edited

Loading