Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

short-circuiting-backoff #673

Merged

Conversation

mshitrit
Copy link
Contributor

@mshitrit mshitrit commented Mar 1, 2021

Signed-off-by: mshitrit [email protected]

@mshitrit
Copy link
Contributor Author

mshitrit commented Mar 1, 2021

/wip

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a couple comments, and i have a couple questions.

i see you've added the note about the failed state being paired with the missing noderef and providerid, but i wonder if this feature should be broader in scope? for example, a cluster admin might want to check all mhc remediations, and want to have a blanket timeout for any machine that is going to be remediated. i'm curious if this is something you considered?

the name of this is "short circuiting backoff", but i didn't see mention of the short circuit mechanisms. i imagine that there will be cases where the machines that are held in the failed state will cause the mhc to go above its max unhealthy limit, is this the backoff addressed? assuming so, i think it should be more specifically mentioned.

enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
@mshitrit
Copy link
Contributor Author

mshitrit commented Mar 8, 2021

i see you've added the note about the failed state being paired with the missing noderef and providerid, but i wonder if this feature should be broader in scope? for example, a cluster admin might want to check all mhc remediations, and want to have a blanket timeout for any machine that is going to be remediated. i'm curious if this is something you considered?

Hmm I could be wrong, but I think what you are referring to is NodeStartUpTime which already exist.

the name of this is "short circuiting backoff", but i didn't see mention of the short circuit mechanisms. i imagine that there will be cases where the machines that are held in the failed state will cause the mhc to go above its max unhealthy limit, is this the backoff addressed? assuming so, i think it should be more specifically mentioned.

Good point - Thanks !

@elmiko
Copy link
Contributor

elmiko commented Mar 8, 2021

looking good to me, thanks for the update. @mshitrit is this still wip or are you ready for labels?

@mshitrit
Copy link
Contributor Author

mshitrit commented Mar 9, 2021

Hi @elmiko
I'm ready for labels, bring it on 😄

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @mshitrit , this generally looks good to me. i added some suggestions that i think help to clarify and a few grammar nits.

enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
Signed-off-by: mshitrit <[email protected]>
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking good to me. we might remove some of the unused sections, but i'm ok as is.
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 15, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elmiko
To complete the pull request process, please assign miciah after the PR has been reviewed.
You can assign the PR to them by writing /assign @miciah in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the comments I've added, there's some problems I think need exploring further before we can merge around defaulting/disabling the new field and also what happens to existing users when they upgrade, will this change the behaviour for them, is that desirable?

Comment on lines +76 to +77
If no value for `FailedNodeStartupTimeout` is defined for the MachineHealthCheck CR, the existing remediation flow
is preserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the implementation you've set a default, this will be incompatible with this statement, you won't be able to remove the value. If you want to have no default, that would actually be preferable as this would then preserve existing behaviour for users who upgrade

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right.
Once we decide what's the best way to proceed regarding default/non default I'll make the proper adjustments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did we decide on this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default was removed in the implementation, so I guess that was the decision
openshift/machine-api-operator@e3d7784

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.
Here is link to the correspondence.


### Risks and Mitigations

No known risks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, you're setting a default in the implementation and this would affect existing users, may be worth discussing pros/cons of a default here so we know whether to have one or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, here is my take on that:
non default (pro) - naive users aren't being surprised with a new behavior.
default (pro) - naive users do benefit from the new behavior.
I guess the real question here is whether this new behavior benefit all users or not.
Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now lets keep no default and maintain the existing behaviour, but can we get these pros/cons fleshed out in the risks/mitigations section within the doc?

enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
@openshift-ci-robot
Copy link

New changes are detected. LGTM label has been removed.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 25, 2021
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 23, 2021
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real objections to this, would like to see some extra details fleshed out, I've highlighted where these are in the document

Did we ever start a conversation upstream about this, if so, would be good to link that into this document too


### Risks and Mitigations

No known risks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now lets keep no default and maintain the existing behaviour, but can we get these pros/cons fleshed out in the risks/mitigations section within the doc?

enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
@mshitrit
Copy link
Contributor Author

Hi @JoelSpeed,
Looks like removing the unused sections is causing the ci/prow/markdownlint Job to fail.
Do you want me to revert those changes, or keep them ?

enhancements/machine-api/short-circuiting-backoff.md missing "### Graduation Criteria"
enhancements/machine-api/short-circuiting-backoff.md missing "#### Dev Preview -> Tech Preview"
enhancements/machine-api/short-circuiting-backoff.md missing "#### Tech Preview -> GA"
enhancements/machine-api/short-circuiting-backoff.md missing "#### Removing a deprecated feature"
enhancements/machine-api/short-circuiting-backoff.md missing "### Upgrade / Downgrade Strategy"

@JoelSpeed
Copy link
Contributor

Do you want me to revert those changes, or keep them ?

We will have to revert as the markdownlint is required, I'll pass this as feedback to the archs though, seems weird to enforce these titles for all enhancements

@mshitrit mshitrit force-pushed the short_circuit_backoff branch 3 times, most recently from ba99f6c to 16d55ac Compare June 29, 2021 11:56
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 29, 2021
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Aug 28, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 28, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@slintes
Copy link
Member

slintes commented Oct 11, 2021

We still want this

/reopen
/remove-lifecycle rotten
/test all

@openshift-ci openshift-ci bot reopened this Oct 11, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 11, 2021

@slintes: Reopened this PR.

In response to this:

We still want this

/reopen
/remove-lifecycle rotten
/test all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 11, 2021
@slintes
Copy link
Member

slintes commented Oct 11, 2021

Hi @JoelSpeed and @elmiko , we want to revive this task. I understand there was a lgtm already on this, but just the linter prevented it to be merged. The linter is green today, anything else left before merging? Thanks in advance 🙂

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we propose this feature upstream at all?

Comment on lines +76 to +77
If no value for `FailedNodeStartupTimeout` is defined for the MachineHealthCheck CR, the existing remediation flow
is preserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did we decide on this one?

...

// +optional
FailedNodeStartupTimeout metav1.Duration `json:"failedNodeStartupTimeout,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this, is Startup really involved here? Won't this FailedNode timeout apply to all failed nodes? Should this just be a FailedNodeTimeout?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.
From reading this enhancement I'd say the same.
From reading the implemenation I'd say Startup makes sense, because the timeout is applied to the machine's creation timestamp. Not sure if that implementation is correct though. Maybe the timeout should be applied to the time when the machine reached the failed phase? (Not sure if that information is available...).
@beekhof @mshitrit WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the implementation failedNodeStartupTimeout kicks in for machines which their nodes presumably failed to start, so basically I agree with Marc that the name does make sense assuming the implementation is correct.
Here is the relevant implementation code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Andrew pointed to something: we only apply that timeout when there is no nodeRef or providerId, isn't that implicitely the same as "during startup"? 🤔 @JoelSpeed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, yeah, lets make sure that's clear in the proposal because I haven't reviewed the implementation in a while and it wasn't clear to me that this only affects startup, hence the comment. Ok with it staying as is

@mshitrit
Copy link
Contributor Author

Did we propose this feature upstream at all?

Not as far as I'm aware of.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-read this again today, it's looking mostly good to me but i found one small error in the text.

also, what are we doing about the sections marked "TBD" ?

enhancements/machine-api/short-circuiting-backoff.md Outdated Show resolved Hide resolved
@beekhof
Copy link
Contributor

beekhof commented Oct 19, 2021

Did we propose this feature upstream at all?

My folks didn't, but this idea came from the cloud team so maybe one of your people did?

@beekhof
Copy link
Contributor

beekhof commented Oct 19, 2021

also, what are we doing about the sections marked "TBD" ?

I would expect it to be supported when shipped. So nothing needed for the TBD sections

@mshitrit
Copy link
Contributor Author

mshitrit commented Oct 19, 2021

also, what are we doing about the sections marked "TBD" ?

Originally we wanted to remove them, but it caused the build to fail - so decided to keep it as is.
I think @JoelSpeed might have queried further on this issue.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating the text @mshitrit , and answering the TBD question. i'm good with this

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2021
@JoelSpeed
Copy link
Contributor

/approve
/hold

Would like to make sure we have an approval from the dragonfly team on this as well. We should also pursue pushing the same design upstream (kubernetes-sigs/cluster-api#3106)

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 20, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 20, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 20, 2021
@slintes
Copy link
Member

slintes commented Oct 20, 2021

Thanks for approval and the pointer to the upstream issue, will have a look.
Team dragonfly is fine with this :)

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 20, 2021
@openshift-merge-robot openshift-merge-robot merged commit 38c8422 into openshift:master Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants