Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

short-circuiting-backoff #673

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions enhancements/machine-api/short-circuiting-backoff.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
---
title: short-circuiting backoff

authors:
- @mshitrit

reviewers:
- @beekhof
- @n1r1
- @slintes

approvers:
- @JoelSpeed
- @michaelgugino
- @enxebre

creation-date: 2021-03-01

last-updated: 2021-03-01

status: implementable

see-also:
- https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md
---

# Support backoff when short-circuiting

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

By using `MachineHealthChecks` a cluster admin can configure automatic remediation of unhealthy machines and nodes.
The machine healthcheck controller's remediation strategy is deleting the machine, and letting the cloud provider
create a new one. This isn't the best remediation strategy in all environments.

Any Machine that enters the `Failed` state is remediated immediately, without waiting, by the MHC.
When this occurs, if the error which caused the failure is persistent (spot price too low, configuration error), replacement Machines will also be `Failed`.
As replacement machines start and fail, MHC causes a hot loop of Machine being deleted and recreated.
This hot looping makes it difficult for users to find out why their Machines are failing.
Another side effect of machines constantly failing, is the risk of hitting the benchmark of machine failures percentage - thus triggering the "short-circuit" mechanism which will prevent all remediations.

With this enhancement we propose a better mechanism.
In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID, it will not be remediated until after a certain time period has passed - thus allowing a manual intervention in order to break to hot loop.

## Motivation

- Preventing remediation hot loop, in order to allow a manual fix and prevent unnecessary resource usage.

### Goals

- Create the opportunity for users to enact custom remediations for Machines that enter the `Failed` state.

### Non-Goals

- This enhancement does not seek to create a pluggable remediation system in the MHC.

## Proposal

We propose modifying the MachineHealthCheck CRD to support a failed node startup timeout. This timeout defines the period after which a `Failed` machine will be remediated by the MachineHealthCheck.

### User Stories

#### Story 1

As an admin of a hardware based cluster, I would like failed machines to delay before automatically re-provisioning so I'll have a time frame in which to manually analyze and fix them.

### Implementation Details/Notes/Constraints

If no value for `FailedNodeStartupTimeout` is defined for the MachineHealthCheck CR, the existing remediation flow
is preserved.
Comment on lines +76 to +77
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the implementation you've set a default, this will be incompatible with this statement, you won't be able to remove the value. If you want to have no default, that would actually be preferable as this would then preserve existing behaviour for users who upgrade

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right.
Once we decide what's the best way to proceed regarding default/non default I'll make the proper adjustments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did we decide on this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default was removed in the implementation, so I guess that was the decision
openshift/machine-api-operator@e3d7784

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.
Here is link to the correspondence.


In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it's remediation will be requeued by `FailedNodeStartupTimeout`.
After that time has passed if the machine current state remains, remediation will be performed.


#### MHC struct enhancement

```go
type MachineHealthCheckSpec struct {
...

// +optional
FailedNodeStartupTimeout metav1.Duration `json:"failedNodeStartupTimeout,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this, is Startup really involved here? Won't this FailedNode timeout apply to all failed nodes? Should this just be a FailedNodeTimeout?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.
From reading this enhancement I'd say the same.
From reading the implemenation I'd say Startup makes sense, because the timeout is applied to the machine's creation timestamp. Not sure if that implementation is correct though. Maybe the timeout should be applied to the time when the machine reached the failed phase? (Not sure if that information is available...).
@beekhof @mshitrit WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the implementation failedNodeStartupTimeout kicks in for machines which their nodes presumably failed to start, so basically I agree with Marc that the name does make sense assuming the implementation is correct.
Here is the relevant implementation code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Andrew pointed to something: we only apply that timeout when there is no nodeRef or providerId, isn't that implicitely the same as "during startup"? 🤔 @JoelSpeed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, yeah, lets make sure that's clear in the proposal because I haven't reviewed the implementation in a while and it wasn't clear to me that this only affects startup, hence the comment. Ok with it staying as is

}
```

#### Example CRs

MachineHealthCheck:
```yaml
kind: MachineHealthCheck
apiVersion: machine.openshift.io/v1beta1
metadata:
name: REMEDIATION_GROUP
namespace: NAMESPACE_OF_UNHEALTHY_MACHINE
spec:
selector:
matchLabels:
...
failedNodeStartupTimeout: 48h
```

### Risks and Mitigations
In case `FailedNodeStartupTimeout` is undefined default behaviour is preserved (i.e. remediation is not postponed).
The Pro is that naive users aren't being surprised with a new behavior however, the con is that naive users do not benefit from the new behavior.

## Design Details

### Open Questions

See deprecation and upgrade.

### Test Plan

The existing remediation tests will be reviewed / adapted / extended as needed.
mshitrit marked this conversation as resolved.
Show resolved Hide resolved

### Graduation Criteria

TBD

#### Dev Preview -> Tech Preview

TBD

#### Tech Preview -> GA

TBD

#### Removing a deprecated feature

### Upgrade / Downgrade Strategy

### Version Skew Strategy

## Implementation History

- [x] 03/01/2021: Opened enhancement PR

## Drawbacks

no known drawbacks

## Alternatives

In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID do not perform remediation on that machine.

This alternative is simpler since it does not require `FailedNodeStartupTimeout` however it does not allow an option to retain the system previous behaviour and relies completely on manual fix of the failed machines.