Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

short-circuiting-backoff #673

Merged
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 164 additions & 0 deletions enhancements/machine-api/short-circuiting-backoff.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
---
title: short-circuiting backoff

authors:
- @mshitrit

reviewers:
- @beekhof
- @n1r1
- @slintes

approvers:
- @JoelSpeed
- @michaelgugino
- @enxebre

creation-date: 2021-03-01

last-updated: 2021-03-01

status: implementable

see-also:
- https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md
---

# Support backoff when short-circuiting

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

By using `MachineHealthChecks` a cluster admin can configure automatic remediation of unhealthy machines and nodes.
The machine healthcheck controller's remediation strategy is deleting the machine, and letting the cloud provider
create a new one. This isn't the best remediation strategy in all environments.

Any Machine that enters the `Failed` state is remediated immediately, without waiting, by the MHC
mshitrit marked this conversation as resolved.
Show resolved Hide resolved
When this occurs, if the error which caused the failure is persistent (spot price too low, configuration error), replacement Machines will also be `Failed`
mshitrit marked this conversation as resolved.
Show resolved Hide resolved
As replacement machines start and fail, MHC causes a hot loop of Machine being deleted and recreated
mshitrit marked this conversation as resolved.
Show resolved Hide resolved
Hot loop makes it difficult for users to find out why their Machines are failing.
mshitrit marked this conversation as resolved.
Show resolved Hide resolved

With this enhancement we propose a better mechanism.
In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it'll be remediated after a certain time period has passed - thus allowing a manual intervention in order to break to hot loop.
mshitrit marked this conversation as resolved.
Show resolved Hide resolved

## Motivation

- Preventing remediation hot loop, in order to allow a manual fix and prevent unnecessary resource usage.

### Goals

- Create the ability to define customized remediation for Machine that enters the `Failed` state.
mshitrit marked this conversation as resolved.
Show resolved Hide resolved

### Non-Goals

TBD
mshitrit marked this conversation as resolved.
Show resolved Hide resolved

## Proposal

We propose modifying the MachineHealthCheck CRD to support a FailedNodeStartupTimeout, this is the time period which controls remediation of a machine that enters the `Failed` state.
mshitrit marked this conversation as resolved.
Show resolved Hide resolved

### User Stories

#### Story 1

As an admin of a hardware based cluster, I would like failed machines to delay before automatically re-provisioning so I'll have a time frame in which to manually analyze and fix them .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As an admin of a hardware based cluster, I would like failed machines to delay before automatically re-provisioning so I'll have a time frame in which to manually analyze and fix them .
As an admin of a hardware based cluster, I would like failed machines to delay before automatically re-provisioning so I'll have a time frame in which to manually analyze and fix them.


### Implementation Details/Notes/Constraints

If no value for failedNodeStartupTimeout is defined for the MachineHealthCheck CR, the existing remediation flow
mshitrit marked this conversation as resolved.
Show resolved Hide resolved
is preserved.

In case a machine enters the `Failed` state and does not have a NodeRef or a ProviderID it's remediation will be requeued by failedNodeStartupTimeout.
mshitrit marked this conversation as resolved.
Show resolved Hide resolved
After that time has passed if the machine current state remains remediation will be performed.
mshitrit marked this conversation as resolved.
Show resolved Hide resolved


#### MHC struct enhancement

```go
type MachineHealthCheckSpec struct {
...

// +optional
FailedNodeStartupTimeout metav1.Duration `json:"failedNodeStartupTimeout,omitempty"`
mshitrit marked this conversation as resolved.
Show resolved Hide resolved
}
```

#### Example CRs

MachineHealthCheck:
```yaml
kind: MachineHealthCheck
apiVersion: machine.openshift.io/v1beta1
metadata:
name: REMEDIATION_GROUP
namespace: NAMESPACE_OF_UNHEALTHY_MACHINE
spec:
selector:
matchLabels:
...
failedNodeStartupTimeout: 48h
```

### Risks and Mitigations

No known risks

## Design Details

### Open Questions

See deprecation and upgrade

### Test Plan

The existing remediation tests will be reviewed / adapted / extended as needed.
mshitrit marked this conversation as resolved.
Show resolved Hide resolved

### Graduation Criteria

TBD

#### Examples

TBD

##### Dev Preview -> Tech Preview

TBD

##### Tech Preview -> GA

TBD

##### Removing a deprecated feature


### Upgrade / Downgrade Strategy
mshitrit marked this conversation as resolved.
Show resolved Hide resolved

### Version Skew Strategy

## Implementation History

- [x] 03/01/2021: Opened enhancement PR

## Drawbacks

no known drawbacks

## Alternatives

- Instead of delaying, canceling the remediation for failed machines.
mshitrit marked this conversation as resolved.
Show resolved Hide resolved

## Infrastructure Needed [optional]

Use this section if you need things from the project. Examples include a new
subproject, repos requested, github details, and/or testing infrastructure.

Listing these here allows the community to get the process for these resources
started right away.
mshitrit marked this conversation as resolved.
Show resolved Hide resolved