Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add worker only controller with operator for machine health check #1950

Merged
merged 2 commits into from
Apr 5, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,13 @@ questions or comments.

* machineset: Ensures that a minimum of two worker replicas are met.

* machinehealthcheck: Ensures the MachineHealthCheck resource is running as configured so that at most one worker node at a time is automatically
reconciled when not ready for at least 5 minutes.
* The CR will only be applied when both `aro.machinehealthcheck.managed` and `aro.machinehealthcheck.enabled` are set to `"true"`.
* When `aro.machinehealthcheck.enabled` is `"false"` and `aro.machinehealthcheck.managed` is `"false"` the CR will be removed from the cluster.
* If `aro.machinehealthcheck.enabled` is `"false"` no actions will be taken to modify the CR.
* More information around the MHC CR can be found [in openshift documentation of MHC](https://docs.openshift.com/container-platform/4.9/machine_management/deploying-machine-health-checks.html)

* monitoring: Ensures that the OpenShift monitoring configuration in the `openshift-monitoring` namespace is consistent and immutable.

* node: Force deletes pods when a node fails to drain for 1 hour. It should clear up any pods that refuse to be evicted on a drain due to violating a pod disruption budget.
Expand Down
5 changes: 5 additions & 0 deletions cmd/aro/operator.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ import (
"github.com/Azure/ARO-RP/pkg/operator/controllers/genevalogging"
"github.com/Azure/ARO-RP/pkg/operator/controllers/imageconfig"
"github.com/Azure/ARO-RP/pkg/operator/controllers/machine"
"github.com/Azure/ARO-RP/pkg/operator/controllers/machinehealthcheck"
"github.com/Azure/ARO-RP/pkg/operator/controllers/machineset"
"github.com/Azure/ARO-RP/pkg/operator/controllers/monitoring"
"github.com/Azure/ARO-RP/pkg/operator/controllers/muo"
Expand Down Expand Up @@ -216,6 +217,10 @@ func operator(ctx context.Context, log *logrus.Entry) error {
mgr)).SetupWithManager(mgr); err != nil {
return fmt.Errorf("unable to create controller %s: %v", autosizednodes.ControllerName, err)
}
if err = (machinehealthcheck.NewReconciler(
s-amann marked this conversation as resolved.
Show resolved Hide resolved
arocli, dh)).SetupWithManager(mgr); err != nil {
return fmt.Errorf("unable to create controller %s: %v", machinehealthcheck.ControllerName, err)
}
}

if err = (checker.NewReconciler(
Expand Down
2 changes: 2 additions & 0 deletions pkg/api/defaults.go
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@ func DefaultOperatorFlags() OperatorFlags {
"aro.imageconfig.enabled": flagTrue,
"aro.machine.enabled": flagTrue,
"aro.machineset.enabled": flagTrue,
"aro.machinehealthcheck.enabled": flagFalse,
"aro.machinehealthcheck.managed": flagFalse,
ross-bryan marked this conversation as resolved.
Show resolved Hide resolved
"aro.monitoring.enabled": flagTrue,
"aro.nodedrainer.enabled": flagTrue,
"aro.pullsecret.enabled": flagTrue,
Expand Down
244 changes: 244 additions & 0 deletions pkg/operator/controllers/machinehealthcheck/bindata.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

27 changes: 27 additions & 0 deletions pkg/operator/controllers/machinehealthcheck/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
package machinehealthcheck

// Copyright (c) Microsoft Corporation.
// Licensed under the Apache License 2.0.

/*

The controller in this package aims to ensure that MachineHealthCheck objects
exist and are correctly configured to automatically mitigate non-ready worker nodes.

There are two flags which control the operations performed by the controller:

aro.machinehealthcheck.enabled:
- When set to false, the controller will noop and not perform any further action
- When set to true, the controller continues on to check the managed flag

aro.machinehealthcheck.managed
- When set to false, the controller will attempt to remove the aro-machinehealthcheck CR from the cluster.
This should effectively disable the MHC we deploy and prevent the automatic reconciliation of nodes.
- When set to true, the controller will deploy/overwrite the aro-machinehealthcheck CR to the cluster.
This enables the cluster to self heal when at most 1 worker node goes not ready for at least 5 minutes.

The aro-machinehealth check is configured in a way that if 2 worker nodes go not ready it will not take any action.
More information about how the MHC works can be found here:
https://docs.openshift.com/container-platform/4.9/machine_management/deploying-machine-health-checks.html

*/
7 changes: 7 additions & 0 deletions pkg/operator/controllers/machinehealthcheck/generate.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
package machinehealthcheck

// Copyright (c) Microsoft Corporation.
// Licensed under the Apache License 2.0.

//go:generate go run ../../../../vendor/github.com/go-bindata/go-bindata/go-bindata -nometadata -pkg $GOPACKAGE -prefix staticresources staticresources/...
//go:generate gofmt -s -l -w bindata.go
Loading