Skip to content

Commit

Permalink
when mhc is managed create an alert for frequent remediation (#2123)
Browse files Browse the repository at this point in the history
  • Loading branch information
s-amann authored May 31, 2022
1 parent 8753830 commit 7bba674
Show file tree
Hide file tree
Showing 18 changed files with 4,075 additions and 18 deletions.
6 changes: 1 addition & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,11 +116,7 @@ questions or comments.

* machineset: Ensures that a minimum of two worker replicas are met.

* machinehealthcheck: Ensures the MachineHealthCheck resource is running as configured so that at most one worker node at a time is automatically
reconciled when not ready for at least 5 minutes.
* The CR will only be applied when both `aro.machinehealthcheck.managed` and `aro.machinehealthcheck.enabled` are set to `"true"`.
* When `aro.machinehealthcheck.enabled` is `"false"` and `aro.machinehealthcheck.managed` is `"false"` the CR will be removed from the cluster.
* If `aro.machinehealthcheck.enabled` is `"false"` no actions will be taken to modify the CR.
* machinehealthcheck: Ensures the MachineHealthCheck resource is running as configured. See [machinehealthcheck/doc.go](pkg/operator/controllers/machinehealthcheck/doc.go)
* More information around the MHC CR can be found [in openshift documentation of MHC](https://docs.openshift.com/container-platform/4.9/machine_management/deploying-machine-health-checks.html)

* monitoring: Ensures that the OpenShift monitoring configuration in the `openshift-monitoring` namespace is consistent and immutable.
Expand Down
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ require (
github.com/openshift/machine-config-operator v3.11.0+incompatible
github.com/pires/go-proxyproto v0.6.2
github.com/pkg/errors v0.9.1
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.48.1
github.com/prometheus/client_golang v1.12.1
github.com/prometheus/common v0.33.0
github.com/sirupsen/logrus v1.8.1
Expand Down
1 change: 1 addition & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -2348,6 +2348,7 @@ github.com/proglottis/gpgme v0.1.1/go.mod h1:fPbW/EZ0LvwQtH8Hy7eixhp1eF3G39dtx7G
github.com/prometheus-community/prom-label-proxy v0.2.0/go.mod h1:XdjyZg7LCbCC5FADHtpgNp6kQ0W9beXVGfmcvndMj5Y=
github.com/prometheus-operator/prometheus-operator v0.48.1/go.mod h1:lXJz0R74XkhfLwsw5mIeLPJZOVIQp6kQyMyMjOU+MWY=
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.44.1/go.mod h1:3WYi4xqXxGGXWDdQIITnLNmuDzO5n6wYva9spVhR4fg=
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.48.1 h1:OGC7+ktZ6h8xI99VB6i8iuiXecdhUmwto9vbGzoVMac=
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.48.1/go.mod h1:3WYi4xqXxGGXWDdQIITnLNmuDzO5n6wYva9spVhR4fg=
github.com/prometheus-operator/prometheus-operator/pkg/client v0.48.1/go.mod h1:k4BrWlVQQsvBiTcDnKEMgyh/euRxyxgrHdur/ZX/sdA=
github.com/prometheus/alertmanager v0.20.0/go.mod h1:9g2i48FAyZW6BtbsnvHtMHQXl2aVtrORKwKVCQ+nbrg=
Expand Down
2 changes: 2 additions & 0 deletions hack/validate-imports/validate-imports.go
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,8 @@ func acceptableNames(path string) []string {
return []string{"coreosarch"}
case "github.com/openshift/installer/pkg/rhcos":
return []string{"rhcospkg"}
case "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1":
return []string{"monitoringv1"}
case "golang.org/x/crypto/ssh":
return []string{"", "cryptossh"}
case "k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1beta1":
Expand Down
27 changes: 25 additions & 2 deletions pkg/operator/controllers/machinehealthcheck/bindata.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 7 additions & 5 deletions pkg/operator/controllers/machinehealthcheck/doc.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ package machinehealthcheck

/*
The controller in this package aims to ensure that MachineHealthCheck objects
exist and are correctly configured to automatically mitigate non-ready worker nodes.
The controller in this package aims to ensure the ARO MachineHealthCheck CR and MHC Remediation Alert
exist and are correctly configured to automatically mitigate non-ready worker nodes and create an in-cluster alert
when remediation is occuring frequently.
There are two flags which control the operations performed by the controller:
Expand All @@ -15,10 +16,11 @@ aro.machinehealthcheck.enabled:
- When set to true, the controller continues on to check the managed flag
aro.machinehealthcheck.managed
- When set to false, the controller will attempt to remove the aro-machinehealthcheck CR from the cluster.
- When set to false, the controller will attempt to remove the aro-machinehealthcheck CR and the MHC Remediation alert from the cluster.
This should effectively disable the MHC we deploy and prevent the automatic reconciliation of nodes.
- When set to true, the controller will deploy/overwrite the aro-machinehealthcheck CR to the cluster.
This enables the cluster to self heal when at most 1 worker node goes not ready for at least 5 minutes.
- When set to true, the controller will deploy/overwrite the aro-machinehealthcheck CR and the MHC Remediation alert to the cluster.
This enables the cluster to self heal when at most 1 worker node goes not ready for at least 5 minutes and alert when remediation
occurs 2 or more times within an hour.
The aro-machinehealth check is configured in a way that if 2 worker nodes go not ready it will not take any action.
More information about how the MHC works can be found here:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,16 @@ func (r *Reconciler) Reconcile(ctx context.Context, request ctrl.Request) (ctrl.
if err != nil {
return reconcile.Result{RequeueAfter: time.Hour}, err
}

err = r.dh.EnsureDeleted(ctx, "PrometheusRule", "openshift-machine-api", "mhc-remediation-alert")
if err != nil {
return reconcile.Result{RequeueAfter: time.Hour}, err
}
return reconcile.Result{}, nil
}

var resources []kruntime.Object

// this loop prevents us from hard coding resource strings
// and ensures all static resources are accounted for.
for _, assetName := range AssetNames() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ func TestReconciler(t *testing.T) {
wantErr: "",
},
{
name: "Managed Feature Flag is false: ensure mhc is deleted",
name: "Managed Feature Flag is false: ensure mhc and its alert are deleted",
arocli: arofake.NewSimpleClientset(&arov1alpha1.Cluster{
ObjectMeta: metav1.ObjectMeta{
Name: arov1alpha1.SingletonClusterName,
Expand All @@ -74,6 +74,7 @@ func TestReconciler(t *testing.T) {
}),
mocks: func(mdh *mock_dynamichelper.MockInterface) {
mdh.EXPECT().EnsureDeleted(gomock.Any(), "MachineHealthCheck", "openshift-machine-api", "aro-machinehealthcheck").Times(1)
mdh.EXPECT().EnsureDeleted(gomock.Any(), "PrometheusRule", "openshift-machine-api", "mhc-remediation-alert").Times(1)
},
wantErr: "",
},
Expand All @@ -96,6 +97,26 @@ func TestReconciler(t *testing.T) {
wantErr: "Could not delete mhc",
wantRequeueAfter: time.Hour,
},
{
name: "Managed Feature Flag is false: mhc deletes but mhc alert fails to delete, an error is returned",
arocli: arofake.NewSimpleClientset(&arov1alpha1.Cluster{
ObjectMeta: metav1.ObjectMeta{
Name: arov1alpha1.SingletonClusterName,
},
Spec: arov1alpha1.ClusterSpec{
OperatorFlags: arov1alpha1.OperatorFlags{
enabled: strconv.FormatBool(true),
managed: strconv.FormatBool(false),
},
},
}),
mocks: func(mdh *mock_dynamichelper.MockInterface) {
mdh.EXPECT().EnsureDeleted(gomock.Any(), "MachineHealthCheck", "openshift-machine-api", "aro-machinehealthcheck").Times(1)
mdh.EXPECT().EnsureDeleted(gomock.Any(), "PrometheusRule", "openshift-machine-api", "mhc-remediation-alert").Return(errors.New("Could not delete mhc alert"))
},
wantErr: "Could not delete mhc alert",
wantRequeueAfter: time.Hour,
},
{
name: "Managed Feature Flag is true: dynamic helper ensures resources",
arocli: arofake.NewSimpleClientset(&arov1alpha1.Cluster{
Expand Down Expand Up @@ -142,10 +163,7 @@ func TestReconciler(t *testing.T) {
tt.mocks(mdh)

ctx := context.Background()
r := &Reconciler{
arocli: tt.arocli,
dh: mdh,
}
r := NewReconciler(tt.arocli, mdh)
request := ctrl.Request{}
request.Name = "cluster"

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: mhc-remediation-alert
namespace: openshift-machine-api
labels:
prometheus: k8s
role: alert-rules
spec:
groups:
- name: sre-mhc-remediation-alert
rules:
- alert: SREMachineHealthCheckRemediationRateHigh
expr: increase(mapi_machinehealthcheck_remediation_success_total [60m]) > 1
Annotations:
Message: worker nodes have been remediated 2 or more times in the last hour this may indicate an unstable workload running on the cluster
labels:
severity: warning
3 changes: 2 additions & 1 deletion pkg/util/scheme/scheme.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import (
machinev1beta1 "github.com/openshift/api/machine/v1beta1"
securityv1 "github.com/openshift/api/security/v1"
mcv1 "github.com/openshift/machine-config-operator/pkg/apis/machineconfiguration.openshift.io/v1"
monitoringv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
rbacv1 "k8s.io/api/rbac/v1"
Expand Down Expand Up @@ -39,7 +40,7 @@ func init() {
utilruntime.Must(rbacv1defaults.RegisterDefaults(scheme.Scheme))
utilruntime.Must(machinev1beta1.AddToScheme(scheme.Scheme))
utilruntime.Must(consolev1.AddToScheme(scheme.Scheme))

utilruntime.Must(monitoringv1.AddToScheme(scheme.Scheme))
// AzureMachineProviderSpec is not registered by default
scheme.Scheme.AddKnownTypes(machinev1beta1.GroupVersion, &machinev1beta1.AzureMachineProviderSpec{})
}
Loading

0 comments on commit 7bba674

Please sign in to comment.