Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MHC support #224

Merged
merged 31 commits into from
Dec 20, 2023
Merged

MHC support #224

merged 31 commits into from
Dec 20, 2023

Conversation

slintes
Copy link
Member

@slintes slintes commented Jul 4, 2023

Handle failed machines and unhealthy nodes based on MachineHealthChecks.
This basically is a copy of the controller and unit tests from https://github.com/openshift/machine-api-operator/tree/master/pkg/controller/machinehealthcheck.
Where applicable, existing NHC code was reused.

Limitations, can potentailly be handled in a follow up:

  • no internal remediation like in MHC, we can consider to use Machine Deletion Remediation instead
  • no annotation based baremetal remediation, we can consider to use Metal3 Remediation instead
  • unit tests don't use envtest but old style fake clients

Done:

  • Unit tests
  • Fix NHC e2e 🙈
  • handle events
  • handle metrics
  • add e2e tests: for this we first need a way to disable the existing MHC controller, probably with a OCP feature gate

ECOPROJECT-773

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 4, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 4, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jul 4, 2023
@slintes
Copy link
Member Author

slintes commented Jul 4, 2023

/test ?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 4, 2023

@slintes: The following commands are available to trigger required jobs:

  • /test 4.11-ci-bundle-my-bundle
  • /test 4.11-images
  • /test 4.11-openshift-e2e
  • /test 4.11-test
  • /test 4.12-ci-bundle-my-bundle
  • /test 4.12-images
  • /test 4.12-openshift-e2e
  • /test 4.12-test
  • /test 4.13-ci-bundle-my-bundle
  • /test 4.13-images
  • /test 4.13-openshift-e2e
  • /test 4.13-test
  • /test 4.14-ci-bundle-my-bundle
  • /test 4.14-images
  • /test 4.14-openshift-e2e
  • /test 4.14-test

Use /test all to run all jobs.

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@slintes
Copy link
Member Author

slintes commented Jul 4, 2023

/test 4.13-openshift-e2e

1 similar comment
@slintes
Copy link
Member Author

slintes commented Jul 5, 2023

/test 4.13-openshift-e2e

return result, err
}

if !r.ReconcileMHC {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be at the top of method ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the MHCChecker above isn't new

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is, in case the feature is disabled wouldn't it make sense to return immediately and skip other tests ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MHCChecker needs to always run 🤷🏼‍♂️

controllers/shared.go Outdated Show resolved Hide resolved
@slintes
Copy link
Member Author

slintes commented Jul 18, 2023

checking old NHC e2e tests

/test 4.11-openshift-e2e

@slintes
Copy link
Member Author

slintes commented Jul 18, 2023

/test 4.11-openshift-e2e

@slintes
Copy link
Member Author

slintes commented Jul 18, 2023

/test 4.11-openshift-e2e

@slintes
Copy link
Member Author

slintes commented Jul 18, 2023

/test 4.12-openshift-e2e

@slintes
Copy link
Member Author

slintes commented Jul 18, 2023

/test 4.12-openshift-e2e

@slintes
Copy link
Member Author

slintes commented Jul 19, 2023

/test 4.12-openshift-e2e

@slintes
Copy link
Member Author

slintes commented Nov 21, 2023

After a lot of "fun" with rebase, fixing merge conflicts, and getting leases to work with MHC, unit tests are green again (locally at least). Let's see how old NHC e2e looks like now.

Not ready for review! Still several TODOs in code. And no e2e for MHC yet.

/test 4.13-openshift-e2e

Signed-off-by: Marc Sluiter <[email protected]>
MHC needs to also remediate failed Machines which
don't have a node yet. Because of that, the CR name
always is the Machine name. And we can't create
a node lease in this case.

Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Works on OCP 4.14+ only

Signed-off-by: Marc Sluiter <[email protected]>
Otherwise we might miss MHCChecker update in case
getting MHC fails

Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
@slintes
Copy link
Member Author

slintes commented Dec 20, 2023

rebased on #272

@clobrano
Copy link
Contributor

/lgtm

Signed-off-by: Marc Sluiter <[email protected]>
@openshift-ci openshift-ci bot removed the lgtm label Dec 20, 2023
@slintes
Copy link
Member Author

slintes commented Dec 20, 2023

/test 4.12-openshift-e2e

Copy link
Contributor

openshift-ci bot commented Dec 20, 2023

@slintes: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.11-openshift-e2e cae72df link true /test 4.11-openshift-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@slintes
Copy link
Member Author

slintes commented Dec 20, 2023

/test 4.13-openshift-e2e
/test 4.15-openshift-e2e

@slintes
Copy link
Member Author

slintes commented Dec 20, 2023

/hold cancel
/override e2e-k8s

Copy link
Contributor

openshift-ci bot commented Dec 20, 2023

@slintes: Overrode contexts on behalf of slintes: e2e-k8s

In response to this:

/hold cancel
/override e2e-k8s

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@clobrano
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Dec 20, 2023
@openshift-merge-bot openshift-merge-bot bot merged commit 5f73549 into medik8s:main Dec 20, 2023
19 of 20 checks passed
@slintes slintes deleted the mhc_support branch March 4, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants