OCPBUGS-13558: fix: ensure LVMVolumeGroupNodeStatus is removed by dedicated cleanup controller in case of multi-node #372

jakobmoellerdev · 2023-07-31T17:05:58Z

In case the operator gets deployed in a multi-node environemnt the LVMVolumeGroupNodeStatus is orphaned when a node is removed. This is unintended. We need a logic that can succesfully remove this NodeStatus whenever there is a removed Node.

Unfortunately since the status update of LVMCluster is currently assuming the correct presence of the LVMVolumeGroupNodeStatus, the only way I found to fix this issue would be to

delete the LVMVolumeGroupNodeStatus in the status update of LVMCluster in case the node does not exist anymore and the Status check finds an orphaned status without a node. potentially expensive since now LVMCluster has to compare all nodes and check if the LVMNodeVolumeStatus exists. Also the removal will be delayed from the Node.
delete the LVMVolumeGroupNodeStatus in a new reconcile loop that listens on node changes and uses a finalizer to protect the node deletion until we have the LVMVolumeGroupNodeStatus cleaned up. This is the "clean" solution to how we should handle deletion, however it should be noted This reconcile loop would only need to run outside of SNO, in SNO it can be disabled. Danger here is that if the finalizer is not removed properly, node removal can fail. Good thing is that in theory we can remove the created vgs from the node easily with that hook later on if necessary

Currently the PR uses a new controller.

Currently the main integration test was changed to now remove the node (which triggers the Status object removal) instead of removing the object directly which automatically covers this use case.

openshift-ci · 2023-07-31T17:06:03Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2023-07-31T17:06:27Z

@jakobmoellerdev: This pull request references Jira Issue OCPBUGS-13558, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-08-01T14:29:53Z

@jakobmoellerdev: This pull request references Jira Issue OCPBUGS-13558, which is invalid:

expected the bug to target the "4.14.0" version, but it targets "4.15.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-08-01T14:30:23Z

@jakobmoellerdev: This pull request references Jira Issue OCPBUGS-13558, which is invalid:

expected the bug to target the "4.14.0" version, but it targets "4.15.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-08-01T14:30:43Z

@jakobmoellerdev: This pull request references Jira Issue OCPBUGS-13558, which is invalid:

expected the bug to target the "4.14.0" version, but it targets "4.15.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-08-01T14:31:03Z

@jakobmoellerdev: This pull request references Jira Issue OCPBUGS-13558, which is invalid:

expected the bug to target the "4.14.0" version, but it targets "4.15.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jakobmoellerdev · 2023-08-01T14:31:07Z

/test all

openshift-ci-robot · 2023-08-01T14:31:48Z

@jakobmoellerdev: This pull request references Jira Issue OCPBUGS-13558, which is invalid:

expected the bug to target the "4.14.0" version, but it targets "4.15.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-08-01T15:21:41Z

@jakobmoellerdev: This pull request references Jira Issue OCPBUGS-13558, which is invalid:

expected the bug to target the "4.14.0" version, but it targets "4.15.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

codecov-commenter · 2023-08-01T15:24:11Z

Codecov Report

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #372       +/-   ##
===========================================
+ Coverage   16.59%   57.01%   +40.42%     
===========================================
  Files          24       28        +4     
  Lines        2061     2138       +77     
===========================================
+ Hits          342     1219      +877     
+ Misses       1693      828      -865     
- Partials       26       91       +65

Files Changed	Coverage Δ
controllers/node_removal_controller.go	`56.00% <56.00%> (ø)`
controllers/node_removal_controller_watches.go	`61.53% <61.53%> (ø)`
pkg/cluster/leaderelection.go	`66.66% <66.66%> (ø)`
pkg/cluster/sno.go	`72.72% <72.72%> (ø)`
controllers/lvmcluster_controller_watches.go	`91.42% <100.00%> (+91.42%)`	⬆️

... and 10 files with indirect coverage changes

jakobmoellerdev · 2023-08-01T15:40:00Z

/test all

jakobmoellerdev · 2023-08-02T09:22:41Z

/hold as testing revealed that I potentially introduce a delete issue

jakobmoellerdev · 2023-08-02T09:28:08Z

seems like there is an edge case sometimes where the vg is not removed. not sure why, but unrelated to the PR. when I can reproduce I will open a separate issue.

controllers/lvmcluster_controller_integration_test.go

main.go

jakobmoellerdev · 2023-08-03T11:47:42Z

/hold

jakobmoellerdev · 2023-08-04T11:00:17Z

/jira refresh

openshift-ci-robot · 2023-08-04T11:00:19Z

@jakobmoellerdev: This pull request references Jira Issue OCPBUGS-13558, which is invalid:

expected the bug to target the "4.14.0" version, but it targets "4.15.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-08-09T15:23:10Z

@jakobmoellerdev: This pull request references Jira Issue OCPBUGS-13558, which is invalid:

expected the bug to target the "4.14.0" version, but it targets "4.15.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

qJkee · 2023-08-09T20:44:05Z

/lgtm

jakobmoellerdev · 2023-08-10T16:44:38Z

/cc @suleymanakbas91 to get another opinion on that controller approach.

controllers/lvmcluster_controller_watches.go

controllers/lvmcluster_controller_integration_test.go

controllers/node_removal_controller_watches.go

controllers/lvmcluster_controller_watches.go

jakobmoellerdev · 2023-08-11T14:52:07Z

/test all

jakobmoellerdev · 2023-08-12T22:15:50Z

/test verify

suleymanakbas91 · 2023-08-14T09:33:05Z

/lgtm
/approve

openshift-ci · 2023-08-14T09:33:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jakobmoellerdev, suleymanakbas91

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [suleymanakbas91]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…controller in case of multi-node

suleymanakbas91 · 2023-08-14T11:29:28Z

/lgtm

openshift-ci · 2023-08-14T11:42:27Z

@jakobmoellerdev: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-08-14T11:54:54Z

@jakobmoellerdev: Jira Issue OCPBUGS-13558: All pull requests linked via external trackers have merged:

openshift/lvm-operator#372

Jira Issue OCPBUGS-13558 has been moved to the MODIFIED state.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 31, 2023

jakobmoellerdev changed the title ~~fix: ensure LVMVolumeGroupNodeStatus is removed by dedicated cleanup controller in case of multi-node~~ OCPBUGS-13558: fix: ensure LVMVolumeGroupNodeStatus is removed by dedicated cleanup controller in case of multi-node Jul 31, 2023

openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 31, 2023

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 31, 2023

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jul 31, 2023

jakobmoellerdev force-pushed the OCPBUGS-13558 branch from d9a86cb to 41f47ba Compare August 1, 2023 11:19

jakobmoellerdev marked this pull request as ready for review August 1, 2023 15:18

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 1, 2023

openshift-ci bot requested review from jerpeter1 and qJkee August 1, 2023 15:18

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 2, 2023

jeff-roche requested changes Aug 3, 2023

View reviewed changes

controllers/lvmcluster_controller_integration_test.go Show resolved Hide resolved

main.go Outdated Show resolved Hide resolved

jakobmoellerdev force-pushed the OCPBUGS-13558 branch from acc4927 to 3982605 Compare August 4, 2023 10:59

jakobmoellerdev requested a review from jeff-roche August 4, 2023 15:37

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 9, 2023

jakobmoellerdev requested a review from qJkee August 9, 2023 20:10

openshift-ci bot assigned qJkee Aug 9, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2023

suleymanakbas91 requested changes Aug 11, 2023

View reviewed changes

controllers/lvmcluster_controller_watches.go Outdated Show resolved Hide resolved

controllers/lvmcluster_controller_integration_test.go Outdated Show resolved Hide resolved

suleymanakbas91 requested changes Aug 11, 2023

View reviewed changes

controllers/node_removal_controller_watches.go Outdated Show resolved Hide resolved

suleymanakbas91 requested changes Aug 11, 2023

View reviewed changes

controllers/lvmcluster_controller_watches.go Outdated Show resolved Hide resolved

jakobmoellerdev force-pushed the OCPBUGS-13558 branch from 59b26bc to 840c0de Compare August 11, 2023 14:48

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 11, 2023

jakobmoellerdev force-pushed the OCPBUGS-13558 branch from 840c0de to 827e320 Compare August 11, 2023 14:50

jakobmoellerdev requested a review from suleymanakbas91 August 11, 2023 14:52

openshift-ci bot assigned suleymanakbas91 Aug 14, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 14, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 14, 2023

fix: ensure LVMVolumeGroupNodeStatus is removed by dedicated cleanup …

422d3ae

…controller in case of multi-node

jakobmoellerdev force-pushed the OCPBUGS-13558 branch from 827e320 to 422d3ae Compare August 14, 2023 10:47

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 14, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 14, 2023

suleymanakbas91 merged commit bb74044 into openshift:main Aug 14, 2023

openshift-ci-robot mentioned this pull request Nov 8, 2023

OCPBUGS-13558: fix: refactor node removal controller #479

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-13558: fix: ensure LVMVolumeGroupNodeStatus is removed by dedicated cleanup controller in case of multi-node #372

OCPBUGS-13558: fix: ensure LVMVolumeGroupNodeStatus is removed by dedicated cleanup controller in case of multi-node #372

jakobmoellerdev commented Jul 31, 2023 •

edited

Loading

openshift-ci bot commented Jul 31, 2023

openshift-ci-robot commented Jul 31, 2023

openshift-ci-robot commented Aug 1, 2023

openshift-ci-robot commented Aug 1, 2023

openshift-ci-robot commented Aug 1, 2023

openshift-ci-robot commented Aug 1, 2023

jakobmoellerdev commented Aug 1, 2023

openshift-ci-robot commented Aug 1, 2023

openshift-ci-robot commented Aug 1, 2023

codecov-commenter commented Aug 1, 2023 •

edited

Loading

jakobmoellerdev commented Aug 1, 2023

jakobmoellerdev commented Aug 2, 2023

jakobmoellerdev commented Aug 2, 2023 •

edited

Loading

jakobmoellerdev commented Aug 3, 2023

jakobmoellerdev commented Aug 4, 2023

openshift-ci-robot commented Aug 4, 2023

openshift-ci-robot commented Aug 9, 2023

qJkee commented Aug 9, 2023

jakobmoellerdev commented Aug 10, 2023

jakobmoellerdev commented Aug 11, 2023

jakobmoellerdev commented Aug 12, 2023

suleymanakbas91 commented Aug 14, 2023

openshift-ci bot commented Aug 14, 2023

suleymanakbas91 commented Aug 14, 2023

openshift-ci bot commented Aug 14, 2023

openshift-ci-robot commented Aug 14, 2023

OCPBUGS-13558: fix: ensure LVMVolumeGroupNodeStatus is removed by dedicated cleanup controller in case of multi-node #372

OCPBUGS-13558: fix: ensure LVMVolumeGroupNodeStatus is removed by dedicated cleanup controller in case of multi-node #372

Conversation

jakobmoellerdev commented Jul 31, 2023 • edited Loading

openshift-ci bot commented Jul 31, 2023

openshift-ci-robot commented Jul 31, 2023

openshift-ci-robot commented Aug 1, 2023

openshift-ci-robot commented Aug 1, 2023

openshift-ci-robot commented Aug 1, 2023

openshift-ci-robot commented Aug 1, 2023

jakobmoellerdev commented Aug 1, 2023

openshift-ci-robot commented Aug 1, 2023

openshift-ci-robot commented Aug 1, 2023

codecov-commenter commented Aug 1, 2023 • edited Loading

Codecov Report

jakobmoellerdev commented Aug 1, 2023

jakobmoellerdev commented Aug 2, 2023

jakobmoellerdev commented Aug 2, 2023 • edited Loading

jakobmoellerdev commented Aug 3, 2023

jakobmoellerdev commented Aug 4, 2023

openshift-ci-robot commented Aug 4, 2023

openshift-ci-robot commented Aug 9, 2023

qJkee commented Aug 9, 2023

jakobmoellerdev commented Aug 10, 2023

jakobmoellerdev commented Aug 11, 2023

jakobmoellerdev commented Aug 12, 2023

suleymanakbas91 commented Aug 14, 2023

openshift-ci bot commented Aug 14, 2023

suleymanakbas91 commented Aug 14, 2023

openshift-ci bot commented Aug 14, 2023

openshift-ci-robot commented Aug 14, 2023

jakobmoellerdev commented Jul 31, 2023 •

edited

Loading

codecov-commenter commented Aug 1, 2023 •

edited

Loading

jakobmoellerdev commented Aug 2, 2023 •

edited

Loading