Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal3DataTemplate deletion stuck, finalizer never cleared #1994

Closed
tmmorin opened this issue Sep 20, 2024 · 6 comments · Fixed by #2000
Closed

Metal3DataTemplate deletion stuck, finalizer never cleared #1994

tmmorin opened this issue Sep 20, 2024 · 6 comments · Fixed by #2000
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue is ready to be actively worked on.

Comments

@tmmorin
Copy link
Contributor

tmmorin commented Sep 20, 2024

What steps did you take and what happened

When deleting all resources for a capm3 cluster a Metal3DataTemplate resource remains pending deletion, its metal3datatemplate.infrastructure.cluster.x-k8s.io finalizer never gets cleared.

What did you expect to happen:

No resource should have remained.

** TL;DR, me speculating on the cause**

Disclaimer: my familiarity with capm3 is low, please take what follows as humble hypothesis ;)

The Metal3DataTemplate controller reconcileDelete does not seem able to take into account that a Metal3DataClaim is not here anymore. More specifically DataTemplateManager.UpdateDatas blindly assumes that all Metal3DataClaims reported to exist by Metal3Data resource, actually exist.

Context

In the context of the Sylva project CI pipelines, since we upgraded to capm3 1.8.1, we've been regularly observing a case where a Metal3DataTemplate remains stuck pending deletion because its metal3datatemplate.infrastructure.cluster.x-k8s.io finalizer is never cleared (for reference: our related gitlab issue)

The context is a workload cluster deletion scenario: we first delete the resources of the cluster, CAPI Cluster object first, then once the Cluster resource is deleted, we delete all the CAPI resources that our system defines (KubeadmControlPlane, MDs, all the XXX template resources, etc.) and then the namespace.

(I realized when digging this issue that we were deleting the Machine3DataTemplate too early, possibly before the Metal3Machines using them would be deleted, so I suspect that this is possibly contributing to triggering this issue. We're aware that we possibly need to wait and delete resources in a more orderly way...)

I digged into one specific occurrence, and I'm sharing here my observations.

Observations

The observations shared are based on the dumps of resources and logs that we take after a CI job failure.

All the data can be found here: https://gitlab.com/sylva-projects/sylva-core/-/jobs/7852801513. The interesting bits are under management-cluster-dump which has a clusterctl-describe.txt, Nodes.summary.txt, Metal3* files with resource dumps (the relevant one are in namespace: kubeadm-capm3-virt, and capm3-system directory has capm3 controller logs). I'm directly attaching the Metal3DataTemplate resource dump and the capm3 controller logs to this issue.

At the point where the dump is taken we see the following:

  • there remains no capm3 resource except our Metal3DataTemplate resource
  • extract of the state of this resource:
  kind: Metal3DataTemplate
  metadata:
    annotations:
      meta.helm.sh/release-name: cluster
      meta.helm.sh/release-namespace: kubeadm-capm3-virt
    creationTimestamp: "2024-09-18T08:24:11Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2024-09-18T09:47:51Z"    # <<<<<<<<<<<<<<<<<<<<<<
    finalizers:
    - metal3datatemplate.infrastructure.cluster.x-k8s.io      # <<<<<<<<<
   ...
    name: wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f  # <<<<<<<<<<<<<
  
  managedFields:
...
    - apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:indexes:
            .: {}
            f:wc-1458191268-kubeadm-capm3-virt-control-plane-fn667: {}
          f:lastUpdated: {}
      manager: cluster-api-provider-metal3-manager
      operation: Update
      subresource: status
      time: "2024-09-18T09:51:51Z"

# ( this managedFields entry is the most recent one)
  
  status:
    indexes:
      wc-1458191268-kubeadm-capm3-virt-control-plane-fn667: 0  # <<<<<<<<<<<<<<<<<<<
    lastUpdated: "2024-09-18T09:51:51Z"

The status.index below is interesting, it's related to why the finalizer can't be removed. It designates a Meta3DataClaim resource (wc-1458191268-kubeadm-capm3-virt-control-plane-fn667) that is deleted very early.

  • the namespace was deleted at 09:53:32 (see below the Metal3Data seems to be deleted at that point): deletionTimestamp: "2024-09-18T09:53:32Z"

Grepping the logs of capm3 controllers :

I0918 09:47:50.684503       1 metal3datatemplate_manager.go:383] "msg"="Deleting Metal3DataClaim" "Metal3DataClaim"="wc-1458191268-kubeadm-capm3-virt-control-plane-8x9vs" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.713494       1 metal3datatemplate_manager.go:421] "msg"="Deleted Metal3Data" "Metal3Data"="wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-2" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.713522       1 metal3datatemplate_manager.go:435] "msg"="Deleted Metal3DataClaim" "Metal3DataClaim"="wc-1458191268-kubeadm-capm3-virt-control-plane-8x9vs" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.758422       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.785723       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.951225       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.960604       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.960713       1 metal3datatemplate_manager.go:383] "msg"="Deleting Metal3DataClaim" "Metal3DataClaim"="wc-1458191268-kubeadm-capm3-virt-control-plane-fn667" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.980052       1 metal3datatemplate_manager.go:421] "msg"="Deleted Metal3Data" "Metal3Data"="wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.980079       1 metal3datatemplate_manager.go:435] "msg"="Deleted Metal3DataClaim" "Metal3DataClaim"="wc-1458191268-kubeadm-capm3-virt-control-plane-fn667" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.998850       1 metal3datatemplate_manager.go:383] "msg"="Deleting Metal3DataClaim" "Metal3DataClaim"="wc-1458191268-kubeadm-capm3-virt-control-plane-7tjgn" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:51.023508       1 metal3datatemplate_manager.go:421] "msg"="Deleted Metal3Data" "Metal3Data"="wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-1" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:51.023536       1 metal3datatemplate_manager.go:435] "msg"="Deleted Metal3DataClaim" "Metal3DataClaim"="wc-1458191268-kubeadm-capm3-virt-control-plane-7tjgn" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:51.075542       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:51.083012       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:51.092095       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:51.360889       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:51:51.222374       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:51:51.234904       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}

(no log after 09:51:51 for our Metal3DataTemplate)

The above shows among other things that the Metal3DataClaims where deleted quite early (09:47:51)

• on related resources (same prefix)

I0918 09:53:40.305945       1 metal3data_manager.go:414] "msg"="Releasing address from IPPool" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"} "pool name"="wc-1458191268-kubeadm-capm3-virt-provisioning-pool"
I0918 09:53:40.327924       1 metal3data_manager.go:414] "msg"="Releasing address from IPPool" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"} "pool name"="wc-1458191268-kubeadm-capm3-virt-provisioning-pool"
I0918 09:53:40.343495       1 metal3data_manager.go:1468] "msg"="Address claim not found" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"}
I0918 09:53:40.343533       1 metal3data_manager.go:414] "msg"="Releasing address from IPPool" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"} "pool name"="wc-1458191268-kubeadm-capm3-virt-primary-pool"
I0918 09:53:40.364283       1 metal3data_manager.go:1468] "msg"="Address claim not found" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"}
I0918 09:53:40.391932       1 metal3data_manager.go:414] "msg"="Releasing address from IPPool" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"} "pool name"="wc-1458191268-kubeadm-capm3-virt-provisioning-pool"
I0918 09:53:40.399177       1 metal3data_manager.go:1468] "msg"="Address claim not found" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"}
I0918 09:53:40.399424       1 metal3data_manager.go:414] "msg"="Releasing address from IPPool" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"} "pool name"="wc-1458191268-kubeadm-capm3-virt-primary-pool"
I0918 09:53:40.406140       1 metal3data_manager.go:1468] "msg"="Address claim not found" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"}
I0918 09:53:40.412083       1 metal3data_controller.go:79] "msg"="failed to Patch Metal3Data" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"}

This shows that:

  • until 09:53:40 the wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0 Metal3Data was still around and reconciled
  • (the dump tells us that at some point later it wasn't anymore, presumably when the ns was deleted)

Tentative analysis

I've been looking at the code to understand why the Metal3DataTemplate reconciliation can have been unable to remove the finalizer, and unable to remove the non-existing Metal3DataClaim from the status.indexes.

I'm trying to infer what the code would have seen between 09:47:51 (when the deletionTimestamp was set on Metal3DataTemplate, and when the Metal3DataClaims were deleted) and the deletion of the Metal3Data (which happened after 09:53:40).

So, in this windows:

  • the Metal3DataTemplate is present
  • the Metal3DataClaim wc-1458191268-kubeadm-capm3-virt-control-plane-fn667 is absent (all the relevant Metal3DataClaim are absent)
  • the Metal3Data is present

The code in DataTemplateManager.UpdateDatas seems to:

  • rely on getIndexes to get a map of indexes integers to Metal3DataClaim names, based on listing Metal3Data resources and looking at their Spec.Claim.Name
    • in our scenario this would result in finding the Metal3Data, and returning map[0: wc-1458191268-kubeadm-capm3-virt-control-plane-fn667]
  • then UpdateDatas lists Metal3DataClaims and for those it finds, calls indexes, err = m.updateData(ctx, &dataClaim, indexes to update the indexes
    • in our scenario, since no Metal3DataClaim exists, it would never have reached indexes, err = m.updateData(ctx, &dataClaim, indexes)
  • then, it returns the length of indexes
    • in our scenario, it seems that it would be 1 (what getIndexes returns, unmodified since the list objects loop isn't entered)

This is consistent with the observation that the allocationsCount seen by reconcileDelete is 1, that the finalizer is hence not removed, and the status.indexes we see on the Metal3DataTemplate resource.

It seems that UpdateDatas should check that the Metal3DataClaims names returned by getIndexes actually exist, and remove from 'indexes' the ones that are not.

I would speculate that #1478 would have made this latent issue occur more frequently. Indeed #1478 "Before cleaning up Metal3Data, check that the Metal3DataClaim is gone", merged in the 1.8.x development timeframe, would possibly create the problematic condition: Metal3DataClaim gone, but Metal3Data still points to it.

Worth noting

The logs show that there are various issues occurring at differents points. I haven't digged those specifically, with the idea that despite transient errors, the processing should still be able to converge, and also because I suspect that the way Sylva tears down CAPI and CAPI providers resources may not be orderly enough and might cause these issues.

Environment:

  • Cluster-api version: 1.7.6
  • Cluster-api-provider-metal3 version: 1.8.1 specifically (is
  • Environment (metal3-dev-env or other): Sylva project CI pipelines (more details below)
  • Kubernetes version: (use kubectl version): 1.28.12

/kind bug

@metal3-io-bot metal3-io-bot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Sep 20, 2024
@tmmorin
Copy link
Contributor Author

tmmorin commented Sep 20, 2024

@kashifest
Copy link
Member

Thanks for reporting the issue, we will check and get back to you on this, at a first glance it might be related to #1357 but I cant confirm yet.

@tmmorin
Copy link
Contributor Author

tmmorin commented Sep 22, 2024

Thanks for reporting the issue, we will check and get back to you on this, at a first glance it might be related to #1357 but I cant confirm yet.

Thank you.

I also wondered about that, but we don't use templateReference in our manifests.

@tmmorin
Copy link
Contributor Author

tmmorin commented Sep 22, 2024

My analysis in the description focused on the "why Metal3DataTemplate reconciliateDelete fails to remove the finalizer when some Metal3Data is present but Metal3DataClaim absent?".

The question that remains is "why isn't the Metal3DataTemplate finalizer removed when no Metal3Datas anad Metal3DataClaims remain?". To this question, I think the answer may be simpler: reconcileDelete should probably Requeue for finalizer cleanup retries by returning ctrl.Result{Requeue: true} if it did not unset the finalizer.

@tmmorin
Copy link
Contributor Author

tmmorin commented Sep 23, 2024

I have submitted #1997 for review to address the requeue aspect (described here).

It seems to me as sufficient to allow the Metal3DataTemplate resource to ultimately be deleted ; making the reconciliateDelete remove the finalizer when some Metal3Data is present but Metal3DataClaim absent could be a further improvement.

@Rozzii
Copy link
Member

Rozzii commented Sep 25, 2024

/triage accepted
Thanks for pushing the fix also @tmmorin

@metal3-io-bot metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Sep 25, 2024
pierrecregut added a commit to pierrecregut/cluster-api-provider-metal3 that referenced this issue Sep 26, 2024
DataTemplate deletion has a watch over DataClaims but counts Data.
We ensure that only Data for which a live claim exist are counted
to avoid races during multiple DataClaim deletion that result with
a positive count of data when all claims have been deleted.

Fixes: metal3-io#1994
Co-authored-by: Pierre Crégut <[email protected]>
Co-authored-by: Thomas Morin <[email protected]>
Signed-off-by: Pierre Crégut <[email protected]>
pierrecregut added a commit to pierrecregut/cluster-api-provider-metal3 that referenced this issue Sep 26, 2024
DataTemplate deletion has a watch over DataClaims but counts Data.
We ensure that only Data for which a live claim exist are counted
to avoid races during multiple DataClaim deletion that result with
a positive count of data when all claims have been deleted.

Fixes: metal3-io#1994
Co-authored-by: Pierre Crégut <[email protected]>
Co-authored-by: Thomas Morin <[email protected]>
Signed-off-by: Pierre Crégut <[email protected]>
pierrecregut added a commit to pierrecregut/cluster-api-provider-metal3 that referenced this issue Sep 27, 2024
DataTemplate deletion has a watch over DataClaims but counts Data.
We ensure that only Data for which a live claim exist are counted
to avoid races during multiple DataClaim deletion that result with
a positive count of data when all claims have been deleted.

Fixes: metal3-io#1994
Co-authored-by: Pierre Crégut <[email protected]>
Co-authored-by: Thomas Morin <[email protected]>
Signed-off-by: Pierre Crégut <[email protected]>
pierrecregut added a commit to pierrecregut/cluster-api-provider-metal3 that referenced this issue Sep 27, 2024
DataTemplate deletion has a watch over DataClaims but deletion
can only be performed when Data are completely removed (finalizer
executed) because Data need the template.

We keep the facts that Data and DataClaims are still existing.
* When DataClaims exist, we do not remove the finalizer and wait
  for reconcile events from the watch.
* If we have neither DataClaims or Data, we can safely
  remove the finalizer. Deletion is complete.
* Otherwise we have to make a busy wait on Data deletion
  as there is no more claim to generate event but we still
  need to wait until the Data finalizers have been executed.

Fixes: metal3-io#1994
Co-authored-by: Pierre Crégut <[email protected]>
Co-authored-by: Thomas Morin <[email protected]>
Signed-off-by: Pierre Crégut <[email protected]>
pierrecregut added a commit to pierrecregut/cluster-api-provider-metal3 that referenced this issue Oct 2, 2024
DataTemplate deletion has a watch over DataClaims but deletion
can only be performed when Data are completely removed (finalizer
executed) because Data need the template.

We keep the facts that Data and DataClaims are still existing.
* When DataClaims exist, we do not remove the finalizer and wait
  for reconcile events from the watch.
* If we have neither DataClaims or Data, we can safely
  remove the finalizer. Deletion is complete.
* Otherwise we have to make a busy wait on Data deletion
  as there is no more claim to generate event but we still
  need to wait until the Data finalizers have been executed.

Fixes: metal3-io#1994
Co-authored-by: Pierre Crégut <[email protected]>
Co-authored-by: Thomas Morin <[email protected]>
Signed-off-by: Pierre Crégut <[email protected]>
metal3-io-bot pushed a commit to metal3-io-bot/cluster-api-provider-metal3 that referenced this issue Oct 3, 2024
Some Metal3Data and Metal3DataTemplates were not destroyed
in rare circumstances due to missing reconcile events caused
by two distinct problems.

DataTemplate deletion has a watch over DataClaims but deletion
can only be performed when Data are completely removed (finalizer
executed) because Data resource requires the template.

We keep the facts that Data and DataClaims are still existing.
* When DataClaims exist, we do not remove the finalizer and wait
  for reconcile events from the watch.
* If we have neither DataClaims or Data, we can safely
  remove the finalizer. Deletion is complete.
* Otherwise we have to make a busy wait on Data deletion
  as there is no more claim to generate event but we still
  need to wait until the Data finalizers have been executed.

Actively wait for DataClaim deletion: If Metal3Data is deleted
before the Metal3DataClaim, we must actively wait for the claim
deletion, as no other reconciliation event will be triggered.

Fixes: metal3-io#1994
Co-authored-by: Pierre Crégut <[email protected]>
Co-authored-by: Thomas Morin <[email protected]>
Signed-off-by: Pierre Crégut <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants