Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-Restoring deleted ServiceImports and EndPointSlices #696

Open
davidohana opened this issue Mar 2, 2022 · 12 comments
Open

Auto-Restoring deleted ServiceImports and EndPointSlices #696

davidohana opened this issue Mar 2, 2022 · 12 comments

Comments

@davidohana
Copy link
Contributor

davidohana commented Mar 2, 2022

Not sure whether to call this a bug or an enhancement.

What happened:

  • Deleting a service import or an endpoint slice locally - it stays deleted locally and deletion is synced to the broker.
  • Deleting a service import or an endpoint slice from the broker, nothing happen

What you expected to happen:

The agent shall recreate the service import / endpoint slice after some time as the "source of truth" is the export and it still exists. Currently this only happens after lighthouse agent restart.

Additional notes:

To compare, if I delete a k8s endpoint slice, it is recreated immediately.
I realize that those edge cases are only possible if someone with sufficient permissions delete those objects manually.

Slack discussion link: https://kubernetes.slack.com/archives/C010RJV694M/p1645966829687039

@davidohana davidohana added the bug Something isn't working label Mar 2, 2022
@vthapar
Copy link
Contributor

vthapar commented Mar 2, 2022

@davidohana Could you clarify what you mean by deleting locally? I expect you mean deleting locally on the source cluster where export exists, but just wanted to be sure it is so.

@davidohana
Copy link
Contributor Author

Yes, that what I mean - deleting on the origin cluster.

@dfarrell07 dfarrell07 added enhancement New feature or request and removed bug Something isn't working labels Jun 7, 2022
@dfarrell07
Copy link
Member

There are likely other objects that would fail in similar ways if a user deletes them, because they are internal objects that we don't expect users to mess with. To handle this properly we would need to make a more systematic review of the code base, which would likely be benefited by an EP and a broader discussion about what should change and how.

@stale
Copy link

stale bot commented Oct 15, 2022

This issue has been automatically marked as stale because it has not had activity for 60 days. It will be closed if no further activity occurs. Please make a comment if this issue/pr is still valid. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Oct 15, 2022
@dfarrell07
Copy link
Member

@vthapar pointed out that finalizers may be a good, low-overhead way of doing this.

@stale stale bot removed the wontfix This will not be worked on label Oct 18, 2022
@stale
Copy link

stale bot commented Feb 18, 2023

This issue has been automatically marked as stale because it has not had activity for 60 days. It will be closed if no further activity occurs. Please make a comment if this issue/pr is still valid. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Feb 18, 2023
@dfarrell07
Copy link
Member

Discussing this again, it's important to note that monitoring resources cross-many-clusters for deletion comes with overhead that doesn't exist for the parallel of an endpoint slices in one K8s cluster. If there's a usecase for this, we'd be happy to discuss that and prioritize this work. Again, this likely applies to other resources and finalizers might enable this with reasonable overhead.

This might be kinda-related to submariner-io/enhancements#161, or at least a good time to think about finalizers for various resources.

@stale stale bot removed the wontfix This will not be worked on label Feb 21, 2023
@dfarrell07 dfarrell07 added wontfix This will not be worked on technical-debt labels Feb 21, 2023
@stale stale bot removed the wontfix This will not be worked on label Feb 21, 2023
@dfarrell07 dfarrell07 added wontfix This will not be worked on priority:low and removed enhancement New feature or request wontfix This will not be worked on labels Feb 21, 2023
@stale
Copy link

stale bot commented Aug 13, 2023

This issue has been automatically marked as stale because it has not had activity for 60 days. It will be closed if no further activity occurs. Please make a comment if this issue/pr is still valid. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Aug 13, 2023
@dfarrell07 dfarrell07 removed the wontfix This will not be worked on label Aug 22, 2023
@tpantelis
Copy link
Contributor

While a finalizer would prevent out-of-band deletion, we wouldn't be able to update the resource since it's in the process of being deleted.

A simple solution would be to specify a resync period with the informers.

@tpantelis
Copy link
Contributor

I tested finalizer behavior and and after deletion with a finalizer still present, the resource can still be updated. You just can't modify the DeletionTimestamp nor add another finalizer. However using a finalizer merely to prevent out-of-band deletion can be problematic if the resource isn't deleted by the controller that added the finalizer. So if Submariner isn't properly uninstalled, finalizers would remain and block namespace deletion requiring manual removal of the finalizers.

A safer solution is to recreate a resource on out-of-band deletion. We can do this fairly simply via the resource syncer. On deletion of a resource, check if the originating resource still exists and, if so, re-queue it so the deleted resource is re-created. If not then process the deleted resource normally. Eg, ServiceImports are created from ServiceExports, so on deletion of a ServiceImport check if the originating ServiceExport still exists in the resource syncer cache.

@t0lya
Copy link

t0lya commented Feb 15, 2024

My team runs a multi-cluster setup with Submariner's service discovery setup (v0.15.2), where we might be seeing a case of this issue.

Our workloads can be placed on particular cluster based on nodeSelectors in the Deployment spec. When a Deployment's nodeSelectors are updated in a way that triggers a cluster placement change, we have a controller that deletes Service/ServiceExport/Deployment from the old source cluster and immediately re-creates these 3 objects in new source cluster.

We are seeing that this results in the deletion of aggregated ServiceImport by submariner-lighthouse-agent running on old source cluster. submariner-lighthouse-agent running on new source cluster never re-creates the aggregated ServiceImport.

To mitigate the issue, we had to manually re-create Service/ServiceExport on the new cluster to trigger ServiceImport re-creation on the broker cluster.

@tpantelis Will your solution handle above scenario? And is there a timeline for when a solution will be rolled out?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further
activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jun 16, 2024
@tpantelis tpantelis removed the stale label Jun 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

5 participants