describe delete protection for resources to avoid client-side delete coordination #380

deads2k · 2020-06-17T20:19:43Z

This came up in a couple areas with ManifestWork and to a lesser degree in openshift. OpenShift's operators don't generally face this problem because bulk delete of all namespaces isn't a thing we do. The one time we actually tried, it wedged on pretty much this.

Interestingly, as I've described it here so far, it would be possible for us to use this to protect certain namespaces using other immortal ones and effectively 403 everyone's delete. We probably shouldn't do that.

@pmorie

…coordination

openshift-ci-robot · 2020-06-17T20:20:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ecordell · 2020-06-17T22:16:33Z

enhancements/kube-apiserver/generic-delete-protection.md

+2. Check the .spec.criteria for each matching CriticalService.
+   If all .spec.criteria are satisfied, ALLOW the delete.
+   If not all .spec.criteria are satisfied, do something sane in spec and DENY the delete.
+3. As a special case, a CriticalService cannot be deleted until its .spec.provider is no longer present.


I'm not sure I see why this is necessary - it seems like it should be reasonable to mark something as non-critical after marking it critical

it's necessary to enforce the order: if you delete all resources at once, the apiserver might delete the CriticalService first and then the other resources we tried to protect are unprotected.

To clarify - you can remove a CriticalService by either deleting the resource referenced in .spec.provider or by removing the .spec.provider section entirely?

(Without the latter, you can not "unmark" a provider as critical)

sttts · 2020-06-18T07:24:19Z

enhancements/kube-apiserver/generic-delete-protection.md

+
+## Proposal
+
+We will create a validating admission webhook that intercepts DELETEs of namespaces, criticalservices, and any resource listed as a .spec.provider.


which means it mutates its own admission registration object depending on which provider resources are defined?

sttts · 2020-06-18T07:26:47Z

enhancements/kube-apiserver/generic-delete-protection.md

+
+type CriticalServiceCriteria struct{
+    Type CriticalServiceCriteriaType
+    Finalizer *FinalizerCriticalServiceCriteria


doc when these two match

sttts · 2020-06-18T07:39:22Z

enhancements/kube-apiserver/generic-delete-protection.md

+```go
+package api
+
+type CriticalService struct{


the "service" term is overloaded. CriticalResource? CriticalComponent?

In case of service, I would also expect multiple providers. There could be a deployment, a service and more which contribute to that service.

I like CriticalResource, agree that service is overloaded

qiujian16 · 2020-06-18T02:34:54Z

enhancements/kube-apiserver/generic-delete-protection.md

+  coolresources.my.crd.group/some-instance
+```
+
+This construct also means that namespaces in a management cluster can be be deleted when managed clusters are removed


s/can be/can/g

qiujian16 · 2020-06-18T09:19:10Z

enhancements/kube-apiserver/generic-delete-protection.md

+
+1. Check all CriticalServices to see if this particular instance is protected, if not ALLOW the delete.
+2. Check the .spec.criteria for each matching CriticalService.
+   If all .spec.criteria are satisfied, ALLOW the delete.


I think satisfied means what defined in .spec.criteria no longer exists?

qiujian16 · 2020-06-18T09:23:04Z

enhancements/kube-apiserver/generic-delete-protection.md

+because deletion can happen in any order. 
+
+
+When a bulk delete happens, the effective order will be


To confirm, the bulk delete happens in several loops, that in each loop some resource deletion is DENIED until all resources are deleted.

qiujian16 · 2020-06-18T09:27:48Z

enhancements/kube-apiserver/generic-delete-protection.md

+
+This is enforced without client deletion coordination.
+
+#### Story 2


I think that will also guard any unexpected deletion of operator deployment?

pmorie · 2020-06-18T15:23:46Z

enhancements/kube-apiserver/generic-delete-protection.md

+1. Check all CriticalServices to see if this particular instance is protected, if not ALLOW the delete.
+2. Check the .spec.criteria for each matching CriticalService.
+   If all .spec.criteria are satisfied, ALLOW the delete.
+   If not all .spec.criteria are satisfied, do something sane in spec and DENY the delete.


Define "something sane"? And, do you mean "in status" ?

pmorie · 2020-06-18T15:24:59Z

enhancements/kube-apiserver/generic-delete-protection.md

+```go
+package api
+
+type CriticalService struct{


I like CriticalResource, agree that service is overloaded

qiujian16 · 2020-06-19T13:40:48Z

enhancements/kube-apiserver/generic-delete-protection.md

+This is enforced without client deletion coordination.
+
+#### Story 2
+


Another story is that the provider and criteria could be the same kind of resources. Taking manifestwork as an example. We could have a manifestwork defining the deployment of agent that operats manifetwork apis, for the purpose of agent upgrade in the future. It means there will be a critical manifestwork and other manifestworks, and the deletion of critical manifestwork should be after all the other manifestworks. I think CriticalService API should be able to handle such case also.

deads2k · 2020-07-01T15:43:25Z

note to self, I need a way to indicate in spec that this criticalresource instance wants to be deleted, so stop protecting it.

qiujian16 · 2020-07-02T01:23:19Z

It seems it might generate a deadlock such as if we define a critical resource

kind: CriticalService
spec:
  provider:
    group: apps
    resource: deployments
    namespace: default
    name: default-deployment
  criteria:
  - type: SpecificResource
    specificResource:
      group: v1
      resource: Namespace
      name: default

We won't be able to delete both namespace and deployment. Should we disallow such kind of criticalresource to be created?

sttts · 2020-07-02T08:06:16Z

I don't think we can programmatically detect whether we have a dead lock, at least not in a exact manner. There can be things like finalizers whose logic depends on e.g. annotations that are invisible to the CriticalResource implementation. If we try to find cycles which might possibly cause dead locks, we restrict ourselves in expressiveness. Do we want that to be on the safe side? I think we can let the developer think about what he/she is doing.

sttts · 2020-07-02T08:07:21Z

@deads2k please change the wording:

provider => resource
criteria => deletionBlockers

p0lyn0mial · 2020-07-14T20:27:11Z

enhancements/kube-apiserver/generic-delete-protection.md

+2. coolresources.my.crd.group/some-instance is deleted, but waits to be finalized
+3. coolresources.my.crd.group/some-instance is finalized
+4. crd/coolresources.my.crd.group is finalized
+5. deployment.apps/finalizer-deployment is deleted, but waits to be finalized


what if a deployment doesn't have a finalizer?

p0lyn0mial · 2020-07-14T20:29:20Z

enhancements/kube-apiserver/generic-delete-protection.md

+6. namespace/finalizer-namespace is deleted, but waits to be finalized
+7. deployment.apps/finalizer-deployment is finalized
+8. namespace/finalizer-namespace is finalized
+9. criticalservices/for-finalizer-deployment is deleted


will there be an active controller that will try to remove criticalservices/for-finalizer-deployment ?
as I understand it the first call will fail because it will be protected by the validation webhook, right ?

openshift-bot · 2020-10-27T19:22:34Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-11-26T21:15:58Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

p0lyn0mial · 2020-11-27T07:02:27Z

/remove-lifecycle stale

openshift-bot · 2020-12-27T11:07:45Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2020-12-27T11:08:00Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-06-02T14:56:29Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2021-06-02T15:02:59Z

@deads2k: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/markdownlint	`ac94eff`	link	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-07-02T19:43:11Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-07-02T19:43:20Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

describe delete protection for resources to avoid client-side delete …

ac94eff

…coordination

openshift-ci-robot requested review from knobunc and smarterclayton June 17, 2020 20:20

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2020

ecordell reviewed Jun 17, 2020

View reviewed changes

sttts reviewed Jun 18, 2020

View reviewed changes

qiujian16 reviewed Jun 18, 2020

View reviewed changes

pmorie reviewed Jun 18, 2020

View reviewed changes

qiujian16 reviewed Jun 19, 2020

View reviewed changes

jmazzitelli mentioned this pull request Jun 22, 2020

Kiali CR finalizer not removed when deleting the operator kiali/kiali#2866

Closed

This was referenced Jun 22, 2020

[wip] add CriticalResource type to allow deletion ordering openshift/api#671

Closed

[wip] add CriticalResource type to allow deletion ordering openshift/client-go#150

Closed

[wip] add CriticalResources open-cluster-management-io/registration#47

Closed

p0lyn0mial reviewed Jul 14, 2020

View reviewed changes

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 26, 2020

openshift-ci-robot closed this Dec 27, 2020

deads2k reopened this Jun 2, 2021

openshift-ci bot closed this Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

describe delete protection for resources to avoid client-side delete coordination #380

describe delete protection for resources to avoid client-side delete coordination #380

deads2k commented Jun 17, 2020

openshift-ci-robot commented Jun 17, 2020

ecordell Jun 17, 2020

sttts Jun 18, 2020

ecordell Jun 22, 2020

sttts Jun 18, 2020

sttts Jun 18, 2020

sttts Jun 18, 2020

pmorie Jun 18, 2020

qiujian16 Jun 18, 2020

qiujian16 Jun 18, 2020

qiujian16 Jun 18, 2020

qiujian16 Jun 18, 2020

pmorie Jun 18, 2020

pmorie Jun 18, 2020

qiujian16 Jun 19, 2020

deads2k commented Jul 1, 2020

qiujian16 commented Jul 2, 2020 •

edited

Loading

sttts commented Jul 2, 2020 •

edited

Loading

sttts commented Jul 2, 2020

p0lyn0mial Jul 14, 2020

p0lyn0mial Jul 14, 2020

openshift-bot commented Oct 27, 2020

openshift-bot commented Nov 26, 2020

p0lyn0mial commented Nov 27, 2020

openshift-bot commented Dec 27, 2020

openshift-ci-robot commented Dec 27, 2020

openshift-ci bot commented Jun 2, 2021

openshift-ci bot commented Jun 2, 2021

openshift-bot commented Jul 2, 2021

openshift-ci bot commented Jul 2, 2021


		## Proposal

		We will create a validating admission webhook that intercepts DELETEs of namespaces, criticalservices, and any resource listed as a .spec.provider.

		because deletion can happen in any order.


		When a bulk delete happens, the effective order will be


		This is enforced without client deletion coordination.

		#### Story 2

describe delete protection for resources to avoid client-side delete coordination #380

describe delete protection for resources to avoid client-side delete coordination #380

Conversation

deads2k commented Jun 17, 2020

openshift-ci-robot commented Jun 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Jul 1, 2020

qiujian16 commented Jul 2, 2020 • edited Loading

sttts commented Jul 2, 2020 • edited Loading

sttts commented Jul 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Oct 27, 2020

openshift-bot commented Nov 26, 2020

p0lyn0mial commented Nov 27, 2020

openshift-bot commented Dec 27, 2020

openshift-ci-robot commented Dec 27, 2020

openshift-ci bot commented Jun 2, 2021

openshift-ci bot commented Jun 2, 2021

openshift-bot commented Jul 2, 2021

openshift-ci bot commented Jul 2, 2021

qiujian16 commented Jul 2, 2020 •

edited

Loading

sttts commented Jul 2, 2020 •

edited

Loading