-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
describe delete protection for resources to avoid client-side delete coordination #380
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
2. Check the .spec.criteria for each matching CriticalService. | ||
If all .spec.criteria are satisfied, ALLOW the delete. | ||
If not all .spec.criteria are satisfied, do something sane in spec and DENY the delete. | ||
3. As a special case, a CriticalService cannot be deleted until its .spec.provider is no longer present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I see why this is necessary - it seems like it should be reasonable to mark something as non-critical after marking it critical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's necessary to enforce the order: if you delete all resources at once, the apiserver might delete the CriticalService first and then the other resources we tried to protect are unprotected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify - you can remove a CriticalService by either deleting the resource referenced in .spec.provider
or by removing the .spec.provider
section entirely?
(Without the latter, you can not "unmark" a provider as critical)
|
||
## Proposal | ||
|
||
We will create a validating admission webhook that intercepts DELETEs of namespaces, criticalservices, and any resource listed as a .spec.provider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which means it mutates its own admission registration object depending on which provider resources are defined?
|
||
type CriticalServiceCriteria struct{ | ||
Type CriticalServiceCriteriaType | ||
Finalizer *FinalizerCriticalServiceCriteria |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc when these two match
```go | ||
package api | ||
|
||
type CriticalService struct{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "service" term is overloaded. CriticalResource? CriticalComponent?
In case of service, I would also expect multiple providers. There could be a deployment, a service and more which contribute to that service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like CriticalResource
, agree that service is overloaded
coolresources.my.crd.group/some-instance | ||
``` | ||
|
||
This construct also means that namespaces in a management cluster can be be deleted when managed clusters are removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/can be/can/g
|
||
1. Check all CriticalServices to see if this particular instance is protected, if not ALLOW the delete. | ||
2. Check the .spec.criteria for each matching CriticalService. | ||
If all .spec.criteria are satisfied, ALLOW the delete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think satisfied means what defined in .spec.criteria no longer exists?
because deletion can happen in any order. | ||
|
||
|
||
When a bulk delete happens, the effective order will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, the bulk delete happens in several loops, that in each loop some resource deletion is DENIED
until all resources are deleted.
|
||
This is enforced without client deletion coordination. | ||
|
||
#### Story 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that will also guard any unexpected deletion of operator deployment?
1. Check all CriticalServices to see if this particular instance is protected, if not ALLOW the delete. | ||
2. Check the .spec.criteria for each matching CriticalService. | ||
If all .spec.criteria are satisfied, ALLOW the delete. | ||
If not all .spec.criteria are satisfied, do something sane in spec and DENY the delete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define "something sane"? And, do you mean "in status" ?
```go | ||
package api | ||
|
||
type CriticalService struct{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like CriticalResource
, agree that service is overloaded
This is enforced without client deletion coordination. | ||
|
||
#### Story 2 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another story is that the provider and criteria could be the same kind of resources. Taking manifestwork as an example. We could have a manifestwork defining the deployment of agent that operats manifetwork apis, for the purpose of agent upgrade in the future. It means there will be a critical
manifestwork and other manifestworks, and the deletion of critical
manifestwork should be after all the other manifestworks. I think CriticalService API should be able to handle such case also.
note to self, I need a way to indicate in spec that this criticalresource instance wants to be deleted, so stop protecting it. |
It seems it might generate a deadlock such as if we define a critical resource kind: CriticalService
spec:
provider:
group: apps
resource: deployments
namespace: default
name: default-deployment
criteria:
- type: SpecificResource
specificResource:
group: v1
resource: Namespace
name: default We won't be able to delete both namespace and deployment. Should we disallow such kind of criticalresource to be created? |
I don't think we can programmatically detect whether we have a dead lock, at least not in a exact manner. There can be things like finalizers whose logic depends on e.g. annotations that are invisible to the |
@deads2k please change the wording:
|
2. coolresources.my.crd.group/some-instance is deleted, but waits to be finalized | ||
3. coolresources.my.crd.group/some-instance is finalized | ||
4. crd/coolresources.my.crd.group is finalized | ||
5. deployment.apps/finalizer-deployment is deleted, but waits to be finalized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if a deployment doesn't have a finalizer?
6. namespace/finalizer-namespace is deleted, but waits to be finalized | ||
7. deployment.apps/finalizer-deployment is finalized | ||
8. namespace/finalizer-namespace is finalized | ||
9. criticalservices/for-finalizer-deployment is deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will there be an active controller that will try to remove criticalservices/for-finalizer-deployment ?
as I understand it the first call will fail because it will be protected by the validation webhook, right ?
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle stale |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@deads2k: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This came up in a couple areas with ManifestWork and to a lesser degree in openshift. OpenShift's operators don't generally face this problem because bulk delete of all namespaces isn't a thing we do. The one time we actually tried, it wedged on pretty much this.
Interestingly, as I've described it here so far, it would be possible for us to use this to protect certain namespaces using other immortal ones and effectively 403 everyone's delete. We probably shouldn't do that.
@pmorie