-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
describe delete protection for resources to avoid client-side delete coordination #380
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,233 @@ | ||
--- | ||
title: generic-delete-protection | ||
authors: | ||
- "@deads2k" | ||
reviewers: | ||
- "@sttts" | ||
approvers: | ||
- "@derekwaynecarr" | ||
creation-date: 2020-06-17 | ||
last-updated: 2020-06-17 | ||
status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced | ||
see-also: | ||
- https://github.com/open-cluster-management/backlog/issues/2648#issuecomment-645170280 | ||
- https://github.com/open-cluster-management/backlog/issues/2348#issuecomment-642231925 | ||
replaces: | ||
superseded-by: | ||
--- | ||
|
||
# Generic Delete Protection | ||
|
||
## Release Signoff Checklist | ||
|
||
- [ ] Enhancement is `implementable` | ||
- [ ] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
## Open Questions | ||
|
||
1. At this moment, adoption is racy because deployment.apps/name is insufficient to distinguish based on UID, but to allow | ||
freedom of creation ordering, the CriticalService cannot include a UID. | ||
We may be able to create a status field that tracks a UID, but the intent isn't fully clear. | ||
If adoption doesn't work, we can end up in a state where ManifestWork cannot be finalized because the CriticalService | ||
isn't actually eligible for deletion. | ||
The simplest solution seem to be to state an intent of "best-effort-deletion" or "orphan-resources". | ||
This could be added as a spec field which could be set post-deletion, pre-finalization. | ||
|
||
## Summary | ||
|
||
By running the platform on the cluster, there are complications that arise regarding deletion of certain critical services. | ||
These services include webhook admission plugins and finalizers. | ||
This design is about trying to find a way to prevent permanently "stuck" resources without trying to coordinate resource deletion | ||
in clients. | ||
|
||
## Motivation | ||
|
||
If pods for admission-webhook-a, which protect resource-a, are deleted, then it becomes impossible to create, update, or | ||
delete any resource-a until the webhookconfiguration is deleted or the pods are restored. | ||
If pods for finalizer-a, which protect resource-a, are deleted, then it becomes impossible to delete any resource-a until | ||
the pods are restored. | ||
Resolving resource deletion ordering in clients is impractical. | ||
|
||
### Goals | ||
|
||
1. Eliminate the need for ordered client deletion. | ||
|
||
### Non-Goals | ||
|
||
1. Prevent permanently stuck resources in cases where the resource that is stuck and the the pod providing the critical service | ||
are in the same namespace. Don't do that. | ||
|
||
## Proposal | ||
|
||
We will create a validating admission webhook that intercepts DELETEs of namespaces, criticalservices, and any resource listed as a .spec.provider. | ||
Realistically, we will hard code deployments.apps and have that as the only valid resource in .spec.provider to start. | ||
When a DELETE arrives for this resource we will.. | ||
|
||
1. Check all CriticalServices to see if this particular instance is protected, if not ALLOW the delete. | ||
2. Check the .spec.criteria for each matching CriticalService. | ||
If all .spec.criteria are satisfied, ALLOW the delete. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think satisfied means what defined in .spec.criteria no longer exists? |
||
If not all .spec.criteria are satisfied, do something sane in spec and DENY the delete. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Define "something sane"? And, do you mean "in status" ? |
||
3. As a special case, a CriticalService cannot be deleted until its .spec.provider is no longer present. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I see why this is necessary - it seems like it should be reasonable to mark something as non-critical after marking it critical There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's necessary to enforce the order: if you delete all resources at once, the apiserver might delete the CriticalService first and then the other resources we tried to protect are unprotected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To clarify - you can remove a CriticalService by either deleting the resource referenced in (Without the latter, you can not "unmark" a provider as critical) |
||
|
||
```go | ||
package api | ||
|
||
type CriticalService struct{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the "service" term is overloaded. CriticalResource? CriticalComponent? In case of service, I would also expect multiple providers. There could be a deployment, a service and more which contribute to that service. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like |
||
Spec CriticalServiceSpec | ||
} | ||
|
||
type CriticalServiceSpec struct{ | ||
Provider CriticalServiceProvider | ||
Criteria []CriticalServiceCriteria | ||
} | ||
|
||
type GroupResource struct{ | ||
Group string | ||
Resource string | ||
} | ||
|
||
type CriticalServiceProvider struct{ | ||
// only allow deployments.apps to start | ||
GroupResource | ||
Namespace string | ||
Name string | ||
} | ||
|
||
type CriticalServiceCriteriaType string | ||
var( | ||
FinalizerType CriticalServiceCriteriaType = "Finalizer" | ||
SpecificResourceType CriticalServiceCriteriaType = "SpecificResource" | ||
) | ||
|
||
type CriticalServiceCriteria struct{ | ||
Type CriticalServiceCriteriaType | ||
Finalizer *FinalizerCriticalServiceCriteria | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. doc when these two match |
||
SpecificResource *SpecificResourceCriticalServiceCriteria | ||
} | ||
|
||
type FinalizerCriticalServiceCriteria struct{ | ||
GroupResource | ||
FinalizerName string | ||
} | ||
|
||
type SpecificResourceCriticalServiceCriteria struct{ | ||
GroupResource | ||
Namespace string | ||
Name string | ||
} | ||
``` | ||
|
||
### User Stories | ||
|
||
#### Deployment provides finalizer processing for CRD | ||
```yaml | ||
kind: CriticalService | ||
spec: | ||
provider: | ||
group: apps | ||
resource: deployments | ||
namespace: finalizer-namespace | ||
name: finalizer-deployment | ||
criteria: | ||
- type: Finalizer | ||
finalizer: | ||
group: my.crd.group | ||
resource: coolresources | ||
finalizerName: my.crd.group/super-important | ||
- type: SpecificResource | ||
specificResource: | ||
group: apiextensions.k8s.io | ||
resource: CustomResourceDefinition | ||
name: coolresources.my.crd.group | ||
``` | ||
|
||
|
||
This would allow three separate ManifestWorks to all be deleted at the same time and avoid conflicting with each other | ||
as the deletes are finalized on individual resources. | ||
```yaml | ||
kind: ManifestWork | ||
metadata: | ||
name: finalizer | ||
spec: | ||
criticalservices/for-finalizer-deployment | ||
namespace/finalizer-namespace | ||
deployment.apps/finalizer-deployment | ||
--- | ||
kind: ManifestWork | ||
metadata: | ||
name: crd | ||
spec: | ||
crd/coolresources.my.crd.group | ||
--- | ||
kind: ManifestWork | ||
metadata: | ||
name: cr | ||
spec: | ||
coolresources.my.crd.group/some-instance | ||
``` | ||
|
||
It would also allow them to be combined into a single ManifestWork. | ||
```yaml | ||
kind: ManifestWork | ||
metadata: | ||
name: all | ||
spec: | ||
criticalservices/for-finalizer-deployment | ||
namespace/finalizer-namespace | ||
deployment.apps/finalizer-deployment | ||
crd/coolresources.my.crd.group | ||
coolresources.my.crd.group/some-instance | ||
``` | ||
|
||
This construct also means that namespaces in a management cluster can be be deleted when managed clusters are removed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/can be/can/g |
||
because deletion can happen in any order. | ||
|
||
|
||
When a bulk delete happens, the effective order will be | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To confirm, the bulk delete happens in several loops, that in each loop some resource deletion is |
||
1. crd/coolresources.my.crd.group is deleted, but waits to be finalized | ||
2. coolresources.my.crd.group/some-instance is deleted, but waits to be finalized | ||
3. coolresources.my.crd.group/some-instance is finalized | ||
4. crd/coolresources.my.crd.group is finalized | ||
5. deployment.apps/finalizer-deployment is deleted, but waits to be finalized | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what if a deployment doesn't have a finalizer? |
||
6. namespace/finalizer-namespace is deleted, but waits to be finalized | ||
7. deployment.apps/finalizer-deployment is finalized | ||
8. namespace/finalizer-namespace is finalized | ||
9. criticalservices/for-finalizer-deployment is deleted | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will there be an active controller that will try to remove criticalservices/for-finalizer-deployment ? |
||
|
||
This is because | ||
1. CRDs are not finalized until all the CR instances are removed | ||
2. deployment.apps/finalizer-deployment and namespace/finalizer-namespace cannot be deleted until the finalizer is | ||
removed from all coolresources.my.crd.group and the crd/coolresources.my.crd.group is finalized. | ||
3. criticalservices/for-finalizer-deployment cannot be deleted until deployment.apps/finalizer-deployment is finalized | ||
|
||
This is enforced without client deletion coordination. | ||
|
||
#### Story 2 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that will also guard any unexpected deletion of operator deployment? |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another story is that the provider and criteria could be the same kind of resources. Taking manifestwork as an example. We could have a manifestwork defining the deployment of agent that operats manifetwork apis, for the purpose of agent upgrade in the future. It means there will be a |
||
### Implementation Details/Notes/Constraints [optional] | ||
|
||
|
||
### Risks and Mitigations | ||
|
||
|
||
## Design Details | ||
|
||
### Test Plan | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
### Version Skew Strategy | ||
|
||
## Drawbacks | ||
|
||
The idea is to find the best form of an argument why this enhancement should _not_ be implemented. | ||
|
||
## Alternatives | ||
|
||
Similar to the `Drawbacks` section the `Alternatives` section is used to | ||
highlight and record other possible approaches to delivering the value proposed | ||
by an enhancement. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which means it mutates its own admission registration object depending on which provider resources are defined?