-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-1847: Auto remove PVCs created by StatefulSet #1915
Conversation
Hi @kk-src. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @mattcary @dsu-igeek |
@kk-src: GitHub didn't allow me to request PR reviews from the following users: dsu-igeek. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @msau42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks in good shape to me. A couple of minor comments only.
Thanks!
the StatefulSet is deleted. As can be seen by the discussion in the issue | ||
[55045](https://github.com/kubernetes/kubernetes/issues/55045) there are several use | ||
cases where the PVCs which are automatically created are deleted as well. There is | ||
merit to the suggestion that the PVCs should not be deleted since they have a different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Probably more precise to say something like "In many StatefulSet use cases, the PVCs have a different lifecycle than the pods of the StatefulSet, and should not be deleted at the same time. Because of this, PVC deletion will be opt-in for users."
If `VolumeReclaimPolicy` is set to `RemoveOnPodDeletion` Pod is set as the owner of the PVC. | ||
When a Pod is deleted, the PVC corresponding to the Pod also gets deleted. During scale up, a | ||
check is made to see if the ownership is already set and if the Pod UID is that of the current | ||
Pod in consideration. In case of a mismatch the scale up will exit out of the current reconcile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little unclear to me. Is "Pod UID" the pod UID in the owner reference of the PVC? and what is "the current Pod in consideration"---wouldn't the controller be looking at all pods & pvcs at once?
I think it might be more clear to just state the behavior rather than saying what goes wrong with different behavior, something like:
When a Pod is deleted, the PVC owned by the Pod is also deleted. During scale-up, for each Pod of the StatefulSet, the PVCs are gathered. If a PVC has an OwnerRef that does not match this Pod, the PVC must not have yet been deleted and the controller will exit the current reconcile loop and attempt to reconcile in the next iteration. This avoids a race with PVC deletion.
PVCs previously in use before scale down will be used again when the scale up occurs. The PVC deletion | ||
should happen only after the Pod gets deleted. Since the Pod ownership has `blockOwnerDeletion` set to | ||
`true` pods will get deleted before the StatefulSet is deleted. The `blockOwnerDeletion` for PVCs will | ||
be set to `false` which ensures that PVC deletion happens only after the StatefulSet is deleted. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very elegant, nice!
--> | ||
|
||
## Design Details | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should specify that ownerRefs on PVCs are only set for those that are created for volumes provisioned as part of the the statefulset, ie in VolumeClaimTemplates with a StorageClass specified, and no dataSource, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell? Someone could manually create a dynamically provisioned PVC with the same name that the StatefulSet controller expects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Does that mean we should add an annotation when a PVC is created by the SS controller for a pod?
We don't want to add an owner reference I think as that would make it harder to not change behavior for a retain policy.
It seems to me that an annotation would work well here but maybe there is an idiomatic alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added text indicating static naming convention used by the statefulset will be reused here to identity the PVCs in question.
* Retain - this is the default policy and is considered in cases where no policy is specified. | ||
This would be the existing behaviour - when a StatefulSet is deleted, no action is taken with | ||
respect to the PVCs created by the StatefulSet. | ||
* RemoveOnPodDeletion - Whenever a pod is deleted, the corresponding PVC is deleted as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RemoveOnPodDeletion
is very similar to "generic ephemeral inline volumes" (https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1698-generic-ephemeral-volumes): volumes get created for a pod and deleted together with it. Just the mechanism is different.
I'm undecided whether it makes sense to have two different approaches for the same use case. On the one hand, it fits into the proposed API. On the other hand, additional code will be needed to implement it and that code will be more complex than the one for RemoveOnStatefulSetDeletion
(like dealing with slow PVC removal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean we want to change the semantics do that the PVC is only removed on StatefulSet scaledown? (and change the name to RemoveOnScaledown or similar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a string for the reclaim policy is better than a boolean, I would just remove the RemoveOnPodDeletion
option unless there's a good reason for having it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean change the semantics so that the three options are:
- Retain (current behavior)
- RemoveOnScaledown (delete PVs on scaledown but not if pod is deleted/restarted for another reason such as being unhealthy)
- RemoveOnStatefulSetDeletion (as currently described)
This would fit the use cases of customer feedback that we've heard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so we are saying, this KEP is mainly for the scale down scenario and not any pod deletion scenario? It would help to make this clear here and completely remove RemoveOnPodDeletion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. The only options will be Retain, RemoveOnScaledown, RemoveOnStatefulSetDeletion.
|
||
The following changes are required: | ||
|
||
1. Add `VolumeReclaimPolicy` entry into StatefulSet spec inorder to make this feature an opt-in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the word VolumeReclaimPolicy is confusing, it seems to conflict in meaning with reclaimPolicy in PV. Should we make it more clear pvcReclaimPolicy ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
is deleted as part of a scale down. Since the subsequent scale up needs to create fresh PVCs, it will | ||
be slower than scale ups relying on existing PVCs(from earlier scale ups). | ||
|
||
User would set the VolumeReclaimPolicy as 'RemoveOnPodDeletion' ensuring PVCs are deleted when corresponding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you confirm, you dont need RemoveOnPodDeletion, but just RemoveOnScaleDown ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if someone deletes the pod manually kubectl delete pod
, do you expect the PVC to be deleted ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. If a pod is manually deleted, the SS will recreate it so there won't be a problem of a PVC to clean up. Only on scaledown (or SS deletion) are there PVCs that currently need to be manually removed.
The two options RemoveOnScaleDown and RemoveOnStatefulSetDeletion allow the operator to distinguish between the case where scaledowns and scaleups will be frequent (and PVCs should only be deleted at the end, when the whole SS is deleted) and the case when they are infrequent and it's more important to conserve PV resources than have rapid scaleup.
At least that's where I've been coming from with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would someone want both behaviors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think I implicitly assumed deletion of the stateful set is equivalent to scaling down to zero, so that RemoveOnScaleDown contains RemoveOnStatefulSetDeletion.
Probably I'm being imprecise and we should call that out explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like manual deletion needs more consideration. When a Pod is deleted, the PVC still has owner references to the deleted Pod. The StatefulSet controller will bring up the Pod again, but the reconcile logic would exit since the Pod owner is different. Eventually the GC would delete the older PVC and the next reconcile will construct new PVC and add reference to the newly created Pod. This might not be the desired behavior as we are currently talking about only deletions during scale down not during the Pod deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I think the behavior we want is that if the SS controller is deleting a pod it knows if it's scaling down and can delete the PVC appropriately.
Otherwise the SS controller will bring the pod back up (eg if it fails) and in that case we want it to reuse the PVC even on RemoveOnScaleDown.
Will it be enough for the SS controller to only put the ownerref when it is deleting a pod for scaledown? Precisely, when the SS controller wants to remove a pod for scaledown, it first puts an ownerref on the PVC and then deletes the pod (rather than putting on the ownerref at PVC creation as we currently have it).
I don't know if this means an extra reconcile loop, where first we confirm the ownerref has been placed and only then we delete the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learned a few things from discussions and thanks for this KEP.
Will it be enough for the SS controller to only put the ownerref when it is deleting a pod for scaledown? Precisely, when the SS controller wants to remove a pod for scaledown, it first puts an ownerref on the PVC and then deletes the pod (rather than putting on the ownerref at PVC creation as we currently have it).
Good one but when we are patching with ownerref I think we have to set BlockOwnerRef as false. If we follow this approach there will be cases where PVC remains in the cluster even after sts is scaledown.
Ex: User deleted the sts pod and within no time(before sts controller reconciles for sts) user also performed scale down in this case, PVC will remain the same(since ownerref is not yet set). Is this case will be marked as known limitations? Correct me if my assumption is wrong in the above case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mittachaitu Good point, I'd have to look at the statefulset controller but you're probably right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kk-src and I looked through the current SS controller code and came to the conclusion that if a manual deletion races with the scaledown, the controller will not do the scaledown and instead wait for all the pods to come back. So I don't think we need to change the proposed logic.
If the race happens in the other direction (user deleted after we start scaling down), it's okay as we're only setting the owner refs for PVCs with pods that are going to be deleted anyway.
At least that's how I reasoned it through, LMK if I missed something.
f74d4f0
to
0d23d75
Compare
Hi @kk-src Enhancements Lead here. Will there be any work for this in 1.20? Also, as a reminder to be included in a release: The KEP must be merged in an implementable state Thanks |
@kikisdeliveryservice - Yes, the current plan is to make it by enhancement freeze and work through the implementation in the 1.20 time frame. |
7b2dcab
to
7d77e95
Compare
|
||
The following changes are required: | ||
|
||
1. Add `VolumeReclaimPolicy` entry into StatefulSet spec inorder to make this feature an opt-in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
This might be a good place to talk about core concepts and how they relate. | ||
--> | ||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A discussion of durability risks with the changed behavior (even though it is opt-in) should be added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kk-src Maybe something like the following?
If the PVCReclaimPolicy is changed from its default of "Retain", then PVs will be deleted on SS scaledown or deletion. A user will lose data on those PVs if proactive effort is not taken to replicate that data. However, PVs associated with the StatefulSet will be more durable than ephemeral volumes would be, as they are only deleted on scaledown or StatefulSet deletion, and not on other pod lifecycle events like being rescheduled to a new node, even with the new retain policies.
User would set the VolumeReclaimPolicy as 'RemoveOnScaledown' ensuring PVCs are deleted when corresponding | ||
Pods are deleted. New Pods created during sclae up followed by a scaledown will wait for freshly created PVCs. | ||
|
||
### Notes/Constraints/Caveats (optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add information on applicability of the feature across configurations. Does the feature work with Local PVCs for instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kk-src Maybe something like the following?
This feature applies to PVs which are dynamically provisioned from the volumeClaimTemplate of a StatefulSet. Any PVC and PV provisioned from this mechanism will function with this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it have to be dynamically provisioned from the StatefulSet controller? Since this behavior is opt-in, the user should understand what they're getting into?
For example, today, you can manually create a PV object, and set Delete reclaim policy. The provisioner will delete the volume even though it didn't provision it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if pods have multiple PVCs? Should all of them be deleted? I don't think that would make sense, there might be a shared volume.
Theres a 1:1 PVC:PV relationship so that doesn't come up in that case.
I think it's much simpler to define reasonable behavior if we scope to the volumeClaimTemplates only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point. Scoping the behavior seems reasonable to me, I would just clarify that this is about PVC objects being created by the StatefulSet controller, and not about PV objects being dynamically provisioned. You can have StatefulSet create PVCs that can be bound to pre-created PVs and those should still be in scope for the Statefulset reclaim policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 In the design section we are talking about the PVCs created from the VolumeClaimTemplate in the StatefulSet. That's exactly the set we're interested in.
eg, if a user created a PVC that matched the naming convention of the volumeClaimTemplate, a pod created for the statefulset would attach to it, so it seems unsurprising that the new reclaim policy would cause that PVC to be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I think I misunderstood then. I thought you wanted PVCs manually created by the user with the same naming convention to NOT be part of the reclaim policy. But it sounds like we do want it part of the reclaim policy.
And PVCs referenced outside of volumeClaimTemplates, ie in volumes
are out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, there was a discussion that only partially made it into the comments. I think that the version down in "design details" is correct: if the statefulset has a volumeClaimTemplate, the static naming scheme defines a PVC for each pod which is called here the associated PVC of a pod. These associated PVCs are the ones that are deleted or not according to the policy.
A new field named `VolumeReclaimPolicy` of the type `StatefulSetVolumeReclaimPolicy` | ||
will be added to the StatefulSet. This field represents the user indication on whether the | ||
associated PVCs can be deleted or not. If this field is not set, the default behaviour is | ||
to leave the PVCs as is during the StatefulSet deletion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just say that the default policy is Retain.
|
||
If `VolumeReclaimPolicy` is set to `RemoveOnScaledown`, Pod is set as the owner of the PVCs created | ||
from the `VolumeClaimTemplates` just before the scale down is performed by the statefulset controller. | ||
When a Pod is deleted, the PVC owned by the Pod is also deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add here that if the SS is deleted the behavior is as in RemoveOnStatefulSetDeletion.
--> | ||
|
||
## Design Details | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Does that mean we should add an annotation when a PVC is created by the SS controller for a pod?
We don't want to add an owner reference I think as that would make it harder to not change behavior for a retain policy.
It seems to me that an annotation would work well here but maybe there is an idiomatic alternative.
|
||
### Volume reclaim policy for the StatefulSet created PVCs | ||
A new field named `VolumeReclaimPolicy` of the type `StatefulSetVolumeReclaimPolicy` | ||
will be added to the StatefulSet. This field represents the user indication on whether the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this field be patched on an already existing stateful set? I think it can be, as long as we don't make any guarantees about what happens if a scaleup/down/delete is in progress during the patching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note to ensure that in alpha we won't allow updating this field after creation to avoid any confusion. Based user feedback we can change this in beta.
This might be a good place to talk about core concepts and how they relate. | ||
--> | ||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kk-src Maybe something like the following?
If the PVCReclaimPolicy is changed from its default of "Retain", then PVs will be deleted on SS scaledown or deletion. A user will lose data on those PVs if proactive effort is not taken to replicate that data. However, PVs associated with the StatefulSet will be more durable than ephemeral volumes would be, as they are only deleted on scaledown or StatefulSet deletion, and not on other pod lifecycle events like being rescheduled to a new node, even with the new retain policies.
User would set the VolumeReclaimPolicy as 'RemoveOnScaledown' ensuring PVCs are deleted when corresponding | ||
Pods are deleted. New Pods created during sclae up followed by a scaledown will wait for freshly created PVCs. | ||
|
||
### Notes/Constraints/Caveats (optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kk-src Maybe something like the following?
This feature applies to PVs which are dynamically provisioned from the volumeClaimTemplate of a StatefulSet. Any PVC and PV provisioned from this mechanism will function with this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on the reqd fields
925442d
to
6e39193
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @kk-src
This looks good from an enhancements reqs perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For beta, I think we'll have to think more carefully on the disablement/rollback scenario.
|
||
* **How can this feature be enabled / disabled in a live cluster?** | ||
- [x] Other | ||
- The feature is enabled only if a user explicitly fills up the `PersistentVolumeClaimReclaimPolicy` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new field needs to be protected by a Kubernetes alpha feature gate. It is the feature gate that controls if the feature is enabled/disabled in a Kubernetes cluster.
* **What happens if we reenable the feature if it was previously rolled back?** | ||
For Statefulset PVCs which were previously set to a non-Retain `PersistentVolumeClaimReclaimPolicy` | ||
the Statefulset is going to come back with VolumeClaimTemplates set to default, but | ||
the experience would be that of non-retain policy - either get deleted on scale down or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be reconciling so that we remove any ownerrefs if they change the policy back? Although it will still be racy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're only adding the ownerrefs while a scaledown is in progress, so I think it will have to be reconciling given idempotency? Anytime we see desired replicas < actual replicas we'll set ownerrefs etc, which should regenerate any that have been manually removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also the StatefulSetDelete policy where we add the ownerref on the statefulset object at creation time right?
be slower than scale ups relying on existing PVCs(from earlier scale ups). | ||
|
||
User would set the `PersistentVolumeClaimReclaimPolicy` as 'RemoveOnScaledown' ensuring PVCs are deleted when corresponding | ||
Pods are deleted. New Pods created during sclae up followed by a scaledown will wait for freshly created PVCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: sclae
will be added to the StatefulSet. This field represents the user indication on whether the | ||
associated PVCs can be deleted or not. The default policy would be `Retain`. | ||
|
||
If `PersistentVolumeClaimReclaimPolicy` is set to `RemoveOnScaledown`, Pod is set as the owner of the PVCs created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may be more clear here to say explicitly that an ownerref is created rather than just "pod is set as the owner"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be made more clear by rearranging like his:
A PersistentVolumeClaimReclaimPolicy
of RemoveOnScaledown
affects both changing the number of replicas in the StatefulSet as well as StatefulSet deletion. When the controller finds that the number of replicas is being reduced, it identifies the pods that are being deleted as well as any PVCs created from the VolumeClaimTemplate of those pods. The OwnerRef of those PVCs is set to the associated pod. When a Pod is deleted, this causes the associated PVC to be deleted as well. When the RemoveOnScaledown
policy is set etc (keep what you have written).
6e39193
to
dd324aa
Compare
86cc94f
to
13f1c05
Compare
will be added to the StatefulSet. This field represents the user indication on whether the | ||
associated PVCs can be deleted or not. The default policy would be `Retain`. | ||
|
||
If `PersistentVolumeClaimReclaimPolicy` is set to `RemoveOnScaledown`, Pod is set as the owner of the PVCs created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be made more clear by rearranging like his:
A PersistentVolumeClaimReclaimPolicy
of RemoveOnScaledown
affects both changing the number of replicas in the StatefulSet as well as StatefulSet deletion. When the controller finds that the number of replicas is being reduced, it identifies the pods that are being deleted as well as any PVCs created from the VolumeClaimTemplate of those pods. The OwnerRef of those PVCs is set to the associated pod. When a Pod is deleted, this causes the associated PVC to be deleted as well. When the RemoveOnScaledown
policy is set etc (keep what you have written).
- Add unit, functional, upgrade and downgrade tests to automated k8s test. | ||
|
||
|
||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the comment block.
@@ -0,0 +1,590 @@ | |||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment block, and the others.
There is a new field getting added to the StatefulSet. The upgrade will not | ||
change the previously expected behaviour of existing Statefulset. | ||
|
||
If the statefulset had been set with the RemoveOnStatefulSetDeletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say
On a downgrade, the PersistentVolumeClaimReclaimPolicy field will be removed from any StatefulSets. If a scaledown or delete is in process when the downgrade happens, any existing OwnerRefs on PVCs will not be removed so that in most cases when the scaledown completes, unused PVCs will be deleted. However, there may be edge cases causing only some of the unused PVCs to be deleted. As unused PVCs remaining after a scaledown is the expected behavior of the downgraded clusters no further effort will be made to remove them.
Hence no change in any user visible behaviour change by default. | ||
|
||
* **Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?** | ||
Yes, but with side effects for users who already started using the feature by means of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be sufficient to say yes here, and refer to the downgrade section above on what happens on rollback.
I think that disabling the feature would also mean removing the new policy field?
I'm not sure what you mean by annotating the PVCs here.
feature flag was diabled, the PVCs could get deleted. | ||
|
||
* **What happens if we reenable the feature if it was previously rolled back?** | ||
The reconcile loop which removes references on disablement will not come into action. Since the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have no effect, as the policy fields were removed during the rollback, so when reenabled all StatefulSets in the cluster will be using the retain policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way feature disabling/rollback works is that the field remains stored in etcd, however controllers will not process the field (because it's protected by a feature gate). Then if you re-enable the feature again, then you can start processing the field again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I think that means that the RemoveOnDeletion needs to be reconciled, that is make sure associated PVCs have ownerrefs to the statefulset.
I don't see this as too much of a problem if things aren't exactly consistent. For example, one might set RemoveOnScaledown on a 10 replica statefulset, disable the feature, scale down to 6 pods, then re-enable the feature; PVCs 7-10 are going to not have been deleted and we can't be expected to delete them. So if the disabling or re-enabling happens during a scaledown and some PVCs are missed or deleted, the behavior is almost the same as before when we were okay with the orphaned PVCs 7-10.
The main thing I think is to make sure "future" scaledown/deletions work correctly, and as long as we get the statefulset ownerref for RemoveOnDeletion that will be covered.
should happen only after the Pod gets deleted. Since the Pod ownership has `blockOwnerDeletion` set to | ||
`true` pods will get deleted before the StatefulSet is deleted. The `blockOwnerDeletion` for PVCs will | ||
be set to `false` which ensures that PVC deletion happens only after the StatefulSet is deleted. This | ||
chain of ownership ensures that Pod deletion occurs before the PVCs are deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if I delete a RemoveOnStatefulSetDeletion
StatefulSet with orphan policy? Will the PVCs be gone while the Pods stay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the orphan policy do you mean kubectl delete --cascade=false?
In that case I think nothing should be deleted other than the statefulset resource, the semantics for cascade seem clear. That would mean removing any ownership IIUC.
At any rate, even if the PVCs were deleted, the protection controller would prevent them from being finalized until the pods were deleted. But it might be unexpected to have the PVCs disappear when the pods are eventually deleted.
be slower than scale ups relying on existing PVCs(from earlier scale ups). | ||
|
||
User would set the `PersistentVolumeClaimDeletePolicy` as 'RemoveOnScaledown' ensuring PVCs are deleted when corresponding | ||
Pods are deleted. New Pods created during scale up followed by a scaledown will wait for freshly created PVCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a real world example of such an application ?if PVC should be deleted when pod is deleted, how is this different than a pod using an emptyDir ? I would vote for some more concrete use cases here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The volume is retained if the pod is rescheduled, eg if the node goes down or is upgraded.
+1 to adding the explicitly so it's clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or statefulset rolling update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update the pod rescheduling note shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk-src is swamped and has agreed to let me clone this PR and try to get it in for the enhancements freeze. I'll do that shortly and we can continue the discussion there. I'll apply all the suggestions in comments on this doc, or continue the conversation, on the new PR.
should happen only after the Pod gets deleted. Since the Pod ownership has `blockOwnerDeletion` set to | ||
`true` pods will get deleted before the StatefulSet is deleted. The `blockOwnerDeletion` for PVCs will | ||
be set to `false` which ensures that PVC deletion happens only after the StatefulSet is deleted. This | ||
chain of ownership ensures that Pod deletion occurs before the PVCs are deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the orphan policy do you mean kubectl delete --cascade=false?
In that case I think nothing should be deleted other than the statefulset resource, the semantics for cascade seem clear. That would mean removing any ownership IIUC.
At any rate, even if the PVCs were deleted, the protection controller would prevent them from being finalized until the pods were deleted. But it might be unexpected to have the PVCs disappear when the pods are eventually deleted.
I have created #2440. Let's move the party over there. Thanks! |
…4050) Approach inspired by kubernetes/enhancements#1915 Adds an new volumeClaimDeletePolicy to the Elasticsearch Spec on the cluster level (not per NodeSet) apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata: name: es spec: version: 7.10.1 volumeClaimDeletePolicy: DeleteOnScaledownAndClusterDeletion nodeSets: - name: default count: 2 Possible values are DeleteOnScaledownAndClusterDeletion (default), DeleteOnScaledownOnly. RemoveOnScaledownAndClusterDeletion relies on an owner reference pointing to the Elasticsearch resource to garbage collect PVCs once the Elasticsearch cluster has been deleted (existing behaviour). It also runs additional garbage collection to remove PVCs on each reconciliation that are no longer in use because either the whole node set has been removed or individual nodes have been scaled down (existing behaviour). RemoveOnScaledownOnly means the PVCs are kept around after the cluster has been deleted. This is implemented by removing the owner reference. Removal of PVCs on scale down happens as before. Switching from one to the other strategy is allowed and is implemented by avoiding the StatefulSet templating mechanism. This is mainly because the PVC template in StatefulSets are considered immutable and it would require StatefulSets to be recreated in order to change the PVC ownership. Instead the operator edits the PVCs after they have been created by the StatefulSet controller.
/close |
@mattcary: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Auto remove PVCs created by StatefulSet
Issue: #1847