Fix lifecycle and ControllerUnpublishVolume description #533

YuikoTakada · 2022-12-12T07:12:40Z

What type of PR is this?
This PR adds new transition which move from PUBLISHED / VOL_READY to CREATED directly by ControllerUnpublishVolume RPC when control plane can't reach the node(for example node shut down due to a hardware failure or a software problem).

What this PR does / why we need it:
Current CSI volume lifecycle is not designed for the case when Node is unreachable.

When Node is shutdown or in a non-recoverable state such as hardware failure or broken OS, Node Plugin cannot issue NodeUnpublishVolume / NodeUnstageVolume . In this case, we want to make the status CREATED (volume is detached from the node and pods are evicted to other node and running)
But in current CSI volume lifecycle, there is no transition from PUBLISHED / VOL_READY / NODE_READY to CREATED.
As a result, k8s doesn't follow the CSI spec and the status moves from PUBLISHED to CREATED directly without going through VOL_READY and/or NODE_READY status.

We need to update the CSI volume lifecycle with considering the case when Node is unreachable.

Which issue(s) this PR fixes:

Fixes #512

Special notes for your reviewer:

Does this PR introduce an API-breaking change?:

none

YuikoTakada · 2022-12-15T02:33:53Z

@jdef This PR fixes #512 . Please take a look.

jdef · 2022-12-15T13:32:04Z

(a) is there value having the CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume? (i.e. via some flag field in the request message) this could possibly be a valuable signal to the SP that additional validation/GC/state-repair needs to happen on the backend. perhaps SP authors can chime in here.

(b) this is a breaking change w/ respect to SP expectations re: volume lifecycle. it may make sense to guard this w/ a specific capability in the controller service. and only then should a CO have expectations that such state transitions are reliable. otherwise SPs may receive calls in an unexpected/unsupported order. k8s isn't the only CO on the block, and just because it breaks the rules currently doesn't mean that all SPs need to support that pattern.

YuikoTakada · 2022-12-20T09:48:39Z

There are COs of which implementation violates csi spec already. They violate csi spec reluctantly because there is no transition in case of node is shut down or in a non-recoverable state. Matching actual implementation and spec makes sense.

jdef · 2022-12-20T13:01:33Z

reluctant COs or otherwise, this change seems to break the contract defined in the spec. it's unclear whether my proposals are under consideration. do you have any feedback re: the suggestions I made?

yosshy · 2022-12-22T08:15:24Z

This PR looks having room to describe more detail, but I believe that its direction is right because:

The CSI spec lacks consideration for forced volume detachment on a failed node.
K8s already has an undocumented (but reasonable) ControllerUnpublishVolume call to detach an attached volume to a failed node.
CSI plugins including NetApp Trident allow 2.

See logs for details.

If someone appends ANOTHER call like #477 to the spec for the case, we will have to modify K8s and CSI plugins supporting 2. We don't want to change them because they are working very well now. So, it's good to modify the spec to follow the current as-is workflow. This PR will do so.

tgross · 2023-01-04T15:08:30Z

(a) is there value having the CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume? (i.e. via some flag field in the request message) this could possibly be a valuable signal to the SP that additional validation/GC/state-repair needs to happen on the backend. perhaps SP authors can chime in here.

With Nomad our experience is the primary reason we need to bypass the proper cleanup calls is because we can't communicate with the node plugin (ex. it's host isn't running). Right now we don't track the failure state specifically and just log the event and bump the internal state machine along, but it's totally feasible for us to implement. For SP plugins like AWS EBS, the controller could use that information to force-detach instead of detach.

(b) this is a breaking change w/ respect to SP expectations re: volume lifecycle. it may make sense to guard this w/ a specific capability in the controller service. and only then should a CO have expectations that such state transitions are reliable. otherwise SPs may receive calls in an unexpected/unsupported order. k8s isn't the only CO on the block, and just because it breaks the rules currently doesn't mean that all SPs need to support that pattern.

As a CO implementer I'd suggest that if we were to have such a capability, the spec should provide specific guidance to what the CO and SP should expect as to the correct behavior if that capability is missing. Because by a strict reading of the current spec if the node plugin is unavailable the state machine cannot continue and the volume is unrecoverably lost from the perspective of the CO. Needless to say this is an awful user experience. So if the spec is going to be particular about it I'd love to be able to point to the plugin author who isn't providing that capability so that we can suggest to our users that they use a better plugin. 😁

K8s already has an undocumented (but reasonable) ControllerUnpublishVolume call to detach an attached volume to a failed node.

fwiw, Nomad implements something similar as well.

YuikoTakada · 2023-01-12T00:06:57Z

@jdef

(a) is there value having the CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume? (i.e. via some flag field in the request message) this could possibly be a valuable signal to the SP that additional validation/GC/state-repair needs to happen on the backend. perhaps SP authors can chime in here.

Thank you for giving me good advice. I also think your idea "CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume" and "i.e. via some flag field in the request message"is good.
I've updated just figure so PTAL. If you think this change is good, I'll update ControllerUnpublishVolume description also.

bswartz · 2023-01-17T20:53:00Z

It's true that today k8s already violates the spec, but only in situations where it's known to be safe. The workaround that exist to ensure safety when doing this still rely on external knowledge, though, which means they either need a human in the loop (not suitable for lights out operation) or they rely on some external component with special access to machine state (not available in all deployments).

I'm going to put some more effort into bringing back my proposed solution in #477. The remaining objections seem to be related to the fact that the proposal only covers the CSI layer and people want to a proposal that covers the upper layers too. To this end I will write up a KEP that describes a comprehensive proposal.

jsafrane · 2023-01-19T11:48:04Z

(a) is there value having the CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume?

Yes, as mentioned above some cloud-y CSI drivers could invoke "force detach". Some more traditional (FC, iSCSI) drivers could at least try to clean up LUN mapping on the server (target) side, because the client (initiator) is in down / unreachable.

jsafrane · 2023-01-19T11:50:48Z

(b)) this is a breaking change w/ respect to SP expectations re: volume lifecycle. it may make sense to guard this w/ a specific capability in the controller service. and only then should a CO have expectations that such state transitions are reliable. otherwise SPs may receive calls in an unexpected/unsupported order. k8s isn't the only CO on the block, and just because it breaks the rules currently doesn't mean that all SPs need to support that pattern.

As a CO implementer I'd suggest that if we were to have such a capability, the spec should provide specific guidance to what the CO and SP should expect as to the correct behavior if that capability is missing. Because by a strict reading of the current spec if the node plugin is unavailable the state machine cannot continue and the volume is unrecoverably lost from the perspective of the CO.

As another CO implementer (Kubernetes), I have the same point. Only human can bring the volume back to a working state, because CO does not have any message that would indicate SP to recover the volume from such a state.

YuikoTakada · 2023-01-23T06:49:02Z

@bswartz Thank you for your comments. If you can refine #477 or create another new PR, I'm glad to discuss on it.

YuikoTakada · 2023-02-10T12:51:46Z

@bswartz I read #477 again. #477 suggests to create new states QUARANTINED_SP and QUARANTINED_S. This means that if #477 will have been merged, COs (of cource kubernetes is included also) will not work well and many COs need to be updated. It makes big impact.
Could you please consider to refine #477 which doesn't make new state, just making CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume(i.e. via some flag field in the request message) ?
Thanks!

yosshy · 2023-02-11T15:15:48Z

I'm going to put some more effort into bringing back my proposed solution in #477. The remaining objections seem to be related to the fact that the proposal only covers the CSI layer and people want to a proposal that covers the upper layers too. To this end I will write up a KEP that describes a comprehensive proposal.

We found that k8s calls NodeStageVolume and NodePublishVolume for attached volumes on a restarted node. The status of attached volume is "PUBLISHED" and it's another violation of the state transition diagram in Fig.6, but it's reasonable behavior for COs because attached volume might be detached before node restart. Please consider them when you will update #477.

BTW, #477 looks storage fencing, the first plan of non-graceful shutdown in k8s but it was finally changed to use node fencing. Is my understanding incorrect?

bswartz · 2024-01-31T17:56:43Z

We intend to fix the broken behavior in Kubernetes:
kubernetes/kubernetes#120344

Initially the fix will be opt-in, but eventually the old spec-violating behavior will be removed entirely. Based on that plan, this PR is no longer needed.

YuikoTakada · 2024-02-01T03:22:16Z

@bswartz Thank you for your notice. I'll check it.

YuikoTakada changed the title ~~Fix lifecycle~~ Fix lifecycle and ControllerUnpublishVolume description Dec 15, 2022

Fix lifecycle

afd5b33

YuikoTakada force-pushed the fix_lifecycle branch from f5963c5 to afd5b33 Compare January 11, 2023 23:59

bswartz closed this Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix lifecycle and ControllerUnpublishVolume description #533

Fix lifecycle and ControllerUnpublishVolume description #533

YuikoTakada commented Dec 12, 2022

YuikoTakada commented Dec 15, 2022

jdef commented Dec 15, 2022

YuikoTakada commented Dec 20, 2022

jdef commented Dec 20, 2022

yosshy commented Dec 22, 2022 •

edited

Loading

tgross commented Jan 4, 2023 •

edited

Loading

YuikoTakada commented Jan 12, 2023

bswartz commented Jan 17, 2023

jsafrane commented Jan 19, 2023

jsafrane commented Jan 19, 2023

YuikoTakada commented Jan 23, 2023

YuikoTakada commented Feb 10, 2023

yosshy commented Feb 11, 2023

bswartz commented Jan 31, 2024

YuikoTakada commented Feb 1, 2024

Fix lifecycle and ControllerUnpublishVolume description #533

Fix lifecycle and ControllerUnpublishVolume description #533

Conversation

YuikoTakada commented Dec 12, 2022

YuikoTakada commented Dec 15, 2022

jdef commented Dec 15, 2022

YuikoTakada commented Dec 20, 2022

jdef commented Dec 20, 2022

yosshy commented Dec 22, 2022 • edited Loading

tgross commented Jan 4, 2023 • edited Loading

YuikoTakada commented Jan 12, 2023

bswartz commented Jan 17, 2023

jsafrane commented Jan 19, 2023

jsafrane commented Jan 19, 2023

YuikoTakada commented Jan 23, 2023

YuikoTakada commented Feb 10, 2023

yosshy commented Feb 11, 2023

bswartz commented Jan 31, 2024

YuikoTakada commented Feb 1, 2024

yosshy commented Dec 22, 2022 •

edited

Loading

tgross commented Jan 4, 2023 •

edited

Loading