Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new GroupControllerService and VolumeGroupSnapshot CSI RPCs #519

Merged
merged 13 commits into from
Feb 21, 2023

Conversation

xing-yang
Copy link
Contributor

@xing-yang xing-yang commented Aug 25, 2022

KEP in Kubernetes: kubernetes/enhancements#1551

Copy link

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for proposing these changes! Except for missing secrets in some of the procedures, we should be able to implement VolumeGroups with Ceph-CSI this way.

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
csi.proto Outdated Show resolved Hide resolved
csi.proto Outdated Show resolved Hide resolved
csi.proto Outdated Show resolved Hide resolved
@xing-yang xing-yang force-pushed the volume_group branch 2 times, most recently from a5808fb to cf06518 Compare September 14, 2022 13:57
@xing-yang xing-yang changed the title WIP: Add VolumeGroup and VolumeGroupSnapshot CSI RPCs Add VolumeGroup and VolumeGroupSnapshot CSI RPCs Sep 14, 2022
@jdef
Copy link
Member

jdef commented Sep 14, 2022

This model first-classes the "group" concept as a thing w/ its own metadata. Is that really needed? Does CSI really care about group metadata? If so, why?

What if the model was inverted, e.g. the Volume concept was extended w/ a list of "group names", which would just be opaque identifiers from a CSI perspective. We didn't add "Zone" or "Topology" as first-class objects managed by CSI RPCs - why would we do the same for groups?

If we're aiming for the simplest API that still provides value, this current proposal seems a bit heavy. Consistency (for snapshot) groups, replication groups, isolation groups, etc definitely have value for different use cases. I wonder why CSI cares to distinguish among them at this point? If a volume object could self-identify as belonging to groups "a" and "b" isn't that enough to begin implementing support for some the desired, advanced use cases from the perspective of both the CO and SP? If not, why?

If the answer is "because the KEP needs these things" then my next question is "why does the KEP propose separate CRDs to manage groups vs extending the Volume k8s resource w/ group attributes" - but that's arguably a question that belongs on the KEP thread :)

csi.proto Outdated Show resolved Hide resolved
@xing-yang
Copy link
Contributor Author

This model first-classes the "group" concept as a thing w/ its own metadata. Is that really needed? Does CSI really care about group metadata? If so, why?

What if the model was inverted, e.g. the Volume concept was extended w/ a list of "group names", which would just be opaque identifiers from a CSI perspective. We didn't add "Zone" or "Topology" as first-class objects managed by CSI RPCs - why would we do the same for groups?

If we're aiming for the simplest API that still provides value, this current proposal seems a bit heavy. Consistency (for snapshot) groups, replication groups, isolation groups, etc definitely have value for different use cases. I wonder why CSI cares to distinguish among them at this point? If a volume object could self-identify as belonging to groups "a" and "b" isn't that enough to begin implementing support for some the desired, advanced use cases from the perspective of both the CO and SP? If not, why?

If the answer is "because the KEP needs these things" then my next question is "why does the KEP propose separate CRDs to manage groups vs extending the Volume k8s resource w/ group attributes" - but that's arguably a question that belongs on the KEP thread :)

@jdef That’s an interesting idea. If we only need to support Group Snapshot, then I think just adding a group name attribute to a volume object would be sufficient. If we want to support more group types, then just relying on attributes in volume objects could be confusing. A group could be for group snapshot, group replication, placement, etc. So other than the group name, we also need to specify what type of group that is. Although this PR has not talked about replication group, the plan is to make this design generic so that it can be extended to support replication group in the future. For a replication group, there are additional group specific attributes, i.e., whether it is primary or secondary, etc. We would also want to introduce RPCs to support operations such as failover, failback, etc. at the group level.

Regarding why we didn’t add “Zone” or “Topology” as first class objects in CSI, my understanding is that those are not storage system’s internal attributes. Support for zone and topology does not consider the internal topology of the storage system. That’s why we are also exploring how to use the group concept to handle placement which is associated with storage system’s internal topology.

csi.proto Outdated Show resolved Hide resolved
csi.proto Outdated Show resolved Hide resolved
Copy link

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! With the secrets added it should be possible for Ceph-CSI to implement the new procedures.

@xing-yang
Copy link
Contributor Author

A few people have reviewed the design and confirmed it works for their storage systems. See details here:
https://docs.google.com/document/d/1c_48xZuuvyfYrvgIHgwNFf4AHGrbvaUsbRdKfvtRLdQ/edit#heading=h.gnlmzi84mhax

Copy link

@rbo54 rbo54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall these changes look very good to me. I did leave a couple of comments but they are only advisory. I believe we can implement this specification for our arrays.

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@msau42
Copy link

msau42 commented Oct 6, 2022

@jdef That’s an interesting idea. If we only need to support Group Snapshot, then I think just adding a group name attribute to a volume object would be sufficient. If we want to support more group types, then just relying on attributes in volume objects could be confusing. A group could be for group snapshot, group replication, placement, etc. So other than the group name, we also need to specify what type of group that is. Although this PR has not talked about replication group, the plan is to make this design generic so that it can be extended to support replication group in the future. For a replication group, there are additional group specific attributes, i.e., whether it is primary or secondary, etc. We would also want to introduce RPCs to support operations such as failover, failback, etc. at the group level.

Following this idea a bit more, is it required that the group needs to distinguish its purpose/type? At least in its current proposal, the Group does not indicate if its for snapshots vs other purposes, and there's a separate GroupSnapshot. I imagine in the future too, you could use the same group for multiple purposes like replication as well.

Should we consider this alternative? Or at least see where we run into difficulties?

@xing-yang
Copy link
Contributor Author

xing-yang commented Oct 6, 2022

@jdef That’s an interesting idea. If we only need to support Group Snapshot, then I think just adding a group name attribute to a volume object would be sufficient. If we want to support more group types, then just relying on attributes in volume objects could be confusing. A group could be for group snapshot, group replication, placement, etc. So other than the group name, we also need to specify what type of group that is. Although this PR has not talked about replication group, the plan is to make this design generic so that it can be extended to support replication group in the future. For a replication group, there are additional group specific attributes, i.e., whether it is primary or secondary, etc. We would also want to introduce RPCs to support operations such as failover, failback, etc. at the group level.

Following this idea a bit more, is it required that the group needs to distinguish its purpose/type? At least in its current proposal, the Group does not indicate if its for snapshots vs other purposes, and there's a separate GroupSnapshot. I imagine in the future too, you could use the same group for multiple purposes like replication as well.

Should we consider this alternative? Or at least see where we run into difficulties?

When we support group snapshot, replication group, etc., it is required to distinguish its type as CSI driver may need to call different APIs for different group types.

When we take a GroupSnapshot, CSI driver needs to call a storage API to do so. In order to do that, it needs to have a list of volumes as the input. If we don't have a VolumeGroup, then we still need to add new RPCs which take an input of a list of volumes.

@jdef
Copy link
Member

jdef commented Oct 6, 2022 via email

@xing-yang
Copy link
Contributor Author

I'm trying to understand the leap, from volume lifecycle mgmt to group
lifecycle mgmt. Why does a CO need to explictly create or modify or destroy
groups? Can't a CO use a group, perhaps created out of band, by name ref?
Dropping VolumeGroup does not preclude you from passing a groupName
parameter to a snapshot rpc. Asking CSI API to enumerate all vendor group
features seems pitted against the desire to maintain a lowest common
denominator, minimal API.

@jdef, you replied before I edited my previous response that removed the last sentence from my previous response. Actually even if we only support group snapshot, we still need to have a new CSI CreateGroupSnapshot RPC that allows us to pass a list of volume ids to the CSI driver. CSI aims to support multiple storage systems. For some, simply passing a volume group name is enough, but for others, it is required to pass a list of volume ids as well.

@msau42
Copy link

msau42 commented Oct 6, 2022

When we support group snapshot, replication group, etc., it is required to distinguish its type as CSI driver may need to call different APIs for different group types.

When we take a GroupSnapshot, CSI driver needs to call a storage API to do so. In order to do that, it needs to have a list of volumes as the input. If we don't have a VolumeGroup, then we still need to add new RPCs which take an input of a list of volumes.

When you create a volume group, do you need to state what its purpose is during group creation? For my reference, can you point to a storage system that would need this? If this is common, then maybe we should not have a generic group concept.

@xing-yang
Copy link
Contributor Author

When you create a volume group, do you need to state what its purpose is during group creation? For my reference, can you point to a storage system that would need this? If this is common, then maybe we should not have a generic group concept.

@msau42 Not every driver needs to differentiate that but some do. Here's documentation of Generic Volume Group in OpenStack Cinder: https://docs.openstack.org/cinder/latest/contributor/groups.html. It was extended to support both group snapshots and replication group. It was implemented by multiple storage vendors. There was at least a Cinder driver from Dell EMC that calls different APIs to create a group that supports consistent group snapshot vs a group that supports replication group.

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a spin through the PR and left some notes. Why are secrets everywhere - seems like overkill?

This adds a TON of surface area. Has anyone considered breaking volume group things into a separate service? VolumeGroupController or something - because, really there's a lot of RPCs here that a plugin has to implement (at least stubs) when they don't support vol groups at all. We've tended to lean away from defining new services, but maybe a new alpha service in this case makes sense?

Volume groups, as a concept, feel pretty opinionated here. I still wonder if there's a simpler representation that we can aim for.

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@xing-yang
Copy link
Contributor Author

I took a spin through the PR and left some notes. Why are secrets everywhere - seems like overkill?

@jdef Regarding the secrets, there are CSI drivers that require a secret to be passed in to every volume operation. For that reason, we have added secrets to most existing RPCs. Now since we are introducing new RPCs in this PR, we are just adding them at the same time. This way we don't have to submit followup PRs to add secrets to these RPCs in the future.

I'll get back to your other comments later. Thanks!

@xing-yang
Copy link
Contributor Author

Hi @jdef, I addressed all your comments. Please take a look. Thanks.

@xing-yang xing-yang changed the title Add VolumeGroup and VolumeGroupSnapshot CSI RPCs Add VolumeGroupSnapshot CSI RPCs Dec 20, 2022
@xing-yang xing-yang changed the title Add VolumeGroupSnapshot CSI RPCs Add new GroupControllerService and VolumeGroupSnapshot CSI RPCs Dec 20, 2022
@xing-yang
Copy link
Contributor Author

I made the following changes:

  • Added a new GroupControllerService to address comments from @jdef. Also moved VolumeGroupSnapshot RPCs to the new service.
  • Based on discussions in the last CSI community sync, I removed VolumeGroup RPCs and only kept VolumeGroupSnapshot RPCs for supporting volume group snapshot. Group for volumes will be introduced in the future when we work on ReplicationGroup.

csi.proto Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Outdated

In VolumeGroupSnapshot message, both snapshots and group_snapshot_id are required fields.

If an error occurs before all the individual snapshots are created when creating a group snapshot of multiple volumes and a group_snapshot_id is not yet available for CO to do clean up, SP SHOULD do clean up and make sure no snapshots are leaked.
Copy link
Member

@jdef jdef Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If create-volume-group-snapshot (CVGS) fails after some subset of snapshots was created w/ some subset failing (because some post-processing op failed), then a subsequent CVGS RPC is expected to report the error... as long as the SP hasn't already cleaned up after the failed CVGS op, right?

Because if the SP is doing cleanup, and the cleanup happens very quickly then subsequent CVGS could kick off a secondary operation (that may or may not fail) like the first. This could happen in such a manner that the CO never actually observes the failure gRPC because the SP is so quick to clean up between CO invocations of (an async-failing) CVGS. It's entirely possible that in such cases, the CO could reissue CVGS calls indefinitely w/ the same parameter set, not knowing that it will never be completed by the SP because the SP is mopping things up so quickly - all the CO might see is that ready_to_use never becomes true.

Are there holes in the spec like this elsewhere, where the SP can perform GC in a manner that leaves the CO blind to underlying errors? In general, we've shied away from async operations in CSI. Designing w/ post-processing steps that complete out-of-band effectively makes the API async. Can we avoid that?

IMO this needs more thought: the CO should be able to observe a failure and report it back to either some controller, or admin. Which likely means that magical auto-cleanup for failed CVGS (and maybe even for single create-snapshot) ops, performed by the SP, should go away.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If create-volume-group-snapshot (CVGS) fails after some subset of snapshots was created w/ some subset failing (because some post-processing op failed), then a subsequent CVGS RPC is expected to report the error... as long as the SP hasn't already cleaned up after the failed CVGS op, right?

create-volume-group-snapshot (CVGS) will not return success with a group_snapshot_id before all the snapshots are cut. Post-processing happens after this point. This sentence "SP SHOULD do clean up and make sure no snapshots are leaked" refers to the situation before all the snapshots are cut successfully. So "If create-volume-group-snapshot (CVGS) fails after some subset of snapshots was created w/ some subset failing (because some post-processing op failed)", a gRPC error code will be returned back to CO at this point. It does not rely on a subsequent call to report error. The create operation is synchronous until those IDs are returned. Yes, post-processing is async, but CO should clean up those resources when post-processing starts because it has got those IDs and CO does not need to rely on SP for the cleanups. I'll clarify this in the spec that "SP clean up" refers to before post-processing starts.

IMO this needs more thought: the CO should be able to observe a failure and report it back to either some controller, or admin. Which likely means that magical auto-cleanup for failed CVGS (and maybe even for single create-snapshot) ops, performed by the SP, should go away.

Cleanup by SP is needed if an error occurs before a CO can get group_snapshot_id and snapshot_ids. Without those IDs, CO does not have a way to clean up even if it observes a failure. @bswartz did suggest to have a new delete API that takes in all the parameters from a create operation during our discussions. This applies to not just CreateSnapshot and CreateVolumeGroupSnapshot, but also CreateVolume.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put an agenda item on tomorrow's monthly meeting to discuss this issue. It's a wider class of problems that affects failed creates for all CSI objects: volumes and snapshots included.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Xing and I discussed this, the idea was that the SP would only clean up when reporting a terminal error. Non-terminal errors would not trigger a clean up, and cleanup would never occur until after an error has been reported to the CO.

At the time I was able to convince myself that this guarantee was "good enough" to stay out of trouble, and follows the existing pattern of CSI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comments to this:

If an error occurs before all the individual snapshots are cut when creating a group snapshot of multiple volumes and a group_snapshot_id is not yet available for CO to do clean up, SP MUST return an error, and SP SHOULD also do clean up and make sure no snapshots are leaked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdef We discussed this in today's CSI Community Sync and decided not to introduce a name parameter in the Delete operation. Can you please review the updated PR and see if it looks good?

spec.md Outdated Show resolved Hide resolved
@xing-yang
Copy link
Contributor Author

Comments are addressed.

Copy link

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COs may want to be able to require a minimum level of consistency for group snapshots. Users may want to be able to discover the consistency of a group snapshot that was previously created. To address these use cases, I propose adding a new optional field to CreateVolumeGroupSnapshotRequest and VolumeGroupSnapshot:

enum VolumeGroupSnapshotConsistency {
  CONSISTENCY_UNSPECIFIED = 0;
  CONSISTENCY_CRASH_CONSISTENT = 1;
  CONSISTENCY_FILESYSTEM_CONSISTENT = 2;
  CONSISTENCY_APPLICATION_CONSISTENT = 3;
}
message CreateVolumeGroupSnapshotRequest {
  ...
  // This field is OPTIONAL. Indicates that the request should fail
  // if the Plugin cannot guarantee snapshot consistency at the
  // the requested level or higher.
  VolumeGroupSnapshotConsistency consistency = 5;
}
message VolumeGroupSnapshot {
  ...
  // This field is OPTIONAL. Indicates the consistency of the snapshot
  // as ascertained by the Plugin and the underlying storage provider
  // at the time that the group snapshot was created. A value of
  // CONSISTENCY_UNSPECIFIED indicates that the consistency level is
  // not known.
  VolumeGroupSnapshotConsistency consistency = 5;
}

@bswartz
Copy link
Contributor

bswartz commented Feb 1, 2023

COs may want to be able to require a minimum level of consistency for group snapshots. Users may want to be able to discover the consistency of a group snapshot that was previously created. To address these use cases, I propose adding a new optional field to CreateVolumeGroupSnapshotRequest and VolumeGroupSnapshot:

enum VolumeGroupSnapshotConsistency {
  CONSISTENCY_UNSPECIFIED = 0;
  CONSISTENCY_CRASH_CONSISTENT = 1;
  CONSISTENCY_FILESYSTEM_CONSISTENT = 2;
  CONSISTENCY_APPLICATION_CONSISTENT = 3;
}

It's true that better degrees of consistency exist and that end users may want to know about them. Unfortunately CSI is too low level to provide any guarantees other than "crash consistent". The better forms of consistency must be implemented at a higher level and thus should be tracked at a higher level.

@xing-yang
Copy link
Contributor Author

Note that there is a WIP KEP in Kubernetes that tries to address application consistency: kubernetes/enhancements#1995

spec.md Outdated Show resolved Hide resolved
@xing-yang
Copy link
Contributor Author

@saad-ali I have addressed your comment. PTAL. Thanks.

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
@xing-yang
Copy link
Contributor Author

@saad-ali Your comments are addressed. Please take another look. Thanks.

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks for all the hard work on this @xing-yang !!

@humblec
Copy link
Contributor

humblec commented Feb 8, 2023

Indeed, thanks @xing-yang for adopting all the review thoughts and getting into the consensus!

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty close to the finish line. mostly nits, except for concerns around the newly added delete RPCs (and dissonance w/ the newly added get RPC) - PTAL

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@xing-yang
Copy link
Contributor Author

@jdef @bswartz All new comments are addressed. Please take a look. Thanks.

spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
if DeleteSnapshot fails because snapshot is part of a group snapshot.
@xing-yang
Copy link
Contributor Author

@jdef Your comments are addressed. PTAL. Thanks.

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jdef jdef merged commit 47a0fa4 into container-storage-interface:master Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.