-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for multi-writer PD #415
Add support for multi-writer PD #415
Conversation
Welcome @sschmitt! |
Hi @sschmitt. Thanks for your PR. I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
@@ -288,6 +297,66 @@ func (cloud *CloudProvider) insertRegionalDisk(ctx context.Context, volKey *meta | |||
return nil | |||
} | |||
|
|||
func (cloud *CloudProvider) insertRegionalAlphaDisk(ctx context.Context, volKey *meta.Key, diskType string, capBytes int64, capacityRange *csi.CapacityRange, replicaZones []string, snapshotID, diskEncryptionKmsKey string, multiWriter bool) error { | |||
diskToCreateAlpha := &computealpha.Disk{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this, It looks pretty good as-is but I think there's also a way to de-dupe some of this copy-pasted code.
Assuming zonal here:
- Have one
cloud.insertDisk
that takes in all the parameters - Create a v1 disk with logic up to line
318
here. - Make
insertOp
ainterface{}
type (this part is kind of nasty, suggestions welcome)
if multiWriter {
alphadisk := convertV1DiskToV1AlphaDisk(disk) // you have to write this
insertOp = cloud.alphaService.Insert(alphadisk)
} else{
insertOp = cloud.service.Insert(disk)
}
Then the rest of the error handling logic is shared
4) typeAssert on insertOp
and call the respective waitForOp
on it
5) Share the other logic
This would dedupe most of the code and only have branching for the actual GCE API Insert call and the WaitForOp.
Let me know what you think. Happy to discuss more
/cc @msau42 @misterikkit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I notice the waitForOp
methods only refer to op.Name
. I wonder if it's okay to call the non-alpha operations API to check on the status of an alpha operation. My thinking there is that we're not leveraging alpha features of the operations API.
I'm not sure so I'll err on the side of caution here. The solution you recommended works just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack. I didn't notice that - we could just grab the op name then and avoid the whole type assertion thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait... looked at it again and we still might need to since we're doing the op get on svc.ZoneOperations.Get(project, zone, op.Name).Context(ctx).Do()
- thats the v1 service. If we have an alpha service op maybe we can't "find it" unless we use a v1alpha.ZoneOperations
call? Could you verify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
David, I ran some tests and found that the v1 operations API is able to get the status of beta and alpha operations.
This can easily be tested using the in-browser API Explorer:
https://cloud.google.com/compute/docs/reference/rest/beta/disks/insert
https://cloud.google.com/compute/docs/reference/rest/v1/zoneOperations/get
I think there also might be some opportunity to reduce the duplication between Zonal and Regional methods but I'll skip that for now. You can let me know your thoughts there. I'm happy to make additional changes.
In the meantime I'll address the rest of your comments and add another commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets keep this change focused on adding the multiwriter stuff - feel free to open up a fix to reduce zonal+regional duplication afterwards if you're interested. That would be really cool 👍
Ack on the ops
- lets dedupe them and make sure to add a comment in the code with that exact finding (so someone doesn't come in and waste cycles trying to "fix" it later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, just some comments about code de-dupe and maybe some missed pieces.
Could you please write some E2E tests for this too? You can see examples in:
test/e2e/tests/single_zone_e2e_test.go
and run them locally with:
test/run-e2e-local.sh
insertOp, err := cloud.alphaService.RegionDisks.Insert(cloud.project, volKey.Region, diskToCreateAlpha).Context(ctx).Do() | ||
if err != nil { | ||
if IsGCEError(err, "alreadyExists") { | ||
disk, err := cloud.GetDisk(ctx, volKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does GetDisk
need to be changed as well? If we are to get a disk that is multi-writer what happens (since it is only on the alpha struct)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may be able to leverage CloudDisk
here (originally created for this different api problem for RePD)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, GetDisk has to be updated. CloudDisk looks to be the way to go.
if err != nil { | ||
return err | ||
} | ||
err = cloud.ValidateExistingDisk(ctx, disk, diskType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validation might also need to be changed to check if multiWriter
field is equal
@@ -346,14 +346,19 @@ func TestCreateVolumeArguments(t *testing.T) { | |||
}, | |||
}, | |||
{ | |||
name: "fail with MULTI_NODE_MULTI_WRITER capability", | |||
name: "success with block/MULTI_NODE_MULTI_WRITER capabilities", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about fs/MULTI_NODE_MULTI_WRITER
should we fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikhilkathare Any details here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattcary : Hi matt, fs/MULTI_NODE_MULTI_WRITER has been handled above as mount/MULTI_NODE_MULTI_WRITER capability.
@davidz627 I think the e2e is going to require the boskos projects to be whitelisted before we can merge it |
@davidz627 E2E is a little tricky because this feature is still in Alpha. I also notice that the E2E tests are hardcoded for us-central1. Multi-writer PD is currently only available in us-east1-a in whitelisted projects. Should I write E2E tests and leave them commented out or skip them? |
This could probably be changed pretty easily - please do so unless it seems like it would be significant additional investment.
I will take a look at how to resolve this and try get the projects all whitelisted - I will update this PR when I know more.
Yes, have them skipped for now. However, you should be able to run the tests in your own whitelisted project[s] Thanks! |
@davidz627 Perfect. Thanks for the guidance. |
…riter PD size of 200GB.
@@ -182,15 +182,22 @@ func (cloud *FakeCloudProvider) ValidateExistingDisk(ctx context.Context, resp * | |||
return fmt.Errorf("disk already exists with incompatible type. Need %v. Got %v", | |||
diskType, respType[len(respType)-1]) | |||
} | |||
|
|||
// We are assuming here that a multiWriter disk could be used as non-multiWriter | |||
if multiWriter && !resp.GetMultiWriter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about the other way around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other way around would be when the existing disk is enabled for multi-writer but the user didn't ask for that capability. I assume here that the user would be okay with that.
It's actually quite challenging to check the opposite because the user might not have Alpha API access. I suppose we could first try alpha and if that fails fall back to v1.
pkg/gce-pd-csi-driver/controller.go
Outdated
@@ -298,7 +305,7 @@ func (gceCS *GCEControllerServer) ControllerPublishVolume(ctx context.Context, r | |||
PublishContext: nil, | |||
} | |||
|
|||
_, err = gceCS.CloudProvider.GetDisk(ctx, volKey) | |||
_, err = gceCS.CloudProvider.GetDisk(ctx, volKey, gce.V1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this definitely V1
? Couldn't it be a alpha multiwriter disk that we want to get here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The v1 API returns disks that use alpha features, it just doesn't have any information with respect to those features. Here the API call is used to detect the existence of the disk. The disk itself is thrown away and only the error code is parsed. I didn't see a need to use anything other than v1.
alphaDiskToCreate := convertV1DiskToAlphaDisk(diskToCreate) | ||
alphaDiskToCreate.MultiWriter = multiWriter | ||
insertOp, err = cloud.alphaService.RegionDisks.Insert(cloud.project, volKey.Region, alphaDiskToCreate).Context(ctx).Do() | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if insertOp != nil
instead?
var ( | ||
err error | ||
opName string | ||
apiVersion = V1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets call this gceAPIVersion
to disambiguate.
@davidz627 Are there any further concerns? You had a question here, I wasn't sure if you saw my response. |
@@ -346,14 +346,19 @@ func TestCreateVolumeArguments(t *testing.T) { | |||
}, | |||
}, | |||
{ | |||
name: "fail with MULTI_NODE_MULTI_WRITER capability", | |||
name: "success with block/MULTI_NODE_MULTI_WRITER capabilities", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikhilkathare Any details here?
@@ -257,7 +257,11 @@ func testLifecycleWithVerify(volID string, volName string, instance *remote.Inst | |||
if secondMountVerify != nil { | |||
// Mount disk somewhere else | |||
secondPublishDir := filepath.Join("/tmp/", volName, "secondmount") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm missing something from the test setup, but will this test multizonal shared PD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shared PD is yet not supported on multizone. This is the fix added to handle function testLifecycleWithVerify with useBlock=true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying.
/assign @mattcary |
Thanks for resurrecting this PR, Nikhil. It looks good to me, just a couple of general questions:
|
Hi Matt, Thanks for your review.
|
Cool, thanks for the info. |
/lgtm |
@nikhilkathare this lgtm! Thanks for picking this up! One last thing, would you be able to squash the commits to reduce some of the "merge branch" and "address comments" commits? If you haven't tried it before, I would suggest practicing in another branch before doing it here. Take a look here for pointers on how to do the squshing. |
9eb436f
to
6af5e68
Compare
/cc @msau42 @mattcary Thanks for reviewing the code. Tried to squash the commits using rebase but I could squash only final commits(last 5 commits) which were done by me. When tried to squash further, rebase is erroring out with conflicts while applying commits and exiting to resolve errors manually in other commits. As commits in this PR are done with lot of gap there are many commits that have got in between which squash is not able to resolve. Should we create new branch and take change from this branch to new branch or any other better approach. Let us know what approach you would like us to take ? |
Hi @nikhilkathare can you try just squashing your own commits then? It's not a big deal if we can't get it to work, it's mainly to just try to cleanup the commit listing a little bit. |
@msau42 Thanks for quick response. Yes I have squashed my five changes to one and pushed, this led to removal of tag lgtm from PR. Now no further squash could be done as there is merge(which has other commits) between our PR commits. |
/lgtm Thank you so much! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, sschmitt The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
2 similar comments
/retest |
/retest |
/kind feature
/cc @msau42
What this PR does / why we need it:
Adds multi writer support (currently alpha in GCP).
Special notes for your reviewer:
There's a bit of code duplication due to alpha copies of certain methods. I found it challenging to reduce the duplication but I'd be open to suggestions on how to refactor. On the other hand it might just be a temporary situation until the API's move to beta / GA.
Does this PR introduce a user-facing change?: