Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PodCidr is required even if "azure" networkPlugin is specified in kubernetesClustersClient.CreateOrUpdate method #2178

Closed
lfshr opened this issue Jul 3, 2018 · 25 comments
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization.

Comments

@lfshr
Copy link

lfshr commented Jul 3, 2018

NetworkProfile.PodCidr is required even if "azure" network is specified in the NetworkProfile.NetworkPlugin field. As far as I know this is only a mandatory field when kubenet is used.

@tombuildsstuff
Copy link
Contributor

👋

Looking into this from our side - I wonder if we could work around this if the validation wasn't present for this field and passing an empty string? Admittedly, I don't know if this field has server-side validation too; but that may allow this to work without changes in the service too?

@joshgav
Copy link
Contributor

joshgav commented Jul 18, 2018

Hi @tombuildsstuff, I believe the validation comes from the spec: https://github.com/Azure/azure-rest-api-specs/blob/master/specification/containerservices/resource-manager/Microsoft.ContainerService/stable/2018-03-31/managedClusters.json#L977

Following up with the containers team now.

@lfshr
Copy link
Author

lfshr commented Jul 19, 2018

@joshgav the spec has a default value. The SDK doesn't seem to adhere to this default value when no value is passed.

@marstr
Copy link
Member

marstr commented Jul 19, 2018

@tombuildsstuff says:
I wonder if we could work around this if the validation wasn't present for this field and passing an empty string? Admittedly, I don't know if this field has server-side validation too; but that may allow this to work without changes in the service too?

Need help working around the validation to test this?

@lfshr
Copy link
Author

lfshr commented Jul 19, 2018

@marstr I'm wondering if this belongs in autorest? See above comment.

I may be getting products confused. Does autorest generate the SDK from the spec?

@marstr
Copy link
Member

marstr commented Jul 19, 2018

You are not confused, @Ifshr :) AutoRest generates the SDKs from specs. Even if the issue is ultimately one with AutoRest, I don't mind having the conversation here.

With regards to the default value, it's by design that it not get included by our client. It's in the Swagger for the sake of documentation only:

As per https://swagger.io/specification/v2/ (Thanks @jhendrixMSFT)
Declares the value of the parameter that the server will use if none is provided, for example a "count" to control the number of results per page might default to 100 if not supplied by the client in the request. (Note: "default" has no meaning for required parameters.) See https://tools.ietf.org/html/draft-fge-json-schema-validation-00#section-6.2. Unlike JSON Schema this value MUST conform to the defined type for this parameter.

If it really is just our validation that is getting in the way, you can work around it by writing your own method which calls the Preparer and Sender without the validation clause. Other than that, I'm investigating now to see if there's unexpected validation happening.

edit: rewording mitigation steps for accuracy.

edit: another mitigation option would be to pass the CIDR 0.0.0.0 for the unused Cidr.

edit: grammar

@marstr
Copy link
Member

marstr commented Jul 19, 2018

The relevant validation line, just to speed anybody else up who's investigating.

Chain: []validation.Constraint{{Target: "parameters.ManagedClusterProperties.NetworkProfile.PodCidr", Name: validation.Pattern, Rule: `^([0-9]{1,3}\.){3}[0-9]{1,3}(\/([0-9]|[1-2][0-9]|3[0-2]))?$`, Chain: nil}}},

@marstr
Copy link
Member

marstr commented Jul 20, 2018

Here's the state of things as I understand them now, after reading this thread, hashicorp/terraform-provider-azurerm#1479, and a tremendous amount of our code regarding validation (please pardon the redundancy):

  • When Terraform creates a managed cluster with the "azure" mode, the "podCidr" field is not required.
  • Terraform is therefor not providing "podCidr"
  • This is unexpectedly causing a server error, about not being able to parse the unprovided value.
  • @lfshr and @tombuildsstuff tried working around this by using the default value 10.244.0.0/24, however it ends up looking like drift and therefor is unusable as a work around.
  • @lfshr and @tombuildsstuff would like to try providing the empty string to the service, to see if that tricks the server past its problems.
  • The validation line noted in the comment above gets in the way of that, because the empty string doesn't pass it.
  • @lfshr and @tombuildsstuff can still use the work around I mention above, where you replace the CreateOrUpdate method with your own helper to bypass our validation for the sake of your experiment.
  • There may still be a server bug that needs to be fixed, but there is no problem with the Go SDK itself.

I'm going to leave this issue open to help host the conversation, but it is important to note, @joshgav and @jhendrixMSFT that there isn't really any work for our team to do to support this.

@marstr
Copy link
Member

marstr commented Jul 20, 2018

Hey @lfshr, could you expand on your comment here:

@tombuildsstuff pod_cidr should not be required when the network_profile is set to "azure" but it is. I've opened an issue here: #2178

Sorry, I should have picked up on this issue. (and probably should not have included my workaround)

What error do you receive when you don't provide podCidr?

@tombuildsstuff
Copy link
Contributor

@lfshr also out of interest does this work when the validation in the Go SDK is commented out locally?

@lfshr
Copy link
Author

lfshr commented Jul 21, 2018

@marstr sure thing. Here is the output from Terraform when applying a plan without a podcidr:

Error: Error applying plan:

1 error(s) occurred:

* azurerm_kubernetes_cluster.test: 1 error(s) occurred:

* azurerm_kubernetes_cluster.test: containerservice.ManagedClustersClient#CreateOrUpdate: Invalid input: autorest/validation: validation failed: parameter=parameters.ManagedClusterProperties.NetworkProfile.PodCidr constraint=Pattern value="" details: value doesn't match pattern ^([0-9]{1,3}\.){3}[0-9]{1,3}(\/([0-9]|[1-2][0-9]|3[0-2]))?$

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

Here is the plan:

resource "azurerm_resource_group" "test" {
  name     = "acctest2RG"
  location = "West Europe"
}

resource "azurerm_virtual_network" "test" {
  name                = "acctest2virtnet"
  address_space       = ["172.0.0.0/16"]
  location            = "${azurerm_resource_group.test.location}"
  resource_group_name = "${azurerm_resource_group.test.name}"

  tags {
    environment = "Testing"
  }
}

resource "azurerm_subnet" "test" {
  name                 = "acctest2subnet"
  resource_group_name  = "${azurerm_resource_group.test.name}"
  virtual_network_name = "${azurerm_virtual_network.test.name}"
  address_prefix       = "172.0.2.0/24"
}

resource "azurerm_kubernetes_cluster" "test" {
  name                = "acctest2aks"
  location            = "${azurerm_resource_group.test.location}"
  resource_group_name = "${azurerm_resource_group.test.name}"
  dns_prefix          = "acctest2aks"
  kubernetes_version  = "1.7.7"

  linux_profile {
    admin_username = "acctest2user"

    ssh_key {
      key_data = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCqaZoyiz1qbdOQ8xEf6uEu1cCwYowo5FHtsBhqLoDnnp7KUTEBN+L2NxRIfQ781rxV6Iq5jSav6b2Q8z5KiseOlvKA/RF2wqU0UPYqQviQhLmW6THTpmrv/YkUCuzxDpsH7DUDhZcwySLKVVe0Qm3+5N2Ta6UYH3lsDf9R9wTP2K/+vAnflKebuypNlmocIvakFWoZda18FOmsOoIVXQ8HWFNCuw9ZCunMSN62QGamCe3dL5cXlkgHYv7ekJE15IA9aOJcM7e90oeTqo+7HTcWfdu0qQqPWY5ujyMw/llas8tsXY85LFqRnr3gJ02bAscjc477+X+j/gkpFoN1QEmt [email protected]"
    }
  }

  agent_pool_profile {
    name           = "default"
    count          = "2"
    vm_size        = "Standard_DS2_v2"
    vnet_subnet_id = "${azurerm_subnet.test.id}"
  }

  network_profile {
    network_plugin     = "azure"
    dns_service_ip     = "10.10.0.10"
    docker_bridge_cidr = "172.18.0.1/16"
    service_cidr       = "10.10.0.0/16"
  }

  service_principal {
    client_id     = "<redacted>"
    client_secret = "<redacted>"
  }
}

@tombuildsstuff I can verify that removing the validation rule in managedclusters.go fixes the issue. I also see that there are validation rules on all the other IP fields. I'm not 100% sure what fields are required when using kubenet, but this may also be an issue with kubenet clusters. I can try and validate this later tonight.

Code removed:

{Target: "parameters.ManagedClusterProperties.NetworkProfile.PodCidr", Name: validation.Null, Rule: false,
    Chain: []validation.Constraint{{Target: "parameters.ManagedClusterProperties.NetworkProfile.PodCidr", Name: validation.Pattern, Rule: `^([0-9]{1,3}\.){3}[0-9]{1,3}(\/([0-9]|[1-2][0-9]|3[0-2]))?$`, Chain: nil}}},

Chain: []validation.Constraint{{Target: "parameters.ManagedClusterProperties.NetworkProfile.PodCidr", Name: validation.Null, Rule: false,
Chain: []validation.Constraint{{Target: "parameters.ManagedClusterProperties.NetworkProfile.PodCidr", Name: validation.Pattern, Rule: `^([0-9]{1,3}\.){3}[0-9]{1,3}(\/([0-9]|[1-2][0-9]|3[0-2]))?$`, Chain: nil}}},

@tombuildsstuff
Copy link
Contributor

@lfshr

@tombuildsstuff I can verify that removing the validation rule in managedclusters.go fixes the issue. I also see that there are validation rules on all the other IP fields. I'm not 100% sure what fields are required when using kubenet, but this may also be an issue with kubenet clusters. I can try and validate this later tonight.

So given that works with the validation disabled in the SDK it sounds like this is just incorrect Swagger (which in turn generates the invalid validation in the SDK) - as such it should be possible to fix this by sending a PR to remove the line which @marstr highlighted above: https://github.com/Azure/azure-rest-api-specs/blob/master/specification/containerservices/resource-manager/Microsoft.ContainerService/stable/2018-03-31/managedClusters.json#L977 (or asking the Service Team to fix it, who would do the same thing).

Once the Swagger's fixed - a PR will be automatically sent to the Go (and other) SDK's to update this, which will make it in the next release; at which point we can update and ship
hashicorp/terraform-provider-azurerm#1479

Hope that helps! :)

@lfshr
Copy link
Author

lfshr commented Jul 22, 2018

@tombuildsstuff the only issue with removing the field is that the fact that the validation pattern is a valid pattern for checking that the field is a valid IP Address with a CIDR notation.

What if the required = false flag was added to the parameter in the swagger. Would autorest generate code that allowed this to be nullable?

@tombuildsstuff
Copy link
Contributor

@lfshr my understanding is that Required:false will allow a null value but not an empty string; which would mean it doesn’t handle this case, but @marstr could confirm for sure.

From our side that validation can be done within Terraform (I’m assuming that the same regex exists for non-empty strings at the API end too?)

@lfshr
Copy link
Author

lfshr commented Jul 23, 2018

@tombuildsstuff was that not just the plan for the workaround? Does terraform schema not supply nil when a value is not present in the tf file? If so then required: false should (in theory) fix the issue.

We could do validation at the terraform side for sure (and probably should to ensure early failures), but we're not the only consumer of this API.

@tombuildsstuff
Copy link
Contributor

@lfshr

Does terraform schema not supply nil when a value is not present in the tf file? If so then required: false should (in theory) fix the issue.

nope, Terraform provides an empty string which would be caught by this validation.

was that not just the plan for the workaround?

In the case of the Swagger - unless the Regex applies to all uses of that field then the validation is incorrect - so whilst we're not the only consumer, this is currently broken for anybody who attempts this operation; as such Required: false wouldn't solve this unfortunately and I believe this validation needs to be removed for anybody to be able to use the API?

@mbrancato
Copy link

mbrancato commented Jul 23, 2018

In regards to the high-level request to expand on this comment by @marstr (#2178 (comment)), the pod_cidr shouldn't be required because with the 'azure' plugin it obviously pulls IPs from the assigned subnet. This affects multiple Azure SDKs.

I've done some initial tracing on how the Azure CLI does this (via the Azure Python SDK), and being Python, it first sets the pod_cidr to None and then, later on it instantiates a class which (when using Python3) will take a None type on the pod_cidr and change it to the default CIDR (10.244.0.0/16) as a string. From my testing, the Azure CLI sends a default pod_cidr and never sends an empty string, even when the user does not specify. The network profile forces the creation of the default pod_cidr value.

https://github.com/Azure/azure-sdk-for-python/blob/ce296258253a8eb25c98fbabc4d75066460b7d0a/azure-mgmt-containerservice/azure/mgmt/containerservice/models/container_service_network_profile_py3.py#L60-L67

Is it possible for the Go SDK to mimic this behavior, and send a default pod_cidr when the user does not specify a value or when the 'azure' network plugin is specified

edit: I see the previous comment referring to this as "drift", updated this comment to request the Go SDK mimic the Python SDK.

@marstr
Copy link
Member

marstr commented Jul 24, 2018

@mbrancato says:
Is it possible for the Go SDK to mimic this behavior, and send a default pod_cidr when the user does not specify a value or when the 'azure' network plugin is specified.

Possible, yes. However, I believe moving to client-side defined defaults is probably not in conformance with the Swagger spec as noted below:

Declares the value of the parameter that the server will use if none is provided, for example a "count" to control the number of results per page might default to 100 if not supplied by the client in the request.

If there were going to be any change, I'd prefer it be that Python/CLI moves to not send defaults. But that is another conversation. :)

@marstr
Copy link
Member

marstr commented Jul 24, 2018

Alright folks, I wrote up a little console application that just tries to create an AKS cluster with various Network Profiles to help repro/drive conversation around this bug. You can find the code here:
https://github.com/marstr/repro2178

I accept PRs to it, so if any body involved sees something that ought to be changed to more accurately reflect the problem, please send a PR!

@lfshr
Copy link
Author

lfshr commented Jul 25, 2018

@marstr there is also Azure/AKS#506 #2176 and #2177
I don't know whether you want to keep these issues separate or cover them all under the same issue.

#2176 and #2177 are less important imo as at the minute 'calico' is implied.

@metacpp
Copy link

metacpp commented Jul 25, 2018

@marstr the PR was merged, no action item from your side. You can close this issue.

@tombuildsstuff
Copy link
Contributor

@metacpp I believe there's still a behavioural bug in the API which needs to be fixed here - so I'd suggest we leave this issue open until that's resolved.

Other API's in Azure use conditional validation for fields such that the validation's only applied when the values are used - this allows empty strings to be sent for these values when they're not required (and they return a nil value from the API, which allows consumers to confirm they're not set). As such I believe the AKS API should be updated to match the behaviour of the other API's here.

@marstr could we reach out to the Service Team about applying conditional validation here?

@jhendrixMSFT
Copy link
Member

@tombuildsstuff Martin is out for the next few days so I'll answer. The swagger authoring for the podCidr field indicates that if you don't want to send a value for it you must pass nil (non-nil values must match the regex provided in the constraint). There was an internal email thread about this and the RP confirmed that this is by design; the API cannot accept an empty string for the podCidr field. Does this answer your question?

@tombuildsstuff
Copy link
Contributor

@jhendrixMSFT

Martin is out for the next few days so I'll answer.

thanks :)

There was an internal email thread about this and the RP confirmed that this is by design; the API cannot accept an empty string for the podCidr field. Does this answer your question?

Sure, technically the swagger matches the Resource Provider/API; the issue I'm raising is more that this RP fundamentally behaves differently to other RP's (which apply validation conditionally as needed when conditional logic is needed).

In addition the validation requirements for the AKS API aren't documented - and the API returns some pretty unhelpful errors when something's gone wrong (the latest case being this one - but we've had a generic 400 error returned in the past).

The validation requirements of the RP are worth calling out specifically since they've changed multiple times within the same API version (and apparently it's going to change again). Whilst I can understand this may improve the User Experience in the long term - it's challenging to provide meaningful validation for this resource whilst this Stable API (v2018-03-31) is a moving target (we had similar issues with the Preview, but that's more understandable).

Whilst this API technically works as-is according to the Swagger - I think there's some behavioural/usability issues with the AKS API which need to be addressed to provide a better user experience to consumers of the API. In particular the behaviour of a Stable API shouldn't change without a version bump (even if that's changing the validation) since that can still break downstream consumers - what do you think?

@vladbarosan
Copy link

Hi @tombuildsstuff since there is no action the Go SDK ill close this. Please open this issue in the specs repo and provide an exact ask for the Service team ( or you can reference this issue if you think its enough )

@RickWinter RickWinter added the customer-reported Issues that are reported by GitHub users external to the Azure organization. label Jul 12, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization.
Projects
None yet
Development

No branches or pull requests

9 participants