Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling public network access and using UserDefinedRouting #3690

Closed
andrewkreuzer opened this issue May 31, 2023 · 27 comments
Closed

Disabling public network access and using UserDefinedRouting #3690

andrewkreuzer opened this issue May 31, 2023 · 27 comments
Labels
Feedback General feedback

Comments

@andrewkreuzer
Copy link

Describe your scenario
I have created a cluster with outbound_type = UserDefinedRouting and public_network_access_enabled = false using the terraform provider. I am now receiving error:

Code="BadRequest" Message="UserDefinedRouting is not supported when Cluster has public network access set to Disabled.

or from the portal:

Failed to save Kubernetes service 'MyCluster'.
Error: UserDefinedRouting is not supported when Cluster has
public network access set to Disabled.

A support ticket was opened and I was told:

When running a Terraform plan that includes the option/value "publicnetworkaccess: 'disabled'" and using a UDR, the cluster creation should have failed validation and the cluster should not have been created. Prior to the last Azure CLI update, this validation was skipped and the cluster was allowed to be built, however, that should not have been allowed

Feedback
I'm confused as to why this is not supported.

Setting private_cluster_enabled keeps the api endpoint within the vnet, setting public_network_access_enabled to false keeps the loadbalancer within our vnet, and using outbound_type UserDefinedRouting to control egress traffic through our firewall ensures we control all outbound traffic. The fact that this was allowed and the cluster is functioning is more confusing. If this is intended to not be supported why does it work?

We're now stuck in a state where we can't make changes to the cluster unless we enable public access (which would cause cluster re-creation)... and we have three clusters.

If there's something I'm misunderstanding or a technical reason as to why this is not supported I would be grateful of some insight

@andrewkreuzer andrewkreuzer added the Feedback General feedback label May 31, 2023
@zadigus
Copy link

zadigus commented Jun 1, 2023

For what it's worth, you can find here a repository that reproduces the issue. I am experiencing the very same problem since last Tuesday. It was working fine until that time for the last months.

@matthiasguentert
Copy link

matthiasguentert commented Jun 1, 2023

Same issue here while trying to upgrade from 1.24.6 to 1.25.6 by using Azure CLI.

az aks upgrade --name <cluster> --resource-group <group> --subscription <subscription> --no-wait --kubernetes-version "1.25.6"
Kubernetes may be unavailable during cluster upgrades.
 Are you sure you want to perform this operation? (y/N): y
Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version 1.25.6. Continue? (y/N): y
(BadRequest) UserDefinedRouting is not supported when Cluster has public network access set to Disabled.
Code: BadRequest
Message: UserDefinedRouting is not supported when Cluster has public network access set to Disabled.

It seems as if the Azure RM API has changed, as I was able to upgrade another cluster (created with the exact same version of terraform module) two weeks ago. The terraform module uses:

...
network_profile {
  ...
  outbound_type = "userDefinedRouting"
}
public_network_access_enabled = false
private_cluster_enabled = true

How am I now able to upgrade my clusters? 😐

@jvikes11
Copy link

jvikes11 commented Jun 1, 2023

According to my ticket with Microsoft, this is the new normal:

Recent changes to cluster validation during creation or update have caused this validation error to surface for customers using public_network_access_enabled=false in their Terraform template with UDR. Public_network_access_enabled must be set to True in the Terraform template for deployment to succeed. It is not possible to deploy a cluster with public_network_access_enabled=false via Az CLI or Azure Portal.

Will update any solution we find - it is not happening to all subscriptions, so might be a phased rollout

@zadigus
Copy link

zadigus commented Jun 1, 2023

What is the public_network_access_enabled = false option doing, other than setting very restrictive network security rules on the AKS workers subnet? Isn't that option redundant anyway?

@andrewkreuzer
Copy link
Author

andrewkreuzer commented Jun 1, 2023

What is the public_network_access_enabled = false option doing, other than setting very restrictive network security rules on the AKS workers subnet?

It's my understanding it places the default loadbalancer in your vnet as opposed to being publicly accessible
kubernetes-internal vs kubernetes loadbalancer name

@zadigus
Copy link

zadigus commented Jun 1, 2023

What is the public_network_access_enabled = false option doing, other than setting very restrictive network security rules on the AKS workers subnet?

It's my understanding it places the default loadbalancer in your vnet as opposed to being publicly accessible

Well, I don't know how you deploy your AKS cluster, but in my case, my internal load-balancer (ILB) is deployed through the nginx ingress with the following annotations (among others):

controller:
  service:
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-internal: "true"
      service.beta.kubernetes.io/azure-pls-name: "some-name"
      service.beta.kubernetes.io/azure-load-balancer-internal-subnet: "some-private-subnet"

Therefore, my ILB gets deployed to a private subnet, and this has nothing to do with the public_network_access_enabled = false option.

@andrewkreuzer
Copy link
Author

ya I didn't have to add those annotations because the loadbalancer was created in the subnet we have designated to aks (nodepools, api server private endpoint, loadbalancer's frontend ips (private))

@zadigus
Copy link

zadigus commented Jun 1, 2023

@andrewkreuzer ok I didn't know it was possible

@andrewkreuzer
Copy link
Author

I'm beginning to believe it shouldn't be

@andrewkreuzer
Copy link
Author

my bad I do have those annotations

annotations:
  service.beta.kubernetes.io/azure-load-balancer-internal: true
  service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: /healthz

@jvikes11
Copy link

jvikes11 commented Jun 1, 2023

I haven't tried this yet because my test cluster was recreated, but this is a quick fix that Microsoft gave me to change the public access property without recreating the cluster:

az deployment group create --template-file clusterupdate_publicaccess.bicep -g ResourceGroupoftheCluster

Apparently if you have the private cluster enabled the public access option isn't needed and the cluster is still closed off from public access.

@matthiasguentert
Copy link

@jvikes11 what does the content of clusterupdate_publicaccess.bicep look like?

@zadigus
Copy link

zadigus commented Jun 2, 2023

I haven't tried this yet because my test cluster was recreated, but this is a quick fix that Microsoft gave me to change the public access property without recreating the cluster:

az deployment group create --template-file clusterupdate_publicaccess.bicep -g ResourceGroupoftheCluster

Apparently if you have the private cluster enabled the public access option isn't needed and the cluster is still closed off from public access.

that might be true if, and only if, your workers are deployed to a private subnet

@matthiasguentert
Copy link

In my case, there are two potential methods for fixing this configuration issue (taken from the troubleshooting guide from within the azure portal):

  1. Update the cluster through ARM or Bicep and Azure CLI. An example Bicep script is available below and can be used via az deployment group create --template-file clusterupdate.bicep -g
param cluster_name string
param location string

resource aks_cluster 'Microsoft.ContainerService/managedClusters@2023-03-02-preview' = {
  name: cluster_name
  location: location
  properties: {
        publicNetworkAccess: 'Enabled'

  }
}

For convenience, here is the ARM version as well

{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "metadata": {
    "_generator": {
      "name": "bicep",
      "version": "0.17.1.54307",
      "templateHash": "15811174133861481625"
    }
  },
  "parameters": {
    "cluster_name": {
      "type": "string"
    },
    "location": {
      "type": "string"
    }
  },
  "resources": [
    {
      "type": "Microsoft.ContainerService/managedClusters",
      "apiVersion": "2023-03-02-preview",
      "name": "[parameters('cluster_name')]",
      "location": "[parameters('location')]",
      "properties": {
        "publicNetworkAccess": "Enabled"
      }
    }
  ]
}

  1. Update the value in your Terraform code or ARM template to set publicNetworkAccess: "Enabled" (or in the case of Terraform, public_network_access_enabled = true) and apply the updated template. In the case of Terraform, this will result in a deletion and recreation of the cluster.

@andrewkreuzer
Copy link
Author

does this create a public load balancer "kubernetes"?

@zadigus
Copy link

zadigus commented Jun 2, 2023

I have a pretty big battery of tests for my private infrastructure on Azure, and just removing the problematic option has kept my tests green. I validate AKS privacy, hopefully deep enough. So my initial guess was very likely correct: the parameter is redundant with other settings like the NSG of the Subnet where the AKS workers are deployed into.

@matthiasguentert
Copy link

Just verified with Microsoft support, that the cluster remains private.

Meanwhile , we recommend you to set PublicNetworkAccess=Enabled to unblock the cluster upgrade as your cluster is private (setting it to Enabled won't expose your cluster to public internet , your apiserver is still exposed on private vNet only).

@andrewkreuzer
Copy link
Author

ya the api private endpoint, which is controlled by private_cluster_endabled or the ----enable-private-cluster az cli flag, but is there a loadbalancer with a public IP named "kubernetes" after running the above template?

hashicorp/terraform-provider-azurerm#18221 (comment)

@zadigus
Copy link

zadigus commented Jun 2, 2023

ya the api private endpoint, which is controlled by private_cluster_endabled or the ----enable-private-cluster az cli flag, but is there a loadbalancer with a public IP named "kubernetes" after running the above template?

hashicorp/terraform-provider-azurerm#18221 (comment)

in my case, where I explicitly set

public_network_access_enabled     = true

I get the kubernetes-internal load-balancer.

@kevinkrp93
Copy link
Contributor

@phealy @chasewilson - please take a look at this

@andrewkreuzer
Copy link
Author

Thanks everyone for the feedback

The fix posted by @matthiasguentert allows you to update the clusters configuration without having to redeploy.

I think there are a few things which can be added to Azure's documentation to better describe what these parameters do. It's still unclear exactly what the PublicNetworkAccess[1] rest parameter does and adding a description to that page for enabled and disabled would be helpful.

The documentation for UDR configuration[2] describes the use of UDR not creating a loadbalancer until a service of type loadbalancer is created within the cluster which explains the above configuration outcomes.

AKS clusters with an outbound type of UDR get a standard load balancer only when the first Kubernetes service of type loadBalancer is deployed. The load balancer is configured with a public IP address for inbound requests and a backend pool for inbound requests. The Azure cloud provider configures inbound rules, but it doesn't configure outbound public IP address or outbound rules. Your UDR is the only source for egress traffic.

And you won't get a public IP unless you explicitly request one with a loadbalancer service

When using an outbound type of UDR, a load balancer public IP address for inbound requests isn't created unless you configure a service of type loadbalancer. AKS never creates a public IP address for outbound requests if you set an outbound type of UDR.

A section on the use of public network access being required for this configuration would be helpful as there is no mention of it in this documentation.

And finally after running the above fix there is no perceived change in the cluster configuration displayed in the Azure portal.

[1] Rest API AKS - PublicNetworkAccess
[2] Customize cluster egress with a user-defined routing

@cfBrianMiller
Copy link

Why was this closed? This is a very misleading configuration setting

@zadigus
Copy link

zadigus commented Jun 12, 2023

Indeed I don't think anything has been solved. The parameter is still there and its use is still is a mystery.

@andrewkreuzer
Copy link
Author

andrewkreuzer commented Jun 12, 2023

from support:

The comments and speculation for this feature on the GitHub issue aren't correct. This has nothing to do with the load balancer creation that's triggered by AKS.
Public Network Access allows users to block connectivity to the API server at the time of cluster creation. Thinking along the lines of authorized IP ranges but in reverse - we block everything until told otherwise where with authorized ranges, we allow all until told otherwise. The issue is really with the UDR component - if the outbound type was LB, we know the egress IP and can add an allow in the public network access deny rule so that nodes can register. With UDR, we don't have that luxury since we don't know exactly where someone's traffic is going to route to or which or how many public IPs are in play.
If you're combining public network access with a private cluster, there's no value in using the public network access - your API server is already as private as we can make it since it's only accessible via PE within the vnet. If you've got a public control plane, using authorized IP ranges at the time of the cluster creation accomplishes the same configuration through a different approach.

strange though that the issue on the azurerm tf provider linked above and from my own experience did have a public loadbalancer created when the public_network_access was set to true.

Anyways hopefully this clears things up for you

@zadigus
Copy link

zadigus commented Jun 12, 2023

@andrewkreuzer I am a bit lost with that explanation of your support engineer, as what they describe seem to rather correspond to parameter private_cluster_enabled in terraform resource azurerm_kubernetes_cluster. I'd be curious to know what the difference is between parameters private_cluster_enabled and public_network_access_enabled is.

@palma21
Copy link
Member

palma21 commented Jun 14, 2023

Hey folks,

Hoping this can clear some questions. This went unnoticed, apologies.

The full functionality of PublicNetworkAccess (PNA) is not really completed yet. Which is why we shouldn't have any docs about it out, let us know if you found any out there that we need to look into. It seems TF released this and there might be some interpretations on what it means/does.

We're rushing some docs for this reason, but for the time being Private Cluster or the equivalent API Server VNet integration (in preview) are really the only things that affect your cluster control plane networking exposure.

Your nodes and services exposure is controlled by you, internal/external services, NSG/FW, etc.
(getting an LB on the nodes or not is about your cluster outbound type, not related to this feature either, inbound LB rules are defined by your k8s services)

PNA has no effect with private clusters, this change was part of the development process of the feature but we were not aware TF was already exposing it and with a default (to disable if I understood correctly).

For private clusters on both current mode and vnet integration (that's N/A) since they already have no public connections allowed.

We were testing if for public clusters we should allow for setting disabled and the behaviors there, and outbound type UDR was an outlier since we can't get communication back from the nodes (but our change caught all clusters not just public). It was an oversight to not check if some client was using this config already, it was wrongly assumed no since this was not fully out.

This is not a required or finished property and I'm not sure of the TF context to start to support it, but we'll try to reach out to revert that.

Sorry for the confusion

@clarenceb
Copy link

Currently, it looks like TF azure provider always sets a value (default is true). See: kubernetes_cluster_resource.go#L1416 (and PR #18705).

@ghost ghost locked as resolved and limited conversation to collaborators Jul 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Feedback General feedback
Projects
None yet
Development

No branches or pull requests

8 participants