Service principal creation isn't finished before other resources start provisioning #156

tillig · 2019-10-08T20:23:54Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureAD Provider) Version

Terraform v0.12.10

provider.azuread v0.6.0
provider.azurerm v1.35.0
provider.random v2.2.1

Affected Resource(s)

azuread_service_principal

Terraform Configuration Files

(Subscription and tenant ID are not the real ones.)

provider "azurerm" {
  version         = "=1.35.0"
  subscription_id = "122da2bf-07eb-473c-acb3-1c9f666d3d32"
  tenant_id       = "e2b4ae1f-3aa1-421c-9e65-86fafc7f05e8"
}

provider "azuread" {
  version = "=0.6.0"
}

provider "random" {
  version = "~> 2.2"
}

variable "cluster_name" {
  type    = "string"
  default = "tillig-k8s"
}

resource "random_string" "kubernetes_sp_password" {
  length  = 32
  special = true
}

resource "azuread_application" "kubernetes" {
  name                       = "${var.cluster_name}-kubernetes"
  available_to_other_tenants = false
}

resource "azuread_service_principal" "kubernetes" {
  application_id = "${azuread_application.kubernetes.application_id}"
}

resource "azuread_service_principal_password" "kubernetes" {
  service_principal_id = "${azuread_service_principal.kubernetes.id}"
  value                = "${random_string.kubernetes_sp_password.result}"
  end_date_relative    = "17520h" #expire in 2 years
}

# Network contributor required to use LoadBalancer resources
resource "azurerm_role_assignment" "kubernetes" {
  scope                = "${azurerm_resource_group.kubernetes.id}"
  role_definition_name = "Network Contributor"
  principal_id         = "${azuread_service_principal.kubernetes.id}"
}

resource "azurerm_resource_group" "kubernetes" {
  name     = "${var.cluster_name}"
  location = "West US"
}

resource "azurerm_kubernetes_cluster" "kubernetes" {
  name                = "${var.cluster_name}"
  location            = "${azurerm_resource_group.kubernetes.location}"
  resource_group_name = "${azurerm_resource_group.kubernetes.name}"
  dns_prefix          = "${var.cluster_name}-tfrm"
  addon_profile {
    kube_dashboard {
      enabled = "true"
    }
  }
  agent_pool_profile {
    name            = "default"
    count           = 3
    vm_size         = "Standard_DS2_v2"
    os_type         = "Linux"
    os_disk_size_gb = 30
    type            = "VirtualMachineScaleSets"
  }
  service_principal {
    client_id     = "${azuread_application.kubernetes.application_id}"
    client_secret = "${random_string.kubernetes_sp_password.result}"
  }
}

Expected Behavior

I expect the provisioning to occur without issue: application, service principal, resource group, and Kubernetes cluster.

Actual Behavior

The Kubernetes cluster failed to provision because when it started the Service Principal had not yet completed. (Again, not the real IDs here.)

Error: Error creating Managed Kubernetes Cluster "tillig-k8s" (Resource Group "tillig-k8s"): containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="ServicePrincipalNotFound" Message="Service principal clientID: d8261536-1b1c-40d1-b096-7757f032418a not found in Active Directory tenant e2b4ae1f-3aa1-421c-9e65-86fafc7f05e8, Please see https://aka.ms/aks-sp-help for more details."

Running the provisioning again - no changes to the configuration - will succeed. By the time the provisioning runs the second time the service principal has finished being created and the Kubernetes cluster provisioning can proceed.

Steps to Reproduce

terraform apply to start things off.
Watch error come in.
terraform apply to finish provisioning.

Given this is fairly timing-related, I didn't re-run this several times to try and catch the debug output. If that's required, I can try allocating time to that.

The text was updated successfully, but these errors were encountered:

katbyte · 2019-10-11T05:09:36Z

Hi @tillig,

This is most likely an eventual consistency error caused by AAD replication. We have done our best to prevent replication issues by tryin to get the service principal from the API until we successfully get it 10 times in a row. This has fixed most replication problems but its not perfect. There isn't too much we can do here until the graph API calls only return once the object is fully replicated and availible.

tillig · 2019-10-11T14:12:27Z

Hmmm. I appreciate your looking at this. Is there some issue or feedback item with Azure I can +1 or do something with to encourage them to solve that? Also, is there some custom code or something I could run as a workaround, like... a "sleep and retry" or "try/catch/retry" or something? A configuration value to increase the number of retries already happening? Grasping at straws on this one, I don't know much about AAD replication.

Or is it potentially better (recommended?) to create service principals in a separate Terraform execution from deployment of features requiring those service principals to force a sort of "manual delay" and allow replication to finish?

katbyte · 2019-10-11T16:13:31Z

Its a hard call, i have toyed with the idea of exposing the replication wait constants, but i'm afraid that may or may not work as the server terraform hits could be different then the one AKS is internally querying. The replication waits added seem ensure a terraform run will always complete, but not AKS.

I have seen people add null resources/local exec with a sleep before, but that is far from ideal. Creating the SP & creds separately would most likely solve the issue for you, but that is also not ideal.

techspeque · 2019-10-31T09:30:31Z

The replication issue is strange enough and some more findings to add to the context.

Whilst working on a workaround I found that the one mentioned here #4 (comment) here did not work.

Whilst trying to figure out the solution I started to get Message="The credentials in ServicePrincipalProfile were invalid. when trying to combine these two workarounds: #4 (comment) and #4 (comment)

Azure CLI lads are also experiencing a similar issue: Azure/azure-cli#9585 for the aks resource and as it seems it is due to the replication and eventual consistency of AzAD and it is indeed a known MS issue: Azure/AKS#1206

Valid workaround proposed by microsoft is described here: Azure/AKS#1206 (comment) and here https://docs.microsoft.com/en-us/azure/aks/troubleshooting#im-receiving-errors-that-my-service-principal-was-not-found-when-i-try-to-create-a-new-cluster-without-passing-in-an-existing-one

So your thinking @katbyte was spot on with the solutions, of which neither is good...

drdamour · 2020-01-10T20:24:25Z

could we make the rechecks count configurable?

manicminer · 2020-06-26T00:28:24Z

Just a note that this issue is a duplicate of #128. Since we've added mitigation to try work around this, I've closed that issue and this one too.

On the workarounds proposed in the linked issue Azure/AKS#1206 - we've added the get-10-times step as a partial workaround - which hopefully goes some way towards mitigating this problem - but retrying to establish a new session using the created app/SP is not really viable for us (at least right now) since there's no way to tell whether an app/SP is authorized to do whatever action we might attempt.

If you're still affected by this, I encourage you to raise this as an Azure support issue - ultimately any additional steps we take in the provider to mitigate early OK responses and replication delays are working around upstream issues. If you have a specific idea or strategy for further improving our handling here, please do open a new issue for discussion. Thanks!

manicminer · 2020-06-26T00:28:38Z

Duplicate of #128

ghost · 2020-07-26T16:50:24Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

katbyte added bug feature/service-principals labels Oct 11, 2019

mikhailshilkov mentioned this issue Oct 18, 2019

ServicePrincipal eventual consistency issue when creating AKS cluster pulumi/pulumi-azure#391

Closed

mikhailshilkov mentioned this issue Nov 25, 2019

azure-cs-aks: "Error creating Managed Kubernetes Cluster" pulumi/examples#480

Closed

manicminer closed this as completed Jun 26, 2020

manicminer marked this as a duplicate of #128 Jun 26, 2020

ghost locked and limited conversation to collaborators Jul 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service principal creation isn't finished before other resources start provisioning #156

Service principal creation isn't finished before other resources start provisioning #156

tillig commented Oct 8, 2019

katbyte commented Oct 11, 2019

tillig commented Oct 11, 2019

katbyte commented Oct 11, 2019 •

edited

Loading

techspeque commented Oct 31, 2019 •

edited

Loading

drdamour commented Jan 10, 2020

manicminer commented Jun 26, 2020

manicminer commented Jun 26, 2020

ghost commented Jul 26, 2020

Service principal creation isn't finished before other resources start provisioning #156

Service principal creation isn't finished before other resources start provisioning #156

Comments

tillig commented Oct 8, 2019

Community Note

Terraform (and AzureAD Provider) Version

Affected Resource(s)

Terraform Configuration Files

Expected Behavior

Actual Behavior

Steps to Reproduce

katbyte commented Oct 11, 2019

tillig commented Oct 11, 2019

katbyte commented Oct 11, 2019 • edited Loading

techspeque commented Oct 31, 2019 • edited Loading

drdamour commented Jan 10, 2020

manicminer commented Jun 26, 2020

manicminer commented Jun 26, 2020

ghost commented Jul 26, 2020

katbyte commented Oct 11, 2019 •

edited

Loading

techspeque commented Oct 31, 2019 •

edited

Loading