Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service principal creation isn't finished before other resources start provisioning #156

Closed
tillig opened this issue Oct 8, 2019 · 8 comments

Comments

@tillig
Copy link

tillig commented Oct 8, 2019

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureAD Provider) Version

Terraform v0.12.10

  • provider.azuread v0.6.0
  • provider.azurerm v1.35.0
  • provider.random v2.2.1

Affected Resource(s)

  • azuread_service_principal

Terraform Configuration Files

(Subscription and tenant ID are not the real ones.)

provider "azurerm" {
  version         = "=1.35.0"
  subscription_id = "122da2bf-07eb-473c-acb3-1c9f666d3d32"
  tenant_id       = "e2b4ae1f-3aa1-421c-9e65-86fafc7f05e8"
}

provider "azuread" {
  version = "=0.6.0"
}

provider "random" {
  version = "~> 2.2"
}

variable "cluster_name" {
  type    = "string"
  default = "tillig-k8s"
}

resource "random_string" "kubernetes_sp_password" {
  length  = 32
  special = true
}

resource "azuread_application" "kubernetes" {
  name                       = "${var.cluster_name}-kubernetes"
  available_to_other_tenants = false
}

resource "azuread_service_principal" "kubernetes" {
  application_id = "${azuread_application.kubernetes.application_id}"
}

resource "azuread_service_principal_password" "kubernetes" {
  service_principal_id = "${azuread_service_principal.kubernetes.id}"
  value                = "${random_string.kubernetes_sp_password.result}"
  end_date_relative    = "17520h" #expire in 2 years
}

# Network contributor required to use LoadBalancer resources
resource "azurerm_role_assignment" "kubernetes" {
  scope                = "${azurerm_resource_group.kubernetes.id}"
  role_definition_name = "Network Contributor"
  principal_id         = "${azuread_service_principal.kubernetes.id}"
}

resource "azurerm_resource_group" "kubernetes" {
  name     = "${var.cluster_name}"
  location = "West US"
}

resource "azurerm_kubernetes_cluster" "kubernetes" {
  name                = "${var.cluster_name}"
  location            = "${azurerm_resource_group.kubernetes.location}"
  resource_group_name = "${azurerm_resource_group.kubernetes.name}"
  dns_prefix          = "${var.cluster_name}-tfrm"
  addon_profile {
    kube_dashboard {
      enabled = "true"
    }
  }
  agent_pool_profile {
    name            = "default"
    count           = 3
    vm_size         = "Standard_DS2_v2"
    os_type         = "Linux"
    os_disk_size_gb = 30
    type            = "VirtualMachineScaleSets"
  }
  service_principal {
    client_id     = "${azuread_application.kubernetes.application_id}"
    client_secret = "${random_string.kubernetes_sp_password.result}"
  }
}

Expected Behavior

I expect the provisioning to occur without issue: application, service principal, resource group, and Kubernetes cluster.

Actual Behavior

The Kubernetes cluster failed to provision because when it started the Service Principal had not yet completed. (Again, not the real IDs here.)

Error: Error creating Managed Kubernetes Cluster "tillig-k8s" (Resource Group "tillig-k8s"): containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="ServicePrincipalNotFound" Message="Service principal clientID: d8261536-1b1c-40d1-b096-7757f032418a not found in Active Directory tenant e2b4ae1f-3aa1-421c-9e65-86fafc7f05e8, Please see https://aka.ms/aks-sp-help for more details."

Running the provisioning again - no changes to the configuration - will succeed. By the time the provisioning runs the second time the service principal has finished being created and the Kubernetes cluster provisioning can proceed.

Steps to Reproduce

  1. terraform apply to start things off.
  2. Watch error come in.
  3. terraform apply to finish provisioning.

Given this is fairly timing-related, I didn't re-run this several times to try and catch the debug output. If that's required, I can try allocating time to that.

@katbyte
Copy link
Collaborator

katbyte commented Oct 11, 2019

Hi @tillig,

This is most likely an eventual consistency error caused by AAD replication. We have done our best to prevent replication issues by tryin to get the service principal from the API until we successfully get it 10 times in a row. This has fixed most replication problems but its not perfect. There isn't too much we can do here until the graph API calls only return once the object is fully replicated and availible.

@tillig
Copy link
Author

tillig commented Oct 11, 2019

Hmmm. I appreciate your looking at this. Is there some issue or feedback item with Azure I can +1 or do something with to encourage them to solve that? Also, is there some custom code or something I could run as a workaround, like... a "sleep and retry" or "try/catch/retry" or something? A configuration value to increase the number of retries already happening? Grasping at straws on this one, I don't know much about AAD replication.

Or is it potentially better (recommended?) to create service principals in a separate Terraform execution from deployment of features requiring those service principals to force a sort of "manual delay" and allow replication to finish?

@katbyte
Copy link
Collaborator

katbyte commented Oct 11, 2019

Its a hard call, i have toyed with the idea of exposing the replication wait constants, but i'm afraid that may or may not work as the server terraform hits could be different then the one AKS is internally querying. The replication waits added seem ensure a terraform run will always complete, but not AKS.

I have seen people add null resources/local exec with a sleep before, but that is far from ideal. Creating the SP & creds separately would most likely solve the issue for you, but that is also not ideal.

@techspeque
Copy link

techspeque commented Oct 31, 2019

The replication issue is strange enough and some more findings to add to the context.

Whilst working on a workaround I found that the one mentioned here #4 (comment) here did not work.

Whilst trying to figure out the solution I started to get Message="The credentials in ServicePrincipalProfile were invalid. when trying to combine these two workarounds: #4 (comment) and #4 (comment)

Azure CLI lads are also experiencing a similar issue: Azure/azure-cli#9585 for the aks resource and as it seems it is due to the replication and eventual consistency of AzAD and it is indeed a known MS issue: Azure/AKS#1206

Valid workaround proposed by microsoft is described here: Azure/AKS#1206 (comment) and here https://docs.microsoft.com/en-us/azure/aks/troubleshooting#im-receiving-errors-that-my-service-principal-was-not-found-when-i-try-to-create-a-new-cluster-without-passing-in-an-existing-one

So your thinking @katbyte was spot on with the solutions, of which neither is good...

@drdamour
Copy link

could we make the rechecks count configurable?

@manicminer
Copy link
Contributor

Just a note that this issue is a duplicate of #128. Since we've added mitigation to try work around this, I've closed that issue and this one too.

On the workarounds proposed in the linked issue Azure/AKS#1206 - we've added the get-10-times step as a partial workaround - which hopefully goes some way towards mitigating this problem - but retrying to establish a new session using the created app/SP is not really viable for us (at least right now) since there's no way to tell whether an app/SP is authorized to do whatever action we might attempt.

If you're still affected by this, I encourage you to raise this as an Azure support issue - ultimately any additional steps we take in the provider to mitigate early OK responses and replication delays are working around upstream issues. If you have a specific idea or strategy for further improving our handling here, please do open a new issue for discussion. Thanks!

@manicminer
Copy link
Contributor

Duplicate of #128

@manicminer manicminer marked this as a duplicate of #128 Jun 26, 2020
@ghost
Copy link

ghost commented Jul 26, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Jul 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants