-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service principal creation isn't finished before other resources start provisioning #156
Comments
Hi @tillig, This is most likely an eventual consistency error caused by AAD replication. We have done our best to prevent replication issues by tryin to get the service principal from the API until we successfully get it 10 times in a row. This has fixed most replication problems but its not perfect. There isn't too much we can do here until the graph API calls only return once the object is fully replicated and availible. |
Hmmm. I appreciate your looking at this. Is there some issue or feedback item with Azure I can +1 or do something with to encourage them to solve that? Also, is there some custom code or something I could run as a workaround, like... a "sleep and retry" or "try/catch/retry" or something? A configuration value to increase the number of retries already happening? Grasping at straws on this one, I don't know much about AAD replication. Or is it potentially better (recommended?) to create service principals in a separate Terraform execution from deployment of features requiring those service principals to force a sort of "manual delay" and allow replication to finish? |
Its a hard call, i have toyed with the idea of exposing the replication wait constants, but i'm afraid that may or may not work as the server terraform hits could be different then the one AKS is internally querying. The replication waits added seem ensure a terraform run will always complete, but not AKS. I have seen people add null resources/local exec with a sleep before, but that is far from ideal. Creating the SP & creds separately would most likely solve the issue for you, but that is also not ideal. |
The replication issue is strange enough and some more findings to add to the context. Whilst working on a workaround I found that the one mentioned here #4 (comment) here did not work. Whilst trying to figure out the solution I started to get Azure CLI lads are also experiencing a similar issue: Azure/azure-cli#9585 for the aks resource and as it seems it is due to the replication and eventual consistency of AzAD and it is indeed a known MS issue: Azure/AKS#1206 Valid workaround proposed by microsoft is described here: Azure/AKS#1206 (comment) and here https://docs.microsoft.com/en-us/azure/aks/troubleshooting#im-receiving-errors-that-my-service-principal-was-not-found-when-i-try-to-create-a-new-cluster-without-passing-in-an-existing-one So your thinking @katbyte was spot on with the solutions, of which neither is good... |
could we make the rechecks count configurable? |
Just a note that this issue is a duplicate of #128. Since we've added mitigation to try work around this, I've closed that issue and this one too. On the workarounds proposed in the linked issue Azure/AKS#1206 - we've added the get-10-times step as a partial workaround - which hopefully goes some way towards mitigating this problem - but retrying to establish a new session using the created app/SP is not really viable for us (at least right now) since there's no way to tell whether an app/SP is authorized to do whatever action we might attempt. If you're still affected by this, I encourage you to raise this as an Azure support issue - ultimately any additional steps we take in the provider to mitigate early OK responses and replication delays are working around upstream issues. If you have a specific idea or strategy for further improving our handling here, please do open a new issue for discussion. Thanks! |
Duplicate of #128 |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks! |
Community Note
Terraform (and AzureAD Provider) Version
Terraform v0.12.10
Affected Resource(s)
azuread_service_principal
Terraform Configuration Files
(Subscription and tenant ID are not the real ones.)
Expected Behavior
I expect the provisioning to occur without issue: application, service principal, resource group, and Kubernetes cluster.
Actual Behavior
The Kubernetes cluster failed to provision because when it started the Service Principal had not yet completed. (Again, not the real IDs here.)
Running the provisioning again - no changes to the configuration - will succeed. By the time the provisioning runs the second time the service principal has finished being created and the Kubernetes cluster provisioning can proceed.
Steps to Reproduce
terraform apply
to start things off.terraform apply
to finish provisioning.Given this is fairly timing-related, I didn't re-run this several times to try and catch the debug output. If that's required, I can try allocating time to that.
The text was updated successfully, but these errors were encountered: