Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Azure): add support for workload identity #3111

Closed

Conversation

weisdd
Copy link
Contributor

@weisdd weisdd commented Oct 25, 2022

Description

Context

In Release 2022-10-10, AKS has brought support for Workload Identity Federation which will eventually replace Pod-Managed Identities.
This is an important change, because it resolves several limitations of Pod-Managed Identities described here.

As I see, there is a PR #2885 that tried to add the functionality, though it haven't got any traction since July. Plus, it doesn't tackle federated token renewal. Thus, I decided to come up with a different implementation to see if it finds any support.

Implementation details

adal package has a function called NewServicePrincipalTokenFromFederatedToken which packs a federated token into a struct &ServicePrincipalFederatedSecret{jwt: jwt,} and passes it to NewServicePrincipalTokenWithSecret through secret field.

After that, we get an instance of *adal.ServicePrincipalToken. Unfortunately, it doesn't expose any methods to replace jwt in secret. So, suddenly, we have the following issue:
By default, the federated token is valid only for 1h, whereas the AAD token that is received in exchange for federated - for 24h (docs). Thus, after roughly 24h, the implicitly called .Refresh() method for *adal.ServicePrincipalToken would start producing errors due to stale jwt.

adal doesn't seem to offer any native support for tracking changes in the file with federated token. The only sane method that I was able to find by looking through the library was taking advantage of customRefreshFunc, which is supposed to return a fresh access token or an error if any. To avoid having to reinvent the wheel, I wanted to reuse as many existing methods as possible.

The final logic is the following: whenever an access token is about to expire (or just empty right after an instance of *adal.ServicePrincipalToken is generated), the library would automatically call a wrapper function that would read the federated token from disk and use it to obtain a new access token by invoking the standard process through explicit call of .Refresh() method.

On one of my AKS clusters, I used the patched build of external-dns with an additional diagnostic message, which would highlight any attempts to refresh an access token, and everything worked well. So, I think we have a robust setup now.

The updated user documentation is based on a similar section that was written for Pod-Managed Identities. It might still be a bit rough, I could easily fix it in case of any issues.

Fixes #2724

Extra information

The adal package that is currently used in Azure provider is scheduled to be deprecated by March 31 2023, so we'll eventually need to migrate away from it. Though that's a different question.

In case you're curious how the same is implemented in the official sidecar proxy, you can have a look at the code here.

Checklist

  • Unit tests updated
  • End user documentation updated

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 25, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Oct 25, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @weisdd!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 25, 2022
@weisdd
Copy link
Contributor Author

weisdd commented Oct 27, 2022

Updated description to add more details and to say that, on one of my AKS clusters, I used the patched build of external-dns with an additional diagnostic message, which would highlight any attempts to refresh an access token, and everything worked well. So, I think we have a robust setup now.

@weisdd
Copy link
Contributor Author

weisdd commented Nov 7, 2022

@seanmalloy @szuecs Please, let me know what is the current estimate on when you could potentially review the PR. - Any timeline is fine for me, it just a matter of settings some expectations. :)
On my end, I'm ready to bring any required changes to the PR in a timely manner.

Thanks!

@NissesSenap
Copy link

I have read the documentation from an Azure admin point of view and I think the docs looks good.
Looking forward to the feature.

Great job @weisdd

@weisdd
Copy link
Contributor Author

weisdd commented Dec 1, 2022

Updated the PR a bit:

@szuecs @seanmalloy would be great if you could take a look at the PR. Thanks!

@phillebaba
Copy link
Contributor

@weisdd may i suggest waiting for #3040 to be merged before moving along with this PR? I have updated to the lastest Azure Go SDK which removes all use of the adal lib. This will most likely simplify the logic to add support for wokload identities.

@k8s-ci-robot
Copy link
Contributor

@weisdd: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 3, 2022
@weisdd
Copy link
Contributor Author

weisdd commented Dec 5, 2022

@phillebaba Well, go-sdk doesn't have native support for workload identity yet, it'll only be added in Beta around December/January. Till that time, this code snippet can potentially be used (I haven't tested whether it has any token renewal issues myself). So, it won't make things easier, at least for now.

Based on the recent discussion in Slack, maintainers are busy, and there's kind of pause till a plugins provider is ready, so I'm unlikely to see any reviews in near future. Thus, I guess there's not much sense in maintaining the PR anyway, and by the time yours is merged, there's likely to be a new go-sdk version out with support for Workload Identity.

@phillebaba
Copy link
Contributor

I have totally missed that issue, thanks for the link. Yeah let's just hope the release is not just pushed forward further past January in that case.

@hbuckle
Copy link

hbuckle commented Jan 10, 2023

Looks like a beta azidentity package has been published with support for workload identity, would be great to get it into external-dns

@pinkfloydx33
Copy link

Any word on this? The existing pod-identity/msi feature (which the azure/azure-private-dns providers support) was deprecated in October and will see total archival/sunset by September of this year. The new Azure Workload MSI feature will be GA in March. Without movement here it will no longer be possible to use MSI to connect to Azure DNS zones without specifying credentials (for an SP, not an MSI).

Azure has provided a side car container that intercepts token requests (sorta like pod identity) and forwards them through the federated credentials flow. While it does indeed work, it is meant as a stop gap for use during migration and not as a long term solution (even the azure docs state as much)

We've already seen support added to other libraries: cert-manager, keda, external-secrets, akv2k8s (and I'm sure others, thanks at least in part @weisdd!)... There's definitely a general need/want for this throughout the ecosystem.

Also note that the instructions for the external-dns azure provider here includes instructions for pod-identity as well as enabling the AKS-native pod-identity "v1" Azure preview feature, which should not be enabled anymore. The docs are pushing people towards technologies that are deprecated and/or don't/won't soon exist!!

@weisdd
Copy link
Contributor Author

weisdd commented Jan 25, 2023

I'll update the PR once #3040 is merged.


#### Update labels and annotations on ExternalDNS service account

To instruct Workload Identity webhook to inject a projected token into the ExternalDNS pod, the service account needs to have a label `azure.workload.identity/use: "true"` and an annotation `azure.workload.identity/client-id: <IDENTITY_CLIENT_ID>`:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weisdd - Went through the tutorial to test out building AKS cluster with workload identity and getting external DNS fully working only using managed identity. Thanks so much for the hard work on this and really hoping we can get this merged in the near future.

There was one step that was missing which is adding the necessary annotations to the pod (not just the service account). There were a few errors that keep showing up about the token missing from the azure.json file, so after some research found this portion in the docs that should be included in this tutorial for others that are trying to enable workload identity for their AKS clusters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod label was not a requirement during the original private preview. It was announced/communicated it was going to be required for GA, and was then subsequently rolled out to AKS in the next update (still during preview!). Point being, it wasn't a requirement when those instructions were written 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slickmode Thanks for the review!
Indeed, the tutorial was written before v1.0.0-beta.0 was released last month, which introduced the need for the extra label to limit the number of pods intercepted by a webhook (through an object selector) and enforce failurePolicy: Fail. As far as I understand, the design for annotations and labels is finalized now, there's already v1.0.0-rc.0 release, v1.0.0 is expected in a few weeks time (hopefully, it won't get delayed).
I'd be happy to update my PR to include the label in the guide, just wasn't sure if there's any chance for the PR to be reviewed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slickmode I've just pushed the updated guide to make sure we won't forget it whenever maintainers get to review the PR.
@pinkfloydx33 thanks for your comments here and in other places, they're always helpful :)

…kload identity v1.0.0-beta.0 and newer

Signed-off-by: Igor Beliakov <[email protected]>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weisdd
Once this PR has been reviewed and has the lgtm label, please assign seanmalloy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

```bash
$ kubectl patch serviceaccount external-dns --namespace "default" --patch \
"{\"metadata\": {\"annotations\": {\"azure.workload.identity/client-id\": \"${IDENTITY_CLIENT_ID}\"}}}"
$ kubectl patch deployment external-dns --namespace "default" --patch \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this should be part of the deployment. https://github.com/kubernetes-sigs/external-dns/blob/master/charts/external-dns/templates/deployment.yaml

Add the client_id to the values.yaml and if .Values.provider==azure (provider already exists in the values.yaml) and if the client_id for workload identity was set, add the label to the deployment plus add the annotation to the service account

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebader Many months ago, when I was working on this section, I just tried to make it look similar to other sections of the guide (to keep the guide consistent). When looking at the official external-dns chart, it doesn't seem to have much focus on offering something special for the providers external-dns has integrations with. So, I'm not sure if the changes you suggested can get accepted by maintainers. Though, we can simply rely on podLabels and serviceAccount.annotations of the existing chart.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards sebader in that I don't think patching should be encouraged but I think I lean a bit towards weisdd with that podLabels, podAnnotations and serviceAccount.labels is the way to go and explicitly setting them. Having logic for automatically setting them feels fragile since Microsoft keeps changing the way things behave.

@giuliocalzolari
Copy link

giuliocalzolari commented May 19, 2023

Do you have any updates or ETA ?
Would be amazing to have such kind of feature

@tsolodov
Copy link

Hi folks, any chance it will be completed ? What blocks it?

@weisdd
Copy link
Contributor Author

weisdd commented Aug 24, 2023

Well, in the beginning, maintainers were simply overloaded with requests (e.g. too many provider integrations to maintain => they mentioned the need to develop a new plugin system), and, then, I guess, the PR just got covered with dust :)

The code itself works just fine. The same workaround for token renewal has been accepted by cert-manager project last year, and I haven't seen any issues there. Now, as #3040 seems to be getting closer to be merged, my PR will need to be rewritten using the updated SDK. That should be a trivial change as the SDK supports token renewals out of the box.
For now, I'm just keeping an eye on #3040 and highlighting it all to Microsoft once in a while during our meetings.

@nbjohnson
Copy link

nbjohnson commented Sep 1, 2023

@weisdd Looks like #3040 just got merged which is great news, cant wait for this guy to get updated and hopefully merged as well

@weisdd
Copy link
Contributor Author

weisdd commented Sep 1, 2023

@nbjohnson that's great news :) I've just rewritten the code, and currently preparing a cluster to test my new build.

@nbjohnson
Copy link

nbjohnson commented Sep 1, 2023

@nbjohnson that's great news :) I've just rewritten the code, and currently preparing a cluster to test my new build.

@weisdd Great to hear, thanks for all your work on the workload identity support. Then we just have to bring this to the maintainer's attention

@weisdd
Copy link
Contributor Author

weisdd commented Sep 1, 2023

@nbjohnson Thanks!
I'll let them know once the PR is ready for review :)

@weisdd
Copy link
Contributor Author

weisdd commented Sep 1, 2023

Hey everyone, I decided not to update this PR to make sure the adal implementation is not lost through a force-push. - It might be used by someone who maintains a custom build.

I've just opened another PR #3906 that implements Workload Identity using azidentity. The code there is much simpler as token renewal is available out of the box. Also, I added a section with helm values that you can use with the official chart.
Please, take a look there :)

@weisdd weisdd closed this Sep 1, 2023
@weisdd
Copy link
Contributor Author

weisdd commented Sep 1, 2023

P.S. For those who wants to test the new build, you can find it here: quay.io/weisdd/external-dns:v0.13.5-wi0
It's built using the following commands:

export IMAGE=quay.io/weisdd/external-dns
export VERSION=v0.13.5-wi0
make build.push/multiarch

NOTE: ko has to be installed first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Azure Workload Identities