Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use workload identity for all prow integrations #180

Closed
listx opened this issue Feb 13, 2020 · 24 comments
Closed

use workload identity for all prow integrations #180

listx opened this issue Feb 13, 2020 · 24 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@listx
Copy link
Contributor

listx commented Feb 13, 2020

This way, we don't have to give service account creds to Prow admins.

@listx
Copy link
Contributor Author

listx commented Feb 13, 2020

/cc @fejta

@fejta
Copy link

fejta commented Feb 13, 2020

Do you know anywhere that you are explicitly requiring a secret.json file? If not then we can just remove it and validate things work

@fejta
Copy link

fejta commented Feb 13, 2020

ref kubernetes/test-infra#15806

@listx
Copy link
Contributor Author

listx commented Feb 14, 2020

Here are the promoter's Prow jobs that use volume-mounted creds:

@fejta
Copy link

fejta commented Feb 14, 2020

@listx
Copy link
Contributor Author

listx commented Feb 14, 2020

Then the service account creds are never loaded. This means go-containerregistry will have to fall back to some other auth mechanism (presumably).

I am not sure how WI interacts with go-containnerregistry's Copy() function which the promoter uses to push images.

@jonjohnsonjr
Copy link
Contributor

I am not sure how WI interacts with go-containnerregistry's Copy() function which the promoter uses to push images.

That loads credentials from the docker config file. I don't have much context for this issue but assuming you're doing something like:

$ gcloud auth configure-docker

... to wire up gcloud as a config helper, this might work? I'm not sure if gcloud's credential helper actually works with workload identity.

Here's a primer on docker auth: https://github.com/google/go-containerregistry/blob/master/pkg/authn/README.md

If you can get a shell in one of these pods where you haven't activated a secret.json file, you could try:

echo "gcr.io" | docker-credential-gcloud get

to see if that returns valid credentials. You should get an access token as the Secret.

This will check the token you get back:

 curl "https://www.googleapis.com/oauth2/v1/tokeninfo?access_token=$(echo 'gcr.io' | docker-credential-gcloud get | jq .Secret -r)"

If that doesn't work there are definitely things we can do to work around this, let me know.

@fejta
Copy link

fejta commented Feb 14, 2020

Can you point me to the code that does this? I've mostly just had to do:

  • Only call gcloud auth activate-service-account to a file if the env variable is set
  • Always call gcloud auth configure-docker even when the env isn't set

Example: kubernetes/test-infra@966711a#diff-9f51d141e18f12470d41311d7ced5631

@jonjohnsonjr
Copy link
Contributor

Can you point me to the code that does this?

The crane package uses authn.DefaultKeychain by default: https://github.com/google/go-containerregistry/blob/4336215636f7ace860f1e499cf5033d12073a44b/pkg/crane/options.go#L33

Which is just a thin wrapper around github.com/docker/cli/cli/config, which fetches credentials based on your config file as you would expect (i.e. it works with gcloud auth configure-docker).

From the linked diff things should just work, assuming gcloud has been set up appropriately.

@fejta
Copy link

fejta commented Feb 19, 2020

Linus, do you have opinions about the way you want these jobs to run gcloud auth configure-docker?

It isn't obvious to me that google/cloud-sdk has comes preconfigured for docker. Most obvious solution would be to add a layer that adds this file to /root/.docker/config.json:

{
  "credHelpers": {
    "gcr.io": "gcloud",
    "us.gcr.io": "gcloud",
    "eu.gcr.io": "gcloud",
    "asia.gcr.io": "gcloud",
    "staging-k8s.gcr.io": "gcloud",
    "marketplace.gcr.io": "gcloud"
  }
}

I can send a PR to do that, or I can do something else you'd prefer

@fejta
Copy link

fejta commented Feb 19, 2020

@fejta
Copy link

fejta commented Feb 19, 2020

/assign

@listx
Copy link
Contributor Author

listx commented Feb 20, 2020

Linus, do you have opinions about the way you want these jobs to run gcloud auth configure-docker?

It isn't obvious to me that google/cloud-sdk has comes preconfigured for docker. Most obvious solution would be to add a layer that adds this file to /root/.docker/config.json:

{
  "credHelpers": {
    "gcr.io": "gcloud",
    "us.gcr.io": "gcloud",
    "eu.gcr.io": "gcloud",
    "asia.gcr.io": "gcloud",
    "staging-k8s.gcr.io": "gcloud",
    "marketplace.gcr.io": "gcloud"
  }
}

I can send a PR to do that, or I can do something else you'd prefer

FWIW I already set up a json like that in the Prow jobs. E.g. https://github.com/kubernetes-sigs/k8s-container-image-promoter/blob/90a75d7781a432cd5e9b56446af4d49747ff986b/test-e2e/cip/e2e-entrypoint-from-container.sh#L31-L34

@listx
Copy link
Contributor Author

listx commented Mar 10, 2020

The kubernetes/test-infra#16724 PR reverted kubernetes/test-infra#16463, because we saw permissions issues like this:

E0310 19:54:40.826604      11 inventory.go:1621] Request {{0 gcr.io/k8s-staging-cluster-api eu.gcr.io/k8s-artifacts-prod/cluster-api k8s-infra-gcr-promoter@k8s-artifacts-prod.iam.gserviceaccount.com cluster-api-controller cluster-api-controller sha256:b36ed8334a9d95116eb6dfb84eb54e58319fb8a3b2b4d97e014381e6ce144b2e  v0.3.0} <nil>}: error(s) encountered: [{running writeImage() failed to copy index: GET https://eu.gcr.io/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fcluster-api%2Fcluster-api-controller%3Apush%2Cpull&service=eu.gcr.io: UNAUTHORIZED: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication}]

This was most likely caused by the lack of a WI permission binding that needed to be actuated in the k8s-artifacts-prod project for the Prow bot account, like so:

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:k8s-prow-builds.svc.id.googtest-pods/k8s-artifacts-prod" \
      k8s-infra-gcr-promoter@k8s-artifacts-prod.iam.gserviceaccount.com

I need to add this binding into the infra scripts set up here.

@listx
Copy link
Contributor Author

listx commented Mar 11, 2020

Pasting from kubernetes/k8s.io#655 (comment):

I think we have to actuate this first, then un-revert kubernetes/test-infra#16463 and see if the ci-k8sio-cip job (or post-k8sio-cip job) still works. Once all the jobs have migrated to WI, then the JSON keys can be revoked.

/cc @thockin

@listx
Copy link
Contributor Author

listx commented Mar 24, 2020

Once kubernetes/test-infra#16917 merges I will have a much better understanding of WI, enabling me to port the remaining Prow jobs to WI systematically. The concept is simple but it's a bit of grunt work to get all the details right.

@listx
Copy link
Contributor Author

listx commented Mar 25, 2020

Since kubernetes/k8s.io#695 and kubernetes/test-infra#16948, the ci-k8sio-cip job is working!

I have some other more urgent matters to attend to (need to fixup #199), but I now understand the pattern to apply to workload-identity-ize the rest of the jobs.

@fejta
Copy link

fejta commented Apr 10, 2020

/unassign
/assign @listx

Awesome!

@k8s-ci-robot k8s-ci-robot assigned listx and unassigned fejta Apr 10, 2020
@listx listx added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jun 4, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 2, 2020
@listx
Copy link
Contributor Author

listx commented Sep 3, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 13, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

5 participants