Skip to content
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.

Cannot pull private AWS ECR image from controller node #620

Closed
cknowles opened this issue Aug 15, 2016 · 14 comments · Fixed by kubernetes-retired/kube-aws#35
Closed

Cannot pull private AWS ECR image from controller node #620

cknowles opened this issue Aug 15, 2016 · 14 comments · Fixed by kubernetes-retired/kube-aws#35

Comments

@cknowles
Copy link

cknowles commented Aug 15, 2016

I'm not sure if this is a kube-aws issue or a Kubernetes issue so hopefully someone can shed some light on it. I'm using kube-aws 0.8.0 and when launching a DaemonSet with a private AWS ECR image it seems that the controller node won't pull the image but all worker nodes will.

The error I get in the pod logs is:

container "private_name" in pod "private_name-5wafj" is waiting to start: image can't be pulled

Steps to reproduce

  1. Launch a cluster using kube-aws, default configuration seems to have the same issue although we are using the multi AZ setup
  2. Add a DaemonSet with a single container referencing a private AWS ECR image from the same AWS account

I will try with the latest kube-aws version soon just to check whether it's still an issue.

Current workaround

None that I know of that will get the private image running on the controller. Currently I added a label to the workers and then I used nodeSelector in the DaemonSet to avoid launching it on the controller at all. In my case the daemon is monitoring related so I need it to run on the controller as well.

@whereisaaron
Copy link

Hi, I haven't seen the a problem with Daemonsets specifically, but when I did have issues with ~0.7 I found the journelctl logs on the nodes pretty informative. They normally said why the image couldn't be pulled, e.g. DNS or authentication.

Does the image pull work if you manually specify your own ImagePullSecret?

@cknowles
Copy link
Author

Good ideas, thanks.

I checked the controller journelctl logs and found this:

Aug 15 05:04:27 ip-10-0-0-50.eu-west-1.compute.internal kubelet-wrapper[1508]: I0815 05:04:27.961740    1508 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/secret/default-token-icqo6" (spec.Name: "default-token-icqo6") to pod "792283c9-62a0-11e6-8d91-0a90a4438ea5" (UID: "792283c9-62a0-11e6-8d91-0a90a4438ea5"). Volume is already mounted to pod, but remount was requested.
Aug 15 05:04:27 ip-10-0-0-50.eu-west-1.compute.internal kubelet-wrapper[1508]: I0815 05:04:27.964366    1508 operation_executor.go:720] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/default-token-icqo6" (spec.Name: "default-token-icqo6") pod "792283c9-62a0-11e6-8d91-0a90a4438ea5" (UID: "792283c9-62a0-11e6-8d91-0a90a4438ea5").
Aug 15 05:04:28 ip-10-0-0-50.eu-west-1.compute.internal dockerd[1199]: time="2016-08-15T05:04:28.203174835Z" level=error msg="Handler for GET /images/REDACTED.dkr.ecr.eu-west-1.amazonaws.com/REDACTED/json returned error: No such image: REDACTED.dkr.ecr.eu-west-1.amazonaws.com/REDACTED"
Aug 15 05:04:28 ip-10-0-0-50.eu-west-1.compute.internal kubelet-wrapper[1508]: E0815 05:04:28.203765    1508 docker_manager.go:2085] container start failed: ImagePullBackOff: Back-off pulling image "REDACTED.dkr.ecr.eu-west-1.amazonaws.com/REDACTED"
Aug 15 05:04:28 ip-10-0-0-50.eu-west-1.compute.internal kubelet-wrapper[1508]: E0815 05:04:28.204043    1508 pod_workers.go:183] Error syncing pod 792283c9-62a0-11e6-8d91-0a90a4438ea5, skipping: failed to "StartContainer" for "REDACTED" with ImagePullBackOff: "Back-off pulling image \"REDACTED.dkr.ecr.eu-west-1.amazonaws.com/REDACTED\""

At first it seems like it can't find the image. Obviously the fact that the DaemonSet manages to pull the exact same definition on the workers means this is not true, but I heard on the AWS forums that there are still a few discrepancies between the Docker Registry API and the AWS ECR implementation. Out of curiosity I checked one of the workers and strangely it has this log line as well but still manages to pull the image:

Aug 13 13:08:47 ip-10-0-0-224.eu-west-1.compute.internal dockerd[1251]: time="2016-08-13T13:08:47.598642022Z" level=error msg="Handler for GET /images/REDACTED.dkr.ecr.eu-west-1.amazonaws.com/REDACTED/json returned error: No such image: REDACTED.dkr.ecr.eu-west-1.amazonaws.com/REDACTED"

I also tried to set the ImagePullSecret as you mentioned. I logged into ECR locally, grabbed the auth out of ~/.docker/config.json and pushed that into the namespace followed by editing the editing DaemonSet with imagePullSecrets. The cluster did not seem to cycle the pod on the controller at all but after manually deleting it the newly scheduled pod pulled the image fine.

This seems to indicate a specific problem with ECR auth for DaemonSets on the controller.

@whereisaaron
Copy link

Good news @c-knowles - a workaround at least.

The ECR tokens cycle every 12 hours in every region, so you'll need to update it often. You can fetch the token from the API with aws ecr --region=$REGION get-authorization-token

If you would like to automate that you can use a script like this to update a Secret, either before each deployment or as a scheduled job. You can then use the Secret name with imagePullSecrets.

https://github.com/whereisaaron/kubernetes-aws-scripts/blob/master/create-ecr-imagepullsecret.sh

@cknowles
Copy link
Author

Yeah, it's an ok workaround for now. I'd like to see if we can work out the root cause, should I report in the Kubernetes project as well? I wondered whether reporting in both is standard practice for people using kube-aws when it seems like a core issue like this.

@ewang
Copy link

ewang commented Aug 17, 2016

I think this issue is due to kube-aws setting the ECR permissions on the IAMWorker role, but not on the IAMController role. That's probably why setting imagePullSecrets manually worked for you.

@cknowles
Copy link
Author

@ewang I've just tested that, seems like it's something more involved unfortunately. Shame as editing the stack-template.json before deploy would have been a better workaround.

On my existing cluster, I first ensured the master could not pull the private image for the daemon set. Then I manually added the below to the IAMRoleController role in the AWS console and removed the pod from the controller. The image still fails to pull with the message in my original report.

{
   "Action": [
       "ecr:GetAuthorizationToken",
       "ecr:BatchCheckLayerAvailability",
       "ecr:GetDownloadUrlForLayer",
       "ecr:GetRepositoryPolicy",
       "ecr:DescribeRepositories",
       "ecr:ListImages",
       "ecr:BatchGetImage"
   ],
   "Resource": "*",
   "Effect": "Allow"
}

@colhom
Copy link
Contributor

colhom commented Aug 29, 2016

@c-knowles is your ec2 registry in the same region as your cluster? Otherwise the endpoint would still be accessible, but the Kubelet will be unable to fetch the credentials from the metadata service.

If you checkout out the kubelet logs (journalctl -u kubelet -f) on a node that is failing to pull the ECR image, you should be able to confirm this as the root cause.

@cknowles
Copy link
Author

Afraid not, they are the same region (both eu-west-1). On a slightly different point, since this issue was filed I've moved some of the images which are failing to pull to a different AWS account and granted cross account permissions on the repos. Pulling still continues to work on the workers but not on the controller.

@Thermus
Copy link

Thermus commented Sep 3, 2016

@c-knowles: Adding the necessary argument to the argument list of the kubelet.service file on the controller and adding a policy to the IAM role of the controller worked for me. I'm running 0.71 or 0.80. Can't remember. Same problem as #518

@Thermus
Copy link

Thermus commented Sep 4, 2016

I had to rebuild the cluster today with the newest version and I run into the same problem as c-knowles.

@cknowles
Copy link
Author

cknowles commented Sep 5, 2016

Should I give that a go or are you saying that workaround no longer works on the latest version (0.8.1)?

@Thermus
Copy link

Thermus commented Sep 5, 2016

@c-knowles It started working after a couple of hours. I guess kubernetes tries fetching the credentials every couple of hours, irrespective of the fact that the previous attempt failed.

TL;DR: Works with the added policy and the required argument (which is there in the latest version of coreos-kubernetes). So based on the default installation, you just have to add the IAM policy to the controller role and wait or restart the controller.

@cknowles
Copy link
Author

cknowles commented Sep 7, 2016

Thanks, I've just tried that and looks like it works. The bit I had not tried before was waiting a little time, I expected it to just work. This time around I had already previously updated the IAMRoleController role in my test above so I just removed my workaround and then the controller node successfully pulled the image.

@aaronlevy
Copy link
Contributor

I'm going to close this issue as it seems like it has been resolved. Please let me know if it should be re-opened

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.