Cannot pull private AWS ECR image from controller node #620

cknowles · 2016-08-15T04:32:48Z

I'm not sure if this is a kube-aws issue or a Kubernetes issue so hopefully someone can shed some light on it. I'm using kube-aws 0.8.0 and when launching a DaemonSet with a private AWS ECR image it seems that the controller node won't pull the image but all worker nodes will.

The error I get in the pod logs is:

container "private_name" in pod "private_name-5wafj" is waiting to start: image can't be pulled

Steps to reproduce

Launch a cluster using kube-aws, default configuration seems to have the same issue although we are using the multi AZ setup
Add a DaemonSet with a single container referencing a private AWS ECR image from the same AWS account

I will try with the latest kube-aws version soon just to check whether it's still an issue.

Current workaround

None that I know of that will get the private image running on the controller. Currently I added a label to the workers and then I used nodeSelector in the DaemonSet to avoid launching it on the controller at all. In my case the daemon is monitoring related so I need it to run on the controller as well.

The text was updated successfully, but these errors were encountered:

whereisaaron · 2016-08-15T15:12:38Z

Hi, I haven't seen the a problem with Daemonsets specifically, but when I did have issues with ~0.7 I found the journelctl logs on the nodes pretty informative. They normally said why the image couldn't be pulled, e.g. DNS or authentication.

Does the image pull work if you manually specify your own ImagePullSecret?

cknowles · 2016-08-16T08:52:59Z

Good ideas, thanks.

I checked the controller journelctl logs and found this:

Aug 15 05:04:27 ip-10-0-0-50.eu-west-1.compute.internal kubelet-wrapper[1508]: I0815 05:04:27.961740    1508 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/secret/default-token-icqo6" (spec.Name: "default-token-icqo6") to pod "792283c9-62a0-11e6-8d91-0a90a4438ea5" (UID: "792283c9-62a0-11e6-8d91-0a90a4438ea5"). Volume is already mounted to pod, but remount was requested.
Aug 15 05:04:27 ip-10-0-0-50.eu-west-1.compute.internal kubelet-wrapper[1508]: I0815 05:04:27.964366    1508 operation_executor.go:720] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/default-token-icqo6" (spec.Name: "default-token-icqo6") pod "792283c9-62a0-11e6-8d91-0a90a4438ea5" (UID: "792283c9-62a0-11e6-8d91-0a90a4438ea5").
Aug 15 05:04:28 ip-10-0-0-50.eu-west-1.compute.internal dockerd[1199]: time="2016-08-15T05:04:28.203174835Z" level=error msg="Handler for GET /images/REDACTED.dkr.ecr.eu-west-1.amazonaws.com/REDACTED/json returned error: No such image: REDACTED.dkr.ecr.eu-west-1.amazonaws.com/REDACTED"
Aug 15 05:04:28 ip-10-0-0-50.eu-west-1.compute.internal kubelet-wrapper[1508]: E0815 05:04:28.203765    1508 docker_manager.go:2085] container start failed: ImagePullBackOff: Back-off pulling image "REDACTED.dkr.ecr.eu-west-1.amazonaws.com/REDACTED"
Aug 15 05:04:28 ip-10-0-0-50.eu-west-1.compute.internal kubelet-wrapper[1508]: E0815 05:04:28.204043    1508 pod_workers.go:183] Error syncing pod 792283c9-62a0-11e6-8d91-0a90a4438ea5, skipping: failed to "StartContainer" for "REDACTED" with ImagePullBackOff: "Back-off pulling image \"REDACTED.dkr.ecr.eu-west-1.amazonaws.com/REDACTED\""

At first it seems like it can't find the image. Obviously the fact that the DaemonSet manages to pull the exact same definition on the workers means this is not true, but I heard on the AWS forums that there are still a few discrepancies between the Docker Registry API and the AWS ECR implementation. Out of curiosity I checked one of the workers and strangely it has this log line as well but still manages to pull the image:

Aug 13 13:08:47 ip-10-0-0-224.eu-west-1.compute.internal dockerd[1251]: time="2016-08-13T13:08:47.598642022Z" level=error msg="Handler for GET /images/REDACTED.dkr.ecr.eu-west-1.amazonaws.com/REDACTED/json returned error: No such image: REDACTED.dkr.ecr.eu-west-1.amazonaws.com/REDACTED"

I also tried to set the ImagePullSecret as you mentioned. I logged into ECR locally, grabbed the auth out of ~/.docker/config.json and pushed that into the namespace followed by editing the editing DaemonSet with imagePullSecrets. The cluster did not seem to cycle the pod on the controller at all but after manually deleting it the newly scheduled pod pulled the image fine.

This seems to indicate a specific problem with ECR auth for DaemonSets on the controller.

whereisaaron · 2016-08-17T02:00:07Z

Good news @c-knowles - a workaround at least.

The ECR tokens cycle every 12 hours in every region, so you'll need to update it often. You can fetch the token from the API with aws ecr --region=$REGION get-authorization-token

If you would like to automate that you can use a script like this to update a Secret, either before each deployment or as a scheduled job. You can then use the Secret name with imagePullSecrets.

https://github.com/whereisaaron/kubernetes-aws-scripts/blob/master/create-ecr-imagepullsecret.sh

cknowles · 2016-08-17T02:38:48Z

Yeah, it's an ok workaround for now. I'd like to see if we can work out the root cause, should I report in the Kubernetes project as well? I wondered whether reporting in both is standard practice for people using kube-aws when it seems like a core issue like this.

ewang · 2016-08-17T07:31:54Z

I think this issue is due to kube-aws setting the ECR permissions on the IAMWorker role, but not on the IAMController role. That's probably why setting imagePullSecrets manually worked for you.

cknowles · 2016-08-19T08:09:10Z

@ewang I've just tested that, seems like it's something more involved unfortunately. Shame as editing the stack-template.json before deploy would have been a better workaround.

On my existing cluster, I first ensured the master could not pull the private image for the daemon set. Then I manually added the below to the IAMRoleController role in the AWS console and removed the pod from the controller. The image still fails to pull with the message in my original report.

{
   "Action": [
       "ecr:GetAuthorizationToken",
       "ecr:BatchCheckLayerAvailability",
       "ecr:GetDownloadUrlForLayer",
       "ecr:GetRepositoryPolicy",
       "ecr:DescribeRepositories",
       "ecr:ListImages",
       "ecr:BatchGetImage"
   ],
   "Resource": "*",
   "Effect": "Allow"
}

colhom · 2016-08-29T19:22:39Z

@c-knowles is your ec2 registry in the same region as your cluster? Otherwise the endpoint would still be accessible, but the Kubelet will be unable to fetch the credentials from the metadata service.

If you checkout out the kubelet logs (journalctl -u kubelet -f) on a node that is failing to pull the ECR image, you should be able to confirm this as the root cause.

cknowles · 2016-08-31T01:14:04Z

Afraid not, they are the same region (both eu-west-1). On a slightly different point, since this issue was filed I've moved some of the images which are failing to pull to a different AWS account and granted cross account permissions on the repos. Pulling still continues to work on the workers but not on the controller.

Thermus · 2016-09-03T01:05:32Z

@c-knowles: Adding the necessary argument to the argument list of the kubelet.service file on the controller and adding a policy to the IAM role of the controller worked for me. I'm running 0.71 or 0.80. Can't remember. Same problem as #518

Thermus · 2016-09-04T01:32:19Z

I had to rebuild the cluster today with the newest version and I run into the same problem as c-knowles.

cknowles · 2016-09-05T05:58:53Z

Should I give that a go or are you saying that workaround no longer works on the latest version (0.8.1)?

Thermus · 2016-09-05T08:01:26Z

@c-knowles It started working after a couple of hours. I guess kubernetes tries fetching the credentials every couple of hours, irrespective of the fact that the previous attempt failed.

TL;DR: Works with the added policy and the required argument (which is there in the latest version of coreos-kubernetes). So based on the default installation, you just have to add the IAM policy to the controller role and wait or restart the controller.

cknowles · 2016-09-07T05:40:53Z

Thanks, I've just tried that and looks like it works. The bit I had not tried before was waiting a little time, I expected it to just work. This time around I had already previously updated the IAMRoleController role in my test above so I just removed my workaround and then the controller node successfully pulled the image.

Goes some way to resolve coreos/coreos-kubernetes#620

aaronlevy · 2016-11-17T02:15:06Z

I'm going to close this issue as it seems like it has been resolved. Please let me know if it should be re-opened

aaronlevy added platform/AWS kind/friction kind/question labels Aug 18, 2016

aaronlevy assigned colhom Aug 18, 2016

aaronlevy added the priority/P1 label Aug 19, 2016

cknowles mentioned this issue Oct 17, 2016

Add existing ECR policy to controller IAM role #731

Closed

cknowles pushed a commit to cknowles/kube-aws that referenced this issue Nov 5, 2016

Allow ECR pull from controller IAM role

145aebb

Goes some way to resolve coreos/coreos-kubernetes#620

cknowles mentioned this issue Nov 5, 2016

Allow ECR pull from controller IAM role kubernetes-retired/kube-aws#35

Merged

aaronlevy closed this as completed Nov 17, 2016

aaronlevy mentioned this issue Nov 17, 2016

Enable Kubernetes native support for AWS ECR container registries when using kube-aws #518

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot pull private AWS ECR image from controller node #620

Cannot pull private AWS ECR image from controller node #620

cknowles commented Aug 15, 2016 •

edited

Loading

whereisaaron commented Aug 15, 2016

cknowles commented Aug 16, 2016

whereisaaron commented Aug 17, 2016

cknowles commented Aug 17, 2016

ewang commented Aug 17, 2016 •

edited

Loading

cknowles commented Aug 19, 2016

colhom commented Aug 29, 2016

cknowles commented Aug 31, 2016

Thermus commented Sep 3, 2016

Thermus commented Sep 4, 2016

cknowles commented Sep 5, 2016

Thermus commented Sep 5, 2016

cknowles commented Sep 7, 2016

aaronlevy commented Nov 17, 2016

Cannot pull private AWS ECR image from controller node #620

Cannot pull private AWS ECR image from controller node #620

Comments

cknowles commented Aug 15, 2016 • edited Loading

Steps to reproduce

Current workaround

whereisaaron commented Aug 15, 2016

cknowles commented Aug 16, 2016

whereisaaron commented Aug 17, 2016

cknowles commented Aug 17, 2016

ewang commented Aug 17, 2016 • edited Loading

cknowles commented Aug 19, 2016

colhom commented Aug 29, 2016

cknowles commented Aug 31, 2016

Thermus commented Sep 3, 2016

Thermus commented Sep 4, 2016

cknowles commented Sep 5, 2016

Thermus commented Sep 5, 2016

cknowles commented Sep 7, 2016

aaronlevy commented Nov 17, 2016

cknowles commented Aug 15, 2016 •

edited

Loading

ewang commented Aug 17, 2016 •

edited

Loading