Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

installer not completing : incorrect ocp default registry format registry@sha@sha #933

Closed
kikisdeliveryservice opened this issue Dec 17, 2018 · 23 comments

Comments

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Dec 17, 2018

Version

$ openshift-install version
bin/openshift-install v0.7.0-master
and
bin/openshift-install v0.7.0

Platform (aws|libvirt|openstack):

aws

What happened?

Ran the installer and got errors related to pulling an image. The location seems to be incorrect and has 2 shas:
Dec 17 19:44:54 ip-10-0-9-101 bootkube.sh[3794]: unable to pull quay.io/openshift-release-dev/ocp-release@sha256@sha256:4f02d5c7183360a519a7c7dbe601f58123c9867cd5721ae503072ae62920575b: error getting default registries to try: invalid reference format

What you expected to happen?

I expect the installer to run

How to reproduce it (as minimally and precisely as possible)?

run the installer, note that it hangs at the below DEBUG lines for 20 minutes:

DEBUG added kube-scheduler.157140aa428fc565: ip-10-0-3-94_a271882c-024e-11e9-ae96-024b8209fb6a became leader DEBUG added kube-controller-manager.157140ab00eeb507: ip-10-0-3-94_a26c1b70-024e-11e9-98fb-024b8209fb6a became leader

until proceeding to the error:

ERROR: logging before flag.Parse: E1217 15:14:13.773270 1521 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug="" WARNING RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 148 WARNING Failed to connect events watcher: Get https://k-api.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=148&watch=true: dial tcp 52.37.184.199:6443: connect: connection refused

the pastebin logs below are occuring at the time that the WARNINGs in the main terminal appear.

Anything else we need to know?

Output from running journalctl -b -u bootkube --no-pager on bootstrap:
http://pastebin.test.redhat.com/685235

@kikisdeliveryservice
Copy link
Contributor Author

cc: @abhinavdahiya @wking

@kikisdeliveryservice kikisdeliveryservice changed the title incorrect default registry format registry@sha@sha installer not completing : incorrect default registry format registry@sha@sha Dec 17, 2018
@kikisdeliveryservice kikisdeliveryservice changed the title installer not completing : incorrect default registry format registry@sha@sha installer not completing : incorrect ocp default registry format registry@sha@sha Dec 17, 2018
@cgwalters
Copy link
Member

Probably related to this code:

# convert the release image pull spec to an "absolute" form if a digest is available - this is
# safe to resolve after the actions above because podman will not pull the image once it is 
# locally available
if ! release=$( podman inspect {{.ReleaseImage}} -f '{{"{{"}} index .RepoDigests 0 {{"}}"}}' ) || [[ -z "${release}" ]]; then
	echo "Warning: Could not resolve release image to pull by digest" 2>&1
	release="{{.ReleaseImage}}"

Hmm. Why is this a warning instead of fatal? (And shouldn't that be 1>&2?)

@kikisdeliveryservice
Copy link
Contributor Author

kikisdeliveryservice commented Dec 17, 2018

If it's an invalid reference format and that format/target isn't going to change it does seem to make more sense to just make that a fatal error rather than giving Warnings for such a long time only to eventually fail because of it.

So basically there are 2 issues: 1 - double sha256 and 2 that it takes forever for it to finally permanently fail due to a bad format.

@wking
Copy link
Member

wking commented Dec 18, 2018

Excerpted from the linked logs:

Dec 17 18:50:00 ip-10-0-9-101 bootkube.sh[4397]: Pulling release image...
Dec 17 18:50:02 ip-10-0-9-101 bootkube.sh[4397]: Trying to pull quay.io/openshift-release-dev/ocp-release:4.0.0-4...Getting image source signatures

Huh, update payload 4.0.0-4 was installer v0.6.0. Are you sure these logs are from installer v0.7.0?

Here's the hung wait in those logs:

Dec 17 18:57:38 ip-10-0-9-101 bootkube.sh[4397]: Pod Status:openshift-cluster-version/cluster-version-operator-7bff475b99-l7qfb        Pending
Dec 17 19:17:08 ip-10-0-9-101 bootkube.sh[4397]: Error: error while checking pod status: timed out waiting for the condition

And on the next bootkube.service run there's the doubled @sha256 issue:

Dec 17 19:17:14 ip-10-0-9-101 systemd[1]: Started Bootstrap a Kubernetes cluster.
Dec 17 19:17:14 ip-10-0-9-101 bootkube.sh[13694]: unable to pull quay.io/openshift-release-dev/ocp-release@sha256@sha256:4f02d5c7183360a519a7c7dbe601f58123c9867cd5721ae503072ae62920575b: error getting default registries to try: invalid reference format

Both v0.6.0 and v0.7.0 have the same code @cgwalters excerpted above. And once the wrapping templating is expanded, it seems to be working fine for me with my random Podman build from last week:

$ podman version
Version:       0.12.2-dev
Go Version:    go1.10.3
Git Commit:    "ec4cada3d1eabc77d9691a71fe2c99e3bf9343d6-dirty"
Built:         Wed Dec 12 23:58:05 2018
OS/Arch:       linux/amd64
$ podman pull quay.io/openshift-release-dev/ocp-release:4.0.0-4
$ podman inspect quay.io/openshift-release-dev/ocp-release:4.0.0-4 -f '{{ index .RepoDigests 0 }}'
quay.io/openshift-release-dev/ocp-release@sha256:4f02d5c7183360a519a7c7dbe601f58123c9867cd5721ae503072ae62920575b

The next time you see this, can you grab podman version and that inspect command of the hung bootstrap node?

Re: "why the warning?", @smarterclayton motivated it with a reference to broken registries, although I'm still not clear on the details there.

@kikisdeliveryservice
Copy link
Contributor Author

@wking Haven't run into this since that day, so closing the issue, but will reopen if I see the behaviour again.

@lveyde
Copy link
Contributor

lveyde commented Jan 9, 2019

Jan 09 16:27:24 test1-bootstrap bootkube.sh[24200]: unable to pull quay.io/openshift-release-dev/ocp-release@sha256@sha256:e237499d3b118e25890550daad8b17274af93baf855914a9c6f8f07ebc095dea: error getting default registries to try: invalid reference format

@lveyde
Copy link
Contributor

lveyde commented Jan 9, 2019

podman version

Version: 0.11.1.1
Go Version: go1.10.2
OS/Arch: linux/amd64

@lveyde
Copy link
Contributor

lveyde commented Jan 9, 2019

podman pull quay.io/openshift-release-dev/ocp-release:4.0.0-9
Trying to pull quay.io/openshift-release-dev/ocp-release:4.0.0-9...Getting image source signatures
Skipping fetch of repeat blob sha256:a02a4930cb5d36f3290eb84f4bfa30668ef2e9fe3a1fb73ec015fc58b9958b17
Skipping fetch of repeat blob sha256:47e4121c7dbd4743f868526085b5bb36f4dff5cec8c1a3f992a6b7f2bf06403c
Skipping fetch of repeat blob sha256:ca9c3b8314517310a56ef66741850ee0acce144bb241778cc13635b62fb990b0
Skipping fetch of repeat blob sha256:6266d9236cf1f855894ae5b171bed487a1453703cea1265dae6f6cf1a64b2e76
Writing manifest to image destination
Storing signatures
d0bc83da1db0c2fa8af4ac4176bd3fff1607b18178b38732b5408d72ab7c0784

podman inspect quay.io/openshift-release-dev/ocp-release:4.0.0-9 -f '{{ index .RepoDigests 0 }}'
quay.io/openshift-release-dev/ocp-release@sha256@sha256:e237499d3b118e25890550daad8b17274af93baf855914a9c6f8f07ebc095dea

@wking
Copy link
Member

wking commented Jan 9, 2019

containers/podman#2106

@lveyde
Copy link
Contributor

lveyde commented Jan 9, 2019

@wking Thanks for the update!

lveyde added a commit to lveyde/installer that referenced this issue Jan 9, 2019
Due to the issue in podman it seems to incorrectly return digests with
double @sha256 in the name.

This patch fixes the issue in the bootkube.sh to remove double occurances
of @256 and replace it with single occurence, as normally expected.

I.e.
quay.io/openshift-release-dev/ocp-release@sha256@sha256:e237499d3b118e25890550daad8b17274af93baf855914a9c6f8f07ebc095dea

Reference to the issue:
openshift#933

Signed-off-by: Lev Veyde <[email protected]>
@wking
Copy link
Member

wking commented Jan 15, 2019

Re-opening until we get Podman 1.0.

@wking wking reopened this Jan 15, 2019
@jatanmalde
Copy link

I have face the same issue with installer 0.9.1
How do we fix these issues where we can see extra keyword when ocp-release image is pulled.

I have also seen 0.10 releases. Does that fix this issue ??

@wking
Copy link
Member

wking commented Jan 16, 2019

0.9.1 and 0.10.0 both pin RHCOS 47.249 with podman 0.11.1.1. To get the fix from containers/podman#2106, we'll need podman 1.0. We expect to get it in the next installer release (or in that ballpark). In the meantime, you can repeat installs until you get through (for some reason we don't see this often in CI), or patch your installer with something like #1032.

@wking
Copy link
Member

wking commented Jan 17, 2019

RHCOS 47.268 picked up podman 1.0. Hopefully we can pin it or later in our next release.

@wking
Copy link
Member

wking commented Jan 28, 2019

0.11.0 bumped to RHCOS 47.280 with Podman 1.0, so I think this is fixed.

/close

@openshift-ci-robot
Copy link
Contributor

@wking: Closing this issue.

In response to this:

0.11.0 bumped to RHCOS 47.280 with Podman 1.0, so I think this is fixed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@eparis
Copy link
Member

eparis commented Jan 31, 2019

Jan 31 00:53:06 ip-10-0-4-46 systemd[1]: Started Bootstrap a Kubernetes cluster.
Jan 31 00:53:06 ip-10-0-4-46 bootkube.sh[29983]: unable to pull quay.io/openshift-release-dev/ocp-release@sha256@sha256:8580a118ce951dd241e4a4b73a0e5f4cda3b56088b6c1ab56ccadbf8e270fb1d: error getting default registries to try: invalid reference format
Jan 31 00:53:06 ip-10-0-4-46 systemd[1]: bootkube.service: main process exited, code=exited, status=125/n/a
Jan 31 00:53:06 ip-10-0-4-46 systemd[1]: Unit bootkube.service entered failed state.
Jan 31 00:53:06 ip-10-0-4-46 systemd[1]: bootkube.service failed.
^C
[root@ip-10-0-4-46 ~]# rpm -q podman
podman-1.0.0-1.git82e8011.el7.x86_64

@eparis eparis reopened this Jan 31, 2019
@wking
Copy link
Member

wking commented Jan 31, 2019

Grr. Do you still have this cluster, @eparis? I'd like podman inspect ... output as seen here.

@fridim
Copy link

fridim commented Jan 31, 2019

@wking yes, it's still up.
I'm adding your key there and sending you access info

@wking
Copy link
Member

wking commented Jan 31, 2019

Fix in flight with containers/podman#2251, although we'll need to wait for that to land in libpod and percolate through into RHCOS.

@lveyde
Copy link
Contributor

lveyde commented Feb 1, 2019 via email

@wking
Copy link
Member

wking commented Feb 1, 2019

Or... you could merge my PR which fixes the issue right now, and doesn't
break anything once/when the podman will be finally fixed.

True. But the merge queue is long, and this is a 1% issue. So I'm fine letting it slide, but I'm also fine if other maintainers want to land your guard.

lveyde added a commit to lveyde/installer that referenced this issue Feb 14, 2019
Due to the issue in podman it seems to incorrectly return digests with
double @sha256 in the name.

This patch fixes the issue in the bootkube.sh to remove double occurances
of @256 and replace it with single occurence, as normally expected.

I.e.
quay.io/openshift-release-dev/ocp-release@sha256@sha256:e237499d3b118e25890550daad8b17274af93baf855914a9c6f8f07ebc095dea

Reference to the issue:
openshift#933

Signed-off-by: Lev Veyde <[email protected]>
@eparis
Copy link
Member

eparis commented Feb 20, 2019

I'm going to go ahead and close this issue as I believe this has been fixed in oodman. Its not an installer issue and I don't think we should paper over a broken runtime.

@eparis eparis closed this as completed Feb 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants