Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.7 vSphere IPI] systemd-resolved configuration fails on installation #690

Closed
fortinj66 opened this issue Jun 14, 2021 · 10 comments · Fixed by openshift/machine-config-operator#2780

Comments

@fortinj66
Copy link
Contributor

fortinj66 commented Jun 14, 2021

Describe the bug
with the newest stable release, systemd-resolved is misconfigured due to a missing directory:

Jun 14 14:25:14 localhost nm-dispatcher[1148]: /etc/NetworkManager/dispatcher.d/30-resolv-prepender: line 47: /etc/systemd/resolved.conf.d/60-kni.conf: No such file or directory
Jun 14 14:25:14 localhost nm-dispatcher[1078]: req:4 'up' [ens192], "/etc/NetworkManager/dispatcher.d/30-resolv-prepender": complete: failed with Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 1.
Jun 14 14:25:15 localhost NetworkManager[1039]: <warn>  [1623680715.0501] dispatcher: (4) /etc/NetworkManager/dispatcher.d/30-resolv-prepender failed (failed): Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 1.

/etc/systemd/resolved.conf.d should be created by FCOS and it is not.

Since DNS resolution is now broken, cluster does not complete installation.

This seems to be a FCOS 34 issue as it does not happen with FCOS 33.

Note that latest stable uses FCOS 34 as the initial bootstrap image. Prior stable use FCOS 33.

Version
OpenShift Installer 4.7.0-0.okd-2021-06-13-090745

How reproducible
100%

@vrutkovs
Copy link
Member

Does this happen before -firstboot service run? Seems installer / machine-config-operator scripts should be creating the dir just in case too

@vrutkovs vrutkovs pinned this issue Jun 14, 2021
@fortinj66
Copy link
Contributor Author

fortinj66 commented Jun 14, 2021

Does this happen before -firstboot service run? Seems installer / machine-config-operator scripts should be creating the dir just in case too

My feeling is that since this is supposed to be provided by FCOS and has been in the past it should be there...
We shouldn't have to worry if a system level directory exists...

adding the directory "fixes" the symptom but not the issue

EDIT: This turns out to not be an FCOS issue, but an OKD config issue as mentioned below. OKD was creating the directory structure described above

@fortinj66
Copy link
Contributor Author

I need to look at the previous release to see when that directory gets created

@fortinj66
Copy link
Contributor Author

So, fortunately this is not an issue with FCOS...

Workaround until fix above is implemented:

ssh into each affected node:

sudo su -
mkdir -p /etc/systemd/resolved.conf.d/
reboot

@vrutkovs
Copy link
Member

https://amd64.origin.releases.ci.openshift.org/releasestream/4.7.0-0.okd/release/4.7.0-0.okd-2021-06-14-203151 should have installer/machine-config-operator fixes for that (but since the squashed in existing commits no diff is displayed).

vrutkovs/machine-config-operator@53bfabc
vrutkovs/installer@97d2611

@fortinj66
Copy link
Contributor Author

fortinj66 commented Jun 14, 2021

I was able to succesfully install the cluster:

INFO Install complete!
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/root/poc-4.7/okd-install/auth/kubeconfig'
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.poc-c1v4.os.maeagle.corp
INFO Login to the console with user: "kubeadmin", and password: "xxx"
DEBUG Time elapsed per stage:
DEBUG     Infrastructure: 1m27s
DEBUG Bootstrap Complete: 24m12s
DEBUG                API: 4m36s
DEBUG  Bootstrap Destroy: 23s
DEBUG  Cluster Operators: 23m22s
INFO Time elapsed: 49m36s

@bobby0724
Copy link

hey @fortinj66 is there any chance you can help me check this error

dial tcp lookup image-registry.openshift-image-registry.svc on 10.10.8.132-53 no such host

@chrisu001
Copy link

same problem here, @bobby0724

pivot: 47.34.202106101121-0 (2021-06-10T11:24:19Z)
Fcos:  34.20210529.3.0 (2021-06-14T14:45:28Z)

the /etc/resolv.conf on the node looks like

search .
nameserver XX.XX.XX.XX

within the dns-default pod of openshift-dns namepsace the "." is missing

search 
nameserver XX.XX.XX.XX

this leads to a "parse of /etc/resolv.conf failed" within the dns-node-resolver container of the dns-default pod.
So the /etc/hosts with the image-registry domain entry was not created by the container.

my quickfix was to add a "Domain=foo" in /etc/systemd/resolved.conf via machine-config

So crio is working as expected with the correct /ect/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.30.33.172 image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver

validated with:

oc run -it --attach --rm test --image=image-registry.openshift-image-registry.svc:5000/openshift/cli --command -- bash
If you don't see a command prompt, try pressing enter.
[root@test /]# oc version
Client Version: v4.2.0-alpha.0-996-g9b9f77a
Kubernetes Version: v1.20.0-1077+2817867655bb7b-dirty

of course this is a workaround only. I assume, that the root cause should be found in the chain fcos-> system-resolver-> openshift-dns pods -> parse error of resolve.conf
But I did no further investigation yet.

@bobby0724
Copy link

Thanks for the explanation, I have update my DHCP settings to deliver hostname and domain, I have reinstall the cluster using UPI with DHCP and now the issue is gone

@vrutkovs
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants