[4.7 vSphere IPI] systemd-resolved configuration fails on installation #690

fortinj66 · 2021-06-14T14:44:02Z

Describe the bug
with the newest stable release, systemd-resolved is misconfigured due to a missing directory:

Jun 14 14:25:14 localhost nm-dispatcher[1148]: /etc/NetworkManager/dispatcher.d/30-resolv-prepender: line 47: /etc/systemd/resolved.conf.d/60-kni.conf: No such file or directory
Jun 14 14:25:14 localhost nm-dispatcher[1078]: req:4 'up' [ens192], "/etc/NetworkManager/dispatcher.d/30-resolv-prepender": complete: failed with Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 1.
Jun 14 14:25:15 localhost NetworkManager[1039]: <warn>  [1623680715.0501] dispatcher: (4) /etc/NetworkManager/dispatcher.d/30-resolv-prepender failed (failed): Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 1.

/etc/systemd/resolved.conf.d should be created by FCOS and it is not.

Since DNS resolution is now broken, cluster does not complete installation.

This seems to be a FCOS 34 issue as it does not happen with FCOS 33.

Note that latest stable uses FCOS 34 as the initial bootstrap image. Prior stable use FCOS 33.

Version
OpenShift Installer 4.7.0-0.okd-2021-06-13-090745

How reproducible
100%

The text was updated successfully, but these errors were encountered:

vrutkovs · 2021-06-14T15:01:55Z

Does this happen before -firstboot service run? Seems installer / machine-config-operator scripts should be creating the dir just in case too

fortinj66 · 2021-06-14T15:08:15Z

Does this happen before -firstboot service run? Seems installer / machine-config-operator scripts should be creating the dir just in case too

My feeling is that since this is supposed to be provided by FCOS and has been in the past it should be there...
We shouldn't have to worry if a system level directory exists...

adding the directory "fixes" the symptom but not the issue

EDIT: This turns out to not be an FCOS issue, but an OKD config issue as mentioned below. OKD was creating the directory structure described above

fortinj66 · 2021-06-14T15:14:37Z

I need to look at the previous release to see when that directory gets created

fortinj66 · 2021-06-14T17:30:05Z

So, fortunately this is not an issue with FCOS...

Workaround until fix above is implemented:

ssh into each affected node:

sudo su -
mkdir -p /etc/systemd/resolved.conf.d/
reboot

vrutkovs · 2021-06-14T20:40:51Z

https://amd64.origin.releases.ci.openshift.org/releasestream/4.7.0-0.okd/release/4.7.0-0.okd-2021-06-14-203151 should have installer/machine-config-operator fixes for that (but since the squashed in existing commits no diff is displayed).

vrutkovs/machine-config-operator@53bfabc
vrutkovs/installer@97d2611

fortinj66 · 2021-06-14T21:45:49Z

I was able to succesfully install the cluster:

INFO Install complete!
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/root/poc-4.7/okd-install/auth/kubeconfig'
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.poc-c1v4.os.maeagle.corp
INFO Login to the console with user: "kubeadmin", and password: "xxx"
DEBUG Time elapsed per stage:
DEBUG     Infrastructure: 1m27s
DEBUG Bootstrap Complete: 24m12s
DEBUG                API: 4m36s
DEBUG  Bootstrap Destroy: 23s
DEBUG  Cluster Operators: 23m22s
INFO Time elapsed: 49m36s

bobby0724 · 2021-06-16T21:39:10Z

hey @fortinj66 is there any chance you can help me check this error

dial tcp lookup image-registry.openshift-image-registry.svc on 10.10.8.132-53 no such host

chrisu001 · 2021-06-18T09:45:42Z

same problem here, @bobby0724

pivot: 47.34.202106101121-0 (2021-06-10T11:24:19Z)
Fcos:  34.20210529.3.0 (2021-06-14T14:45:28Z)

the /etc/resolv.conf on the node looks like

search .
nameserver XX.XX.XX.XX

within the dns-default pod of openshift-dns namepsace the "." is missing

search 
nameserver XX.XX.XX.XX

this leads to a "parse of /etc/resolv.conf failed" within the dns-node-resolver container of the dns-default pod.
So the /etc/hosts with the image-registry domain entry was not created by the container.

my quickfix was to add a "Domain=foo" in /etc/systemd/resolved.conf via machine-config

So crio is working as expected with the correct /ect/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.30.33.172 image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver

validated with:

oc run -it --attach --rm test --image=image-registry.openshift-image-registry.svc:5000/openshift/cli --command -- bash
If you don't see a command prompt, try pressing enter.
[root@test /]# oc version
Client Version: v4.2.0-alpha.0-996-g9b9f77a
Kubernetes Version: v1.20.0-1077+2817867655bb7b-dirty

of course this is a workaround only. I assume, that the root cause should be found in the chain fcos-> system-resolver-> openshift-dns pods -> parse error of resolve.conf
But I did no further investigation yet.

bobby0724 · 2021-06-18T14:37:31Z

Thanks for the explanation, I have update my DHCP settings to deliver hostname and domain, I have reinstall the cluster using UPI with DHCP and now the issue is gone

vrutkovs · 2021-06-20T07:17:40Z

Fixed in https://amd64.origin.releases.ci.openshift.org/releasestream/4-stable/release/4.7.0-0.okd-2021-06-19-191547, please reopen if this still happens

Fixes: okd-project/okd#690

vrutkovs pinned this issue Jun 14, 2021

vrutkovs mentioned this issue Jun 14, 2021

Set DNS stub listener openshift/okd-machine-os#140

Merged

vrutkovs closed this as completed Jun 20, 2021

vrutkovs mentioned this issue Jun 20, 2021

Hostname resolution via reverse DNS lookups broken in OKD 4.7/4.8 #648

Closed

vrutkovs unpinned this issue Sep 19, 2021

LorbusChris pushed a commit to LorbusChris/machine-config-operator that referenced this issue Sep 28, 2021

templates: NM prepender: ensure /etc/systemd/resolved.conf.d dir exists

b271422

Fixes: okd-project/okd#690

LorbusChris mentioned this issue Sep 28, 2021

templates: [on-prem] NM prepender: ensure /etc/systemd/resolved.conf.d dir exists openshift/machine-config-operator#2780

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4.7 vSphere IPI] systemd-resolved configuration fails on installation #690

[4.7 vSphere IPI] systemd-resolved configuration fails on installation #690

fortinj66 commented Jun 14, 2021 •

edited

Loading

vrutkovs commented Jun 14, 2021

fortinj66 commented Jun 14, 2021 •

edited

Loading

fortinj66 commented Jun 14, 2021

fortinj66 commented Jun 14, 2021

vrutkovs commented Jun 14, 2021

fortinj66 commented Jun 14, 2021 •

edited

Loading

bobby0724 commented Jun 16, 2021

chrisu001 commented Jun 18, 2021

bobby0724 commented Jun 18, 2021

vrutkovs commented Jun 20, 2021

[4.7 vSphere IPI] systemd-resolved configuration fails on installation #690

[4.7 vSphere IPI] systemd-resolved configuration fails on installation #690

Comments

fortinj66 commented Jun 14, 2021 • edited Loading

vrutkovs commented Jun 14, 2021

fortinj66 commented Jun 14, 2021 • edited Loading

fortinj66 commented Jun 14, 2021

fortinj66 commented Jun 14, 2021

vrutkovs commented Jun 14, 2021

fortinj66 commented Jun 14, 2021 • edited Loading

bobby0724 commented Jun 16, 2021

chrisu001 commented Jun 18, 2021

bobby0724 commented Jun 18, 2021

vrutkovs commented Jun 20, 2021

fortinj66 commented Jun 14, 2021 •

edited

Loading

fortinj66 commented Jun 14, 2021 •

edited

Loading

fortinj66 commented Jun 14, 2021 •

edited

Loading