-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hostname resolution via reverse DNS lookups broken in OKD 4.7/4.8 #648
Comments
I don't think its a good idea. How was initial hostname passed to the cluster - via DHCP/kernel args/custom ignition? |
FCOS docs state this as a method to set the hostname (https://docs.fedoraproject.org/en-US/fedora-coreos/hostname/), so I thought it might be ok to try. Also /usr/local/sbin/set-valid-hostname.sh stays that /etc/hostname is authoritative. In my case the hostname is in DHCP and DNS and since I am in control of both of them they can be considered pretty static. But don't get me wrong, I don't insist on manually setting the hostname if there are other means to get a valid hostname. It was just a check if the installation would run through if the issue (every node is 'localhost' after the first reboot') was put aside. It did. I now installed the cluster without static ips and got a valid hostname at first boot and later. It was taken from DHCP4 in both cases and the 'transient hostname' was set. /usr/local/sbin/set-valid-hostname.sh was happy and the cluster nodes are up. |
I experimented a bit with dhcp and static ips. My goal is to have nodes with ipv4 and ipv6 on the primary interface and ipv4 static ips without dhcp on the second interface that is used for storage. Please note that I didn't install a full cluster every time, so I don't know if the first experiments would have resulted in a working cluster.
ATM I have a running cluster that was installed with method 4 including dual stack hosts and ips for the storage network. |
Right, that's the only feasible option at the moment. If initially the interface can be configured via DHCP, then the whole static IP configuration (along with search domain) can be set in NM keyfiles. This however won't work on all setups - but we depend on search domain configuration in dracut to close the feature gap |
I have installed already a number of clusters with static IPs. At the beginning (before Afterburn/dracut became available in FCOS), I modified the boot images (initial kernel arguments) for the initial boot and via ignition set the config via NM config file for ens192. When Afterburn was available I first only used it for the initial boot IP config and stayed with NM config files in addition tho this. After a number of tests I found that Afterburn config worked so well that I decided to use only Afterburn Ip config and no longer set NM config files. I did also notice that I then have no way to specify the search suffix any longer. But I have not found yet that this causes problems. @gudroot, are you sure this causes problems with the internal registry? Perhaps we do not use this enough to even notice? How would I notice this? @vrutkovs, with some recent OKD release there was also the change that after cluster deployment, the IP config is no longer on the ens192 interface but on br-ex. The Afterburn IP config ends up in a NM config file named |
We didn't do OKD 4.8 releases yet, lets keep this ticket on topic - and file a new one if the change is needed |
I can confirm that the problem that @gudroot describes about the internal registry not being resolvable when "dns-search=." is not specific to 4.8 ... It already happens at least with 4.7-2021-06-04 as well. I am pretty sure it did not happen with earlier releases. But I will verify this with another deployment/test cycle later today or tomorrow. Once I have more details I will create another issue and reference this one. |
@kai-uwe-rommel you seem to have found out already: to show the search domain problem simply run an image from the local registry like |
And I have meanwhile found out, that the problem already appears with OKD 4.7 2021-06-04 as well, not only with 4.8. |
What was the behaviour in previous versions? I don't think empty DNS domain is ever a correct situation. Edit: I updated to 2021-06-04 and NM didn't append |
We are not talking about the DNS domain. When you specify a FQDN via dracut/Afterburn as the hostname, then the domain part of it is used as the DNS domain. The problem is the search domain (or search suffix list). |
As a workaround I re-enabled the code in my deployment scripts that also creates NetworkManager config files via ignition in addition to the Afterburn vSphere config string. This for now solves the problem and I again have a working search domain suffix. I create a default.nmconnection referring to ens192 and the OKD setup picks up the data from it correctly for the br-ex config file it generates. (I had stopped generating NetworkManager config files as dracut/Afterburn worked so well. And a search suffix was not a problem so far - but now with the value of "." it is a problem). |
My bad, I meant search domain. I don't think it ever got set previously - and I don't see why would that affect the deployment - |
Well that's the theory. :-) In practice, we see this difference now.
With the new release now that fails:
With all previous releases this was working fine. |
in my opinion the problem is in the node-resolver pods. Since I don't have the broken cluster any more, I'll have to recall this from memory:
on the host.
I am not sure a search domain "." offers any advantages over not having a search domain at all, but having an empty search domain is IMHO syntactically incorrect. It reminds me of this old bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=669163 Maybe dig and nslookup are not the only tools that stumble upon that kind of error in resolv.conf. |
No, I don't have the "broken" cluster available at the moment because I already tested/implemented the workaround. |
Perhaps its similar to #690, could you check if this still happens on OKD 4.8 RC3? |
@vrutkovs, I think these are two different problems. The #690 issue talks about a missing /etc/systemd/resolved.conf.d directory. This case here is (also) about the "search ." problem. I don't have time to test OKD 4.8 RC versions at the moment but I do always test new 4.7 releases as I need them for my projects. So I just installed a 4.7 2021-06-19 cluster and still see the "search ." problem that this issue here is (also) about. But the /etc/systemd/resolved.conf.d directory does exist (#690 also is about 4.7). So the primary topic of #690 is not our problem here. But in #690 @bobby0724 and @chrisu001 were later hijacking the issue a bit also for the "search ." problem which is apparently unrelated to @fortinj66's /etc/systemd/resolved.conf.d problem. |
I was hoping working resolved.conf configuration would resolve (or at least alleviate) this issue, but alas. Looks like there isn't anything to fix in OKD right now. It appears the preferred ways to configure this is:
|
I can confirm, that the bug
node resolv.conf:
node-resolver pod
so the node-resolver pod still fails
Hint: this is an automated test-setup based on UPI Virtualboxes with static ip/kernelargs |
Are we talking about the search domain issue or the hostname issue now? IMO the hostname issue affects every dual stack cluster, since you cannot use both dhcp4 and dhcp6 at the same time (I tried it and only got IPv4), but you can't use static ips via afterburn either as this would trigger the search domain issue. Last time I checked dual stack was on the roadmap for 4.8, so if it still is, there should be some kind of idea on how to set this up correctly.
This would be a temporary fix for the search domain issue until dracut is capable of configuring a search domain? |
#698 (comment) - perhaps its SELinux preventing the hostname to be set? |
Currently I'm solving the "search=..." problem already with by adding a small NetworkManager file during ignition (to keep things together, static IP config also happens in this stage). |
The search=dot problem seems to be more common than I thought: #694 |
I can see the same selinux messages in the logs captured on a cluster that is affected by the hostname=localhost problem that was the subject of this issue before we started to discuss overly short search domains. I'll try another installation when selinux-policy 34.11-1.fc34 hits 4.8 |
@gudroot, how are you assigning hostnames? via DHCP or reverse DNS lookup? edit: Nevermind, I see it is by DHCP... |
@fortinj66 you did ask gudroot and not me but I would still like to add my $0.02 ... |
But you shouldn't have too... FCOS should be able to resolve the hostnames either by DHCP or reverse lookup. reverse lookup seems to be completely broken in FCOS 34. I'm going to test DHCP assignment later this morning. |
In theory, many things ought to work much better than they actually do. |
No, luckily NM team has COPR, so we can use 1.26 or 1.32 builds - or try 1.30.2 from fedora stable repo (latest from I mirrored CI test releases to
Please give these a try. |
I've tested all three versions...
|
So, It seems the issue with search domains with static IPs in FCOS 34 is systemd-resolved related:
This line: |
The equivalent in FCOS 33:
No search line is written if there are no domains |
Actually, I'd even regard it as smart if in the case there is no search domain specified explicitly, it would use the domain suffix from the hostname if that was specified as a FQDN ... |
Seems it was implemented by systemd/systemd#17201, so we might want to revert systemd-* to v246 |
Also, this bug is like million comments long, anyone could summarize which problems we're hitting, which are the workaround and which packages need updated/downgraded? IIUC its two bugs:
|
edit: actually, prepender wont work as it doesn't exist for UPI installs.... |
|
Someone will have to file this upstream with FCOS and resolved. Any fixup commits will have to contain a proper justification and links to the filed issues. |
@kai-uwe-rommel @bobby0724 @gudroot would you folks be able to test: Its based off the It has a workaround for the |
I want to make sure the 'fix' works for folks other than me before I go through the rest of the hoops... |
Will do tonight. |
Seems to be fixed. No search line in resolve.conf at all (when configuring static IP purely via Afterburn, e.g. no search domain specified). |
Were you able to deploy new Deployments and pods? I was able to with my testing... |
Yes, looks all normal. |
fixes for |
thanks for fixing this issue |
I followed the above conversation and wanted to know is there a work around for this issue? What should resolv.conf look like to work? |
Latest OKD now runs a service which removes If you're hitting an issue with similar symptoms and |
Hi, just installed
How can I get rid of it? |
I cannot reproduce this. I just installed a cluster with the same version and also static IPs and do not see this problem. |
Describe the bug
I installed a 4.8 RC2 cluster (vsphere UPI with static ips via ignition). The nodes came up for the first time with their correct hostname from DHCP4 but after the reboot their hostname was 'localhost'. I then reinstalled the cluster and provided an ignition snippet that sets /etc/hostname and the cluster is now running more or less fine. This was different before, i.e. I didn't have to set /etc/hostname with releases up to and including 4.7.
Version
4.8.0-0.okd-2021-05-22-053824 on vsphere UPI
How reproducible
Always without the hostname ignition snippet.
Log bundle
If you really need a log bundle I can reinstall the cluster and create one on thursday. I however doubt it is really necessary as the issue appears to happen before the installer does much of its job.
The text was updated successfully, but these errors were encountered: