DNS failed when using more than one node #751

akhenakh · 2019-08-22T13:30:07Z

I'm chasing this bug for months, pods can't talk to coredns on nodes agents.

On a fresh 0.8.1 arm64 deployment with 3 nodes (one master, two agents), but same issue existed with previous k3s version, kernel 4.4 or 5.3, host is Arch.
iptables v1.8.3 (legacy)

Using default install script

curl -sfL https://get.k3s.io | K3S_URL=https://rk0:6443 K3S_TOKEN=xxxx sh -

Expected:
10.43.0.10 DNS service (and I assume the whole network) should be correctly setup on each nodes.
It's easy to test since the host can't reach the dns when the problem appears.

dig www.google.com @10.43.0.10

I found that scaling up the coredns deployment displaces the working node to an agent, making the master node unable to reach the DNS.

sudo kubectl scale -n kube-system deployment.v1.apps/coredns --replicas=3

For some reasons, sometimes, it just works from, the 3 nodes, but most of the time it doesn't.
I've tried to start the agent manually after the boot sequence is complete with no luck, compared the iptables output, everything is fine ...
I've also tried to point coredns to 8.8.8.8 directly with no result.

The text was updated successfully, but these errors were encountered:

jadsonlourenco · 2019-08-25T12:32:34Z

I've got the same issue last month, I'm using k3s v0.7.0 on master, with 3 nodes, but coredns deploy only in one node, even if the node scheduling is "beta.kubernetes.io/os = linux".
To "solve" this issue I created a clone of the coredns deployment for each node...
OBS: this don't solve the issue, Kubernetes need just one instance of CoreDNS, putting a copy of the CoreDNS on the local node solved the issue on the local node, to have the DNS service on pods, but broke the online node.

EDIT: In my case I've tested running 2 k3s VM on VBox (but using shared network), with the same OS as I run on my server, ubuntu 18.04, and the default DNS os k3s worked fine. So, I think my problem is related with my "router" (mikrotik), even if I disabled all the firewall rules, bu my server are behind a NAT too. I will keep trying.

EDIT 2: Well, I've installed a new cluster using the same setup using kubeadm with weave cni, one VM on DigitalOcean (all-in-one k8s) and one machine (bare metal) behind the NAT (Mikrotik); forwarded all port (dst-nat) to my local node. Pod communication worked fine, but don't work if enable encryption of weave net (by default if using the Rancher command to deploy the cluster).
Don't work using Rancher and Canal to network. I will try all Rancher options and discover better the cause, but I think the issue is not related on the SO (iptables), but on the Mikrotik NAT.

akhenakh · 2019-08-26T17:27:55Z

Same LAN here.

akhenakh · 2019-10-17T16:25:11Z

I've patched 0.9.1 to use host-gw instead of vxlan and all problems disappeared.

Since 0.10 brings options for flannel are you interested in a patch to enable host-gw ?

Diffs for 0.9.1 are very small

--- a/pkg/agent/flannel/flannel.go
+++ b/pkg/agent/flannel/flannel.go
@@ -29,6 +29,7 @@ import (
        log "k8s.io/klog"

        // Backends need to be imported for their init() to get executed and them to register
+ _ "github.com/coreos/flannel/backend/hostgw"
        _ "github.com/coreos/flannel/backend/vxlan"
 )

diff --git a/pkg/agent/flannel/setup.go b/pkg/agent/flannel/setup.go
index c2da4f34..a6f6f11b 100644
--- a/pkg/agent/flannel/setup.go
+++ b/pkg/agent/flannel/setup.go
@@ -38,7 +38,7 @@ const (
        netJSON = `{
     "Network": "%CIDR%",
     "Backend": {
-    "Type": "vxlan"
+    "Type": "host-gw"
     }
 }
 `

erikwilson · 2019-10-17T17:13:43Z

I think it would be okay to have an option for host-gw, not sure how @ibuildthecloud feels about it.

It would be good to get to the bottom of the issue tho.

johnae · 2019-11-05T15:21:45Z

I have a similar issue, not sure if it's the same. I noticed this problem when deploying k3s to more than one node. In my case it seems the master node cannot resolve dns while the other nodes can. So any workload ending up on the master fails connecting to things. For example deploying external-dns or cert-manager, if they end up on the master they fail.

pckbls · 2019-12-03T08:55:02Z

Can confirm, I'm experiencing exactly the same issue as @johnae on k3s version v1.0.0 (18bd921) on a multi-node Raspberry Pi setup. Have you been able to find a workaround for that problem other than scheduling pods onto nodes other than master?

ghost · 2019-12-06T07:19:14Z

I have a 1.0 version of k3s cluster with 3 masters and 2 agents, same problem here, to add some details:

only the node run coredns can resolve via 10.43.0.10, other (both master and agents) can't.
pods with hostNetwork: true can't resolve via it, normal in cluster network is okay.
TCP works everywhere, eg: dig SERVICE @10.43.0.10 +tcp on any node is fine.

ghost · 2019-12-17T08:26:13Z

Another workaround/fix is to use NodeLocal DNSCache https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/.

Download the yaml: https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml
Replace __PILLAR__DNS__SERVER__, __PILLAR__LOCAL__DNS__ and __PILLAR__DNS__DOMAIN__ with derised values.
The apply it.

Hope it helps.

akhenakh · 2019-12-30T20:01:21Z

I'm thinking this issue happen when your dns server is one of the hosts itself.

Neonox31 · 2020-04-24T14:45:11Z

I feel like I have a similar problem.

With only master node, DNS are working well.
With another nodes, DNS are also working but only on pods that are not on the same node than coredns pod.

I'm thinking this issue happen when your dns server is one of the hosts itself.

Indeed ! My DNS server is deployed on the same host.

e3b0c442 · 2020-05-10T18:51:33Z

I'm thinking this issue happen when your dns server is one of the hosts itself.

Can confirm this is not universal. I believe I am running into this issue, and my upstream DNS server is external to both the router and any k3s node.

I had thought that perhaps it might be an issue with a mixed-architecture cluster; my master is running on a Raspberry Pi 4 with Raspbian Buster; I have one worker node on AMD64/Ubuntu 18.04. I haven't been able to test out the multi-arch theory due to lack of nodes (I only have the one RPi right now).

Another commonality I see mentioned in this thread is that I have a MikroTik router. I will go down that rabbit hole here momentarily. I think it is a fair possibility that this or something host-OS side could be the issue and it's related to VXLAN, because I can only ping pod IPs on the local node in the cluster.

e3b0c442 · 2020-05-10T19:02:26Z

And I've resolved my issue.

Check your firewalls -- make sure that your nodes can communicate with each other on UDP port 8472 (assuming you're using the default VXLAN backend for Flannel).

@akenakh this could explain why host-gw backend was working and VXLAN was not, in your case.

deosrc · 2020-12-22T19:06:18Z

I'm thinking this issue happen when your dns server is one of the hosts itself.

This seems to be the case for me.

Everything was working fine when I had my DHCP and DNS handled by my router and forwarding DNS requests to a DNS server inside my cluster (PiHole).

When I tried changing the DHCP and DNS to use PiHole directly (no changes to the pods, only router settings), the pods using hostNetwork: true all broke (they are locked to the same node). All of the requests for the host network seem to be going through coredns and failing when it tries to retrieve them from 8.8.8.8.

Playing around inside one of the host network pods, I found that queries to 10.43.0.1 resolved fine for cluster DNS and querying 127.0.0.1.

EDIT: Sorry, I had some config here to try get all names to resolve but it seemed to only go to the first nameserver. Specifying both didn't have the desired effect.

stale · 2021-07-30T22:45:38Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

pchang388 · 2023-01-16T03:41:19Z

And I've resolved my issue.

Check your firewalls -- make sure that your nodes can communicate with each other on UDP port 8472 (assuming you're using the default VXLAN backend for Flannel).

@akenakh this could explain why host-gw backend was working and VXLAN was not, in your case.

I know this issue is long closed/stale, but just wanted to comment that still works for me. Default flannel mode is vxlan for k3s installations and opening 8472 port on Ubuntu hosts worked. These ports need to be open when you do multi-node clusters, I did not notice it when using a single node.

Still works as of version: v1.25.5+k3s2

samirsss mentioned this issue May 27, 2020

DNS resolution fails with dnsPolicy: ClusterFirstWithHostNet and hostNetwork: true #1827

Closed

dulm mentioned this issue Jan 4, 2021

dns cant resolve by cluster_ip when host is defferent from coredns #2768

Closed

stale bot added the status/stale label Jul 30, 2021

stale bot closed this as completed Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS failed when using more than one node #751

DNS failed when using more than one node #751

akhenakh commented Aug 22, 2019

jadsonlourenco commented Aug 25, 2019 •

edited

Loading

akhenakh commented Aug 26, 2019

akhenakh commented Oct 17, 2019 •

edited

Loading

erikwilson commented Oct 17, 2019

johnae commented Nov 5, 2019

pckbls commented Dec 3, 2019 •

edited

Loading

ghost commented Dec 6, 2019

ghost commented Dec 17, 2019

akhenakh commented Dec 30, 2019

Neonox31 commented Apr 24, 2020 •

edited

Loading

e3b0c442 commented May 10, 2020

e3b0c442 commented May 10, 2020

deosrc commented Dec 22, 2020 •

edited

Loading

stale bot commented Jul 30, 2021

pchang388 commented Jan 16, 2023

DNS failed when using more than one node #751

DNS failed when using more than one node #751

Comments

akhenakh commented Aug 22, 2019

jadsonlourenco commented Aug 25, 2019 • edited Loading

akhenakh commented Aug 26, 2019

akhenakh commented Oct 17, 2019 • edited Loading

erikwilson commented Oct 17, 2019

johnae commented Nov 5, 2019

pckbls commented Dec 3, 2019 • edited Loading

ghost commented Dec 6, 2019

ghost commented Dec 17, 2019

akhenakh commented Dec 30, 2019

Neonox31 commented Apr 24, 2020 • edited Loading

e3b0c442 commented May 10, 2020

e3b0c442 commented May 10, 2020

deosrc commented Dec 22, 2020 • edited Loading

stale bot commented Jul 30, 2021

pchang388 commented Jan 16, 2023

jadsonlourenco commented Aug 25, 2019 •

edited

Loading

akhenakh commented Oct 17, 2019 •

edited

Loading

pckbls commented Dec 3, 2019 •

edited

Loading

Neonox31 commented Apr 24, 2020 •

edited

Loading

deosrc commented Dec 22, 2020 •

edited

Loading