Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS failed when using more than one node #751

Closed
akhenakh opened this issue Aug 22, 2019 · 15 comments
Closed

DNS failed when using more than one node #751

akhenakh opened this issue Aug 22, 2019 · 15 comments

Comments

@akhenakh
Copy link

I'm chasing this bug for months, pods can't talk to coredns on nodes agents.

On a fresh 0.8.1 arm64 deployment with 3 nodes (one master, two agents), but same issue existed with previous k3s version, kernel 4.4 or 5.3, host is Arch.
iptables v1.8.3 (legacy)

Using default install script

curl -sfL https://get.k3s.io | K3S_URL=https://rk0:6443 K3S_TOKEN=xxxx sh -

Expected:
10.43.0.10 DNS service (and I assume the whole network) should be correctly setup on each nodes.
It's easy to test since the host can't reach the dns when the problem appears.

dig www.google.com @10.43.0.10 

I found that scaling up the coredns deployment displaces the working node to an agent, making the master node unable to reach the DNS.

sudo kubectl scale -n kube-system deployment.v1.apps/coredns --replicas=3 

For some reasons, sometimes, it just works from, the 3 nodes, but most of the time it doesn't.
I've tried to start the agent manually after the boot sequence is complete with no luck, compared the iptables output, everything is fine ...
I've also tried to point coredns to 8.8.8.8 directly with no result.

@jadsonlourenco
Copy link

jadsonlourenco commented Aug 25, 2019

I've got the same issue last month, I'm using k3s v0.7.0 on master, with 3 nodes, but coredns deploy only in one node, even if the node scheduling is "beta.kubernetes.io/os = linux".
To "solve" this issue I created a clone of the coredns deployment for each node...
OBS: this don't solve the issue, Kubernetes need just one instance of CoreDNS, putting a copy of the CoreDNS on the local node solved the issue on the local node, to have the DNS service on pods, but broke the online node.

EDIT: In my case I've tested running 2 k3s VM on VBox (but using shared network), with the same OS as I run on my server, ubuntu 18.04, and the default DNS os k3s worked fine. So, I think my problem is related with my "router" (mikrotik), even if I disabled all the firewall rules, bu my server are behind a NAT too. I will keep trying.

EDIT 2: Well, I've installed a new cluster using the same setup using kubeadm with weave cni, one VM on DigitalOcean (all-in-one k8s) and one machine (bare metal) behind the NAT (Mikrotik); forwarded all port (dst-nat) to my local node. Pod communication worked fine, but don't work if enable encryption of weave net (by default if using the Rancher command to deploy the cluster).
Don't work using Rancher and Canal to network. I will try all Rancher options and discover better the cause, but I think the issue is not related on the SO (iptables), but on the Mikrotik NAT.

@akhenakh
Copy link
Author

Same LAN here.

@akhenakh
Copy link
Author

akhenakh commented Oct 17, 2019

I've patched 0.9.1 to use host-gw instead of vxlan and all problems disappeared.

Since 0.10 brings options for flannel are you interested in a patch to enable host-gw ?

Diffs for 0.9.1 are very small

--- a/pkg/agent/flannel/flannel.go
+++ b/pkg/agent/flannel/flannel.go
@@ -29,6 +29,7 @@ import (
        log "k8s.io/klog"

        // Backends need to be imported for their init() to get executed and them to register
+ _ "github.com/coreos/flannel/backend/hostgw"
        _ "github.com/coreos/flannel/backend/vxlan"
 )

diff --git a/pkg/agent/flannel/setup.go b/pkg/agent/flannel/setup.go
index c2da4f34..a6f6f11b 100644
--- a/pkg/agent/flannel/setup.go
+++ b/pkg/agent/flannel/setup.go
@@ -38,7 +38,7 @@ const (
        netJSON = `{
     "Network": "%CIDR%",
     "Backend": {
-    "Type": "vxlan"
+    "Type": "host-gw"
     }
 }
 `

@erikwilson
Copy link
Contributor

I think it would be okay to have an option for host-gw, not sure how @ibuildthecloud feels about it.

It would be good to get to the bottom of the issue tho.

@johnae
Copy link

johnae commented Nov 5, 2019

I have a similar issue, not sure if it's the same. I noticed this problem when deploying k3s to more than one node. In my case it seems the master node cannot resolve dns while the other nodes can. So any workload ending up on the master fails connecting to things. For example deploying external-dns or cert-manager, if they end up on the master they fail.

@pckbls
Copy link

pckbls commented Dec 3, 2019

Can confirm, I'm experiencing exactly the same issue as @johnae on k3s version v1.0.0 (18bd921) on a multi-node Raspberry Pi setup. Have you been able to find a workaround for that problem other than scheduling pods onto nodes other than master?

@ghost
Copy link

ghost commented Dec 6, 2019

I have a 1.0 version of k3s cluster with 3 masters and 2 agents, same problem here, to add some details:

  • only the node run coredns can resolve via 10.43.0.10, other (both master and agents) can't.
  • pods with hostNetwork: true can't resolve via it, normal in cluster network is okay.
  • TCP works everywhere, eg: dig SERVICE @10.43.0.10 +tcp on any node is fine.

@ghost
Copy link

ghost commented Dec 17, 2019

Another workaround/fix is to use NodeLocal DNSCache https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/.

Hope it helps.

@akhenakh
Copy link
Author

I'm thinking this issue happen when your dns server is one of the hosts itself.

@Neonox31
Copy link

Neonox31 commented Apr 24, 2020

I feel like I have a similar problem.

With only master node, DNS are working well.
With another nodes, DNS are also working but only on pods that are not on the same node than coredns pod.

I'm thinking this issue happen when your dns server is one of the hosts itself.

Indeed ! My DNS server is deployed on the same host.

@e3b0c442
Copy link

I'm thinking this issue happen when your dns server is one of the hosts itself.

Can confirm this is not universal. I believe I am running into this issue, and my upstream DNS server is external to both the router and any k3s node.

I had thought that perhaps it might be an issue with a mixed-architecture cluster; my master is running on a Raspberry Pi 4 with Raspbian Buster; I have one worker node on AMD64/Ubuntu 18.04. I haven't been able to test out the multi-arch theory due to lack of nodes (I only have the one RPi right now).

Another commonality I see mentioned in this thread is that I have a MikroTik router. I will go down that rabbit hole here momentarily. I think it is a fair possibility that this or something host-OS side could be the issue and it's related to VXLAN, because I can only ping pod IPs on the local node in the cluster.

@e3b0c442
Copy link

And I've resolved my issue.

Check your firewalls -- make sure that your nodes can communicate with each other on UDP port 8472 (assuming you're using the default VXLAN backend for Flannel).

@akenakh this could explain why host-gw backend was working and VXLAN was not, in your case.

@deosrc
Copy link

deosrc commented Dec 22, 2020

I'm thinking this issue happen when your dns server is one of the hosts itself.

This seems to be the case for me.

Everything was working fine when I had my DHCP and DNS handled by my router and forwarding DNS requests to a DNS server inside my cluster (PiHole).

When I tried changing the DHCP and DNS to use PiHole directly (no changes to the pods, only router settings), the pods using hostNetwork: true all broke (they are locked to the same node). All of the requests for the host network seem to be going through coredns and failing when it tries to retrieve them from 8.8.8.8.

Playing around inside one of the host network pods, I found that queries to 10.43.0.1 resolved fine for cluster DNS and querying 127.0.0.1.

EDIT: Sorry, I had some config here to try get all names to resolve but it seemed to only go to the first nameserver. Specifying both didn't have the desired effect.

@stale
Copy link

stale bot commented Jul 30, 2021

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@stale stale bot added the status/stale label Jul 30, 2021
@stale stale bot closed this as completed Aug 13, 2021
@pchang388
Copy link

And I've resolved my issue.

Check your firewalls -- make sure that your nodes can communicate with each other on UDP port 8472 (assuming you're using the default VXLAN backend for Flannel).

@akenakh this could explain why host-gw backend was working and VXLAN was not, in your case.

I know this issue is long closed/stale, but just wanted to comment that still works for me. Default flannel mode is vxlan for k3s installations and opening 8472 port on Ubuntu hosts worked. These ports need to be open when you do multi-node clusters, I did not notice it when using a single node.

Still works as of version: v1.25.5+k3s2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants