DNS Resolving issues with deploying workload to k3s #4486

ajvn · 2021-11-13T08:23:13Z

Environmental Info:
K3s Version:

k3s version v1.22.3+k3s1 (61a2aab2)
go version go1.16.8

Happens with 1.21.4, 1.21.5, and 1.21.6 as well, across RCs, haven't checked other versions.

Node(s) CPU architecture, OS, and Version:

Linux RPI4-2 5.11.0-1021-raspi #22-Ubuntu SMP PREEMPT Wed Oct 6 17:30:38 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux

NAME="Ubuntu"
VERSION="21.04 (Hirsute Hippo)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 21.04"
VERSION_ID="21.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=hirsute
UBUNTU_CODENAME=hirsute

Cluster Configuration:

RPI4s, 8 and 4gb versions.

NAME     STATUS   ROLES                  AGE   VERSION
rpi4-0   Ready    control-plane,master   20d   v1.22.3+k3s1
rpi4-2   Ready    agent                  20d   v1.22.3+k3s1
rpi4-1   Ready    agent                  20d   v1.22.3+k3s1

Output from /etc/hosts:

# Localhost block
127.0.0.1 localhost
192.168.0.150 rpi4-0.localhost.localdomain
192.168.0.151 rpi4-1.localhost.localdomain
192.168.0.152 rpi4-2.localhost.localdomain
192.168.0.200 rpi4-nfs.localhost.localdomain

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Describe the bug:

Problem happens while trying to deploy Pihole to the cluster, by being unable to resolve public hostnames of different container image registries.
I've only tried PiHole deployment, but I assume any other would fail with same issue.

20s         Warning   Failed  pod/pihole-78d8dbbb75-5rltq      Failed to pull image "pihole/pihole:2021.10.1": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/pihole/pihole:2021.10.1": failed to resolve reference "docker.io/pihole/pihole:2021.10.1": failed to do request: Head "https://registry-1.docker.io/v2/pihole/pihole/manifests/2021.10.1": dial tcp: lookup registry-1.docker.io: Try again
11s         Warning   Failed  pod/svclb-pihole-dns-tcp-hkmhc   Failed to pull image "rancher/klipper-lb:v0.3.4": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/rancher/klipper-lb:v0.3.4": failed to resolve reference "docker.io/rancher/klipper-lb:v0.3.4": failed to do request: Head "https://registry-1.docker.io/v2/rancher/klipper-lb/manifests/v0.3.4": dial tcp: lookup registry-1.docker.io: Try again

Here's more visual way to observe issue across the nodes. This happens when I ping Google before, and during the deployment:

╰─➤  ping google.com
PING google.com (142.250.186.174) 56(84) bytes of data.
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=1 ttl=115 time=34.9 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=2 ttl=115 time=28.4 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=3 ttl=115 time=27.5 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=4 ttl=115 time=24.7 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=5 ttl=115 time=24.7 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=6 ttl=115 time=29.0 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=7 ttl=115 time=22.3 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=8 ttl=115 time=22.0 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=9 ttl=115 time=23.0 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=10 ttl=115 time=22.6 ms
64 bytes from 142.250.186.174: icmp_seq=11 ttl=115 time=23.1 ms
64 bytes from 142.250.186.174: icmp_seq=12 ttl=115 time=22.2 ms
64 bytes from 142.250.186.174: icmp_seq=13 ttl=115 time=24.6 ms
64 bytes from 142.250.186.174: icmp_seq=14 ttl=115 time=22.0 ms
64 bytes from 142.250.186.174: icmp_seq=15 ttl=115 time=23.7 ms
64 bytes from 142.250.186.174: icmp_seq=16 ttl=115 time=22.2 ms
64 bytes from 142.250.186.174: icmp_seq=17 ttl=115 time=26.1 ms
64 bytes from 142.250.186.174: icmp_seq=18 ttl=115 time=24.6 ms
64 bytes from 142.250.186.174: icmp_seq=19 ttl=115 time=24.2 ms
64 bytes from 142.250.186.174: icmp_seq=20 ttl=115 time=25.2 ms
^C64 bytes from 142.250.186.174: icmp_seq=21 ttl=115 time=27.5 ms

--- google.com ping statistics ---
21 packets transmitted, 21 received, 0% packet loss, time 60193ms
rtt min/avg/max/mdev = 22.029/24.981/34.886/3.057 ms

After deployment is removed, it starts working normally again. While it's trying to deploy, DNS will be completely broken across the nodes.

Before deployment/after deployment removal:

telnet registry-1.docker.io 80
Trying 52.204.76.244...
Connected to registry-1.docker.io.
Escape character is '^]'.

Steps To Reproduce:

I'm using Ansible to setup cluster:

Task for preparing master node:

---
- name: Install sqlite3 to enable K3S state backups
  apt:
    name: sqlite3
    state: present

- name: Create Rancher configuration directory
  ansible.builtin.file:
    path: /etc/rancher/k3s
    state: directory
    mode: '0755'

- name: Upload server configuration file
  ansible.builtin.copy:
    src: ../extras/server-config.yaml
    dest: /etc/rancher/k3s/config.yaml
    owner: root
    group: root
    mode: '0400'

- name: Ensure agent-token value is present in config file
  ansible.builtin.lineinfile:
    path: /etc/rancher/k3s/config.yaml
    line: 'agent-token: {{ agent_token }}'
  no_log: True

- name: Upload systemd service file
  ansible.builtin.copy:
    src: ../extras/k3s-server.service
    dest: /etc/systemd/system/k3s.service
    owner: root
    group: root
    mode: '0644'

- name: Setup systemd service
  ansible.builtin.systemd:
    name: k3s.service
    state: started
    enabled: yes
    daemon_reload: yes

Task for preparing agent nodes:

---
- name: Create Rancher configuration directory
  ansible.builtin.file:
    path: /etc/rancher/k3s
    state: directory
    mode: '0755'

- name: Upload agent configuration file
  ansible.builtin.copy:
    src: ../extras/agent-config.yaml
    dest: /etc/rancher/k3s/config.yaml
    owner: root
    group: root
    mode: '0400'

- name: Ensure agent-token value is present in config file
  ansible.builtin.lineinfile:
    path: /etc/rancher/k3s/config.yaml
    line: 'token: {{ agent_token }}'
  no_log: True

- name: Upload systemd service file
  ansible.builtin.copy:
    src: ../extras/k3s-agent.service
    dest: /etc/systemd/system/k3s.service
    owner: root
    group: root
    mode: '0644'

- name: Setup systemd service
  ansible.builtin.systemd:
    name: k3s.service
    state: started
    enabled: yes
    daemon_reload: yes

Task for preparation all of the RPIs:

---
- name: Upload hosts file
  ansible.builtin.copy:
    src: ../extras/hosts
    dest: /etc/hosts
    owner: root
    group: root
    mode: '0644'

- name: Download k3s binary
  get_url:
    url: '{{ k3s_download_url }}'
    dest: /usr/local/bin/k3s
    checksum: '{{ k3s_download_checksum }}'
    mode: '0744'

- name: Install nfs-common package
  apt:
    name: nfs-common
    state: present

Master config:

datastore-endpoint: "sqlite"
disable:
  - "local-storage"
write-kubeconfig-mode: "0600"
node-label:
  - "node=admin"
agent-token: "<token>" #This is being provided during the playbook run, from the vault.

Agent config:

server: "https://rpi4-0.localhost.localdomain:6443"
node-label:
  - "node=agent"
token: "<token>" #This is being provided during the playbook run, from the vault.

Master systemd service:

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStart=/usr/local/bin/k3s server --config /etc/rancher/k3s/config.yaml
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

Agent systemd service:

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
ExecStart=/usr/local/bin/k3s agent --config /etc/rancher/k3s/config.yaml
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

Expected behavior:

Being able to deploy workload with working DNS.

Actual behavior:

Not being able to deploy workload because of broken DNS.

Additional context / logs:

Some logs from master k3s:

Nov 13 07:28:05 RPI4-0 k3s[2133853]: time="2021-11-13T07:28:05Z" level=info msg="Handling backend connection request [rpi4-1]"
Nov 13 07:28:06 RPI4-0 k3s[2133853]: I1113 07:28:06.595401 2133853 kubelet_volumes.go:160] "Cleaned up orphaned pod volumes dir" podUID=0883c307-a89d-4177-bbf5-6c6eafc4afe9 path="/var/lib/kubelet/pods/0883c307-a89d-4177-bbf5-6c6eafc4afe9/volumes"
Nov 13 07:28:12 RPI4-0 k3s[2133853]: I1113 07:28:12.648979 2133853 job_controller.go:406] enqueueing job kube-system/helm-install-traefik
Nov 13 07:28:13 RPI4-0 k3s[2133853]: I1113 07:28:13.861804 2133853 job_controller.go:406] enqueueing job kube-system/helm-install-traefik-crd
Nov 13 07:28:16 RPI4-0 k3s[2133853]: I1113 07:28:16.460564 2133853 event.go:291] "Event occurred" object="kube-system/traefik-97b44b794-4bbcr" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Cancelling deletion of Pod kube-system/traefik-97b44b794-4bbcr"
Nov 13 07:28:16 RPI4-0 k3s[2133853]: I1113 07:28:16.460736 2133853 event.go:291] "Event occurred" object="kube-system/helm-install-traefik--1-wcff5" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Cancelling deletion of Pod kube-system/helm-install-traefik--1-wcff5"
Nov 13 07:28:16 RPI4-0 k3s[2133853]: I1113 07:28:16.460803 2133853 event.go:291] "Event occurred" object="kube-system/helm-install-traefik-crd--1-mrqwq" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Cancelling deletion of Pod kube-system/helm-install-traefik-crd--1-mrqwq"
Nov 13 07:28:18 RPI4-0 k3s[2133853]: E1113 07:28:18.745224 2133853 remote_image.go:114] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"docker.io/rancher/klipper-lb:v0.3.4\": failed to copy: httpReadSeeker: failed open: failed to do request: Get \"https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/62/625882d9991e41d98c4b3d51384d1c8dd99cc36246a81cbfbbdadb9b7828ff3f/data?verify=1636791499-P%2FHfvq3iYd335r0eMn%2FIMbzkn88%3D\": dial tcp: lookup production.cloudflare.docker.com: Try again" image="rancher/klipper-lb:v0.3.4"
Nov 13 07:28:18 RPI4-0 k3s[2133853]: E1113 07:28:18.745503 2133853 kuberuntime_image.go:51] "Failed to pull image" err="rpc error: code = Unknown desc = failed to pull and unpack image \"docker.io/rancher/klipper-lb:v0.3.4\": failed to copy: httpReadSeeker: failed open: failed to do request: Get \"https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/62/625882d9991e41d98c4b3d51384d1c8dd99cc36246a81cbfbbdadb9b7828ff3f/data?verify=1636791499-P%2FHfvq3iYd335r0eMn%2FIMbzkn88%3D\": dial tcp: lookup production.cloudflare.docker.com: Try again" image="rancher/klipper-lb:v0.3.4"
Nov 13 07:28:18 RPI4-0 k3s[2133853]: E1113 07:28:18.745926 2133853 kuberuntime_manager.go:898] container &Container{Name:lb-port-80,Image:rancher/klipper-lb:v0.3.4,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:lb-port-80,HostPort:80,ContainerPort:80,Protocol:TCP,HostIP:,},},Env:[]EnvVar{EnvVar{Name:SRC_PORT,Value:80,ValueFrom:nil,},EnvVar{Name:DEST_PROTO,Value:TCP,ValueFrom:nil,},EnvVar{Name:DEST_PORT,Value:80,ValueFrom:nil,},EnvVar{Name:DEST_IPS,Value:10.43.234.3,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[NET_ADMIN],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod svclb-traefik-7trbj_kube-system(88ce2f96-0d05-4ea1-8c08-caab28afc45d): ErrImagePull: rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/rancher/klipper-lb:v0.3.4": failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/62/625882d9991e41d98c4b3d51384d1c8dd99cc36246a81cbfbbdadb9b7828ff3f/data?verify=1636791499-P%2FHfvq3iYd335r0eMn%2FIMbzkn88%3D": dial tcp: lookup production.cloudflare.docker.com: Try again
Nov 13 07:28:18 RPI4-0 k3s[2133853]: E1113 07:28:18.748964 2133853 pod_workers.go:836] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"lb-port-80\" with ErrImagePull: \"rpc error: code = Unknown desc = failed to pull and unpack image \\\"docker.io/rancher/klipper-lb:v0.3.4\\\": failed to copy: httpReadSeeker: failed open: failed to do request: Get \\\"https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/62/625882d9991e41d98c4b3d51384d1c8dd99cc36246a81cbfbbdadb9b7828ff3f/data?verify=1636791499-P%2FHfvq3iYd335r0eMn%2FIMbzkn88%3D\\\": dial tcp: lookup production.cloudflare.docker.com: Try again\", failed to \"StartContainer\" for \"lb-port-443\" with ImagePullBackOff: \"Back-off pulling image \\\"rancher/klipper-lb:v0.3.4\\\"\"]" pod="kube-system/svclb-traefik-7trbj" podUID=88ce2f96-0d05-4ea1-8c08-caab28afc45d
Nov 13 07:28:18 RPI4-0 k3s[2133853]: E1113 07:28:18.925672 2133853 pod_workers.go:836] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"lb-port-80\" with ImagePullBackOff: \"Back-off pulling image \\\"rancher/klipper-lb:v0.3.4\\\"\", failed to \"StartContainer\" for \"lb-port-443\" with ImagePullBackOff: \"Back-off pulling image \\\"rancher/klipper-lb:v0.3.4\\\"\"]" pod="kube-system/svclb-traefik-7trbj" podUID=88ce2f96-0d05-4ea1-8c08-caab28afc45d
Nov 13 07:28:31 RPI4-0 k3s[2133853]: E1113 07:28:31.976407 2133853 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: an error on the server ("Internal Server Error: \"/apis/metrics.k8s.io/v1beta1?timeout=32s\": the server could not find the requested resource") has prevented the request from succeeding
Nov 13 07:28:32 RPI4-0 k3s[2133853]: W1113 07:28:32.043970 2133853 garbagecollector.go:703] failed to discover some groups: map[metrics.k8s.io/v1beta1:an error on the server ("Internal Server Error: \"/apis/metrics.k8s.io/v1beta1?timeout=32s\": the server could not find the requested resource") has prevented the request from succeeding]
Nov 13 07:28:39 RPI4-0 k3s[2133853]: I1113 07:28:39.187155 2133853 event.go:291] "Event occurred" object="kube-system/kube-dns" kind="Endpoints" apiVersion="v1" type="Warning" reason="FailedToUpdateEndpoint" message="Failed to update endpoint kube-system/kube-dns: Operation cannot be fulfilled on endpoints \"kube-dns\": the object has been modified; please apply your changes to the latest version and try again"
Nov 13 07:28:39 RPI4-0 k3s[2133853]: I1113 07:28:39.943575 2133853 event.go:291] "Event occurred" object="pihole/pihole" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"cluster.local/nfs-subdir-external-provisioner\" or manually created by system administrator"
Nov 13 07:28:39 RPI4-0 k3s[2133853]: I1113 07:28:39.945143 2133853 event.go:291] "Event occurred" object="pihole/pihole" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"cluster.local/nfs-subdir-external-provisioner\" or manually created by system administrator"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.170912 2133853 event.go:291] "Event occurred" object="pihole/svclb-pihole-dns-tcp" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: svclb-pihole-dns-tcp-xjfkp"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.246659 2133853 event.go:291] "Event occurred" object="pihole/svclb-pihole-dns-tcp" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: svclb-pihole-dns-tcp-vrpqm"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.264243 2133853 controller.go:611] quota admission added evaluator for: ingresses.networking.k8s.io
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.287714 2133853 event.go:291] "Event occurred" object="kube-system/metrics-server" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled down replica set metrics-server-86cbb8457f to 0"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.308915 2133853 event.go:291] "Event occurred" object="pihole/pihole" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set pihole-78d8dbbb75 to 1"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.327121 2133853 event.go:291] "Event occurred" object="pihole/svclb-pihole-dns-udp" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: svclb-pihole-dns-udp-g7wzl"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.327216 2133853 event.go:291] "Event occurred" object="pihole/svclb-pihole-dns-tcp" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: svclb-pihole-dns-tcp-hkmhc"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.336341 2133853 topology_manager.go:200] "Topology Admit Handler"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: W1113 07:28:40.390339 2133853 container.go:586] Failed to update stats for container "/kubepods/besteffort/pod719e5583-5ef1-4fa8-b943-d1aa325d941c": /sys/fs/cgroup/cpuset/kubepods/besteffort/pod719e5583-5ef1-4fa8-b943-d1aa325d941c/cpuset.mems found to be empty, continuing to push stats
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.408076 2133853 event.go:291] "Event occurred" object="kube-system/metrics-server-86cbb8457f" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulDelete" message="Deleted pod: metrics-server-86cbb8457f-94rsl"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.681743 2133853 event.go:291] "Event occurred" object="pihole/svclb-pihole-dns-udp" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: svclb-pihole-dns-udp-ttprg"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: E1113 07:28:40.745346 2133853 available_controller.go:524] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.17.169:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.17.169:443/apis/metrics.k8s.io/v1beta1": dial tcp 10.43.17.169:443: connect: connection refused
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.802135 2133853 event.go:291] "Event occurred" object="pihole/svclb-pihole-dns-udp" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: svclb-pihole-dns-udp-ljd2l"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.863705 2133853 topology_manager.go:200] "Topology Admit Handler"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.902157 2133853 trace.go:205] Trace[1302298980]: "GuaranteedUpdate etcd3" type:*core.Pod (13-Nov-2021 07:28:40.378) (total time: 523ms):
Nov 13 07:28:40 RPI4-0 k3s[2133853]: Trace[1302298980]: ---"Transaction committed" 523ms (07:28:40.901)
Nov 13 07:28:40 RPI4-0 k3s[2133853]: Trace[1302298980]: [523.831973ms] [523.831973ms] END
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.903379 2133853 trace.go:205] Trace[698766046]: "Create" url:/api/v1/namespaces/pihole/pods/svclb-pihole-dns-udp-g7wzl/binding,user-agent:k3s/v1.22.3+k3s1 (linux/arm64) kubernetes/61a2aab/scheduler,audit-id:e4ab70b1-045e-4bcd-8a63-9af39c2fced8,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (13-Nov-2021 07:28:40.377) (total time: 526ms):
Nov 13 07:28:40 RPI4-0 k3s[2133853]: Trace[698766046]: ---"Object stored in database" 524ms (07:28:40.902)
Nov 13 07:28:40 RPI4-0 k3s[2133853]: Trace[698766046]: [526.101945ms] [526.101945ms] END
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.936346 2133853 trace.go:205] Trace[197358589]: "Create" url:/apis/events.k8s.io/v1/namespaces/pihole/events,user-agent:k3s/v1.22.3+k3s1 (linux/arm64) kubernetes/61a2aab/scheduler,audit-id:f8d49856-0dea-4ec7-aa8f-ec5938c8809f,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (13-Nov-2021 07:28:40.342) (total time: 593ms):
Nov 13 07:28:40 RPI4-0 k3s[2133853]: Trace[197358589]: ---"Object stored in database" 593ms (07:28:40.935)
Nov 13 07:28:40 RPI4-0 k3s[2133853]: Trace[197358589]: [593.793712ms] [593.793712ms] END
Nov 13 07:28:41 RPI4-0 k3s[2133853]: I1113 07:28:40.960095 2133853 trace.go:205] Trace[1547416]: "Create" url:/api/v1/namespaces/pihole/pods,user-agent:k3s/v1.22.3+k3s1 (linux/arm64) kubernetes/61a2aab/system:serviceaccount:kube-system:replicaset-controller,audit-id:ccab7655-e9c7-4384-8a5a-5e81bb653eaa,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (13-Nov-2021 07:28:40.383) (total time: 576ms):
Nov 13 07:28:41 RPI4-0 k3s[2133853]: Trace[1547416]: ---"Object stored in database" 563ms (07:28:40.947)
Nov 13 07:28:41 RPI4-0 k3s[2133853]: Trace[1547416]: [576.974211ms] [576.974211ms] END

Backporting

Needs backporting to older releases (if it's indeed K3s issue, and not my setup issue)

The text was updated successfully, but these errors were encountered:

manuelbuil · 2021-11-15T11:48:33Z

I have a few questions.

1 - Where are you trying to resolve hostnames, in the node? Or inside a pod? Are both not working when experiencing the problem?

2 - The problem happens when you deploy the pihole deployment. Does it happen with other deployments too?

3 - Can you share how you are installing the pihole deployment please?

ajvn · 2021-11-15T13:49:24Z

1 - On the node(s).
2 - It happens with anything that tries to pull new images from the public repositories, you can see in the logs above that it happens when it tries to pull rancher/klipper-lb:v0.3.4 image as well.
3 - Using this Helm chart https://github.com/MoJo2600/pihole-kubernetes/tree/master/charts/pihole

manuelbuil · 2021-11-15T14:27:55Z

2 - It happens with anything that tries to pull new images from the public repositories, you can see in the logs above that it happens when it tries to pull rancher/klipper-lb:v0.3.4 image as well.

I think I did not explain myself correctly, let me clarify :). You mentioned After deployment is removed, it starts working normally again. By deployment, I understand you mean the pihole deployment. My question is, what happens if you don't deploy the pihole deployment and deploy something else, is DNS broken too? Or that only happens specifically when deploying pihole?

ajvn · 2021-11-15T14:37:52Z

It used to be any deployment with 1.21.x, but I haven't tested anything else beside pihole with 1.22.x. Let me do that first after I'm done with work, and then I'll report back.

Thanks for taking the time to address this issue.

dhermanns · 2021-11-20T12:38:26Z

Same problem here. Worked just fine yesterday. Tried older Chart-Versions down to 2.5.1 with no success.
Could it be just running into docker pull quotas today?

manuelbuil · 2021-11-22T11:15:52Z

Same problem here. Worked just fine yesterday. Tried older Chart-Versions down to 2.5.1 with no success. Could it be just running into docker pull quotas today?

Also with pihole?

dhermanns · 2021-11-22T15:17:36Z

Yes - but drilled it down now. In my case it was a simple dns resolving issue I solved by fixing the nameserver in /etc/resolve.conf.

ajvn · 2021-11-30T16:57:26Z

@manuelbuil Apologies for delay, busy couple of weeks. I've just tried gogs, and downloading images works properly, seems like it's Pihole related. Let me know if you want to investigate this further, if not I'll close the issue and keep investigating on Pihole side.

Thank you.

manuelbuil · 2021-11-30T17:33:17Z

If it's related to Pihole, I'd prefer to close the issue to avoid confusion :)

ajvn · 2021-11-30T17:42:48Z

Will do. I'll post a solution here if I manage to figure it out.

ajvn · 2021-12-21T05:41:20Z

Confirming that indeed it wasn't issue with K3S, but rather with PiHole installation, managed to get it working couple of days ago. If anyone is facing similar issue, feel free to reach out, as this is probably not the right place for that kind of help, and solution is a bit complicated.

diogosilva30 · 2023-05-20T16:23:24Z

@ajvn can you explain your solution? I'm struggling with the same issue. Sorry for commenting it out here, I reckon that this is not the appropriate place, but couldn't find any contact on your github profile

- When deploying pihole on port 53 of kubernetes cluster the cluster would fail on any type of request, dns lookup. Turns out the VM configured DNS was "127.0.0.53" (a local target), instead of an upstream DNS server like Cloudflare or Google. Refs: k3s-io/k3s#4486 MoJo2600/pihole-kubernetes#88

ajvn · 2023-05-20T20:42:31Z

@diogosilva30 Hello, unfortunately I don't recall what the fix was, and my pihole Git history starts 2 days after my comment stating I've found the solution.

If you'd like we can continue in the project you referenced in this issue, open an issue there and tag me, maybe we can compare your setup to mine and reverse engineer the differences.

P.S.
One thing from my values.yaml that seems relevant to this (using Mojo2600 chart):

podDnsConfig:
  enabled: true
  policy: "ClusterFirstWithHostNet"
  nameservers:
    - 127.0.0.1
    - 208.67.222.222 # OpenDNS public nameserver

ajvn mentioned this issue Nov 13, 2021

Ubuntu 21.04 - vxlan failing to route #4188

Closed

ajvn closed this as completed Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS Resolving issues with deploying workload to k3s #4486

DNS Resolving issues with deploying workload to k3s #4486

ajvn commented Nov 13, 2021

manuelbuil commented Nov 15, 2021

ajvn commented Nov 15, 2021

manuelbuil commented Nov 15, 2021

ajvn commented Nov 15, 2021

dhermanns commented Nov 20, 2021

manuelbuil commented Nov 22, 2021

dhermanns commented Nov 22, 2021

ajvn commented Nov 30, 2021

manuelbuil commented Nov 30, 2021

ajvn commented Nov 30, 2021

ajvn commented Dec 21, 2021

diogosilva30 commented May 20, 2023

ajvn commented May 20, 2023

DNS Resolving issues with deploying workload to k3s #4486

DNS Resolving issues with deploying workload to k3s #4486

Comments

ajvn commented Nov 13, 2021

manuelbuil commented Nov 15, 2021

ajvn commented Nov 15, 2021

manuelbuil commented Nov 15, 2021

ajvn commented Nov 15, 2021

dhermanns commented Nov 20, 2021

manuelbuil commented Nov 22, 2021

dhermanns commented Nov 22, 2021

ajvn commented Nov 30, 2021

manuelbuil commented Nov 30, 2021

ajvn commented Nov 30, 2021

ajvn commented Dec 21, 2021

diogosilva30 commented May 20, 2023

ajvn commented May 20, 2023