Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update from RKE v1.2.11 to v1.3.1 leads to unreachable Rancher UI #2702

Closed
kagehisa opened this issue Sep 22, 2021 · 22 comments
Closed

Update from RKE v1.2.11 to v1.3.1 leads to unreachable Rancher UI #2702

kagehisa opened this issue Sep 22, 2021 · 22 comments

Comments

@kagehisa
Copy link

I recently tried to update to Rancher version 2.6 and since I was at it I also updated rke from version 1.2.11 to version 1.3.1. I'm running a single node setup on a Bare-metal server with self signed certificates behind a layer 4 load balancer (multinode setup planned in the future). After the update I read in this Documentation for the configuration of the network options that I need to add
network_mode: none to my ingress configuration since there was a change in Kubernetes. I did this as seen in my cluster configuration below. Nevertheless after deploying the cluster, creating the namespace, adding the certificates and deploying rancher as shown in the steps below I'm unable to curl or browse the Ranger UI. Even though kubectl -n cattle-system rollout status deploy/rancher tells me the deployment was successfull and kubectl -n cattle-system get pods shows that it is running.
I tried to gather as much information as possible including the ingress logs, the rancher logs and the log output from my load balancer when trying to curl the rancher server.
All I can see is that there seems to be no endpoint reachable. The configuration (without the ingress addition) worked fine for rke 1.2.11 and Rancher 2.5. I have no idea where I should look next or what might be the cause of the issue. So every help in that direction is much appreciated.

RKE version:

v1.3.1

Docker version: (docker version)

Client:
 Version:           20.10.6-ce
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        8728dd246c3a
 Built:             Thu Apr 15 12:00:00 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.6-ce
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       8728dd246c3a
  Built:            Thu Apr 15 12:00:00 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.1.5_catatonit
  GitCommit:

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
os-release:

NAME="SLES"
VERSION="15-SP3"
VERSION_ID="15.3"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP3"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp3"
DOCUMENTATION_URL="https://documentation.suse.com/"

uname -r:
5.3.18-59.19-default

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Bare-metal

cluster.yml file:

nodes:
  - address: lff-l3vsrv103.internal.de
    user: cowherder
    role:
      - controlplane
      - worker
      - etcd

services:
  etcd:
    backup_config:
      interval_hours: 12
      retention: 6

# Cluster level SSH private key
# Used if no ssh information is set for the node
ssh_key_path: /home/cowherder/.ssh/rancher_admin_id_rsa

# Set the name of the Kubernetes cluster
cluster_name: lff_adm_cluster

prefix_path: /opt/rke

ignore_docker_version: false

ingress:
  provider: nginx
  network_mode: none
  options:
    use-proxy-protocol: true
  extra_args:
    http-port: 80
    https-port: 443

Steps to Reproduce:

  • use rke to deploy cluster
  • kubectl create namespace cattle-system
  • kubectl -n cattle-system create secret tls tls-rancher-ingress --cert=/opt/rancher_inst/ssl_rancheradm.internal.de/tls.crt --key=/opt/rancher_inst/ssl_rancheradm.internal.de/tls.key
  • kubectl -n cattle-system create secret generic tls-ca --from-file=cacerts.pem=/opt/rancher_inst/lff_root_ca/cacerts.pem
  • helm install rancher rancher-latest/rancher --namespace cattle-system --set hostname=rancheradm.internal.de --set ingress.tls.source=secret --set privateCA=true --set bootstrapPassword=startPass --set replicas=3 --set proxy=http://www.proxy.internal.de:80 --set noProxy=127.0.0.1\\,localhost\\,0.0.0.0\\,10.0.0.0/8\\,cattle-system.svc\\,.svc\\,.cluster.local\\,.internal.de
  • Wait for rancher to be deployed
  • curl -k -vvv https://rancheradm.internal.de
    Results:
* Uses proxy env variable no_proxy == '127.0.0.1,localhost,0.0.0.0,10.0.0.0/8,cattle-system.svc,.svc,.cluster.local,.internal.de,10.48.90.148,10.48.90.144,10.48.90.145,10.48.90.137'
*   Trying 10.48.90.137:443...
* TCP_NODELAY set
* Connected to rancheradm.internal.de (10.48.90.137) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to rancheradm.internal.de:443
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to rancheradm.internal.de:443

Ingress log

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       nginx-0.48.1-rancher1
  Build:         git-4bf680f6c
  Repository:    https://github.com/rancher/ingress-nginx.git
  nginx version: nginx/1.20.1

-------------------------------------------------------------------------------

I0922 07:13:34.857467       7 flags.go:211] "Watching for Ingress" class="nginx"
W0922 07:13:34.857632       7 flags.go:216] Ingresses with an empty class will also be processed by this Ingress controller
W0922 07:13:34.858709       7 client_config.go:614] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0922 07:13:34.858966       7 main.go:241] "Creating API client" host="https://10.43.0.1:443"
I0922 07:13:34.871961       7 main.go:285] "Running in Kubernetes cluster" major="1" minor="21" git="v1.21.5" state="clean" commit="aea7bbadd2fc0cd689de94a54e5b7b758869d691" platform="linux/amd64"
I0922 07:13:35.540937       7 main.go:105] "SSL fake certificate created" file="/etc/ingress-controller/ssl/default-fake-certificate.pem"
I0922 07:13:35.543437       7 main.go:115] "Enabling new Ingress features available since Kubernetes v1.18"
W0922 07:13:35.545694       7 main.go:127] No IngressClass resource with name nginx found. Only annotation will be used.
I0922 07:13:35.642104       7 ssl.go:532] "loading tls certificate" path="/usr/local/certificates/cert" key="/usr/local/certificates/key"
I0922 07:13:35.689372       7 nginx.go:254] "Starting NGINX Ingress controller"
I0922 07:13:35.740182       7 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"nginx-configuration", UID:"f9766ac2-382b-4116-820f-993befcdb30b", APIVersion:"v1", ResourceVersion:"721", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/nginx-configuration
I0922 07:13:36.889935       7 nginx.go:296] "Starting NGINX process"
I0922 07:13:36.890275       7 nginx.go:316] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key"
I0922 07:13:36.890337       7 leaderelection.go:243] attempting to acquire leader lease ingress-nginx/ingress-controller-leader-nginx...
I0922 07:13:36.891207       7 controller.go:148] "Configuration changes detected, backend reload required"
I0922 07:13:36.905403       7 leaderelection.go:253] successfully acquired lease ingress-nginx/ingress-controller-leader-nginx
I0922 07:13:36.905746       7 status.go:84] "New leader elected" identity="nginx-ingress-controller-ldjzf"
I0922 07:13:37.065143       7 status.go:204] "POD is not ready" pod="ingress-nginx/nginx-ingress-controller-ldjzf" node="lff-l3vsrv103.bfd.bayern.de"
I0922 07:13:37.141787       7 controller.go:165] "Backend successfully reloaded"
I0922 07:13:37.142037       7 controller.go:176] "Initial sync, sleeping for 1 second"
I0922 07:13:37.143115       7 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"nginx-ingress-controller-ldjzf", UID:"5fc10fe4-9be2-4605-b710-70e969fdaac5", APIVersion:"v1", ResourceVersion:"811", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration

W0922 07:19:47.329540       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
W0922 07:19:47.329597       7 controller.go:1207] Error getting SSL certificate "cattle-system/tls-rancher-ingress": local SSL certificate cattle-system/tls-rancher-ingress was not found. Using default certificate
I0922 07:19:47.398349       7 main.go:112] "successfully validated configuration, accepting" ingress="rancher/cattle-system"
I0922 07:19:47.425950       7 backend_ssl.go:66] "Adding secret to local store" name="cattle-system/tls-rancher-ingress"
I0922 07:19:47.426203       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"d6e5e5ca-9a99-41df-b4cf-36c6db250461", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"1370", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
W0922 07:19:50.578546       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0922 07:19:50.579051       7 controller.go:148] "Configuration changes detected, backend reload required"
I0922 07:19:50.771484       7 controller.go:165] "Backend successfully reloaded"
I0922 07:19:50.776575       7 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"nginx-ingress-controller-ldjzf", UID:"5fc10fe4-9be2-4605-b710-70e969fdaac5", APIVersion:"v1", ResourceVersion:"811", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
W0922 07:19:53.911917       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0922 07:20:37.199375       7 status.go:284] "updating Ingress status" namespace="cattle-system" ingress="rancher" currentValue=[] newValue=[{IP: Hostname:lff-l3vsrv103.bfd.bayern.de Ports:[]}]
W0922 07:20:37.292629       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0922 07:20:37.292722       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"d6e5e5ca-9a99-41df-b4cf-36c6db250461", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"1902", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0922 07:22:26.280388       7 main.go:112] "successfully validated configuration, accepting" ingress="rancher/cattle-system"
I0922 07:22:26.542352       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"d6e5e5ca-9a99-41df-b4cf-36c6db250461", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"2901", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0922 07:22:47.285329       7 store.go:453] "secret was updated and it is used in ingress annotations. Parsing" secret="cattle-system/tls-rancher-ingress"

Log output from kubectl -n cattle-system logs -f rancher-75b8bc6df6-k8vcs
rancher_log.txt

Log output from nginx load balancer

2021/09/22 08:26:11 [notice] 38139#38139: built by gcc 7.5.0 (SUSE Linux)
2021/09/22 08:26:11 [notice] 38139#38139: OS: Linux 5.3.18-59.19-default
2021/09/22 08:26:11 [notice] 38139#38139: getrlimit(RLIMIT_NOFILE): 1024:524288
2021/09/22 08:26:11 [notice] 38139#38139: start worker processes
2021/09/22 08:26:11 [notice] 38139#38139: start worker process 38150
2021/09/22 08:26:11 [notice] 38139#38139: start worker process 38151
2021/09/22 08:26:11 [notice] 38139#38139: start worker process 38152
2021/09/22 08:26:11 [notice] 38139#38139: start worker process 38153
2021/09/22 08:27:22 [info] 38150#38150: *1 client 10.48.90.148:42518 connected to 0.0.0.0:443
2021/09/22 08:27:22 [error] 38150#38150: *1 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.148, server: 0.0.0.0:443, upstream: "10.48.90.148:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/22 08:56:53 [info] 38150#38150: *3 client 10.48.90.137:51176 connected to 0.0.0.0:443
2021/09/22 08:56:53 [error] 38150#38150: *3 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.148:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/22 08:58:13 [info] 38150#38150: *5 client 10.48.90.137:51182 connected to 0.0.0.0:443
2021/09/22 08:58:13 [error] 38150#38150: *5 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.148:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/22 09:04:43 [info] 38150#38150: *7 client 10.48.90.137:51186 connected to 0.0.0.0:443
2021/09/22 09:04:43 [error] 38150#38150: *7 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.148:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/22 09:05:12 [info] 38150#38150: *9 client 10.48.90.137:51192 connected to 0.0.0.0:443
2021/09/22 09:05:12 [error] 38150#38150: *9 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.148:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/22 09:05:28 [info] 38150#38150: *11 client 10.48.90.137:55166 connected to 0.0.0.0:80
2021/09/22 09:05:28 [error] 38150#38150: *11 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:80, upstream: "10.48.90.148:80", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/22 09:32:25 [info] 38150#38150: *13 client 10.48.90.137:51206 connected to 0.0.0.0:443
2021/09/22 09:32:25 [error] 38150#38150: *13 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.148:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/22 10:29:19 [info] 38150#38150: *15 client 10.48.25.34:56601 connected to 0.0.0.0:443
2021/09/22 10:29:19 [error] 38150#38150: *15 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.25.34, server: 0.0.0.0:443, upstream: "10.48.90.148:443", bytes from/to client:0/0, bytes from/to upstream:0/0
@superseb
Copy link
Contributor

You are talking about an upgrade but the steps to reproduce show a clean install. But I am more interested in this part After the update I read in this Documentation for the configuration of the network options that I need to add network_mode: none to my ingress configuration since there was a change in Kubernetes., please describe why you added this as I don't see a relation for your setup and the new version, why you need to add this. This also helps me understand what you are trying to achieve as what you are seeing seems perfectly fine for what you have configured, so based on what you are trying to achieve I can make better suggestions.

@kagehisa
Copy link
Author

What I want to achieve is a Rancher installation that is reachable via my loadbalancer. As it was with the previous rke version and Rancher 2.5. And if it works then switch to a multi node setup. Sorry if I didn't described my case clearly. I added the ingress network part because of this text in the linked documentation:

For Kubernetes v1.21 and up, the NGINX ingress controller no longer runs in hostNetwork: true but uses hostPorts for port 80 and port 443. This was done so the admission webhook can be configured to be accessed using ClusterIP so it can only be reached inside the cluster. If you want to change the mode and/or the ports, see the options below.

I first tried it woithout this addition but that didn't work as well (It always ended in a 502 Bad Gateway error when I curled the rancher server over the loadbalancer). So after some searching I found the mentioned documentation and added the ingress network part.
If this is not needed I will remove it, but that doesn't really change the result.

@superseb
Copy link
Contributor

There is no change needed, your change made the ingress controller not be exposed on the outside causing the connection refused logging in your NGINX load balancer. We will need the logs from the setup without this setting, and outputs from https://rancher.com/docs/rancher/v2.6/en/troubleshooting/rancherha/, and from the curl commands.

@kagehisa
Copy link
Author

kagehisa commented Sep 22, 2021

Ok, I redeployed the Cluster without the changes. Here are the requested logs and outputs.

Current cluster.conf

nodes:
  - address: lff-l3vsrv103.internal.de
    user: cowherder
    role:
      - controlplane
      - worker
      - etcd
        
services:
  etcd:
    backup_config:
      interval_hours: 12
      retention: 6

# Cluster level SSH private key
# Used if no ssh information is set for the node
ssh_key_path: /home/cowherder/.ssh/rancher_admin_id_rsa

# Set the name of the Kubernetes cluster
cluster_name: lff_adm_cluster

prefix_path: /opt/rke

ignore_docker_version: false

ingress:
  provider: nginx
  options:
    use-proxy-protocol: true

curl -k -vvv https://rancheradm.internal.de

* Uses proxy env variable no_proxy == '127.0.0.1,localhost,0.0.0.0,10.0.0.0/8,cattle-system.svc,.svc,.cluster.local,.internal.de,10.48.90.148,10.48.90.144,10.48.90.145,10.48.90.137'
*   Trying 10.48.90.137:443...
* TCP_NODELAY set
* Connected to rancheradm.internal.de (10.48.90.137) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=DE; ST=Pers; O=Internal; OU=LfF; CN=rancheradm.internal.de; serialNumber=1023752
*  start date: Aug 19 13:03:14 2021 GMT
*  expire date: Aug 19 13:03:14 2022 GMT
*  issuer: C=DE; O=Pers; OU=IT-Internal; CN=Internal-CA-2019
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55e50fc34820)
> GET / HTTP/2
> Host: rancheradm.internal.de
> User-Agent: curl/7.66.0
> Accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
< HTTP/2 502
< date: Wed, 22 Sep 2021 11:55:52 GMT
< content-type: text/html
< content-length: 150
< strict-transport-security: max-age=15724800; includeSubDomains
<
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>
* Connection #0 to host rancheradm.internal.de left intact

The corresponding log entry from the nginx load balancer

2021/09/22 13:55:49 [info] 38150#38150: *35 client 10.48.90.137:51246 connected to 0.0.0.0:443
2021/09/22 13:55:49 [info] 38150#38150: *35 proxy 10.48.90.137:51406 connected to 10.48.90.148:443
2021/09/22 13:55:52 [info] 38150#38150: *35 client disconnected, bytes from/to client:849/3325, bytes from/to upstream:3325/897

Bypassing the load balancer with curl -k -vvv https://rancheradm.internal.de --resolve rancheradm.internal.de:443:10.48.90.148

* Added rancheradm.internal.de:443:10.48.90.148 to DNS cache
* Uses proxy env variable no_proxy == '127.0.0.1,localhost,0.0.0.0,10.0.0.0/8,cattle-system.svc,.svc,.cluster.local,.internal.de,10.48.90.148,10.48.90.144,10.48.90.145,10.48.90.137'
* Hostname rancheradm.internal.de was found in DNS cache
*   Trying 10.48.90.148:443...
* TCP_NODELAY set
* Connected to rancheradm.internal.de (10.48.90.148) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to rancheradm.internal.de:443
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to rancheradm.internal.de:443

Rancher log kubectl -n cattle-system logs -f rancher-75b8bc6df6-gxdkh

rancher_log.txt

Ingress log kubectl -n ingress-nginx logs -f

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       nginx-0.48.1-rancher1
  Build:         git-4bf680f6c
  Repository:    https://github.com/rancher/ingress-nginx.git
  nginx version: nginx/1.20.1

-------------------------------------------------------------------------------

I0922 11:49:00.526080       8 flags.go:211] "Watching for Ingress" class="nginx"
W0922 11:49:00.526122       8 flags.go:216] Ingresses with an empty class will also be processed by this Ingress controller
W0922 11:49:00.526364       8 client_config.go:614] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0922 11:49:00.526680       8 main.go:241] "Creating API client" host="https://10.43.0.1:443"
I0922 11:49:00.558645       8 main.go:285] "Running in Kubernetes cluster" major="1" minor="21" git="v1.21.5" state="clean" commit="aea7bbadd2fc0cd689de94a54e5b7b758869d691" platform="linux/amd64"
I0922 11:49:01.041753       8 main.go:105] "SSL fake certificate created" file="/etc/ingress-controller/ssl/default-fake-certificate.pem"
I0922 11:49:01.044458       8 main.go:115] "Enabling new Ingress features available since Kubernetes v1.18"
W0922 11:49:01.047109       8 main.go:127] No IngressClass resource with name nginx found. Only annotation will be used.
I0922 11:49:01.161892       8 ssl.go:532] "loading tls certificate" path="/usr/local/certificates/cert" key="/usr/local/certificates/key"
I0922 11:49:01.196787       8 nginx.go:254] "Starting NGINX Ingress controller"
I0922 11:49:01.244227       8 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"nginx-configuration", UID:"78b806c8-e78f-4a23-af65-ff8df57cc18c", APIVersion:"v1", ResourceVersion:"722", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/nginx-configuration
I0922 11:49:02.397319       8 nginx.go:296] "Starting NGINX process"
I0922 11:49:02.397715       8 nginx.go:316] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key"
I0922 11:49:02.397769       8 leaderelection.go:243] attempting to acquire leader lease ingress-nginx/ingress-controller-leader-nginx...
I0922 11:49:02.401149       8 controller.go:148] "Configuration changes detected, backend reload required"
I0922 11:49:02.728400       8 leaderelection.go:253] successfully acquired lease ingress-nginx/ingress-controller-leader-nginx
I0922 11:49:02.728581       8 status.go:84] "New leader elected" identity="nginx-ingress-controller-2qlgb"
I0922 11:49:02.750337       8 status.go:204] "POD is not ready" pod="ingress-nginx/nginx-ingress-controller-2qlgb" node="lff-l3vsrv103.internal.de"
I0922 11:49:02.892526       8 controller.go:165] "Backend successfully reloaded"
I0922 11:49:02.892750       8 controller.go:176] "Initial sync, sleeping for 1 second"
I0922 11:49:02.894537       8 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"nginx-ingress-controller-2qlgb", UID:"dafb9190-9de8-4ed2-913c-ab68df8742aa", APIVersion:"v1", ResourceVersion:"809", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
W0922 11:51:05.059317       8 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
W0922 11:51:05.059360       8 controller.go:1207] Error getting SSL certificate "cattle-system/tls-rancher-ingress": local SSL certificate cattle-system/tls-rancher-ingress was not found. Using default certificate
I0922 11:51:05.124349       8 main.go:112] "successfully validated configuration, accepting" ingress="rancher/cattle-system"
I0922 11:51:05.185810       8 backend_ssl.go:66] "Adding secret to local store" name="cattle-system/tls-rancher-ingress"
I0922 11:51:05.186054       8 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"babe268e-308b-46cd-8e91-b0ff30e3de70", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"1020", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
W0922 11:51:08.359554       8 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0922 11:51:08.359822       8 controller.go:148] "Configuration changes detected, backend reload required"
I0922 11:51:08.577551       8 controller.go:165] "Backend successfully reloaded"
I0922 11:51:08.594568       8 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"nginx-ingress-controller-2qlgb", UID:"dafb9190-9de8-4ed2-913c-ab68df8742aa", APIVersion:"v1", ResourceVersion:"809", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
W0922 11:51:11.692938       8 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0922 11:52:03.320181       8 status.go:284] "updating Ingress status" namespace="cattle-system" ingress="rancher" currentValue=[] newValue=[{IP: Hostname:lff-l3vsrv103.internal.de Ports:[]}]
W0922 11:52:03.482712       8 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0922 11:52:03.483139       8 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"babe268e-308b-46cd-8e91-b0ff30e3de70", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"1566", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0922 11:53:43.909459       8 main.go:112] "successfully validated configuration, accepting" ingress="rancher/cattle-system"
I0922 11:53:44.429465       8 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"babe268e-308b-46cd-8e91-b0ff30e3de70", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"2513", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0922 11:54:04.989951       8 store.go:453] "secret was updated and it is used in ingress annotations. Parsing" secret="cattle-system/tls-rancher-ingress"
2021/09/22 11:55:50 [error] 70#70: *1724 connect() failed (113: Host is unreachable) while connecting to upstream, client: 10.48.90.137, server: rancheradm.internal.de, request: "GET / HTTP/2.0", upstream: "http://10.42.0.9:80/", host: "rancheradm.internal.de"
2021/09/22 11:55:51 [error] 70#70: *1724 connect() failed (113: Host is unreachable) while connecting to upstream, client: 10.48.90.137, server: rancheradm.internal.de, request: "GET / HTTP/2.0", upstream: "http://10.42.0.9:80/", host: "rancheradm.internal.de"
2021/09/22 11:55:52 [error] 70#70: *1724 connect() failed (113: Host is unreachable) while connecting to upstream, client: 10.48.90.137, server: rancheradm.internal.de, request: "GET / HTTP/2.0", upstream: "http://10.42.0.9:80/", host: "rancheradm.internal.de"
10.48.90.137 - - [22/Sep/2021:11:55:52 +0000] "GET / HTTP/2.0" 502 150 "-" "curl/7.66.0" 36 3.056 [cattle-system-rancher-80] [] 10.42.0.9:80, 10.42.0.9:80, 10.42.0.9:80 0, 0, 0 1.012, 1.024, 1.024 502, 502, 502 692c84ba208c21d08092f9ecfac89f6f
2021/09/22 12:02:49 [error] 70#70: *3494 broken header: "^V^C^A^B^@^A^@^A�^C^C�����^A^P�6^]F��^?^Nx%�KE=�'ҡ�J�^^YiG l��0ͳ1=��褚�����~��L        �%��^?�q�>^@>^S^B^S^C^S^A�,�0^@��̨̩�+�/^@��$�(^@k" while reading PROXY protocol, client: 10.48.90.137, server: 0.0.0.0:443

kubectl -n cattle-system get pods -l app=rancher -o wide

NAME                       READY   STATUS    RESTARTS   AGE   IP          NODE                          NOMINATED NODE   READINESS GATES
rancher-75b8bc6df6-gxdkh   1/1     Running   0          35m   10.42.0.9   lff-l3vsrv103.internal.de   <none>           <none>

kubectl -n cattle-system describe pods -l app=rancher

Name:         rancher-75b8bc6df6-gxdkh
Namespace:    cattle-system
Priority:     0
Node:         lff-l3vsrv103.internal.de/10.48.90.148
Start Time:   Wed, 22 Sep 2021 13:51:05 +0200
Labels:       app=rancher
              pod-template-hash=75b8bc6df6
              release=rancher
Annotations:  cni.projectcalico.org/containerID: 2cac66eb0748b024703b68f68cff56bc71db46d4545f6821e5a97d68098b4600
              cni.projectcalico.org/podIP: 10.42.0.9/32
              cni.projectcalico.org/podIPs: 10.42.0.9/32
Status:       Running
IP:           10.42.0.9
IPs:
  IP:           10.42.0.9
Controlled By:  ReplicaSet/rancher-75b8bc6df6
Containers:
  rancher:
    Container ID:  docker://401b3430835f5f4c57c4a05d27234773e06e502776afd1dd354d59a482b521a5
    Image:         rancher/rancher:v2.6.0
    Image ID:      docker-pullable://rancher/rancher@sha256:8aef5ceaf816c6610de49debe78fbd36deda3a33733fac47f9096640fc20eb48
    Port:          80/TCP
    Host Port:     0/TCP
    Args:
      --http-listen-port=80
      --https-listen-port=443
      --add-local=true
    State:          Running
      Started:      Wed, 22 Sep 2021 13:51:07 +0200
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:80/healthz delay=60s timeout=1s period=30s #success=1 #failure=3
    Readiness:      http-get http://:80/healthz delay=5s timeout=1s period=30s #success=1 #failure=3
    Environment:
      CATTLE_NAMESPACE:           cattle-system
      CATTLE_PEER_SERVICE:        rancher
      HTTP_PROXY:                 http://www.proxy.internal.de:80
      HTTPS_PROXY:                http://www.proxy.internal.de:80
      NO_PROXY:                   127.0.0.1,localhost,0.0.0.0,10.0.0.0/8,cattle-system.svc,.svc,.cluster.local,.internal.de
      CATTLE_BOOTSTRAP_PASSWORD:  <set to the key 'bootstrapPassword' in secret 'bootstrap-secret'>  Optional: false
    Mounts:
      /etc/rancher/ssl/cacerts.pem from tls-ca-volume (ro,path="cacerts.pem")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6ftjw (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  tls-ca-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tls-ca
    Optional:    false
  kube-api-access-6ftjw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 cattle.io/os=linux:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute for 300s
                             node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Normal   Scheduled         39m                   default-scheduler  Successfully assigned cattle-system/rancher-75b8bc6df6-gxdkh to lff-l3vsrv103.bfd.bayern.de
  Normal   Pulled            39m                   kubelet            Container image "rancher/rancher:v2.6.0" already present on machine
  Normal   Created           39m                   kubelet            Created container rancher
  Normal   Started           39m                   kubelet            Started container rancher
  Warning  Unhealthy         37m (x4 over 38m)     kubelet            Readiness probe failed: Get "http://10.42.0.9:80/healthz": dial tcp 10.42.0.9:80: connect: connection refused
  Warning  Unhealthy         37m (x2 over 37m)     kubelet            Liveness probe failed: Get "http://10.42.0.9:80/healthz": dial tcp 10.42.0.9:80: connect: connection refused
  Warning  DNSConfigForming  3m57s (x34 over 39m)  kubelet            Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de

kubectl -n cattle-system logs -l app=rancher

2021/09/22 12:33:19 [ERROR] error syncing 'local/p-dhdcr': handler system-image-upgrade-controller: upgrade cluster local system service logging failed: get template system-library-rancher-logging failed: catalogtemplates.management.cattle.io "cattle-global-data/system-library-rancher-logging" not found, requeuing
2021/09/22 12:34:45 [ERROR] error syncing 'library': handler catalog: Clone failed: Cloning into 'management-state/catalog-cache/380859f1003fe7603cddc6c15b34b7263f1f0deaa92ddcde465811d032ee7078'...
fatal: unable to access 'https://git.rancher.io/charts/': Could not resolve proxy: www.proxy.internal.de
: exit status 128, requeuing
2021/09/22 12:34:45 [ERROR] error syncing 'system-library': handler catalog: Clone failed: Cloning into 'management-state/catalog-cache/6bda4fa402575032915dc735ad66131a2f8a36ff38c6fb593402ad315fbfdf85'...
fatal: unable to access 'https://git.rancher.io/system-charts/': Could not resolve proxy: www.proxy.internal.de
: exit status 128, handler system-image-upgrade-catalog-controller: upgrade cluster local system service alerting failed: get template cattle-global-data:system-library-rancher-monitoring failed, catalogtemplates.management.cattle.io "cattle-global-data/system-library-rancher-monitoring" not found, requeuing
2021/09/22 12:34:45 [ERROR] error syncing 'helm3-library': handler catalog: Clone failed: Cloning into 'management-state/catalog-cache/f341cfdfa521a9aa2b993cb34b26bb91b2d173ef1a7df8d41b8921b0e4f82788'...
fatal: unable to access 'https://git.rancher.io/helm3-charts/': Could not resolve proxy: www.proxy.internal.de
: exit status 128, requeuing

kubectl -n cattle-system get events

LAST SEEN   TYPE      REASON              OBJECT                                  MESSAGE
43m         Normal    Scheduled           pod/helm-operation-2k6nt                Successfully assigned cattle-system/helm-operation-2k6nt to lff-l3vsrv103.internal.de
43m         Warning   FailedMount         pod/helm-operation-2k6nt                MountVolume.SetUp failed for volume "pod-impersonation-helm-op-fznqn-token-xlqdm" : secret "pod-impersonation-helm-op-fznqn-token-xlqdm" not found
43m         Warning   FailedMount         pod/helm-operation-2k6nt                MountVolume.SetUp failed for volume "data" : secret "helm-operation-pvxmn" not found
43m         Warning   DNSConfigForming    pod/helm-operation-2k6nt                Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de
43m         Normal    Pulled              pod/helm-operation-2k6nt                Container image "rancher/shell:v0.1.10" already present on machine
43m         Normal    Created             pod/helm-operation-2k6nt                Created container helm
43m         Normal    Started             pod/helm-operation-2k6nt                Started container helm
43m         Normal    Pulled              pod/helm-operation-2k6nt                Container image "rancher/shell:v0.1.10" already present on machine
43m         Normal    Created             pod/helm-operation-2k6nt                Created container proxy
43m         Normal    Started             pod/helm-operation-2k6nt                Started container proxy
43m         Normal    Scheduled           pod/helm-operation-pnzmb                Successfully assigned cattle-system/helm-operation-pnzmb to lff-l3vsrv103.internal.de
42m         Warning   DNSConfigForming    pod/helm-operation-pnzmb                Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de
43m         Normal    Pulled              pod/helm-operation-pnzmb                Container image "rancher/shell:v0.1.10" already present on machine
43m         Normal    Created             pod/helm-operation-pnzmb                Created container helm
43m         Normal    Started             pod/helm-operation-pnzmb                Started container helm
43m         Normal    Pulled              pod/helm-operation-pnzmb                Container image "rancher/shell:v0.1.10" already present on machine
43m         Normal    Created             pod/helm-operation-pnzmb                Created container proxy
43m         Normal    Started             pod/helm-operation-pnzmb                Started container proxy
44m         Normal    Scheduled           pod/helm-operation-t4qrz                Successfully assigned cattle-system/helm-operation-t4qrz to lff-l3vsrv103.internal.de
43m         Warning   DNSConfigForming    pod/helm-operation-t4qrz                Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de
44m         Normal    Pulled              pod/helm-operation-t4qrz                Container image "rancher/shell:v0.1.10" already present on machine
44m         Normal    Created             pod/helm-operation-t4qrz                Created container helm
44m         Normal    Started             pod/helm-operation-t4qrz                Started container helm
44m         Normal    Pulled              pod/helm-operation-t4qrz                Container image "rancher/shell:v0.1.10" already present on machine
44m         Normal    Created             pod/helm-operation-t4qrz                Created container proxy
44m         Normal    Started             pod/helm-operation-t4qrz                Started container proxy
42m         Normal    Scheduled           pod/helm-operation-vgdcr                Successfully assigned cattle-system/helm-operation-vgdcr to lff-l3vsrv103.internal.de
42m         Warning   DNSConfigForming    pod/helm-operation-vgdcr                Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de
42m         Normal    Pulled              pod/helm-operation-vgdcr                Container image "rancher/shell:v0.1.10" already present on machine
42m         Normal    Created             pod/helm-operation-vgdcr                Created container helm
42m         Normal    Started             pod/helm-operation-vgdcr                Started container helm
42m         Normal    Pulled              pod/helm-operation-vgdcr                Container image "rancher/shell:v0.1.10" already present on machine
42m         Normal    Created             pod/helm-operation-vgdcr                Created container proxy
42m         Normal    Started             pod/helm-operation-vgdcr                Started container proxy
46m         Normal    Scheduled           pod/rancher-75b8bc6df6-gxdkh            Successfully assigned cattle-system/rancher-75b8bc6df6-gxdkh to lff-l3vsrv103.internal.de
72s         Warning   DNSConfigForming    pod/rancher-75b8bc6df6-gxdkh            Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de
46m         Normal    Pulled              pod/rancher-75b8bc6df6-gxdkh            Container image "rancher/rancher:v2.6.0" already present on machine
46m         Normal    Created             pod/rancher-75b8bc6df6-gxdkh            Created container rancher
46m         Normal    Started             pod/rancher-75b8bc6df6-gxdkh            Started container rancher
44m         Warning   Unhealthy           pod/rancher-75b8bc6df6-gxdkh            Readiness probe failed: Get "http://10.42.0.9:80/healthz": dial tcp 10.42.0.9:80: connect: connection refused
44m         Warning   Unhealthy           pod/rancher-75b8bc6df6-gxdkh            Liveness probe failed: Get "http://10.42.0.9:80/healthz": dial tcp 10.42.0.9:80: connect: connection refused
46m         Normal    SuccessfulCreate    replicaset/rancher-75b8bc6df6           Created pod: rancher-75b8bc6df6-gxdkh
42m         Normal    Scheduled           pod/rancher-webhook-7f84b74ddb-qfsnz    Successfully assigned cattle-system/rancher-webhook-7f84b74ddb-qfsnz to lff-l3vsrv103.internal.de
42m         Warning   FailedMount         pod/rancher-webhook-7f84b74ddb-qfsnz    MountVolume.SetUp failed for volume "tls" : secret "rancher-webhook-tls" not found
2m41s       Warning   DNSConfigForming    pod/rancher-webhook-7f84b74ddb-qfsnz    Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de
42m         Normal    Pulled              pod/rancher-webhook-7f84b74ddb-qfsnz    Container image "rancher/rancher-webhook:v0.2.0" already present on machine
42m         Normal    Created             pod/rancher-webhook-7f84b74ddb-qfsnz    Created container rancher-webhook
42m         Normal    Started             pod/rancher-webhook-7f84b74ddb-qfsnz    Started container rancher-webhook
42m         Normal    SuccessfulCreate    replicaset/rancher-webhook-7f84b74ddb   Created pod: rancher-webhook-7f84b74ddb-qfsnz
42m         Normal    ScalingReplicaSet   deployment/rancher-webhook              Scaled up replica set rancher-webhook-7f84b74ddb to 1
46m         Normal    ScalingReplicaSet   deployment/rancher                      Scaled up replica set rancher-75b8bc6df6 to 1
43m         Normal    Sync                ingress/rancher                         Scheduled for sync

kubectl -n cattle-system get ingress

NAME      CLASS    HOSTS                    ADDRESS                       PORTS     AGE
rancher   <none>   rancheradm.internal.de   lff-l3vsrv103.internal.de   80, 443   50m

kubectl -n ingress-nginx logs -l app=ingress-nginx

W0922 11:52:03.482712       8 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0922 11:52:03.483139       8 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"babe268e-308b-46cd-8e91-b0ff30e3de70", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"1566", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0922 11:53:43.909459       8 main.go:112] "successfully validated configuration, accepting" ingress="rancher/cattle-system"
I0922 11:53:44.429465       8 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"babe268e-308b-46cd-8e91-b0ff30e3de70", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"2513", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0922 11:54:04.989951       8 store.go:453] "secret was updated and it is used in ingress annotations. Parsing" secret="cattle-system/tls-rancher-ingress"
2021/09/22 11:55:50 [error] 70#70: *1724 connect() failed (113: Host is unreachable) while connecting to upstream, client: 10.48.90.137, server: rancheradm.internal.de, request: "GET / HTTP/2.0", upstream: "http://10.42.0.9:80/", host: "rancheradm.internal.de"
2021/09/22 11:55:51 [error] 70#70: *1724 connect() failed (113: Host is unreachable) while connecting to upstream, client: 10.48.90.137, server: rancheradm.internal.de, request: "GET / HTTP/2.0", upstream: "http://10.42.0.9:80/", host: "rancheradm.internal.de"
2021/09/22 11:55:52 [error] 70#70: *1724 connect() failed (113: Host is unreachable) while connecting to upstream, client: 10.48.90.137, server: rancheradm.internal.de, request: "GET / HTTP/2.0", upstream: "http://10.42.0.9:80/", host: "rancheradm.internal.de"
10.48.90.137 - - [22/Sep/2021:11:55:52 +0000] "GET / HTTP/2.0" 502 150 "-" "curl/7.66.0" 36 3.056 [cattle-system-rancher-80] [] 10.42.0.9:80, 10.42.0.9:80, 10.42.0.9:80 0, 0, 0 1.012, 1.024, 1.024 502, 502, 502 692c84ba208c21d08092f9ecfac89f6f
2021/09/22 12:02:49 [error] 70#70: *3494 broken header: "�������6F��x%�KE=�'ҡ�J�^iG l��0ͳ1=��褚�����~��L        �%���q�>>�,�0��̩�̨̩̪�+(k" while reading PROXY protocol, client: 10.48.90.137, server: 0.0.0.0:443

Leader Election

{"holderIdentity":"rancher-75b8bc6df6-gxdkh","leaseDurationSeconds":45,"acquireTime":"2021-09-22T11:52:53Z","renewTime":"2021-09-22T12:46:02Z","leaderTransitions":0}

I hope this helps. I see that there is something wrong but I can not make much sense of these log outputs.
Thank you for helping so far.

@samstride
Copy link

samstride commented Sep 24, 2021

I am running into this exact same issue.

Upgrading from rke version 1.2.11 to rke version 1.3.1 causes the UI to be unaccessible.

Downgrading brings back the UI.

This is what the network and ingress sections look like in the cluster.yml file:

network:
  plugin: calico

ingress:
  provider: nginx
  options:
    use-forwarded-headers: "true"

@superseb
Copy link
Contributor

We'll need the same outputs and logging requested to cross reference, also both please run the networking and DNS tests from https://rancher.com/docs/rancher/v2.6/en/troubleshooting/ to make sure the basics are working.

@samstride
Copy link

samstride commented Sep 27, 2021

@superseb , sorry I had to rollback. I might try again later in the week and will collect logs.

Briefly, this is what works:

rke version 1.2.11,  kubernetes 1.20.9
rke version 1.3.1,  kubernetes 1.20.11

This does not work:

rke version 1.3.1,  kubernetes 1.21.5

I also had ingress issues when I upgraded rancher (via helm) to kubernetes v1.21.5.

I have a feeling it could be related to kubernetes/ingress-nginx#7510.

@kagehisa
Copy link
Author

kagehisa commented Sep 27, 2021

Overlaytest
All hosts can reach each other => Overlay works

MTU settings on nodes ip addr | grep mtu

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
44: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
82: cali2bb2ff5648d@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
84: cali0ea55270f10@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default

I can not check them on the network devices atm. but I will try to get a hold of the network admin.

DNS

kubectl -n kube-system get pods -l k8s-app=kube-dns

NAME                       READY   STATUS    RESTARTS   AGE
coredns-685d6d555d-fwczp   1/1     Running   0          17m
coredns-685d6d555d-rkhgk   1/1     Running   0          17m

kubectl -n kube-system get svc -l k8s-app=kube-dns

NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.43.0.10   <none>        53/UDP,53/TCP,9153/TCP   19m

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default

- If you don't see a command prompt, try pressing enter.
- nslookup: can't resolve 'kubernetes.default'
- pod "busybox" deleted
- pod default/busybox terminated (Error)

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup www.google.com

- If you don't see a command prompt, try pressing enter.
- nslookup: can't resolve 'www.google.com'
- pod "busybox" deleted
- pod default/busybox terminated (Error)

DNS resolution on host system works fine though.

DNS resolve test on all nodes with ds-dnstest.yml
Failed on every node.

Core DNS specific

kubectl -n kube-system logs -l k8s-app=kube-dns

[ERROR] plugin/errors: 2 www.google.com.internal.de. A: read udp 10.42.0.3:38677->10.48.25.12:53: read: no route to host
[ERROR] plugin/errors: 2 www.google.com. A: read udp 10.42.0.3:58930->10.173.230.153:53: i/o timeout
[ERROR] plugin/errors: 2 www.google.com.internal.de. A: read udp 10.42.0.3:33851->10.173.230.153:53: read: no route to host
[ERROR] plugin/errors: 2 www.google.com. A: read udp 10.42.0.3:45903->10.48.25.12:53: i/o timeout
[ERROR] plugin/errors: 2 web.de. AAAA: read udp 10.42.0.3:44863->10.48.25.12:53: i/o timeout
[ERROR] plugin/errors: 2 web.de. AAAA: read udp 10.42.0.3:59853->10.173.230.153:53: i/o timeout
[ERROR] plugin/errors: 2 web.de. A: read udp 10.42.0.3:33204->10.173.77.153:53: i/o timeout
[ERROR] plugin/errors: 2 web.de. A: read udp 10.42.0.3:48611->10.173.77.153:53: i/o timeout
[ERROR] plugin/errors: 2 web.de. A: read udp 10.42.0.3:59439->10.173.77.153:53: i/o timeout
[ERROR] plugin/errors: 2 web.de.internal.de. A: read udp 10.42.0.3:60442->10.173.77.153:53: i/o timeout
[ERROR] plugin/errors: 2 2343882539559209997.5833510137765733991. HINFO: read udp 10.42.1.4:38572->10.48.25.12:53: read: no route to host
[ERROR] plugin/errors: 2 www.google.com. AAAA: read udp 10.42.1.4:43688->10.173.77.153:53: i/o timeout
[ERROR] plugin/errors: 2 www.google.com. AAAA: read udp 10.42.1.4:60269->10.173.230.153:53: i/o timeout
[ERROR] plugin/errors: 2 www.google.com.internal.de. AAAA: read udp 10.42.1.4:48978->10.173.230.153:53: i/o timeout
[ERROR] plugin/errors: 2 www.google.com. AAAA: read udp 10.42.1.4:48327->10.173.230.153:53: i/o timeout
[ERROR] plugin/errors: 2 web.de. AAAA: read udp 10.42.1.4:55961->10.173.230.153:53: i/o timeout
[ERROR] plugin/errors: 2 web.de. AAAA: read udp 10.42.1.4:38270->10.48.25.12:53: i/o timeout
[ERROR] plugin/errors: 2 web.de. A: read udp 10.42.1.4:49732->10.48.25.12:53: i/o timeout

kubectl -n kube-system get configmap coredns -o go-template={{.data.Corefile}}

.:53 {
    errors
    health {
      lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
      pods insecure
      fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . "/etc/resolv.conf"
    cache 30
    loop
    reload
    loadbalance
} # STUBDOMAINS - Rancher specific change

CoreDNS resolv.conf
CoreDNS and host /etc/resolf.conf are identical.
Testet with kubectl run -i --restart=Never --rm test-${RANDOM} --image=ubuntu --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"dnsPolicy":"Default"}}' -- sh -c 'cat /etc/resolv.conf'

CoreDNS log file after enabling query logging

[INFO] plugin/reload: Running configuration MD5 = eaf30f4371adc42d450f7260c14ca6ad
[INFO] Reloading complete
[INFO] 127.0.0.1:38508 - 65460 "HINFO IN 80328382937443733.4147782752510101583. udp 55 false 512" NOERROR - 0 6.003067151s
[ERROR] plugin/errors: 2 80328382937443733.4147782752510101583. HINFO: read udp 10.42.1.4:56774->10.173.230.153:53: i/o timeout
[INFO] 127.0.0.1:56814 - 48992 "HINFO IN 80328382937443733.4147782752510101583. udp 55 false 512" NOERROR - 0 4.001242368s
[ERROR] plugin/errors: 2 80328382937443733.4147782752510101583. HINFO: read udp 10.42.1.4:60721->10.173.230.153:53: read: no route to host
[INFO] 127.0.0.1:48154 - 22201 "HINFO IN 80328382937443733.4147782752510101583. udp 55 false 512" NOERROR - 0 2.000738084s
[ERROR] plugin/errors: 2 80328382937443733.4147782752510101583. HINFO: read udp 10.42.1.4:38380->10.48.25.12:53: read: no route to host
[INFO] 127.0.0.1:58532 - 8589 "HINFO IN 80328382937443733.4147782752510101583. udp 55 false 512" NOERROR - 0 0.000183s
[ERROR] plugin/errors: 2 80328382937443733.4147782752510101583. HINFO: read udp 10.42.1.4:51196->10.173.77.153:53: read: no route to host
[INFO] 127.0.0.1:47432 - 25027 "HINFO IN 5553935729069624834.8065737092722111741. udp 57 false 512" NOERROR - 0 2.000415414s
[ERROR] plugin/errors: 2 5553935729069624834.8065737092722111741. HINFO: read udp 10.42.0.3:38800->10.48.25.12:53: i/o timeout
[INFO] 127.0.0.1:35368 - 8604 "HINFO IN 5553935729069624834.8065737092722111741. udp 57 false 512" NOERROR - 0 2.000581613s
[ERROR] plugin/errors: 2 5553935729069624834.8065737092722111741. HINFO: read udp 10.42.0.3:60249->10.173.77.153:53: i/o timeout
[INFO] 127.0.0.1:46832 - 42835 "HINFO IN 5553935729069624834.8065737092722111741. udp 57 false 512" NOERROR - 0 2.000752913s
[ERROR] plugin/errors: 2 5553935729069624834.8065737092722111741. HINFO: read udp 10.42.0.3:39132->10.173.230.153:53: i/o timeout
[INFO] 127.0.0.1:55847 - 12168 "HINFO IN 5553935729069624834.8065737092722111741. udp 57 false 512" NOERROR - 0 2.001013913s
[ERROR] plugin/errors: 2 5553935729069624834.8065737092722111741. HINFO: read udp 10.42.0.3:52820->10.48.25.12:53: i/o timeout
[INFO] 127.0.0.1:40863 - 26393 "HINFO IN 5553935729069624834.8065737092722111741. udp 57 false 512" NOERROR - 0 2.000321014s
[ERROR] plugin/errors: 2 5553935729069624834.8065737092722111741. HINFO: read udp 10.42.0.3:40247->10.173.77.153:53: i/o timeout

Again I really appreciate the help. As it seems CoreDNS seems to have an issue. Any ideas what this may be?

@superseb
Copy link
Contributor

What OS are you running, can you share docker info? Does any pod have external connectivity? (ping 8.8.8.8)

@kagehisa
Copy link
Author

Hi, I'm running on SLES 15 SP3. Sorry I completely forgot about docker info in my first post.

Docker info

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 28
  Running: 14
  Paused: 0
  Stopped: 14
 Images: 17
 Server Version: 20.10.6-ce
 Storage Driver: btrfs
  Build Version: Btrfs v4.15
  Library Version: 102
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runtime.v1.linux oci runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version:
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.3.18-59.24-default
 Operating System: SUSE Linux Enterprise Server 15 SP3
 OSType: linux
 Architecture: x86_64
 CPUs: 1
 Total Memory: 5.523GiB
 Name: lff-l3vsrv103
 ID: P3IL:P5PM:HP2I:RPIH:OXBX:NDBR:IJFF:EH4P:NPKD:6RV7:5ME5:FTXI
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http://www.proxy.internal.de:80
 HTTPS Proxy: http://www.proxy.internal.de:80
 No Proxy: 127.0.0.1,localhost,0.0.0.0,10.0.0.0/8,cattle-system.svc,.svc,.cluster.local,.internal.de,harbor.internal.de,10.48.90.148,10.48.90.144,10.48.90.145,10.48.90.137
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  harbor.internal.de
  127.0.0.0/8
 Live Restore Enabled: false

Ping to the outside is not possible due to corporate proxy. But using kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- ping 10.173.230.153 on the IP's of our DNS Servers works just fine.

@superseb
Copy link
Contributor

And I assume manually querying the DNS server for a DNS record does not work for any pod (but does work on host)?

@kagehisa
Copy link
Author

kagehisa commented Sep 27, 2021

Yes. Nslookup on host side works but not for pods.

Ah and I forgot, these are the ports I opened in my firewall according to the documentation:

    - 22/tcp
    - 80/tcp
    - 443/tcp
    - 2376/tcp
    - 2379/tcp
    - 2380/tcp
    - 6443/tcp
    - 8472/udp
    - 9099/tcp
    - 9796/tcp
    - 10250/tcp
    - 10254/tcp
    - 30000-32767/tcp
    - 30000-32767/udp

@superseb
Copy link
Contributor

I'd have to run some more checks but if the pods can't talk to your DNS server, it has to either CNI related (updated CNI between 1.20 and 1.21), so please check/supply logs from CNI pods or something with the host firewall but that is more unlikely as its the same hosts being used.

@kagehisa
Copy link
Author

Hmm ok, but the pods have no problem to ping my dns servers ip addresses or the ip's of the other nodes. Just pinging outside of our company network is not possible due to our corporate proxy.

Calico-kube-controllers log

2021-09-27 12:02:31.802 [INFO][1] main.go 92: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0927 12:02:31.804448       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2021-09-27 12:02:31.805 [INFO][1] main.go 113: Ensuring Calico datastore is initialized
2021-09-27 12:02:31.840 [INFO][1] main.go 153: Getting initial config snapshot from datastore
2021-09-27 12:02:32.045 [INFO][1] main.go 156: Got initial config snapshot
2021-09-27 12:02:32.045 [INFO][1] watchersyncer.go 89: Start called
2021-09-27 12:02:32.045 [INFO][1] main.go 173: Starting status report routine
2021-09-27 12:02:32.045 [INFO][1] main.go 182: Starting Prometheus metrics server on port 9094
2021-09-27 12:02:32.045 [INFO][1] main.go 418: Starting controller ControllerType="Node"
2021-09-27 12:02:32.045 [INFO][1] node_controller.go 143: Starting Node controller
2021-09-27 12:02:32.045 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready
2021-09-27 12:02:32.045 [INFO][1] node_syncer.go 65: Node controller syncer status updated: wait-for-ready
2021-09-27 12:02:32.045 [INFO][1] watchersyncer.go 147: Starting main event processing loop
2021-09-27 12:02:32.045 [INFO][1] watchercache.go 174: Full resync is required ListRoot="/calico/ipam/v2/assignment/"
2021-09-27 12:02:32.046 [INFO][1] watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2021-09-27 12:02:32.049 [INFO][1] resources.go 349: Main client watcher loop
2021-09-27 12:02:32.102 [INFO][1] watchercache.go 271: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2021-09-27 12:02:32.102 [INFO][1] watchersyncer.go 127: Sending status update Status=resync
2021-09-27 12:02:32.102 [INFO][1] node_syncer.go 65: Node controller syncer status updated: resync
2021-09-27 12:02:32.102 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2021-09-27 12:02:32.145 [INFO][1] node_controller.go 159: Node controller is now running
2021-09-27 12:02:32.145 [INFO][1] ipam.go 69: Synchronizing IPAM data
2021-09-27 12:02:32.150 [INFO][1] watchercache.go 271: Sending synced update ListRoot="/calico/ipam/v2/assignment/"
2021-09-27 12:02:32.150 [INFO][1] ipam.go 78: Node and IPAM data is in sync
2021-09-27 12:02:32.151 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2021-09-27 12:02:32.151 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2021-09-27 12:02:32.151 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync
2021-09-27 12:02:32.151 [INFO][1] node_syncer.go 65: Node controller syncer status updated: in-sync
2021-09-27 12:02:32.165 [INFO][1] hostendpoints.go 90: successfully synced all hostendpoints

kubectl -n kube-system logs -f canal-z8tht calico-node

2021-09-27 12:02:28.114 [INFO][9] startup/startup.go 390: Early log level set to info
2021-09-27 12:02:28.115 [INFO][9] startup/startup.go 406: Using NODENAME environment for node name lff-l3vsrv103.internal.de
2021-09-27 12:02:28.115 [INFO][9] startup/startup.go 418: Determined node name: lff-l3vsrv103.internal.de
2021-09-27 12:02:28.115 [INFO][9] startup/startup.go 103: Starting node lff-l3vsrv103.internal.de with version v3.19.2
2021-09-27 12:02:28.117 [INFO][9] startup/startup.go 450: Checking datastore connection
2021-09-27 12:02:28.132 [INFO][9] startup/startup.go 474: Datastore connection verified
2021-09-27 12:02:28.133 [INFO][9] startup/startup.go 113: Datastore is ready
2021-09-27 12:02:28.179 [INFO][9] startup/startup.go 774: Using autodetected IPv4 address on interface eth0: 10.48.90.148/26
2021-09-27 12:02:28.179 [INFO][9] startup/startup.go 591: Node IPv4 changed, will check for conflicts
2021-09-27 12:02:28.233 [INFO][9] startup/startup.go 967: Selected default IP pool is '192.168.0.0/16'
2021-09-27 12:02:28.233 [INFO][9] startup/startup.go 706: FELIX_IPV6SUPPORT is false through environment variable
2021-09-27 12:02:28.285 [INFO][9] startup/startup.go 228: Using node name: lff-l3vsrv103.internal.de
CALICO_NETWORKING_BACKEND is none - no BGP daemon running
Calico node started successfully
2021-09-27 12:02:29.773 [INFO][38] monitor-addresses/startup.go 390: Early log level set to info
2021-09-27 12:02:29.785 [INFO][38] monitor-addresses/startup.go 406: Using NODENAME environment for node name lff-l3vsrv103.internal.de
2021-09-27 12:02:29.805 [INFO][38] monitor-addresses/startup.go 418: Determined node name: lff-l3vsrv103.internal.de
2021-09-27 12:03:02.474 [WARNING][39] felix/table.go 652: Detected out-of-sync inserts, marking for resync actualRuleIDs=[]string{"", "O3lYWMrLQYEMJtB5", "", "", "", "", "", "", ""} chainName="POSTROUTING" expectedRuleIDs=[]string{"O3lYWMrLQYEMJtB5", "", "", "", "", "", "", "", ""} ipVersion=0x4 table="nat"

kubectl -n kube-system logs -f canal-z8tht kube-flannel

I0927 12:02:27.980557       1 main.go:520] Determining IP address of default interface
I0927 12:02:27.981786       1 main.go:533] Using interface with name eth0 and address 10.48.90.148
I0927 12:02:27.981937       1 main.go:550] Defaulting external address to interface address (10.48.90.148)
W0927 12:02:27.982033       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0927 12:02:28.010785       1 kube.go:116] Waiting 10m0s for node controller to sync
I0927 12:02:28.012350       1 kube.go:299] Starting kube subnet manager
I0927 12:02:29.014478       1 kube.go:123] Node controller sync successful
I0927 12:02:29.014501       1 main.go:254] Created subnet manager: Kubernetes Subnet Manager - lff-l3vsrv103.bfd.bayern.de
I0927 12:02:29.014506       1 main.go:257] Installing signal handlers
I0927 12:02:29.014724       1 main.go:392] Found network config - Backend type: vxlan
I0927 12:02:29.014803       1 vxlan.go:123] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
I0927 12:02:29.083610       1 iface.go:154] removed IP address 10.42.2.0/32 flannel.1 from flannel.1
I0927 12:02:29.086433       1 main.go:357] Current network or subnet (10.42.0.0/16, 10.42.1.0/24) is not equal to previous one (0.0.0.0/0, 0.0.0.0/0), trying to recycle old iptables rules
I0927 12:02:29.313311       1 iptables.go:172] Deleting iptables rule: -s 0.0.0.0/0 -d 0.0.0.0/0 -j RETURN
I0927 12:02:29.362112       1 iptables.go:172] Deleting iptables rule: -s 0.0.0.0/0 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0927 12:02:29.363125       1 iptables.go:172] Deleting iptables rule: ! -s 0.0.0.0/0 -d 0.0.0.0/0 -j RETURN
I0927 12:02:29.363909       1 iptables.go:172] Deleting iptables rule: ! -s 0.0.0.0/0 -d 0.0.0.0/0 -j MASQUERADE --random-fully
I0927 12:02:29.364753       1 main.go:307] Setting up masking rules
I0927 12:02:29.365806       1 main.go:315] Changing default FORWARD chain policy to ACCEPT
I0927 12:02:29.366790       1 main.go:323] Wrote subnet file to /run/flannel/subnet.env
I0927 12:02:29.366879       1 main.go:327] Running backend.
I0927 12:02:29.366962       1 main.go:345] Waiting for all goroutines to exit
I0927 12:02:29.367178       1 vxlan_network.go:59] watching for new subnet leases
I0927 12:02:29.378138       1 iptables.go:148] Some iptables rules are missing; deleting and recreating rules
I0927 12:02:29.378282       1 iptables.go:172] Deleting iptables rule: -s 10.42.0.0/16 -d 10.42.0.0/16 -j RETURN
I0927 12:02:29.379383       1 iptables.go:172] Deleting iptables rule: -s 10.42.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0927 12:02:29.380494       1 iptables.go:172] Deleting iptables rule: ! -s 10.42.0.0/16 -d 10.42.1.0/24 -j RETURN
I0927 12:02:29.381408       1 iptables.go:172] Deleting iptables rule: ! -s 10.42.0.0/16 -d 10.42.0.0/16 -j MASQUERADE --random-fully
I0927 12:02:29.382499       1 iptables.go:160] Adding iptables rule: -s 10.42.0.0/16 -d 10.42.0.0/16 -j RETURN
I0927 12:02:29.384322       1 iptables.go:160] Adding iptables rule: -s 10.42.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0927 12:02:29.386131       1 iptables.go:160] Adding iptables rule: ! -s 10.42.0.0/16 -d 10.42.1.0/24 -j RETURN
I0927 12:02:29.388805       1 iptables.go:160] Adding iptables rule: ! -s 10.42.0.0/16 -d 10.42.0.0/16 -j MASQUERADE --random-fully

kubectl -n kube-system logs -f canal-z8tht install-cni

time="2021-09-27T12:02:23Z" level=info msg="Running as a Kubernetes pod" source="install.go:140"
time="2021-09-27T12:02:23Z" level=info msg="Installed /host/opt/cni/bin/bandwidth"
time="2021-09-27T12:02:23Z" level=info msg="Installed /host/opt/cni/bin/calico"
time="2021-09-27T12:02:23Z" level=info msg="Installed /host/opt/cni/bin/calico-ipam"
time="2021-09-27T12:02:23Z" level=info msg="Installed /host/opt/cni/bin/flannel"
time="2021-09-27T12:02:24Z" level=info msg="Installed /host/opt/cni/bin/host-local"
time="2021-09-27T12:02:24Z" level=info msg="Installed /host/opt/cni/bin/install"
time="2021-09-27T12:02:24Z" level=info msg="Installed /host/opt/cni/bin/loopback"
time="2021-09-27T12:02:24Z" level=info msg="Installed /host/opt/cni/bin/portmap"
time="2021-09-27T12:02:24Z" level=info msg="Installed /host/opt/cni/bin/tuning"
time="2021-09-27T12:02:24Z" level=info msg="Wrote Calico CNI binaries to /host/opt/cni/bin\n"
time="2021-09-27T12:02:24Z" level=info msg="CNI plugin version: v3.19.2\n"
time="2021-09-27T12:02:24Z" level=info msg="/host/secondary-bin-dir is not writeable, skipping"
time="2021-09-27T12:02:24Z" level=info msg="Using CNI config template from CNI_NETWORK_CONFIG environment variable." source="install.go:319"
time="2021-09-27T12:02:24Z" level=info msg="Created /host/etc/cni/net.d/10-canal.conflist"
{
  "name": "k8s-pod-network",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "calico",
      "log_level": "WARNING",
      "log_file_path": "/var/log/calico/cni/cni.log",
      "datastore_type": "kubernetes",
      "nodename": "lff-l3vsrv103.bfd.bayern.de",
      "mtu": 1450,
      "ipam": {
          "type": "host-local",
          "subnet": "usePodCidr"
      },
      "policy": {
          "type": "k8s"
      },
      "kubernetes": {
          "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
      }
    },
    {
      "type": "portmap",
      "snat": true,
      "capabilities": {"portMappings": true}
    },
    {
      "type": "bandwidth",
      "capabilities": {"bandwidth": true}
    }
  ]
}
time="2021-09-27T12:02:24Z" level=info msg="Removing /host/etcd/cni/net.d/10-calico.conflist"
time="2021-09-27T12:02:24Z" level=warning msg="Failed to remove 10-calico.conflist" error="remove /host/etcd/cni/net.d/10-calico.conflist: no such file or directory"
time="2021-09-27T12:02:24Z" level=info msg="Done configuring CNI.  Sleep= false"

kubectl -n kube-system logs -f canal-z8tht flexvol-driver
File exists /host/driver/uds. Copy over

kubectl -n kube-system logs -f rke-network-plugin-deploy-job-jfxvw

configmap/canal-config created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/kubecontrollersconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/canal-flannel created
clusterrolebinding.rbac.authorization.k8s.io/canal-calico created
daemonset.apps/canal created
serviceaccount/canal created
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers created
Warning: policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
poddisruptionbudget.policy/calico-kube-controllers created

If you can think of anything else you may need please tell me. These were all the logs I could think of.

@kagehisa
Copy link
Author

kagehisa commented Sep 27, 2021

Ok, I just looked around and found this issue. After that I tried to disable firewalld and the dns resolution for pods works. It seems there was a change either in calico or firewalld that is causing this problem.
Since I have to enable firewalld on our host systems (default firewall for SLES) it would be nice to know if there is a fix for that or if there is a plan to fix this.
I will try tomorrow if the rancher 2.6 deployment will work with disabled firewalld as well.

Edit:
Deployed Rancher 2.6 without firewalld but still just the following error when running a curl on the rancher address.

* Uses proxy env variable no_proxy == '127.0.0.1,localhost,0.0.0.0,10.0.0.0/8,cattle-system.svc,.svc,.cluster.local,.internal.de,10.48.90.148,10.48.90.144,10.48.90.145,10.48.90.137'
*   Trying 10.48.90.137:443...
* TCP_NODELAY set
* Connected to rancheradm.internal.de (10.48.90.137) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to rancheradm.internal.de:443
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to rancheradm.internal.de:443

I will run the above mentioned checks regarding rancher HA deployment again tomorrow. and see if I can spot anything.

@kagehisa
Copy link
Author

Ok I did the Rancher HA checks again without firewalld on any node and there are just some logs that seem strange to me.
First of the log from my load Balancer:

2021/09/28 08:22:54 [notice] 59816#59816: using the "epoll" event method
2021/09/28 08:22:54 [notice] 59816#59816: nginx/1.19.8
2021/09/28 08:22:54 [notice] 59816#59816: built by gcc 7.5.0 (SUSE Linux)
2021/09/28 08:22:54 [notice] 59816#59816: OS: Linux 5.3.18-59.24-default
2021/09/28 08:22:54 [notice] 59816#59816: getrlimit(RLIMIT_NOFILE): 1024:524288
2021/09/28 08:22:54 [notice] 59816#59816: start worker processes
2021/09/28 08:22:54 [notice] 59816#59816: start worker process 59826
2021/09/28 08:22:54 [notice] 59816#59816: start worker process 59827
2021/09/28 08:22:54 [notice] 59816#59816: start worker process 59828
2021/09/28 08:22:54 [notice] 59816#59816: start worker process 59829
2021/09/28 08:24:55 [info] 59826#59826: *1 client 10.48.90.137:51668 connected to 0.0.0.0:443
2021/09/28 08:24:55 [error] 59826#59826: *1 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.148:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/28 08:24:55 [error] 59826#59826: *1 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.144:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/28 08:24:55 [error] 59826#59826: *1 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.145:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/28 08:39:02 [info] 59826#59826: *5 client 10.48.90.137:51676 connected to 0.0.0.0:443
2021/09/28 08:39:02 [error] 59826#59826: *5 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.145:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/28 08:39:02 [error] 59826#59826: *5 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.144:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/28 08:39:02 [error] 59826#59826: *5 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.148:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/28 08:39:11 [info] 59826#59826: *9 client 10.48.90.137:51684 connected to 0.0.0.0:443
2021/09/28 08:39:11 [error] 59826#59826: *9 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.148:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/28 08:39:11 [warn] 59826#59826: *9 upstream server temporarily disabled while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.148:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/28 08:39:11 [error] 59826#59826: *9 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.144:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/28 08:39:11 [warn] 59826#59826: *9 upstream server temporarily disabled while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.144:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/28 08:39:11 [error] 59826#59826: *9 connect() failed (111: Connection refused) while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.145:443", bytes from/to client:0/0, bytes from/to upstream:0/0
2021/09/28 08:39:11 [warn] 59826#59826: *9 upstream server temporarily disabled while connecting to upstream, client: 10.48.90.137, server: 0.0.0.0:443, upstream: "10.48.90.145:443", bytes from/to client:0/0, bytes from/to upstream:0/0

The corresponding curl output curl -vvv -k https://rancheradm.internal.de

* Uses proxy env variable no_proxy == '127.0.0.1,localhost,0.0.0.0,10.0.0.0/8,cattle-system.svc,.svc,.cluster.local,.internal.de,10.48.90.148,10.48.90.144,10.48.90.145,10.48.90.137'
*   Trying 10.48.90.137:443...
* TCP_NODELAY set
* Connected to rancheradm.internal.de (10.48.90.137) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to rancheradm.internal.de:443
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to rancheradm.internal.de:443

Pod details

Name:         rancher-75b8bc6df6-2wbkj
Namespace:    cattle-system
Priority:     0
Node:         lff-l3vsrv104.internal.de/10.48.90.144
Start Time:   Mon, 27 Sep 2021 15:45:19 +0200
Labels:       app=rancher
              pod-template-hash=75b8bc6df6
              release=rancher
Annotations:  cni.projectcalico.org/containerID: 2c2d0574bec90f05c111c2eaa17955bffb104ba574eebcfeb681f903fb7e42fe
              cni.projectcalico.org/podIP: 10.42.0.6/32
              cni.projectcalico.org/podIPs: 10.42.0.6/32
Status:       Running
IP:           10.42.0.6
IPs:
  IP:           10.42.0.6
Controlled By:  ReplicaSet/rancher-75b8bc6df6
Containers:
  rancher:
    Container ID:  docker://3ed451d3112a79db07b4c44ac099de165f5aba7a022d49518372cc0407e51599
    Image:         rancher/rancher:v2.6.0
    Image ID:      docker-pullable://rancher/rancher@sha256:8aef5ceaf816c6610de49debe78fbd36deda3a33733fac47f9096640fc20eb48
    Port:          80/TCP
    Host Port:     0/TCP
    Args:
      --http-listen-port=80
      --https-listen-port=443
      --add-local=true
    State:          Running
      Started:      Mon, 27 Sep 2021 15:47:42 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 27 Sep 2021 15:46:12 +0200
      Finished:     Mon, 27 Sep 2021 15:47:39 +0200
    Ready:          True
    Restart Count:  1
    Liveness:       http-get http://:80/healthz delay=60s timeout=1s period=30s #success=1 #failure=3
    Readiness:      http-get http://:80/healthz delay=5s timeout=1s period=30s #success=1 #failure=3
    Environment:
      CATTLE_NAMESPACE:           cattle-system
      CATTLE_PEER_SERVICE:        rancher
      HTTP_PROXY:                 http://www.proxy.internal.de:80
      HTTPS_PROXY:                http://www.proxy.internal.de:80
      NO_PROXY:                   127.0.0.1,localhost,0.0.0.0,10.0.0.0/8,cattle-system.svc,.svc,.cluster.local,.internal.de,harbor.internal.de
      CATTLE_BOOTSTRAP_PASSWORD:  <set to the key 'bootstrapPassword' in secret 'bootstrap-secret'>  Optional: false
    Mounts:
      /etc/rancher/ssl/cacerts.pem from tls-ca-volume (ro,path="cacerts.pem")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qdftk (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  tls-ca-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tls-ca
    Optional:    false
  kube-api-access-qdftk:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 cattle.io/os=linux:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute for 300s
                             node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                   From     Message
  ----     ------            ----                  ----     -------
  Warning  DNSConfigForming  3m7s (x816 over 17h)  kubelet  Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de


Name:         rancher-75b8bc6df6-mnd67
Namespace:    cattle-system
Priority:     0
Node:         lff-l3vsrv103.internal.de/10.48.90.148
Start Time:   Mon, 27 Sep 2021 15:45:19 +0200
Labels:       app=rancher
              pod-template-hash=75b8bc6df6
              release=rancher
Annotations:  cni.projectcalico.org/containerID: b0357da88a47ee4b696ab2b96ee2f7403cd329e42a51c1d230468e2d8749bcfc
              cni.projectcalico.org/podIP: 10.42.1.12/32
              cni.projectcalico.org/podIPs: 10.42.1.12/32
Status:       Running
IP:           10.42.1.12
IPs:
  IP:           10.42.1.12
Controlled By:  ReplicaSet/rancher-75b8bc6df6
Containers:
  rancher:
    Container ID:  docker://239de0f73aa34891b734af45e4f472443d66b99963857c196c83065b5c82697b
    Image:         rancher/rancher:v2.6.0
    Image ID:      docker-pullable://rancher/rancher@sha256:8aef5ceaf816c6610de49debe78fbd36deda3a33733fac47f9096640fc20eb48
    Port:          80/TCP
    Host Port:     0/TCP
    Args:
      --http-listen-port=80
      --https-listen-port=443
      --add-local=true
    State:          Running
      Started:      Mon, 27 Sep 2021 15:47:41 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 27 Sep 2021 15:46:09 +0200
      Finished:     Mon, 27 Sep 2021 15:47:39 +0200
    Ready:          True
    Restart Count:  1
    Liveness:       http-get http://:80/healthz delay=60s timeout=1s period=30s #success=1 #failure=3
    Readiness:      http-get http://:80/healthz delay=5s timeout=1s period=30s #success=1 #failure=3
    Environment:
      CATTLE_NAMESPACE:           cattle-system
      CATTLE_PEER_SERVICE:        rancher
      HTTP_PROXY:                 http://www.proxy.internal.de:80
      HTTPS_PROXY:                http://www.proxy.internal.de:80
      NO_PROXY:                   127.0.0.1,localhost,0.0.0.0,10.0.0.0/8,cattle-system.svc,.svc,.cluster.local,.internal.de,harbor.internal.de
      CATTLE_BOOTSTRAP_PASSWORD:  <set to the key 'bootstrapPassword' in secret 'bootstrap-secret'>  Optional: false
    Mounts:
      /etc/rancher/ssl/cacerts.pem from tls-ca-volume (ro,path="cacerts.pem")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qz4p7 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  tls-ca-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tls-ca
    Optional:    false
  kube-api-access-qz4p7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 cattle.io/os=linux:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute for 300s
                             node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                    From     Message
  ----     ------            ----                   ----     -------
  Warning  DNSConfigForming  2m21s (x817 over 17h)  kubelet  Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de


Name:         rancher-75b8bc6df6-mwnsd
Namespace:    cattle-system
Priority:     0
Node:         lff-l3vsrv105.internal.de/10.48.90.145
Start Time:   Mon, 27 Sep 2021 15:45:19 +0200
Labels:       app=rancher
              pod-template-hash=75b8bc6df6
              release=rancher
Annotations:  cni.projectcalico.org/containerID: 85637c9cc56f34b0224d3d5b9feb7f36d4b1e8f767c6267a064048ab317ad223
              cni.projectcalico.org/podIP: 10.42.2.5/32
              cni.projectcalico.org/podIPs: 10.42.2.5/32
Status:       Running
IP:           10.42.2.5
IPs:
  IP:           10.42.2.5
Controlled By:  ReplicaSet/rancher-75b8bc6df6
Containers:
  rancher:
    Container ID:  docker://e66c018dac29eb5ccb6d0a4fecd20f4a9ddbce683cf8ce3d6e8c8d0f3444a2b6
    Image:         rancher/rancher:v2.6.0
    Image ID:      docker-pullable://rancher/rancher@sha256:8aef5ceaf816c6610de49debe78fbd36deda3a33733fac47f9096640fc20eb48
    Port:          80/TCP
    Host Port:     0/TCP
    Args:
      --http-listen-port=80
      --https-listen-port=443
      --add-local=true
    State:          Running
      Started:      Mon, 27 Sep 2021 15:46:07 +0200
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:80/healthz delay=60s timeout=1s period=30s #success=1 #failure=3
    Readiness:      http-get http://:80/healthz delay=5s timeout=1s period=30s #success=1 #failure=3
    Environment:
      CATTLE_NAMESPACE:           cattle-system
      CATTLE_PEER_SERVICE:        rancher
      HTTP_PROXY:                 http://www.proxy.internal.de:80
      HTTPS_PROXY:                http://www.proxy.internal.de:80
      NO_PROXY:                   127.0.0.1,localhost,0.0.0.0,10.0.0.0/8,cattle-system.svc,.svc,.cluster.local,.internal.de,harbor.internal.de
      CATTLE_BOOTSTRAP_PASSWORD:  <set to the key 'bootstrapPassword' in secret 'bootstrap-secret'>  Optional: false
    Mounts:
      /etc/rancher/ssl/cacerts.pem from tls-ca-volume (ro,path="cacerts.pem")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cxdbg (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  tls-ca-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tls-ca
    Optional:    false
  kube-api-access-cxdbg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 cattle.io/os=linux:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute for 300s
                             node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                    From     Message
  ----     ------            ----                   ----     -------
  Warning  DNSConfigForming  2m58s (x815 over 17h)  kubelet  Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de

*Pod Container logs

W0928 06:35:20.600904      34 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0928 06:35:35.121620      34 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0928 06:35:36.150802      34 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0928 06:40:59.606000      34 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0928 06:41:21.383041      34 warnings.go:80] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
W0928 06:41:49.357556      34 warnings.go:80] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
W0928 06:42:06.123413      34 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0928 06:43:16.152739      34 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0928 06:48:56.607135      34 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0928 06:49:13.359627      34 warnings.go:80] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
W0928 06:14:29.437039      34 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0928 06:17:03.559540      34 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0928 06:24:01.439016      34 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0928 06:26:05.562649      34 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0928 06:33:50.441171      34 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0928 06:34:42.564952      34 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0928 06:39:19.445064      34 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0928 06:39:47.567111      34 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0928 06:45:37.446959      34 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0928 06:46:16.569904      34 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
2021/09/27 23:47:17 [INFO] Retrieve data.json from local path /var/lib/rancher-data/driver-metadata/data.json
2021/09/27 23:47:19 [INFO] kontainerdriver azurekubernetesservice listening on address 127.0.0.1:40707
2021/09/27 23:47:19 [INFO] kontainerdriver amazonelasticcontainerservice listening on address 127.0.0.1:42787
2021/09/27 23:47:19 [INFO] kontainerdriver googlekubernetesengine listening on address 127.0.0.1:43263
2021/09/27 23:47:19 [INFO] kontainerdriver azurekubernetesservice stopped
2021/09/27 23:47:19 [INFO] dynamic schema for kontainerdriver azurekubernetesservice updating
2021/09/27 23:47:19 [INFO] kontainerdriver googlekubernetesengine stopped
2021/09/27 23:47:19 [INFO] dynamic schema for kontainerdriver googlekubernetesengine updating
2021/09/27 23:47:19 [INFO] kontainerdriver amazonelasticcontainerservice stopped
2021/09/27 23:47:19 [INFO] dynamic schema for kontainerdriver amazonelasticcontainerservice updating

Namespace Events

LAST SEEN   TYPE      REASON             OBJECT                                 MESSAGE
19m         Normal    Scheduled          pod/busybox                            Successfully assigned cattle-system/busybox to lff-l3vsrv103.internal.de
19m         Warning   DNSConfigForming   pod/busybox                            Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de
19m         Normal    Pulled             pod/busybox                            Container image "busybox:1.28" already present on machine
19m         Normal    Created            pod/busybox                            Created container busybox
19m         Normal    Started            pod/busybox                            Started container busybox
4m6s        Warning   DNSConfigForming   pod/rancher-75b8bc6df6-2wbkj           Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de
3m20s       Warning   DNSConfigForming   pod/rancher-75b8bc6df6-mnd67           Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de
3m57s       Warning   DNSConfigForming   pod/rancher-75b8bc6df6-mwnsd           Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de
3m47s       Warning   DNSConfigForming   pod/rancher-webhook-7f84b74ddb-rb44l   Search Line limits were exceeded, some search paths have been omitted, the applied search line is: cattle-system.svc.cluster.local svc.cluster.local cluster.local internal.de

Ingress log

I0927 13:45:22.684670       7 controller.go:148] "Configuration changes detected, backend reload required"
I0927 13:45:22.764174       7 controller.go:165] "Backend successfully reloaded"
I0927 13:45:22.765007       7 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"nginx-ingress-controller-2kbjg", UID:"c6f2dedc-2731-4bfe-9a56-c4ca21e32bd6", APIVersion:"v1", ResourceVersion:"999", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
W0927 13:45:49.262185       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0927 13:45:49.262666       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"bb16da2b-76d7-440d-a5a6-90c60e6fae8e", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"13786", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
W0927 13:46:08.407951       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
W0927 13:46:11.761520       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
W0927 13:46:15.077548       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0927 13:49:04.862086       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"bb16da2b-76d7-440d-a5a6-90c60e6fae8e", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"15650", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0927 13:49:08.712227       7 store.go:453] "secret was updated and it is used in ingress annotations. Parsing" secret="cattle-system/tls-rancher-ingress"
I0927 13:45:22.769840       7 controller.go:165] "Backend successfully reloaded"
I0927 13:45:22.771926       7 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"nginx-ingress-controller-gjkkl", UID:"18db3139-7566-4c9c-b296-caa1fbdfb5aa", APIVersion:"v1", ResourceVersion:"1006", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
W0927 13:45:49.261375       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0927 13:45:49.261709       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"bb16da2b-76d7-440d-a5a6-90c60e6fae8e", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"13786", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
W0927 13:46:08.409307       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
W0927 13:46:11.755570       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
W0927 13:46:15.076064       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0927 13:49:04.706013       7 main.go:112] "successfully validated configuration, accepting" ingress="rancher/cattle-system"
I0927 13:49:04.891157       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"bb16da2b-76d7-440d-a5a6-90c60e6fae8e", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"15650", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0927 13:49:08.717546       7 store.go:453] "secret was updated and it is used in ingress annotations. Parsing" secret="cattle-system/tls-rancher-ingress"
I0927 13:45:22.733604       7 controller.go:165] "Backend successfully reloaded"
I0927 13:45:22.739540       7 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"nginx-ingress-controller-k9b2z", UID:"e091a9e5-b6c9-4059-8573-827c52a11ce8", APIVersion:"v1", ResourceVersion:"996", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
I0927 13:45:49.236966       7 status.go:284] "updating Ingress status" namespace="cattle-system" ingress="rancher" currentValue=[] newValue=[{IP: Hostname:lff-l3vsrv103.internal.de Ports:[]} {IP: Hostname:lff-l3vsrv104.bfd.bayern.de Ports:[]} {IP: Hostname:lff-l3vsrv105.bfd.bayern.de Ports:[]}]
W0927 13:45:49.259294       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0927 13:45:49.259769       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"bb16da2b-76d7-440d-a5a6-90c60e6fae8e", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"13786", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
W0927 13:46:08.406243       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
W0927 13:46:11.739653       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
W0927 13:46:15.073253       7 controller.go:992] Service "cattle-system/rancher" does not have any active Endpoint.
I0927 13:49:04.838886       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"bb16da2b-76d7-440d-a5a6-90c60e6fae8e", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"15650", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0927 13:49:08.635770       7 store.go:453] "secret was updated and it is used in ingress annotations. Parsing" secret="cattle-system/tls-rancher-ingress"

kubectl -n cattle-system describe ingress

Name:             rancher
Namespace:        cattle-system
Address:          lff-l3vsrv103.internal.de,lff-l3vsrv104.internal.de,lff-l3vsrv105.internal.de
Default backend:  default-http-backend:80 (<error: endpoints "default-http-backend" not found>)
TLS:
  tls-rancher-ingress terminates rancheradm.internal.de
Rules:
  Host                    Path  Backends
  ----                    ----  --------
  rancheradm.lff.bybn.de
                             rancher:80 (10.42.0.6:80,10.42.1.12:80,10.42.2.5:80)
Annotations:              field.cattle.io/publicEndpoints:
                            [{"addresses":["","",""],"port":443,"protocol":"HTTPS","serviceName":"cattle-system:rancher","ingressName":"cattle-system:rancher","hostna...
                          meta.helm.sh/release-name: rancher
                          meta.helm.sh/release-namespace: cattle-system
                          nginx.ingress.kubernetes.io/proxy-connect-timeout: 30
                          nginx.ingress.kubernetes.io/proxy-read-timeout: 1800
                          nginx.ingress.kubernetes.io/proxy-send-timeout: 1800
Events:                   <none>

The only thing I can see is that there is still some DNS issue acording to DNSConfigForming in the namespace events. And the statement cattle-system/rancher in the ingress logs looks also strange to me.
Is there anything I missed?
Performing the DNS tests as mentioned above works fine now. So I don't really have any idea where to continue from here.

@superseb
Copy link
Contributor

The path is client -> load balancer -> host's ingress controller -> pod.

As before, you can bypass the load balancer to rule that out and connect to the host's ingress controller directly. Tailing that ingress controller log of the host that you are connecting to should give you info on what happens when the connection is made. You can also raise the logging verbosity on the ingress controller, first step is to see if you can properly connect to the host's ingress controller. If that works, you can check wether the ingress controller can connect to any of the rancher pods. I assume they are all active and ready.

Because of the addition of the admission webhook to the new ingress nginx controller, the network mode was changed from hostNetwork to hostPort. This might also give some problems (although it shouldn't, especially with everything disabled firewall wise), but last resort would be to force the ingress controller to the old mode (although this exposes the admission webhook to the outside which is why we changed it to hostPort)

ingress:
  provider: nginx
  network_mode: hostNetwork

@kagehisa
Copy link
Author

So after checking the ingress connection from different angles without a positive result I redeployed the cluster with RKE. And after that it worked with a disabled firewall. I am able to connect to the previously deployed rancher server.

After using the fix described in this comment, I am now able to use rancher together with firewalld on all nodes.

Thank you very much for your help and patience.

From my side this issue can be closed. It seems that it was a problem between calico/canal and firewalld.

@superseb
Copy link
Contributor

Thanks for letting us know. @samstride Please file a new issue with the requested info if you are still experiencing issues.

@samstride
Copy link

samstride commented Sep 29, 2021

@superseb , setting network_mode: hostNetwork resolved it for me. Thanks.

@immanuelfodor
Copy link

although this exposes the admission webhook to the outside which is why we changed it to hostPort

What is the attack vector here? How can it be exploited? Should I worry if I change it back to the previous setting?

@wzrdtales
Copy link

that is a breaking change and wasn't announced properly. also hostport seems to be not configured correctly. it only forwards 127.0.0.1 and internal addresses but doesn't do so for everything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants