[Release 1.30] Windows workloads cannot be deleted post upgrade, stuck in Terminating #6534

rbrtbnfgl · 2024-08-12T17:49:58Z

backport for #5551

mdrahman-suse · 2024-08-20T21:34:41Z

@rbrtbnfgl I am still seeing the issue with v1.30.4-rc1+rke2r1

Test Steps:

Installed rke2 v1.30.3+rke2r1 on 1 linux server, 1 linux agent and 1 windows agent node
Ensured cluster is up
Deployed windows workload
Exec in to windows pod and perform nslookup
Observed the virtual IP is there
Upgraded all the nodes to 1.30.4-rc1+rke2r1 manually
Ensure the cluster is up
Exec in to windows pod and perform nslookup
Observed the virtual IP is not there

Before upgrade

$ rke2 -v
rke2 version v1.30.3+rke2r1 (e742dc53b463d205773cfb25633309671ce6777c)
go version go1.22.5 X:boringcrypto

$ kubectl get nodes
NAME                                        STATUS   ROLES                       AGE     VERSION
ip-xxx-xx-2-1.us-east-2.compute.internal    Ready    control-plane,etcd,master   26m   v1.30.3+rke2r1
ip-xxx-xx-3-23.us-east-2.compute.internal   Ready    <none>                      24m   v1.30.3+rke2r1
ip-ac1f02eb                                 Ready    <none>                      18m   v1.30.3

$ kubectl get pods -A | grep win
default           pod/win-webserver-6678868fb5-c5bqh                                      1/1     Running     0              19m
default           pod/win-webserver-6678868fb5-n5f6w                                      1/1     Running     0              19m

$ kubectl exec -it pod/win-webserver-6678868fb5-c5bqh -- powershell.exe nslookup
Default Server:  rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
Address:  10.43.0.10

After upgrade

$ rke2 -v
rke2 version v1.30.4-rc1+rke2r1 (ba87c79317a1550fac473208e0b23a3d5fd2eeaf)
go version go1.22.5 X:boringcrypto

$ kgn
NAME                                        STATUS   ROLES                       AGE     VERSION
ip-xxx-xx-2-1.us-east-2.compute.internal    Ready    control-plane,etcd,master   1h20m   v1.30.4+rke2r1
ip-xxx-xx-3-23.us-east-2.compute.internal   Ready    <none>                      1h18m  v1.30.4+rke2r1
ip-ac1f02eb                                 Ready    <none>                      1h13m   v1.30.4

~$ kgp | grep win
default           win-webserver-6678868fb5-c5bqh                                      1/1     Running     0             53m
default           win-webserver-6678868fb5-n5f6w                                      1/1     Running     0             53m


$ kubectl exec -it pod/win-webserver-6678868fb5-c5bqh -- powershell.exe
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

PS C:\> nslookup
*** Default servers are not available
Default Server:  UnKnown
Address:  127.0.0.1

Please advise

mdrahman-suse · 2024-08-21T01:41:42Z

@rbrtbnfgl So it looks like the fix works with flannel CNI but not with calico CNI

With calico:

#6534 (comment)

With flannel:

Before upgrade

$ rke2 -v
rke2 version v1.30.3+rke2r1 (e742dc53b463d205773cfb25633309671ce6777c)
go version go1.22.5 X:boringcrypto

$ kgn
NAME                                          STATUS   ROLES                       AGE   VERSION
ip-xxx-xx-13-156.us-east-2.compute.internal   Ready    <none>                      30m   v1.30.3+rke2r1
ip-xxx-xx-3-69.us-east-2.compute.internal     Ready    control-plane,etcd,master   32m   v1.30.3+rke2r1
ip-ac1f2610                                   Ready    <none>                      28m   v1.30.3

$ kgp | grep win
default       win-webserver-6778785459-s8587                                       1/1     Running     0          12m

$ kubectl exec -it pod/win-webserver-6778785459-s8587 -- powershell.exe nslookup
Default Server:  rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
Address:  10.43.0.10

After upgrade

$ rke2 -v
rke2 version v1.30.4-rc1+rke2r1 (ba87c79317a1550fac473208e0b23a3d5fd2eeaf)
go version go1.22.5 X:boringcrypto

$ kgn
NAME                                          STATUS   ROLES                       AGE   VERSION
ip-xxx-xx-13-156.us-east-2.compute.internal   Ready    <none>                      41m   v1.30.4+rke2r1
ip-xxx-xx-3-69.us-east-2.compute.internal     Ready    control-plane,etcd,master   43m   v1.30.4+rke2r1
ip-ac1f2610                                   Ready    <none>                      39m   v1.30.4

$ kgp | grep win
default       win-webserver-6778785459-s8587                                       1/1     Running            0               23m

$ kubectl exec -it pod/win-webserver-6778785459-s8587 -- powershell.exe nslookup
Default Server:  rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
Address:  10.43.0.10

BUT

There is an issue with the upgrade that caused flannel and rke2-ingress-nginx controller pods to go into Error / CrashLoopBackOff state

ubuntu@ip-172-31-3-69:~$ kgp
NAMESPACE     NAME                                                                 READY   STATUS             RESTARTS        AGE
default       win-webserver-6778785459-s8587                                       1/1     Running            0               25m
kube-system   cloud-controller-manager-ip-172-31-3-69.us-east-2.compute.internal   1/1     Running            0               10m
kube-system   etcd-ip-172-31-3-69.us-east-2.compute.internal                       1/1     Running            0               44m
kube-system   helm-install-rke2-coredns-gv99b                                      0/1     Completed          0               9m53s
kube-system   helm-install-rke2-flannel-246kd                                      0/1     Completed          0               9m53s
kube-system   helm-install-rke2-ingress-nginx-wqzqx                                0/1     Completed          0               9m53s
kube-system   helm-install-rke2-metrics-server-2hnj2                               0/1     Completed          0               9m53s
kube-system   helm-install-rke2-snapshot-controller-crd-hdcqs                      0/1     Completed          0               9m53s
kube-system   helm-install-rke2-snapshot-controller-hdt2g                          0/1     Completed          0               9m53s
kube-system   helm-install-rke2-snapshot-validation-webhook-7lb8r                  0/1     Completed          0               9m53s
kube-system   kube-apiserver-ip-172-31-3-69.us-east-2.compute.internal             1/1     Running            0               10m
kube-system   kube-controller-manager-ip-172-31-3-69.us-east-2.compute.internal    1/1     Running            1 (10m ago)     10m
kube-system   kube-flannel-ds-sfc6z                                                0/1     CrashLoopBackOff   6 (3m30s ago)   9m41s
kube-system   kube-flannel-ds-zdkvs                                                0/1     CrashLoopBackOff   8 (2m33s ago)   9m19s
kube-system   kube-proxy-ip-172-31-13-156.us-east-2.compute.internal               1/1     Running            0               8m10s
kube-system   kube-proxy-ip-172-31-3-69.us-east-2.compute.internal                 1/1     Running            0               10m
kube-system   kube-scheduler-ip-172-31-3-69.us-east-2.compute.internal             1/1     Running            0               10m
kube-system   rke2-coredns-rke2-coredns-64dcf4f58b-vlt2p                           1/1     Running            0               43m
kube-system   rke2-coredns-rke2-coredns-6bb85f9dd8-22hlc                           0/1     Pending            0               9m42s
kube-system   rke2-coredns-rke2-coredns-6bb85f9dd8-qhvbt                           0/1     Running            0               9m41s
kube-system   rke2-coredns-rke2-coredns-autoscaler-7b9c797d64-glgnm                1/1     Running            0               9m42s
kube-system   rke2-ingress-nginx-controller-bmfnz                                  0/1     CrashLoopBackOff   6 (21s ago)     8m2s
kube-system   rke2-ingress-nginx-controller-thm6d                                  0/1     CrashLoopBackOff   6 (35s ago)     9m15s
kube-system   rke2-metrics-server-868fc8795f-tfhxv                                 1/1     Running            0               43m
kube-system   rke2-snapshot-controller-7dcf5d5b46-sqckq                            1/1     Running            1 (10m ago)     43m
kube-system   rke2-snapshot-validation-webhook-bf7bbd6fc-7lrlr                     1/1     Running            0               44m

$ k logs -n kube-system pod/kube-flannel-ds-sfc6z
Defaulted container "kube-flannel" out of: kube-flannel, install-cni-plugins (init), install-cni (init)
I0821 01:30:47.986053       1 main.go:211] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: version:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[] ifaceRegex:[] ipMasq:true ifaceCanReach: subnetFile:/run/flannel/subnet.env publicIP: publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true}
W0821 01:30:47.986258       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0821 01:30:48.009934       1 kube.go:139] Waiting 10m0s for node controller to sync
I0821 01:30:48.010052       1 kube.go:469] Starting kube subnet manager
I0821 01:30:48.016282       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.1.0/24]
I0821 01:30:48.016347       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.0.0/24]
I0821 01:30:48.016359       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.2.0/24]
I0821 01:30:49.011027       1 kube.go:146] Node controller sync successful
I0821 01:30:49.011104       1 main.go:231] Created subnet manager: Kubernetes Subnet Manager - ip-172-31-3-69.us-east-2.compute.internal
I0821 01:30:49.011116       1 main.go:234] Installing signal handlers
I0821 01:30:49.011330       1 main.go:452] Found network config - Backend type: vxlan
...
I0821 01:30:49.021925       1 nftables.go:47] Starting flannel in nftables mode...
E0821 01:30:49.022154       1 main.go:353] no nftables support: could not find nftables binary: exec: "nft": executable file not found in $PATH
I0821 01:30:49.022335       1 main.go:432] Stopping shutdownHandler...

$ k logs -n kube-system rke2-ingress-nginx-controller-bmfnz
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.10.1-hardened2
  Build:         git-6c2923297
  Repository:    https://github.com/rancher/ingress-nginx
  nginx version: nginx/1.25.3

-------------------------------------------------------------------------------

W0821 01:33:01.566466       8 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0821 01:33:01.574103       8 main.go:205] "Creating API client" host="https://10.43.0.1:443"

$ k describe -n kube-system rke2-ingress-nginx-controller-bmfnz
...
Events:
  Type     Reason             Age                   From               Message
  ----     ------             ----                  ----               -------
  Normal   Scheduled          10m                   default-scheduler  Successfully assigned kube-system/rke2-ingress-nginx-controller-bmfnz to ip-172-31-13-156.us-east-2.compute.internal
  Normal   Pulling            10m                   kubelet            Pulling image "rancher/nginx-ingress-controller:v1.10.1-hardened2"
  Normal   Pulled             10m                   kubelet            Successfully pulled image "rancher/nginx-ingress-controller:v1.10.1-hardened2" in 14.856s (14.856s including waiting). Image size: 293192075 bytes.
  Normal   Created            9m38s (x2 over 10m)   kubelet            Created container rke2-ingress-nginx-controller
  Normal   Started            9m38s (x2 over 10m)   kubelet            Started container rke2-ingress-nginx-controller
  Normal   Killing            9m38s                 kubelet            Container rke2-ingress-nginx-controller failed liveness probe, will be restarted
  Warning  FailedPreStopHook  9m38s                 kubelet            PreStopHook failed
  Normal   Pulled             9m38s                 kubelet            Container image "rancher/nginx-ingress-controller:v1.10.1-hardened2" already present on machine
  Warning  Unhealthy          5m38s (x39 over 10m)  kubelet            Readiness probe failed: Get "http://10.42.1.16:10254/healthz": dial tcp 10.42.1.16:10254: connect: connection refused
  Warning  BackOff            46s (x20 over 5m28s)  kubelet            Back-off restarting failed container rke2-ingress-nginx-controller in pod rke2-ingress-nginx-controller-bmfnz_kube-system(e688ab3f-d439-4e28-bd2e-02fc499cbb88)

brandond · 2024-08-21T02:14:03Z

I0821 01:30:49.021925       1 nftables.go:47] Starting flannel in nftables mode...
E0821 01:30:49.022154       1 main.go:353] no nftables support: could not find nftables binary: exec: "nft": executable file not found in $PATH

This appears to be on the linux node, not the windows node? And not related to the Windows changes as far as I can tell...

rbrtbnfgl · 2024-08-21T06:56:32Z

The issue is fixed if you update from a fixed version to another fixed version. If you update from a version without the fix the issue will still occur. You need to restart the node to fix it. You could easily reproduce the issue restarting the service that it's similar to what happens when you update the node. This fix is only Calico related if there are any issues with flannel it could be better to open a new one.

rbrtbnfgl · 2024-08-21T07:57:07Z

I created a new issue for this flannel bug #6601

mdrahman-suse · 2024-08-21T15:55:29Z

Validated with RC v1.30.4-rc1+rke2r1

Environment / Config

Ubuntu 22.04 (server and worker nodes)
Windows Server 2022 Datacenter

1 linux server, 1 linux worker, 1 windows worker

Windows workload winapp.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: win-webserver
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      containers:
        - name: windowswebserver
          image: mcr.microsoft.com/windows/servercore:ltsc2022
          command:
            - powershell.exe
            - -command
            - "<#code used from https://gist.github.com/19WAS85/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ;  ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$ip=(Get-NetAdapter | Get-NetIpAddress); $$header='<html><body><H1>Windows Container Web Server</H1>' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='<p>IP {0} callerCount {1} ' -f $$ip[1].IPAddress,$$callerCounts.Item($$_) } ;$$footer='</body></html>' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus)  } ; "
      nodeSelector:
        kubernetes.io/os: windows

Testing

Create an rke2 cluster with the setup mentioned
Ensure cluster comes up
Deploy Windows workload
Wait for the workload to be Running (~12m)
Exec into the workload and do nslookup
Login to Windows worker node and Restart rke2 service
Exec into the workload and do nslookup
Validate the response before and after restart matches

Replication

$ rke2 -v
rke2 version v1.30.3+rke2r1 (e742dc53b463d205773cfb25633309671ce6777c)
go version go1.22.5 X:boringcrypto

Nodes are up

$ kgn
NAME                                         STATUS   ROLES                       AGE   VERSION
ip-xxx-31-4-155.us-east-2.compute.internal   Ready    <none>                      37m   v1.30.3+rke2r1
ip-xxx-31-5-117.us-east-2.compute.internal   Ready    control-plane,etcd,master   39m   v1.30.3+rke2r1
ip-ac1f2610                                  Ready    <none>                      35m   v1.30.3

Windows pod is up

$ kgp | grep win
default           win-webserver-6778785459-csk4x                                        1/1     Running     0          10m

Exec into pod and nslookup (Before restarting rke2 in Windows)

$ kubectl exec -it pod/win-webserver-6778785459-csk4x -- powershell.exe nslookup
Default Server:  rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
Address:  10.43.0.10

Restart rke2 service in Windows

 C:\usr\local\bin\rke2.exe --version
rke2.exe version v1.30.3+rke2r1 (e742dc53b463d205773cfb25633309671ce6777c)
go version go1.22.5

PS C:\Users\Administrator> Get-Service -Name rke2

Status   Name               DisplayName
------   ----               -----------
Running  rke2               rke2


PS C:\Users\Administrator> Restart-Service -Name rke2

Exec into pod and nslookup (After restarting rke2 in Windows)

$ kubectl exec -it pod/win-webserver-6778785459-csk4x -- powershell.exe nslookup
*** Default servers are not available
Default Server:  UnKnown
Address:  127.0.0.1

Validation

$ rke2 -v
rke2 version v1.30.4-rc1+rke2r1 (ba87c79317a1550fac473208e0b23a3d5fd2eeaf)
go version go1.22.5 X:boringcrypto

Nodes are up

$ kgn
NAME                                         STATUS   ROLES                       AGE   VERSION
ip-xxx-xx-13-66.us-east-2.compute.internal   Ready    control-plane,etcd,master   43m   v1.30.4+rke2r1
ip-xxx-xx-8-164.us-east-2.compute.internal   Ready    <none>                      41m   v1.30.4+rke2r1
ip-ac1f2610                                  Ready    <none>                      38m   v1.30.4

Windows pod is up

$ kgp | grep win
default           win-webserver-6778785459-g4rrk                                        1/1     Running     0             13m

Exec into pod and nslookup (Before restarting rke2 in Windows)

$ kubectl exec -it pod/win-webserver-6778785459-g4rrk -- powershell.exe nslookup
Default Server:  rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
Address:  10.43.0.10

Restart rke2 service in Windows

C:\usr\local\bin\rke2.exe --version
rke2.exe version v1.30.4-rc1+rke2r1 (ba87c79317a1550fac473208e0b23a3d5fd2eeaf)
go version go1.22.5
PS C:\Users\Administrator> Get-Service -Name rke2

Status   Name               DisplayName
------   ----               -----------
Running  rke2               rke2


PS C:\Users\Administrator> Restart-Service -Name rke2

Exec into pod and nslookup (After restarting rke2 in Windows)

$ kubectl exec -it pod/win-webserver-6778785459-g4rrk -- powershell.exe nslookup
Default Server:  rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
Address:  10.43.0.10

NOTE: Does not work when upgrading from a version without this fix

rbrtbnfgl added this to the v1.30.4+rke2r1 milestone Aug 12, 2024

rbrtbnfgl mentioned this issue Aug 12, 2024

[Release 1.30] Fixed hns clean only in case of reboot #6538

Merged

endawkins assigned mdrahman-suse Aug 13, 2024

mdrahman-suse closed this as completed Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Release 1.30] Windows workloads cannot be deleted post upgrade, stuck in Terminating #6534

[Release 1.30] Windows workloads cannot be deleted post upgrade, stuck in Terminating #6534

rbrtbnfgl commented Aug 12, 2024

mdrahman-suse commented Aug 20, 2024 •

edited

Loading

mdrahman-suse commented Aug 21, 2024

brandond commented Aug 21, 2024

rbrtbnfgl commented Aug 21, 2024 •

edited

Loading

rbrtbnfgl commented Aug 21, 2024

mdrahman-suse commented Aug 21, 2024

[Release 1.30] Windows workloads cannot be deleted post upgrade, stuck in Terminating #6534

[Release 1.30] Windows workloads cannot be deleted post upgrade, stuck in Terminating #6534

Comments

rbrtbnfgl commented Aug 12, 2024

mdrahman-suse commented Aug 20, 2024 • edited Loading

Before upgrade

After upgrade

mdrahman-suse commented Aug 21, 2024

With calico:

With flannel:

Before upgrade

After upgrade

BUT

brandond commented Aug 21, 2024

rbrtbnfgl commented Aug 21, 2024 • edited Loading

rbrtbnfgl commented Aug 21, 2024

mdrahman-suse commented Aug 21, 2024

Validated with RC v1.30.4-rc1+rke2r1

Environment / Config

Testing

Replication

Validation

mdrahman-suse commented Aug 20, 2024 •

edited

Loading

rbrtbnfgl commented Aug 21, 2024 •

edited

Loading