VIP access during cluster upgrade #7859

WinterNis · 2023-10-16T13:56:52Z

Bug Report

K8s upgrade fails due to failing vip access.

Description

During k8s version upgrade, the VIP access will fail when the node that currently holds the vip is upgrading.
The apiserver actually upgrades correctly but the upgrade command fail on network error.

Relaunching the upgrade command will continue the upgrade further (as the apiserver did upgrade correctly) but then will fail again when upgrading the kubelet of the node holding the VIP.

Talos should probably move the vip before trying to upgrade the components on the node, or wait a bit for the vip to move instead of failing immediatly.

Notes:

Control plane nodes IP are 192.168.70.101, 192.168.70.102, 192.168.70.103
Configured VIP is 192.168.70.10

talos endpoints configuration in .talos/config

    talos-cluster:
        endpoints:
            - 192.168.70.101
            - 192.168.70.102
            - 192.168.70.103

kubeconfig: endpoint:

server: https://192.168.70.10:6443

Logs

First upgrade attempt:

talosctl --nodes 192.168.70.101 upgrade-k8s --to 1.27.6
automatically detected the lowest Kubernetes version 1.26.9
discovered controlplane nodes ["192.168.70.103" "192.168.70.101" "192.168.70.102"]
discovered worker nodes ["192.168.70.201" "192.168.70.202"]
checking for removed Kubernetes component flags
checking for removed Kubernetes API resource versions
 > "192.168.70.103": pre-pulling registry.k8s.io/kube-apiserver:v1.27.6
 > "192.168.70.101": pre-pulling registry.k8s.io/kube-apiserver:v1.27.6
 > "192.168.70.102": pre-pulling registry.k8s.io/kube-apiserver:v1.27.6
 > "192.168.70.103": pre-pulling registry.k8s.io/kube-controller-manager:v1.27.6
 > "192.168.70.101": pre-pulling registry.k8s.io/kube-controller-manager:v1.27.6
 > "192.168.70.102": pre-pulling registry.k8s.io/kube-controller-manager:v1.27.6
 > "192.168.70.103": pre-pulling registry.k8s.io/kube-scheduler:v1.27.6
 > "192.168.70.101": pre-pulling registry.k8s.io/kube-scheduler:v1.27.6
 > "192.168.70.102": pre-pulling registry.k8s.io/kube-scheduler:v1.27.6
 > "192.168.70.103": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.101": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.102": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.201": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.202": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
updating "kube-apiserver" to version "1.27.6"
 > "192.168.70.103": starting update
 > update kube-apiserver: v1.26.9 -> 1.27.6
 > "192.168.70.103": machine configuration patched
 > "192.168.70.103": waiting for kube-apiserver pod update
 < "192.168.70.103": successfully updated
 > "192.168.70.101": starting update
 > update kube-apiserver: v1.26.9 -> 1.27.6
 > "192.168.70.101": machine configuration patched
 > "192.168.70.101": waiting for kube-apiserver pod update
failed updating service "kube-apiserver": error updating node "192.168.70.101": 2 error(s) occurred:
        config version mismatch: got "1", expected "2"
        Get "https://192.168.70.10:6443/api/v1/namespaces/kube-system/pods?labelSelector=k8s-app+%3D+kube-apiserver": dial tcp 192.168.70.10:6443: connectex: No connection could be made because the target machine actively refused it.

Second attempt:

talosctl --nodes 192.168.70.101 upgrade-k8s --to 1.27.6
automatically detected the lowest Kubernetes version 1.26.9
discovered controlplane nodes ["192.168.70.103" "192.168.70.101" "192.168.70.102"]
discovered worker nodes ["192.168.70.201" "192.168.70.202"]
checking for removed Kubernetes component flags
checking for removed Kubernetes API resource versions
 > "192.168.70.103": pre-pulling registry.k8s.io/kube-apiserver:v1.27.6
 > "192.168.70.101": pre-pulling registry.k8s.io/kube-apiserver:v1.27.6
 > "192.168.70.102": pre-pulling registry.k8s.io/kube-apiserver:v1.27.6
 > "192.168.70.103": pre-pulling registry.k8s.io/kube-controller-manager:v1.27.6
 > "192.168.70.101": pre-pulling registry.k8s.io/kube-controller-manager:v1.27.6
 > "192.168.70.102": pre-pulling registry.k8s.io/kube-controller-manager:v1.27.6
 > "192.168.70.103": pre-pulling registry.k8s.io/kube-scheduler:v1.27.6
 > "192.168.70.101": pre-pulling registry.k8s.io/kube-scheduler:v1.27.6
 > "192.168.70.102": pre-pulling registry.k8s.io/kube-scheduler:v1.27.6
 > "192.168.70.103": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.101": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.102": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.201": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.202": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
updating "kube-apiserver" to version "1.27.6"
 > "192.168.70.103": starting update
 > "192.168.70.103": machine configuration patched
 > "192.168.70.103": waiting for kube-apiserver pod update
 < "192.168.70.103": successfully updated
 > "192.168.70.101": starting update
 > "192.168.70.101": machine configuration patched
 > "192.168.70.101": waiting for kube-apiserver pod update
 < "192.168.70.101": successfully updated
 > "192.168.70.102": starting update
 > update kube-apiserver: v1.26.9 -> 1.27.6
 > "192.168.70.102": machine configuration patched
 > "192.168.70.102": waiting for kube-apiserver pod update
 < "192.168.70.102": successfully updated
updating "kube-controller-manager" to version "1.27.6"
 > "192.168.70.103": starting update
 > update kube-controller-manager: v1.26.9 -> 1.27.6
 > "192.168.70.103": machine configuration patched
 > "192.168.70.103": waiting for kube-controller-manager pod update
 < "192.168.70.103": successfully updated
 > "192.168.70.101": starting update
 > update kube-controller-manager: v1.26.9 -> 1.27.6
 > "192.168.70.101": machine configuration patched
 > "192.168.70.101": waiting for kube-controller-manager pod update
 < "192.168.70.101": successfully updated
 > "192.168.70.102": starting update
 > update kube-controller-manager: v1.26.9 -> 1.27.6
 > "192.168.70.102": machine configuration patched
 > "192.168.70.102": waiting for kube-controller-manager pod update
 < "192.168.70.102": successfully updated
updating "kube-scheduler" to version "1.27.6"
 > "192.168.70.103": starting update
 > update kube-scheduler: v1.26.9 -> 1.27.6
 > "192.168.70.103": machine configuration patched
 > "192.168.70.103": waiting for kube-scheduler pod update
 < "192.168.70.103": successfully updated
 > "192.168.70.101": starting update
 > update kube-scheduler: v1.26.9 -> 1.27.6
 > "192.168.70.101": machine configuration patched
 > "192.168.70.101": waiting for kube-scheduler pod update
 < "192.168.70.101": successfully updated
 > "192.168.70.102": starting update
 > update kube-scheduler: v1.26.9 -> 1.27.6
 > "192.168.70.102": machine configuration patched
 > "192.168.70.102": waiting for kube-scheduler pod update
 < "192.168.70.102": successfully updated
updating kube-proxy to version "1.27.6"
 > "192.168.70.103": starting update
 > "192.168.70.101": starting update
 > "192.168.70.102": starting update
updating kubelet to version "1.27.6"
 > "192.168.70.103": starting update
 > update kubelet: 1.26.9 -> 1.27.6
 > "192.168.70.103": machine configuration patched
 > "192.168.70.103": waiting for kubelet restart
 > "192.168.70.103": waiting for node update
 < "192.168.70.103": successfully updated
 > "192.168.70.101": starting update
 > update kubelet: 1.26.9 -> 1.27.6
 > "192.168.70.101": machine configuration patched
 > "192.168.70.101": waiting for kubelet restart
 > "192.168.70.101": waiting for node update
failed upgrading kubelet: error updating node "192.168.70.101": 1 error(s) occurred:
        Get "https://192.168.70.10:6443/api/v1/nodes": read tcp 192.168.70.1:56646->192.168.70.10:6443: wsarecv: An existing connection was forcibly closed by the remote host.

Environment

Client:
        Tag:         v1.5.3
        SHA:         cb21c671
        Built:
        Go version:  go1.20.8
        OS/Arch:     windows/amd64
Server:
        NODE:        192.168.70.101
        Tag:         v1.5.3
        SHA:         cb21c671
        Built:
        Go version:  go1.20.8
        OS/Arch:     linux/amd64
        Enabled:     RBAC````

- Kubernetes version: [`kubectl version --short`]

Client Version: v1.27.3
Kustomize Version: v5.0.1
Server Version: v1.26.9

- Platform:
Host: Windows 10 Pro
Talos running in VMs managed by hyper-v

The text was updated successfully, but these errors were encountered:

See siderolabs/talos#7859 When running e.g. `talosctl` on Windows, we should retry Windows connection errors as well. Also drop dependency on hashicorp/go-version in favor of semver package. Signed-off-by: Andrey Smirnov <[email protected]>

WinterNis · 2023-10-17T08:02:48Z

After a discussion with @smira on slack, turns out the issue has nothing to do with upgrade workflow or vip.

We are running talosctl from windows,and windows-specific client errors are not retryable.
This explains why the upgrade fail on the first error while we would have expected the cli to wait and retry.

Seems like it will be fixed in siderolabs/go-kubernetes#11
Thanks for the fix man.

See siderolabs/talos#7859 When running e.g. `talosctl` on Windows, we should retry Windows connection errors as well. Also drop dependency on hashicorp/go-version in favor of semver package. Signed-off-by: Andrey Smirnov <[email protected]>

Containerd 1.7.7, Linux 6.1.58. Fixes siderolabs#7859 Signed-off-by: Andrey Smirnov <[email protected]>

smira mentioned this issue Oct 16, 2023

fix: retry Windows connection errors siderolabs/go-kubernetes#11

Merged

smira self-assigned this Oct 16, 2023

smira added a commit to smira/talos that referenced this issue Oct 17, 2023

chore: update dependencies

101fd4a

Containerd 1.7.7, Linux 6.1.58. Fixes siderolabs#7859 Signed-off-by: Andrey Smirnov <[email protected]>

smira mentioned this issue Oct 17, 2023

chore: update dependencies #7865

Merged

talos-bot closed this as completed in 9dfae84 Oct 17, 2023

github-actions bot locked as resolved and limited conversation to collaborators Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VIP access during cluster upgrade #7859

VIP access during cluster upgrade #7859

WinterNis commented Oct 16, 2023

WinterNis commented Oct 17, 2023

VIP access during cluster upgrade #7859

VIP access during cluster upgrade #7859

Comments

WinterNis commented Oct 16, 2023

Bug Report

Description

Logs

Environment

WinterNis commented Oct 17, 2023