Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VIP access during cluster upgrade #7859

Closed
WinterNis opened this issue Oct 16, 2023 · 1 comment · Fixed by #7865
Closed

VIP access during cluster upgrade #7859

WinterNis opened this issue Oct 16, 2023 · 1 comment · Fixed by #7865
Assignees

Comments

@WinterNis
Copy link

Bug Report

K8s upgrade fails due to failing vip access.

Description

During k8s version upgrade, the VIP access will fail when the node that currently holds the vip is upgrading.
The apiserver actually upgrades correctly but the upgrade command fail on network error.

Relaunching the upgrade command will continue the upgrade further (as the apiserver did upgrade correctly) but then will fail again when upgrading the kubelet of the node holding the VIP.

Talos should probably move the vip before trying to upgrade the components on the node, or wait a bit for the vip to move instead of failing immediatly.

Notes:

  • Control plane nodes IP are 192.168.70.101, 192.168.70.102, 192.168.70.103
  • Configured VIP is 192.168.70.10

talos endpoints configuration in .talos/config

    talos-cluster:
        endpoints:
            - 192.168.70.101
            - 192.168.70.102
            - 192.168.70.103

kubeconfig: endpoint:

server: https://192.168.70.10:6443

Logs

First upgrade attempt:

talosctl --nodes 192.168.70.101 upgrade-k8s --to 1.27.6
automatically detected the lowest Kubernetes version 1.26.9
discovered controlplane nodes ["192.168.70.103" "192.168.70.101" "192.168.70.102"]
discovered worker nodes ["192.168.70.201" "192.168.70.202"]
checking for removed Kubernetes component flags
checking for removed Kubernetes API resource versions
 > "192.168.70.103": pre-pulling registry.k8s.io/kube-apiserver:v1.27.6
 > "192.168.70.101": pre-pulling registry.k8s.io/kube-apiserver:v1.27.6
 > "192.168.70.102": pre-pulling registry.k8s.io/kube-apiserver:v1.27.6
 > "192.168.70.103": pre-pulling registry.k8s.io/kube-controller-manager:v1.27.6
 > "192.168.70.101": pre-pulling registry.k8s.io/kube-controller-manager:v1.27.6
 > "192.168.70.102": pre-pulling registry.k8s.io/kube-controller-manager:v1.27.6
 > "192.168.70.103": pre-pulling registry.k8s.io/kube-scheduler:v1.27.6
 > "192.168.70.101": pre-pulling registry.k8s.io/kube-scheduler:v1.27.6
 > "192.168.70.102": pre-pulling registry.k8s.io/kube-scheduler:v1.27.6
 > "192.168.70.103": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.101": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.102": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.201": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.202": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
updating "kube-apiserver" to version "1.27.6"
 > "192.168.70.103": starting update
 > update kube-apiserver: v1.26.9 -> 1.27.6
 > "192.168.70.103": machine configuration patched
 > "192.168.70.103": waiting for kube-apiserver pod update
 < "192.168.70.103": successfully updated
 > "192.168.70.101": starting update
 > update kube-apiserver: v1.26.9 -> 1.27.6
 > "192.168.70.101": machine configuration patched
 > "192.168.70.101": waiting for kube-apiserver pod update
failed updating service "kube-apiserver": error updating node "192.168.70.101": 2 error(s) occurred:
        config version mismatch: got "1", expected "2"
        Get "https://192.168.70.10:6443/api/v1/namespaces/kube-system/pods?labelSelector=k8s-app+%3D+kube-apiserver": dial tcp 192.168.70.10:6443: connectex: No connection could be made because the target machine actively refused it.

Second attempt:

talosctl --nodes 192.168.70.101 upgrade-k8s --to 1.27.6
automatically detected the lowest Kubernetes version 1.26.9
discovered controlplane nodes ["192.168.70.103" "192.168.70.101" "192.168.70.102"]
discovered worker nodes ["192.168.70.201" "192.168.70.202"]
checking for removed Kubernetes component flags
checking for removed Kubernetes API resource versions
 > "192.168.70.103": pre-pulling registry.k8s.io/kube-apiserver:v1.27.6
 > "192.168.70.101": pre-pulling registry.k8s.io/kube-apiserver:v1.27.6
 > "192.168.70.102": pre-pulling registry.k8s.io/kube-apiserver:v1.27.6
 > "192.168.70.103": pre-pulling registry.k8s.io/kube-controller-manager:v1.27.6
 > "192.168.70.101": pre-pulling registry.k8s.io/kube-controller-manager:v1.27.6
 > "192.168.70.102": pre-pulling registry.k8s.io/kube-controller-manager:v1.27.6
 > "192.168.70.103": pre-pulling registry.k8s.io/kube-scheduler:v1.27.6
 > "192.168.70.101": pre-pulling registry.k8s.io/kube-scheduler:v1.27.6
 > "192.168.70.102": pre-pulling registry.k8s.io/kube-scheduler:v1.27.6
 > "192.168.70.103": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.101": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.102": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.201": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
 > "192.168.70.202": pre-pulling ghcr.io/siderolabs/kubelet:v1.27.6
updating "kube-apiserver" to version "1.27.6"
 > "192.168.70.103": starting update
 > "192.168.70.103": machine configuration patched
 > "192.168.70.103": waiting for kube-apiserver pod update
 < "192.168.70.103": successfully updated
 > "192.168.70.101": starting update
 > "192.168.70.101": machine configuration patched
 > "192.168.70.101": waiting for kube-apiserver pod update
 < "192.168.70.101": successfully updated
 > "192.168.70.102": starting update
 > update kube-apiserver: v1.26.9 -> 1.27.6
 > "192.168.70.102": machine configuration patched
 > "192.168.70.102": waiting for kube-apiserver pod update
 < "192.168.70.102": successfully updated
updating "kube-controller-manager" to version "1.27.6"
 > "192.168.70.103": starting update
 > update kube-controller-manager: v1.26.9 -> 1.27.6
 > "192.168.70.103": machine configuration patched
 > "192.168.70.103": waiting for kube-controller-manager pod update
 < "192.168.70.103": successfully updated
 > "192.168.70.101": starting update
 > update kube-controller-manager: v1.26.9 -> 1.27.6
 > "192.168.70.101": machine configuration patched
 > "192.168.70.101": waiting for kube-controller-manager pod update
 < "192.168.70.101": successfully updated
 > "192.168.70.102": starting update
 > update kube-controller-manager: v1.26.9 -> 1.27.6
 > "192.168.70.102": machine configuration patched
 > "192.168.70.102": waiting for kube-controller-manager pod update
 < "192.168.70.102": successfully updated
updating "kube-scheduler" to version "1.27.6"
 > "192.168.70.103": starting update
 > update kube-scheduler: v1.26.9 -> 1.27.6
 > "192.168.70.103": machine configuration patched
 > "192.168.70.103": waiting for kube-scheduler pod update
 < "192.168.70.103": successfully updated
 > "192.168.70.101": starting update
 > update kube-scheduler: v1.26.9 -> 1.27.6
 > "192.168.70.101": machine configuration patched
 > "192.168.70.101": waiting for kube-scheduler pod update
 < "192.168.70.101": successfully updated
 > "192.168.70.102": starting update
 > update kube-scheduler: v1.26.9 -> 1.27.6
 > "192.168.70.102": machine configuration patched
 > "192.168.70.102": waiting for kube-scheduler pod update
 < "192.168.70.102": successfully updated
updating kube-proxy to version "1.27.6"
 > "192.168.70.103": starting update
 > "192.168.70.101": starting update
 > "192.168.70.102": starting update
updating kubelet to version "1.27.6"
 > "192.168.70.103": starting update
 > update kubelet: 1.26.9 -> 1.27.6
 > "192.168.70.103": machine configuration patched
 > "192.168.70.103": waiting for kubelet restart
 > "192.168.70.103": waiting for node update
 < "192.168.70.103": successfully updated
 > "192.168.70.101": starting update
 > update kubelet: 1.26.9 -> 1.27.6
 > "192.168.70.101": machine configuration patched
 > "192.168.70.101": waiting for kubelet restart
 > "192.168.70.101": waiting for node update
failed upgrading kubelet: error updating node "192.168.70.101": 1 error(s) occurred:
        Get "https://192.168.70.10:6443/api/v1/nodes": read tcp 192.168.70.1:56646->192.168.70.10:6443: wsarecv: An existing connection was forcibly closed by the remote host.

Environment

Client:
        Tag:         v1.5.3
        SHA:         cb21c671
        Built:
        Go version:  go1.20.8
        OS/Arch:     windows/amd64
Server:
        NODE:        192.168.70.101
        Tag:         v1.5.3
        SHA:         cb21c671
        Built:
        Go version:  go1.20.8
        OS/Arch:     linux/amd64
        Enabled:     RBAC````

- Kubernetes version: [`kubectl version --short`]

Client Version: v1.27.3
Kustomize Version: v5.0.1
Server Version: v1.26.9

- Platform:
Host: Windows 10 Pro
Talos running in VMs managed by hyper-v
smira added a commit to smira/go-kubernetes that referenced this issue Oct 16, 2023
See siderolabs/talos#7859

When running e.g. `talosctl` on Windows, we should retry Windows
connection errors as well.

Also drop dependency on hashicorp/go-version in favor of semver package.

Signed-off-by: Andrey Smirnov <[email protected]>
@smira smira self-assigned this Oct 16, 2023
@WinterNis
Copy link
Author

After a discussion with @smira on slack, turns out the issue has nothing to do with upgrade workflow or vip.

We are running talosctl from windows,and windows-specific client errors are not retryable.
This explains why the upgrade fail on the first error while we would have expected the cli to wait and retry.

Seems like it will be fixed in siderolabs/go-kubernetes#11
Thanks for the fix man.

smira added a commit to smira/go-kubernetes that referenced this issue Oct 17, 2023
See siderolabs/talos#7859

When running e.g. `talosctl` on Windows, we should retry Windows
connection errors as well.

Also drop dependency on hashicorp/go-version in favor of semver package.

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Oct 17, 2023
Containerd 1.7.7, Linux 6.1.58.

Fixes siderolabs#7859

Signed-off-by: Andrey Smirnov <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants