Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

talosctl upgrade --stage ... immediately reboots node #6150

Closed
magicite opened this issue Aug 24, 2022 · 3 comments · Fixed by #6184
Closed

talosctl upgrade --stage ... immediately reboots node #6150

magicite opened this issue Aug 24, 2022 · 3 comments · Fixed by #6184
Assignees

Comments

@magicite
Copy link

Bug Report

Description

The docs for Upgrading Talos Linux states:

In these cases, you can use the --stage flag. This puts the upgrade artifacts on disk, and adds some metadata to a disk partition that gets checked very early in the boot process. The node is not rebooted by the upgrade --stage process. However, whenever the system does next reboot, Talos sees that it needs to apply an upgrade, and will do so immediately.

However when I use this option with talos 1.1.1 the targeted node immediately reboots and then performs the install.

Logs

Target the node:

# talosctl upgrade --nodes 172.30.223.121 --stage --image ghcr.io/siderolabs/installer:v1.1.2
NODE             ACK                        STARTED
172.30.223.121   Upgrade request received   2022-08-24 14:08:31.152577842 -0500 CDT m=+5.464175834

Console of node:

[ 5522.601526] [talos] upgrade request received: preserve false, staged true, force false
[ 5522.696489] [talos] validating "ghcr.io/siderolabs/installer:v1.1.2"
[ 5527.969347] [talos] stageUpgrade sequence: 12 phase(s)
[ 5528.030976] [talos] phase cleanup (1/12): 1 tasks(s)
[ 5528.090542] [talos] task stopAllPods (1/1): starting
[ 5528.150083] [talos] task stopAllPods (1/1): waiting for kubelet lifecycle finalizers
[ 5528.243031] [talos] removing shared IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "link": "eth0", "ip": "172.30.223.50"}
[ 5528.435812] [talos] removed address 172.30.223.50/32 from "eth0" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
[ 5528.629392] [talos] task stopAllPods (1/1): shutting down kubelet gracefully
[ 5554.014891] cni0: port 1(veth363fbf6c) entered disabled state
[ 5554.084134] device veth363fbf6c left promiscuous mode
[ 5554.144704] cni0: port 1(veth363fbf6c) entered disabled state
[ 5554.317896] [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 2522, container kubelet)
[ 5554.613277] [talos] service[kubelet](Finished): Service finished successfully
[ 5554.704213] [talos] skipping pod kube-system/coredns-77c7b7d9b-67b96, state SANDBOX_NOTREADY
[ 5554.806589] [talos] skipping pod kube-system/kube-proxy-2cgp2, state SANDBOX_NOTREADY
[ 5554.900466] [talos] skipping pod kube-system/kube-flannel-285pz, state SANDBOX_NOTREADY
[ 5554.996428] [talos] skipping pod kube-system/kube-proxy-2cgp2, state SANDBOX_NOTREADY
[ 5555.090297] [talos] skipping pod kube-system/kube-flannel-285pz, state SANDBOX_NOTREADY
[ 5555.186263] [talos] skipping pod kube-system/kube-apiserver-talos-172-30-223-121, state SANDBOX_NOTREADY
[ 5555.299916] [talos] skipping pod kube-system/kube-scheduler-talos-172-30-223-121, state SANDBOX_NOTREADY
[ 5555.413544] [talos] skipping pod kube-system/kube-apiserver-talos-172-30-223-121, state SANDBOX_NOTREADY
[ 5555.527184] [talos] skipping pod kube-system/coredns-77c7b7d9b-67b96, state SANDBOX_NOTREADY
[ 5555.628340] [talos] skipping pod kube-system/kube-controller-manager-talos-172-30-223-121, state SANDBOX_NOTREADY
[ 5555.751410] [talos] task stopAllPods (1/1): done, 27.661532624s
[ 5555.822381] [talos] phase cleanup (1/12): done, 27.792073486s
[ 5555.891264] [talos] phase dbus (2/12): 1 tasks(s)
[ 5555.947680] [talos] task stopDBus (1/1): starting
[ 5556.004156] [talos] task stopDBus (1/1): done, 56.486275ms
[ 5556.069933] [talos] phase dbus (2/12): done, 178.681845ms
[ 5556.134646] [talos] phase leave (3/12): 1 tasks(s)
[ 5556.192117] [talos] task leaveEtcd (1/1): starting
[ 5556.281019] [talos] service[etcd](Stopping): Sending SIGTERM to task etcd (PID 2555, container etcd)
[ 5556.528920] [talos] service[etcd](Finished): Service finished successfully
[ 5556.611628] [talos] task leaveEtcd (1/1): done, 419.535486ms
[ 5556.679568] [talos] phase leave (3/12): done, 544.930101ms
[ 5556.745342] [talos] phase stopEverything (4/12): 1 tasks(s)
[ 5556.812183] [talos] task stopAllServices (1/1): starting
[ 5556.875932] [talos] service[trustd](Stopping): Sending SIGTERM to task trustd (PID 2461, container trustd)
[ 5556.991674] [talos] service[apid](Stopping): Sending SIGTERM to task apid (PID 1217, container apid)
[ 5557.101232] [talos] service[udevd](Stopping): Sending SIGTERM to Process(["/sbin/udevd" "--resolve-names=never"])
[ 5557.224283] [talos] service[cri](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"])
[ 5557.411807] [talos] service[machined](Finished): Service finished successfully
[ 5557.498403] [talos] service[trustd](Finished): Service finished successfully
[ 5557.582912] [talos] service[udevd](Finished): Service finished successfully
[ 5557.666372] [talos] service[cri](Finished): Service finished successfully
[ 5557.747763] [talos] service[apid](Finished): Service finished successfully
[ 5557.830258] [talos] service[containerd](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"])
[ 5558.071626] [talos] service[containerd](Finished): Service finished successfully
[ 5558.160339] [talos] task stopAllServices (1/1): done, 1.348209341s
[ 5558.234442] [talos] phase stopEverything (4/12): done, 1.489144769s
[ 5558.309573] [talos] phase unmountUser (5/12): 1 tasks(s)
[ 5558.373283] [talos] task unmountUserDisks (1/1): starting
[ 5558.438016] [talos] task unmountUserDisks (1/1): done, 64.745838ms
[ 5558.512134] [talos] phase unmountUser (5/12): done, 202.572929ms
[ 5558.584150] [talos] phase umount (6/12): 2 tasks(s)
[ 5558.642647] [talos] task unmountPodMounts (2/2): starting
[ 5558.707431] [talos] task unmountOverlayFilesystems (1/2): starting
[ 5558.781525] [talos] task unmountPodMounts (2/2): done, 64.873685ms
[ 5558.866833] [talos] task unmountOverlayFilesystems (1/2): done, 224.183268ms
[ 5558.951362] [talos] phase umount (6/12): done, 367.227894ms
[ 5559.018210] [talos] phase unmountBind (7/12): 1 tasks(s)
[ 5559.081923] [talos] task unmountSystemDiskBindMounts (1/1): starting
[ 5559.158166] [talos] task unmountSystemDiskBindMounts (1/1): unmounting /system/state
[ 5559.251067] XFS (sda5): Unmounting Filesystem
[ 5559.323670] [talos] task unmountSystemDiskBindMounts (1/1): unmounting /var
[ 5559.472553] XFS (sda6): Unmounting Filesystem
[ 5559.800089] [talos] task unmountSystemDiskBindMounts (1/1): done, 718.186983ms
[ 5559.886700] [talos] phase unmountBind (7/12): done, 868.510615ms
[ 5559.958698] [talos] phase unmountSystem (8/12): 2 tasks(s)
[ 5560.024505] [talos] task unmountStatePartition (2/2): starting
[ 5560.094438] [talos] task unmountEphemeralPartition (1/2): starting
[ 5560.168659] [talos] task unmountEphemeralPartition (1/2): done, 144.109781ms
[ 5560.253160] [talos] task unmountStatePartition (2/2): done, 144.193338ms
[ 5560.333496] [talos] phase unmountSystem (8/12): done, 374.807625ms
[ 5560.407570] [talos] phase mountBoot (9/12): 1 tasks(s)
[ 5560.469185] [talos] task mountBootPartition (1/1): starting
[ 5560.573975] XFS (sda3): Mounting V5 Filesystem
[ 5560.842527] XFS (sda3): Ending clean mount
[ 5560.894486] [talos] task mountBootPartition (1/1): done, 425.317406ms
[ 5560.971703] [talos] phase mountBoot (9/12): done, 564.146111ms
[ 5561.041617] [talos] phase kexec (10/12): 1 tasks(s)
[ 5561.100111] [talos] task kexecPrepare (1/1): starting
[ 5562.956659] [talos] prepared kexec environment kernel="/boot/A/vmlinuz" initrd="/boot/A/initramfs.xz" cmdline="talos.platform=metal talos.config=http://172.30.223.27:8081/configdata?uuid= console=ttyS0 console=tty0 init_on_alloc=1 slab_nomerge pti=on consoleblank=0 n"
[ 5563.513413] [talos] task kexecPrepare (1/1): done, 2.413363912s
[ 5563.584386] [talos] phase kexec (10/12): done, 2.542828566s
[ 5563.651179] [talos] phase unmountBoot (11/12): 1 tasks(s)
[ 5563.715921] [talos] task unmountBootPartition (1/1): starting
[ 5563.791458] XFS (sda3): Unmounting Filesystem
[ 5563.877206] [talos] task unmountBootPartition (1/1): done, 161.300444ms
[ 5563.956493] [talos] phase unmountBoot (11/12): done, 305.322915ms
[ 5564.029530] [talos] phase reboot (12/12): 1 tasks(s)
[ 5564.089117] [talos] task reboot (1/1): starting
[ 5574.144075] [talos] WARNING: failed to drain controllers: context deadline exceeded
[ 5574.236164] [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "rpc error: code = Canceled desc = context canceled"}
[ 5574.432021] [talos] controller failed {"component": "controller-runtime", "controller": "runtime.KmsgLogDeliveryController", "error": "error sending logs: dial tcp [fd5c:914b:f534:d003::1]:4001: operation was canceled"}
[ 5574.665297] [talos] unmounted / (/dev/loop0)
[ 5574.716502] [talos] controller runtime finished
[ 5574.770832] [talos] unmounted /system/libexec/apid/apid (/dev/loop0)
[ 5574.847008] [talos] unmounted /system/libexec/trustd/trustd (/dev/loop0)
[ 5574.927336] [talos] waiting for sync...
[ 5574.973337] [talos] sync done
[ 5575.008942] kvm: exiting hardware virtualization
[ 5575.335909] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 5575.420026] mlx4_core 0000:05:00.0: mlx4_shutdown was called
[ 5577.056590] kexec_core: Starting new kernel

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]
talosctl -n 172.30.223.121 version
Client:
	Tag:         v1.1.1
	SHA:         40a050c6
	Built:
	Go version:  go1.18.4
	OS/Arch:     linux/amd64
Server:
	NODE:        172.30.223.121
	Tag:         v1.1.1
	SHA:         40a050c6
	Built:
	Go version:  go1.18.4
	OS/Arch:     linux/amd64
	Enabled:     RBAC
  • Kubernetes version: [kubectl version --short]
kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.24.3
Kustomize Version: v4.5.4
Server Version: v1.24.2
  • Platform: Baremetal Dell R620s; deployed through sidero
@smira
Copy link
Member

smira commented Aug 24, 2022

Thanks, it looks like a bug in the docs.

@mrwulf
Copy link

mrwulf commented Aug 30, 2022

While technically a bug in the docs, this option functionality doesn't seem to match the name named. If the functionality stays the same, --force or --immediate seem more accurate.

@smira
Copy link
Member

smira commented Aug 30, 2022

Naming is hard, but it is neither --force nor --immediate describes the feature.

Staged upgrade performs upgrade after node reboot, "normal" upgrade upgrades before shutdown. So technically staged upgrade = 2 reboots, normal upgrade = 1 reboot.

Staged upgrade allows to workaround some issues which we haven't seen for a while with upgrade failing to wipe the disk if some workloads can't be stopped fully.

@smira smira self-assigned this Aug 31, 2022
smira added a commit to smira/talos that referenced this issue Aug 31, 2022
Update what's new, upgrading docs.

Fix up instances of `master` leftover in the docs.

Fix the formatting of kernel params reference.

Fixes siderolabs#6150

Signed-off-by: Andrey Smirnov <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants