`talosctl upgrade --stage ...` immediately reboots node #6150

magicite · 2022-08-24T19:13:07Z

Bug Report

Description

The docs for Upgrading Talos Linux states:

In these cases, you can use the --stage flag. This puts the upgrade artifacts on disk, and adds some metadata to a disk partition that gets checked very early in the boot process. The node is not rebooted by the upgrade --stage process. However, whenever the system does next reboot, Talos sees that it needs to apply an upgrade, and will do so immediately.

However when I use this option with talos 1.1.1 the targeted node immediately reboots and then performs the install.

Logs

Target the node:

# talosctl upgrade --nodes 172.30.223.121 --stage --image ghcr.io/siderolabs/installer:v1.1.2
NODE             ACK                        STARTED
172.30.223.121   Upgrade request received   2022-08-24 14:08:31.152577842 -0500 CDT m=+5.464175834

Console of node:

[ 5522.601526] [talos] upgrade request received: preserve false, staged true, force false
[ 5522.696489] [talos] validating "ghcr.io/siderolabs/installer:v1.1.2"
[ 5527.969347] [talos] stageUpgrade sequence: 12 phase(s)
[ 5528.030976] [talos] phase cleanup (1/12): 1 tasks(s)
[ 5528.090542] [talos] task stopAllPods (1/1): starting
[ 5528.150083] [talos] task stopAllPods (1/1): waiting for kubelet lifecycle finalizers
[ 5528.243031] [talos] removing shared IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "link": "eth0", "ip": "172.30.223.50"}
[ 5528.435812] [talos] removed address 172.30.223.50/32 from "eth0" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
[ 5528.629392] [talos] task stopAllPods (1/1): shutting down kubelet gracefully
[ 5554.014891] cni0: port 1(veth363fbf6c) entered disabled state
[ 5554.084134] device veth363fbf6c left promiscuous mode
[ 5554.144704] cni0: port 1(veth363fbf6c) entered disabled state
[ 5554.317896] [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 2522, container kubelet)
[ 5554.613277] [talos] service[kubelet](Finished): Service finished successfully
[ 5554.704213] [talos] skipping pod kube-system/coredns-77c7b7d9b-67b96, state SANDBOX_NOTREADY
[ 5554.806589] [talos] skipping pod kube-system/kube-proxy-2cgp2, state SANDBOX_NOTREADY
[ 5554.900466] [talos] skipping pod kube-system/kube-flannel-285pz, state SANDBOX_NOTREADY
[ 5554.996428] [talos] skipping pod kube-system/kube-proxy-2cgp2, state SANDBOX_NOTREADY
[ 5555.090297] [talos] skipping pod kube-system/kube-flannel-285pz, state SANDBOX_NOTREADY
[ 5555.186263] [talos] skipping pod kube-system/kube-apiserver-talos-172-30-223-121, state SANDBOX_NOTREADY
[ 5555.299916] [talos] skipping pod kube-system/kube-scheduler-talos-172-30-223-121, state SANDBOX_NOTREADY
[ 5555.413544] [talos] skipping pod kube-system/kube-apiserver-talos-172-30-223-121, state SANDBOX_NOTREADY
[ 5555.527184] [talos] skipping pod kube-system/coredns-77c7b7d9b-67b96, state SANDBOX_NOTREADY
[ 5555.628340] [talos] skipping pod kube-system/kube-controller-manager-talos-172-30-223-121, state SANDBOX_NOTREADY
[ 5555.751410] [talos] task stopAllPods (1/1): done, 27.661532624s
[ 5555.822381] [talos] phase cleanup (1/12): done, 27.792073486s
[ 5555.891264] [talos] phase dbus (2/12): 1 tasks(s)
[ 5555.947680] [talos] task stopDBus (1/1): starting
[ 5556.004156] [talos] task stopDBus (1/1): done, 56.486275ms
[ 5556.069933] [talos] phase dbus (2/12): done, 178.681845ms
[ 5556.134646] [talos] phase leave (3/12): 1 tasks(s)
[ 5556.192117] [talos] task leaveEtcd (1/1): starting
[ 5556.281019] [talos] service[etcd](Stopping): Sending SIGTERM to task etcd (PID 2555, container etcd)
[ 5556.528920] [talos] service[etcd](Finished): Service finished successfully
[ 5556.611628] [talos] task leaveEtcd (1/1): done, 419.535486ms
[ 5556.679568] [talos] phase leave (3/12): done, 544.930101ms
[ 5556.745342] [talos] phase stopEverything (4/12): 1 tasks(s)
[ 5556.812183] [talos] task stopAllServices (1/1): starting
[ 5556.875932] [talos] service[trustd](Stopping): Sending SIGTERM to task trustd (PID 2461, container trustd)
[ 5556.991674] [talos] service[apid](Stopping): Sending SIGTERM to task apid (PID 1217, container apid)
[ 5557.101232] [talos] service[udevd](Stopping): Sending SIGTERM to Process(["/sbin/udevd" "--resolve-names=never"])
[ 5557.224283] [talos] service[cri](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"])
[ 5557.411807] [talos] service[machined](Finished): Service finished successfully
[ 5557.498403] [talos] service[trustd](Finished): Service finished successfully
[ 5557.582912] [talos] service[udevd](Finished): Service finished successfully
[ 5557.666372] [talos] service[cri](Finished): Service finished successfully
[ 5557.747763] [talos] service[apid](Finished): Service finished successfully
[ 5557.830258] [talos] service[containerd](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"])
[ 5558.071626] [talos] service[containerd](Finished): Service finished successfully
[ 5558.160339] [talos] task stopAllServices (1/1): done, 1.348209341s
[ 5558.234442] [talos] phase stopEverything (4/12): done, 1.489144769s
[ 5558.309573] [talos] phase unmountUser (5/12): 1 tasks(s)
[ 5558.373283] [talos] task unmountUserDisks (1/1): starting
[ 5558.438016] [talos] task unmountUserDisks (1/1): done, 64.745838ms
[ 5558.512134] [talos] phase unmountUser (5/12): done, 202.572929ms
[ 5558.584150] [talos] phase umount (6/12): 2 tasks(s)
[ 5558.642647] [talos] task unmountPodMounts (2/2): starting
[ 5558.707431] [talos] task unmountOverlayFilesystems (1/2): starting
[ 5558.781525] [talos] task unmountPodMounts (2/2): done, 64.873685ms
[ 5558.866833] [talos] task unmountOverlayFilesystems (1/2): done, 224.183268ms
[ 5558.951362] [talos] phase umount (6/12): done, 367.227894ms
[ 5559.018210] [talos] phase unmountBind (7/12): 1 tasks(s)
[ 5559.081923] [talos] task unmountSystemDiskBindMounts (1/1): starting
[ 5559.158166] [talos] task unmountSystemDiskBindMounts (1/1): unmounting /system/state
[ 5559.251067] XFS (sda5): Unmounting Filesystem
[ 5559.323670] [talos] task unmountSystemDiskBindMounts (1/1): unmounting /var
[ 5559.472553] XFS (sda6): Unmounting Filesystem
[ 5559.800089] [talos] task unmountSystemDiskBindMounts (1/1): done, 718.186983ms
[ 5559.886700] [talos] phase unmountBind (7/12): done, 868.510615ms
[ 5559.958698] [talos] phase unmountSystem (8/12): 2 tasks(s)
[ 5560.024505] [talos] task unmountStatePartition (2/2): starting
[ 5560.094438] [talos] task unmountEphemeralPartition (1/2): starting
[ 5560.168659] [talos] task unmountEphemeralPartition (1/2): done, 144.109781ms
[ 5560.253160] [talos] task unmountStatePartition (2/2): done, 144.193338ms
[ 5560.333496] [talos] phase unmountSystem (8/12): done, 374.807625ms
[ 5560.407570] [talos] phase mountBoot (9/12): 1 tasks(s)
[ 5560.469185] [talos] task mountBootPartition (1/1): starting
[ 5560.573975] XFS (sda3): Mounting V5 Filesystem
[ 5560.842527] XFS (sda3): Ending clean mount
[ 5560.894486] [talos] task mountBootPartition (1/1): done, 425.317406ms
[ 5560.971703] [talos] phase mountBoot (9/12): done, 564.146111ms
[ 5561.041617] [talos] phase kexec (10/12): 1 tasks(s)
[ 5561.100111] [talos] task kexecPrepare (1/1): starting
[ 5562.956659] [talos] prepared kexec environment kernel="/boot/A/vmlinuz" initrd="/boot/A/initramfs.xz" cmdline="talos.platform=metal talos.config=http://172.30.223.27:8081/configdata?uuid= console=ttyS0 console=tty0 init_on_alloc=1 slab_nomerge pti=on consoleblank=0 n"
[ 5563.513413] [talos] task kexecPrepare (1/1): done, 2.413363912s
[ 5563.584386] [talos] phase kexec (10/12): done, 2.542828566s
[ 5563.651179] [talos] phase unmountBoot (11/12): 1 tasks(s)
[ 5563.715921] [talos] task unmountBootPartition (1/1): starting
[ 5563.791458] XFS (sda3): Unmounting Filesystem
[ 5563.877206] [talos] task unmountBootPartition (1/1): done, 161.300444ms
[ 5563.956493] [talos] phase unmountBoot (11/12): done, 305.322915ms
[ 5564.029530] [talos] phase reboot (12/12): 1 tasks(s)
[ 5564.089117] [talos] task reboot (1/1): starting
[ 5574.144075] [talos] WARNING: failed to drain controllers: context deadline exceeded
[ 5574.236164] [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "rpc error: code = Canceled desc = context canceled"}
[ 5574.432021] [talos] controller failed {"component": "controller-runtime", "controller": "runtime.KmsgLogDeliveryController", "error": "error sending logs: dial tcp [fd5c:914b:f534:d003::1]:4001: operation was canceled"}
[ 5574.665297] [talos] unmounted / (/dev/loop0)
[ 5574.716502] [talos] controller runtime finished
[ 5574.770832] [talos] unmounted /system/libexec/apid/apid (/dev/loop0)
[ 5574.847008] [talos] unmounted /system/libexec/trustd/trustd (/dev/loop0)
[ 5574.927336] [talos] waiting for sync...
[ 5574.973337] [talos] sync done
[ 5575.008942] kvm: exiting hardware virtualization
[ 5575.335909] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 5575.420026] mlx4_core 0000:05:00.0: mlx4_shutdown was called
[ 5577.056590] kexec_core: Starting new kernel

Environment

Talos version: [talosctl version --nodes <problematic nodes>]

talosctl -n 172.30.223.121 version
Client:
	Tag:         v1.1.1
	SHA:         40a050c6
	Built:
	Go version:  go1.18.4
	OS/Arch:     linux/amd64
Server:
	NODE:        172.30.223.121
	Tag:         v1.1.1
	SHA:         40a050c6
	Built:
	Go version:  go1.18.4
	OS/Arch:     linux/amd64
	Enabled:     RBAC

Kubernetes version: [kubectl version --short]

kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.24.3
Kustomize Version: v4.5.4
Server Version: v1.24.2

Platform: Baremetal Dell R620s; deployed through sidero

The text was updated successfully, but these errors were encountered:

smira · 2022-08-24T19:14:25Z

Thanks, it looks like a bug in the docs.

mrwulf · 2022-08-30T19:19:24Z

While technically a bug in the docs, this option functionality doesn't seem to match the name named. If the functionality stays the same, --force or --immediate seem more accurate.

smira · 2022-08-30T20:02:59Z

Naming is hard, but it is neither --force nor --immediate describes the feature.

Staged upgrade performs upgrade after node reboot, "normal" upgrade upgrades before shutdown. So technically staged upgrade = 2 reboots, normal upgrade = 1 reboot.

Staged upgrade allows to workaround some issues which we haven't seen for a while with upgrade failing to wipe the disk if some workloads can't be stopped fully.

Update what's new, upgrading docs. Fix up instances of `master` leftover in the docs. Fix the formatting of kernel params reference. Fixes siderolabs#6150 Signed-off-by: Andrey Smirnov <[email protected]>

smira self-assigned this Aug 31, 2022

smira mentioned this issue Aug 31, 2022

docs: update docs for upcoming 1.2.0 release #6184

Merged

talos-bot closed this as completed in a798dbd Aug 31, 2022

github-actions bot locked as resolved and limited conversation to collaborators Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`talosctl upgrade --stage ...` immediately reboots node #6150

`talosctl upgrade --stage ...` immediately reboots node #6150

magicite commented Aug 24, 2022

smira commented Aug 24, 2022

mrwulf commented Aug 30, 2022

smira commented Aug 30, 2022

talosctl upgrade --stage ... immediately reboots node #6150

talosctl upgrade --stage ... immediately reboots node #6150

Comments

magicite commented Aug 24, 2022

Bug Report

Description

Logs

Environment

smira commented Aug 24, 2022

mrwulf commented Aug 30, 2022

smira commented Aug 30, 2022

`talosctl upgrade --stage ...` immediately reboots node #6150

`talosctl upgrade --stage ...` immediately reboots node #6150