🐛 Talos turns shutdown into reboot event #7854

salkin · 2023-10-13T09:58:35Z

Bug Report

Talos turns shutdown event into a reboot when there is misbehving pod not obeying SIGTERM

Description

Triggering a shutdown using Shutdown API, Talos starts shutdown but turned into a reboot after one pod misbehaving.

Logs

[ 1571.984742] [talos] stopped pod default/tpm-device-plugin-wz92s
[ 1572.284172] [talos] task stopAllPods (1/1): failed: failed stopping pod FILTERED_OUT_POD: ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:810be2f79aa5cc74e8fd6029d3f3c2d20c9061aad38a9ad20899ea44c39ecd38,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
[ 1574.050049] [talos] phase cleanup (1/9): failed
[ 1574.287490] [talos] shutdown sequence: failed
[ 1574.515824] [talos] shutdown failed: error running phase 1 in shutdown sequence: task 1/1: failed, failed stopping pod app-teamcomms-stable-912bmhxawq/mcs-1-6bcbcb69d7-v6drz: ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:810be2f79aa5cc74e8fd6029d3f3c2d20c9061aad38a9ad20899ea44c39ecd38,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
[ 1576.481870] [talos] service[apid](Stopping): Sending SIGTERM to task apid (PID 1913, container apid)
[ 1576.944240] [talos] service[etcd](Stopping): Sending SIGTERM to task etcd (PID 4124, container etcd)
[ 1577.407055] [talos] service[udevd](Stopping): Sending SIGTERM to Process(["/sbin/udevd" "--resolve-names=never"])
[ 1577.924690] [talos] service[machined](Finished): Service finished successfully
[ 1578.294614] [talos] service[trustd](Stopping): Sending SIGTERM to task trustd (PID 4047, container trustd)
[ 1578.783092] [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
[ 1579.493072] [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
[ 1580.164745] [talos] service[udevd](Finished): Service finished successfully
[ 1580.521130] [talos] service[apid](Finished): Service finished successfully
[ 1580.874347] [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
[ 1581.546399] [talos] service[etcd](Finished): Service finished successfully
[ 1581.900214] [talos] service[cri](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"])
[ 1582.719632] [talos] service[cri](Finished): Service finished successfully
[ 1583.159952] [talos] service[trustd](Finished): Service finished successfully
[ 1583.522368] [talos] service[containerd](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"])
[ 1584.538867] [talos] service[containerd](Finished): Service finished successfully
[ 1584.921394] [talos] fatal sequencer error in "shutdown" sequence: message:"sequence failed: error running phase 1 in shutdown sequence: task 1/1: failed, failed stopping pod FILTERED_OUT_POD: ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:810be2f79aa5cc74e8fd6029d3f3c2d20c9061aad38a9ad20899ea44c39ecd38,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
[ 1587.168198] [talos] rebooting in 10 seconds
[ 1588.395069] [talos] rebooting in 9 seconds
[ 1589.617832] [talos] rebooting in 8 seconds
[ 1590.838854] [talos] rebooting in 7 seconds
[ 1592.061347] [talos] rebooting in 6 seconds
[ 1593.282258] [talos] rebooting in 5 seconds
[ 1594.502522] [talos] rebooting in 4 seconds
[ 1595.723035] [talos] rebooting in 3 seconds
[ 1596.943063] [talos] rebooting in 2 seconds
[ 1597.625279] [talos] controller runtime finished
[ 1598.162033] [talos] rebooting in 1 seconds
[ 1599.379261] [talos] rebooting in 0 seconds

Environment

Talos version: Talos 1.4.6
Kubernetes version: 1.27.6
Platform: bare-metal

The text was updated successfully, but these errors were encountered:

Fixes siderolabs#7854 Talos runs an emergency handler if the sequence experience and unrecoverable failure. The emergency handler was unconditionally executing "reboot" action if no other action was received (which only gets received if the sequence completes successfully), so the Shutdown request might result in a Reboot behavior on error during shutdown phase. This is not a pretty fix, but it's hard to deliver the intent from one part of the code to another right now, so instead use a global variable which stores default emergency intention, and gets overridden early in the Shutdown sequence. Signed-off-by: Andrey Smirnov <[email protected]>

Fixes siderolabs#7854 Talos runs an emergency handler if the sequence experience and unrecoverable failure. The emergency handler was unconditionally executing "reboot" action if no other action was received (which only gets received if the sequence completes successfully), so the Shutdown request might result in a Reboot behavior on error during shutdown phase. This is not a pretty fix, but it's hard to deliver the intent from one part of the code to another right now, so instead use a global variable which stores default emergency intention, and gets overridden early in the Shutdown sequence. Signed-off-by: Andrey Smirnov <[email protected]> (cherry picked from commit 474fa04)

smira self-assigned this Oct 16, 2023

smira mentioned this issue Nov 9, 2023

Release 1.6.0 checklist #7561

Closed

smira changed the title ~~Talos turns shutdown into reboot event~~ 🐛 Talos turns shutdown into reboot event Nov 28, 2023

smira mentioned this issue Dec 4, 2023

fix: store and execute desired action on emergency action #8028

Merged

talos-bot closed this as completed in 474fa04 Dec 4, 2023

github-actions bot locked as resolved and limited conversation to collaborators Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Talos turns shutdown into reboot event #7854

🐛 Talos turns shutdown into reboot event #7854

salkin commented Oct 13, 2023 •

edited by smira

Loading

🐛 Talos turns shutdown into reboot event #7854

🐛 Talos turns shutdown into reboot event #7854

Comments

salkin commented Oct 13, 2023 • edited by smira Loading

Bug Report

Description

Logs

Environment

salkin commented Oct 13, 2023 •

edited by smira

Loading