Kubelet cannot be restarted via API if it's not already running #4665

p3lim · 2021-12-08T11:38:37Z

Bug Report

Description

$ talosctl service -n talos03
NODE      SERVICE      STATE     HEALTH   LAST CHANGE   LAST EVENT
talos03   apid         Running   OK       2m29s ago     Health check successful
talos03   containerd   Running   OK       2m44s ago     Health check successful
talos03   cri          Running   OK       2m35s ago     Health check successful
talos03   kubelet      Failed    ?        2m35s ago     Condition failed: 1 error occurred:
          * resource HostnameStatuses.net.talos.dev(network/hostname@undefined) doesn't exist


talos03   machined   Running   ?    2m49s ago   Service started as goroutine
talos03   udevd      Running   OK   2m41s ago   Health check successful

$ talosctl service -n talos03 kubelet
NODE     talos03
ID       kubelet
STATE    Failed
HEALTH   ?
EVENTS   [Failed]: Condition failed: 1 error occurred:
         * resource HostnameStatuses.net.talos.dev(network/hostname@undefined) doesn't exist

 (2m41s ago)
   [Waiting]: Waiting for service "cri" to be "up" (2m43s ago)
   [Waiting]: Waiting for service "cri" to be "up", time sync, network, nodename (2m44s ago)

$ talosctl -n talos03 service kubelet restart
error starting service: 1 error occurred:
        * talos03: rpc error: code = Unknown desc = service "kubelet" doesn't support start operation via API

If the kubelet service is in "Failed" state it cannot be restarted.
Works fine if it's already running.

Ref #4407.

Logs

Environment

Talos version: [talosctl version --nodes <problematic nodes>]

$ talosctl -n talos03 version
Client:
        Tag:         v0.14.0-alpha.2
        SHA:         f7c87d1d
        Built:
        Go version:  go1.17.3
        OS/Arch:     linux/amd64
Server:
        NODE:        talos03
        Tag:         v0.14.0-alpha.2
        SHA:         f7c87d1d
        Built:
        Go version:  go1.17.3
        OS/Arch:     linux/amd64
        Enabled:     RBAC

Kubernetes version: [kubectl version --short]
- 1.23.0-rc.0
Platform:
- x86_64

The text was updated successfully, but these errors were encountered:

smira · 2021-12-09T20:30:50Z

Thanks, we should get it addressed, and the root cause shouldn't happen as well (that condition failed error).

smira · 2021-12-09T20:55:00Z

This is a bug in Talos, but interestingly enough this code got removed from Talos since 0.14.0-beta.0. So the actual issue with the kubelet failing to restart is not fixed, but the bug with condition failed shouldn't exist anymore.

In addition to restart action, allow also start action. If the service fails to start, it transitions to `Failed` state and it should be actually started to bring it back to running state. Fixes siderolabs#4665 Also GC'ed now unused condition (it had been used before kubelet started being controlled via COSI). Signed-off-by: Andrey Smirnov <[email protected]> (cherry picked from commit ab42886)

p3lim · 2021-12-14T15:58:22Z

but the bug with condition failed shouldn't exist anymore

So this would potentially fix #4574 (sorry for not providing those logs in a reasonable amount of time, it's been hectic lately).
I've got some time set aside tomorrow to test out beta1 and I'll try to reproduce the condition error.

p3lim · 2021-12-15T16:53:07Z

Did some tests today, after ~20 reboots I did not get the condition failed any more, but I started seeing failures to get dhcp packets on the bond interface in the console (happened twice). 10 reboots with kexec, 10 reboots with -m powercycle, and this dhcp issue happened once in each.

This is something I'm unable to provide logs for, since the node was unable to properly turn on, and thus no API available.

Best I can do is this screenshot from the console:

This keeps repeating forever, and I have to manually force reboot the node.

smira · 2021-12-15T19:57:34Z

This feels like different issue completely, probably we should move this to #4574. I think it might be helpful to compare the logs before the error, there might be something about the bond setup which makes a difference.

smira added this to the v0.14 milestone Dec 9, 2021

smira self-assigned this Dec 9, 2021

smira mentioned this issue Dec 10, 2021

fix: allow kubelet to be started via the API #4675

Merged

talos-bot closed this as completed in ab42886 Dec 10, 2021

talos-bot closed this as completed in #4675 Dec 10, 2021

github-actions bot locked as resolved and limited conversation to collaborators Jun 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubelet cannot be restarted via API if it's not already running #4665

Kubelet cannot be restarted via API if it's not already running #4665

p3lim commented Dec 8, 2021

smira commented Dec 9, 2021

smira commented Dec 9, 2021

p3lim commented Dec 14, 2021

p3lim commented Dec 15, 2021 •

edited

Loading

smira commented Dec 15, 2021

Kubelet cannot be restarted via API if it's not already running #4665

Kubelet cannot be restarted via API if it's not already running #4665

Comments

p3lim commented Dec 8, 2021

Bug Report

Description

Logs

Environment

smira commented Dec 9, 2021

smira commented Dec 9, 2021

p3lim commented Dec 14, 2021

p3lim commented Dec 15, 2021 • edited Loading

smira commented Dec 15, 2021

p3lim commented Dec 15, 2021 •

edited

Loading