Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet cannot be restarted via API if it's not already running #4665

Closed
p3lim opened this issue Dec 8, 2021 · 5 comments · Fixed by #4675
Closed

Kubelet cannot be restarted via API if it's not already running #4665

p3lim opened this issue Dec 8, 2021 · 5 comments · Fixed by #4675
Assignees
Milestone

Comments

@p3lim
Copy link

p3lim commented Dec 8, 2021

Bug Report

Description

$ talosctl service -n talos03
NODE      SERVICE      STATE     HEALTH   LAST CHANGE   LAST EVENT
talos03   apid         Running   OK       2m29s ago     Health check successful
talos03   containerd   Running   OK       2m44s ago     Health check successful
talos03   cri          Running   OK       2m35s ago     Health check successful
talos03   kubelet      Failed    ?        2m35s ago     Condition failed: 1 error occurred:
          * resource HostnameStatuses.net.talos.dev(network/hostname@undefined) doesn't exist


talos03   machined   Running   ?    2m49s ago   Service started as goroutine
talos03   udevd      Running   OK   2m41s ago   Health check successful

$ talosctl service -n talos03 kubelet
NODE     talos03
ID       kubelet
STATE    Failed
HEALTH   ?
EVENTS   [Failed]: Condition failed: 1 error occurred:
         * resource HostnameStatuses.net.talos.dev(network/hostname@undefined) doesn't exist

 (2m41s ago)
   [Waiting]: Waiting for service "cri" to be "up" (2m43s ago)
   [Waiting]: Waiting for service "cri" to be "up", time sync, network, nodename (2m44s ago)

$ talosctl -n talos03 service kubelet restart
error starting service: 1 error occurred:
        * talos03: rpc error: code = Unknown desc = service "kubelet" doesn't support start operation via API

If the kubelet service is in "Failed" state it cannot be restarted.
Works fine if it's already running.

Ref #4407.

Logs

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]
$ talosctl -n talos03 version
Client:
        Tag:         v0.14.0-alpha.2
        SHA:         f7c87d1d
        Built:
        Go version:  go1.17.3
        OS/Arch:     linux/amd64
Server:
        NODE:        talos03
        Tag:         v0.14.0-alpha.2
        SHA:         f7c87d1d
        Built:
        Go version:  go1.17.3
        OS/Arch:     linux/amd64
        Enabled:     RBAC
  • Kubernetes version: [kubectl version --short]
    • 1.23.0-rc.0
  • Platform:
    • x86_64
@smira smira added this to the v0.14 milestone Dec 9, 2021
@smira
Copy link
Member

smira commented Dec 9, 2021

Thanks, we should get it addressed, and the root cause shouldn't happen as well (that condition failed error).

@smira smira self-assigned this Dec 9, 2021
@smira
Copy link
Member

smira commented Dec 9, 2021

This is a bug in Talos, but interestingly enough this code got removed from Talos since 0.14.0-beta.0. So the actual issue with the kubelet failing to restart is not fixed, but the bug with condition failed shouldn't exist anymore.

smira added a commit to smira/talos that referenced this issue Dec 13, 2021
In addition to restart action, allow also start action.

If the service fails to start, it transitions to `Failed` state and it
should be actually started to bring it back to running state.

Fixes siderolabs#4665

Also GC'ed now unused condition (it had been used before kubelet started
being controlled via COSI).

Signed-off-by: Andrey Smirnov <[email protected]>
(cherry picked from commit ab42886)
@p3lim
Copy link
Author

p3lim commented Dec 14, 2021

but the bug with condition failed shouldn't exist anymore

So this would potentially fix #4574 (sorry for not providing those logs in a reasonable amount of time, it's been hectic lately).
I've got some time set aside tomorrow to test out beta1 and I'll try to reproduce the condition error.

@p3lim
Copy link
Author

p3lim commented Dec 15, 2021

Did some tests today, after ~20 reboots I did not get the condition failed any more, but I started seeing failures to get dhcp packets on the bond interface in the console (happened twice). 10 reboots with kexec, 10 reboots with -m powercycle, and this dhcp issue happened once in each.

This is something I'm unable to provide logs for, since the node was unable to properly turn on, and thus no API available.

Best I can do is this screenshot from the console:
image

This keeps repeating forever, and I have to manually force reboot the node.

@smira
Copy link
Member

smira commented Dec 15, 2021

This feels like different issue completely, probably we should move this to #4574. I think it might be helpful to compare the logs before the error, there might be something about the bond setup which makes a difference.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants