Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop in the RouteSpecController causing a lot of CPU consumption #7592

Closed
donch opened this issue Aug 9, 2023 · 3 comments · Fixed by #7597
Closed

Infinite loop in the RouteSpecController causing a lot of CPU consumption #7592

donch opened this issue Aug 9, 2023 · 3 comments · Fixed by #7597
Assignees
Milestone

Comments

@donch
Copy link

donch commented Aug 9, 2023

Bug Report

Description

When installing a node of my Talos cluster, a lot of logs are generated by the RouteSpecController consuming a lot of CPU.
Also the gateways displayed in the console are endlessly "flapping".
The initial network configuration was made using the console (network settings), or by editing the grub configuration. In both case, the bug appears.

Logs

10.14.28.66: user: warning: [2023-08-09T15:12:40.99345371Z]: [talos] created route {"component": "controller-runtime", "controller": "network.RouteSpecController", "destination": "default", "gateway": "10.14.28.254", "table": "main", "link": "eth0"}
10.14.28.66: user: warning: [2023-08-09T15:12:40.99563171Z]: [talos] created route {"component": "controller-runtime", "controller": "network.RouteSpecController", "destination": "default", "gateway": "10.14.28.254", "table": "main", "link": "eth0"}
10.14.28.66: user: warning: [2023-08-09T15:12:40.99786871Z]: [talos] created route {"component": "controller-runtime", "controller": "network.RouteSpecController", "destination": "default", "gateway": "10.14.28.254", "table": "main", "link": "eth0"}
10.14.28.66: user: warning: [2023-08-09T15:12:41.00009071Z]: [talos] created route {"component": "controller-runtime", "controller": "network.RouteSpecController", "destination": "default", "gateway": "10.14.28.254", "table": "main", "link": "eth0"}

It seems that there is 2 routespecs on this "looping" instead of one :

talosctl --nodes 10.14.28.66 get routespecs -o yaml
node: 10.14.28.66
metadata:
    namespace: network
    type: RouteSpecs.net.talos.dev
    id: inet4/10.14.28.254//0
    version: 2
    owner: network.RouteMergeController
    phase: running
    created: 2023-08-09T15:02:58Z
    updated: 2023-08-09T15:02:58Z
    finalizers:
        - network.RouteSpecController
spec:
    family: inet4
    dst: ""
    src: ""
    gateway: 10.14.28.254
    outLinkName: eth0
    table: main
    scope: global
    type: unicast
    flags: ""
    protocol: static
    layer: platform
---
node: 10.14.28.66
metadata:
    namespace: network
    type: RouteSpecs.net.talos.dev
    id: inet4/10.14.28.254//1024
    version: 2
    owner: network.RouteMergeController
    phase: running
    created: 2023-08-09T15:03:11Z
    updated: 2023-08-09T15:03:11Z
    finalizers:
        - network.RouteSpecController
spec:
    family: inet4
    dst: ""
    src: ""
    gateway: 10.14.28.254
    outLinkName: eth0
    table: main
    priority: 1024
    scope: global
    type: unicast
    flags: ""
    protocol: static
    layer: configuration

Environment

  • Talos version:
Client:
	Tag:         v1.5.0-beta.0
	SHA:         de763409
	Built:
	Go version:  go1.20.7
	OS/Arch:     darwin/arm64
Server:
	NODE:        10.14.28.66
	Tag:         v1.5.0-beta.0-11-g8a94ae93e
	SHA:         8a94ae93
	Built:
	Go version:  go1.20.7
	OS/Arch:     linux/amd64
	Enabled:     RBAC

The bug is also present at least in 1.4.6, 1.4.7, 1.5.0-beta.0

  • Kubernetes version:
Client Version: v1.26.2
Kustomize Version: v4.5.7
Server Version: v1.26.1
  • Platform: Proxmox VMs
@smira
Copy link
Member

smira commented Aug 9, 2023

So the bug in Talos is that it handles incorrectly two routes which are identical except for the priority (metric). A workaround for you might be to delete that route from the machine configuration, as it's already coming from the platform source. A fix will be done for the 1.5, thanks for reporting it!

@smira smira self-assigned this Aug 9, 2023
@smira smira added this to the v1.5 milestone Aug 9, 2023
smira added a commit to smira/talos that referenced this issue Aug 9, 2023
Fixes siderolabs#7592

The problem was a mismatch between a "primary key" (ID) of the
`RouteSpec` and the way routes are looked up in the kernel - with two
idential routes but different priority Talos would end up in an infinite
loop fighting to remove and re-add back same route, as priority never
matches.

Signed-off-by: Andrey Smirnov <[email protected]>
@lion7
Copy link

lion7 commented Aug 10, 2023

I also ran into the same issue yesterday; in my case I had configured a static IP address + route using the dashboard.
Once my node was reachable, I applied a machine config using the exact same IP + route and that caused this loop.
I resolved it my emptying the network config on the dashboard (by applying an empty new config).

@smira
Copy link
Member

smira commented Aug 10, 2023

The bug will be fixed for 1.5.0 release, thanks for reporting that!

smira added a commit to smira/talos that referenced this issue Aug 16, 2023
Fixes siderolabs#7592

The problem was a mismatch between a "primary key" (ID) of the
`RouteSpec` and the way routes are looked up in the kernel - with two
idential routes but different priority Talos would end up in an infinite
loop fighting to remove and re-add back same route, as priority never
matches.

Signed-off-by: Andrey Smirnov <[email protected]>
(cherry picked from commit ee6d639)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants