Initial BGP sync during kube-router startup extremely slow in kubernetes v1.29 #1668

acalm · 2024-05-10T13:01:50Z

What happened?
After cluster was upgraded to v1.29, from v1.28 we started observing increased startup times on workloads due to network not being available for ~5min. Checking logs in kube-router we saw that the initial bgp sync takes several minutes instead of just a few seconds (at most). While the sync is running no running pods can use the network until
time="2024-05-10T10:13:30Z" level=info msg="sync finished" Topic=Server is reported

I0510 10:09:06.802963 1 version.go:21] Running /usr/local/bin/kube-router version v1.4.0, built on 2022-01-05T17:01:42+0000, go1.17.5
I0510 10:09:07.105518 1 metrics_controller.go:164] Starting metrics controller
I0510 10:09:07.146602 1 network_routes_controller.go:1298] Could not find annotation `kube-router.io/bgp-local-addresses` on node object so BGP will listen on node IP: 10.40.52.161 address.
I0510 10:09:07.147089 1 network_routes_controller.go:376] `subnet` in CNI conf file is empty so populating `subnet` in CNI conf file with pod CIDR assigned to the node obtained from node spec.
I0510 10:09:07.284141 1 network_routes_controller.go:242] Setting MTU of kube-bridge interface to: 8981
I0510 10:09:07.285174 1 network_routes_controller.go:273] Starting network route controller
I0510 10:09:07.288415 1 network_routes_controller.go:1024] Could not find BGP peer info for the node in the node annotations so skipping configuring peer.
I0510 10:09:07.291280 1 network_services_controller.go:344] Starting network services controller
time="2024-05-10T10:09:07Z" level=info msg="Add a peer configuration for:10.40.89.141" Topic=Peer
time="2024-05-10T10:09:07Z" level=info msg="Add a peer configuration for:10.40.93.68" Topic=Peer
time="2024-05-10T10:09:07Z" level=info msg="Add a peer configuration for:10.40.55.212" Topic=Peer
...
time="2024-05-10T10:09:07Z" level=info msg="Add a peer configuration for:10.40.93.167" Topic=Peer
time="2024-05-10T10:09:07Z" level=info msg="Add a peer configuration for:10.40.76.147" Topic=Peer
time="2024-05-10T10:09:07Z" level=info msg="Add a peer configuration for:10.40.81.213" Topic=Peer
I0510 10:09:07.491224 1 network_policy_controller.go:159] Starting network policy controller
I0510 10:09:07.517270 1 network_policy_controller.go:171] Starting network policy controller full sync goroutine
time="2024-05-10T10:09:14Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.68.45 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:09:14Z" level=info msg="Peer Up" Key=10.40.68.45 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-10T10:09:40Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.93.167 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:09:40Z" level=info msg="Peer Up" Key=10.40.93.167 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-10T10:09:41Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.86.87 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:09:41Z" level=info msg="Peer Up" Key=10.40.86.87 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-10T10:10:12Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.60.155 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:10:12Z" level=info msg="Peer Up" Key=10.40.60.155 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-10T10:10:37Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.56.231 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:10:37Z" level=info msg="Peer Up" Key=10.40.56.231 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-10T10:10:43Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.77.181 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:10:43Z" level=info msg="Peer Up" Key=10.40.77.181 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-10T10:10:44Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.66.117 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:10:44Z" level=info msg="Peer Up" Key=10.40.66.117 State=BGP_FSM_OPENCONFIRM Topic=Peer
...
time="2024-05-10T10:13:30Z" level=info msg="sync finished" Topic=Server

What did you expect to happen?
Initial BGP sync should complete in the same amount of time as in Kubernetes v1.28

How can we reproduce the behavior you experienced?
Steps to reproduce the behavior:

run kube-router 1.4.0 in a kubernetes v1.29 cluster
Step 2
Step 3
Step 4

**Screenshots / Architecture Diagrams / Network Topologies **
If applicable, add those here to help explain your problem.

** System Information (please complete the following information):**

Kube-Router Version (kube-router --version): 1.4.0
Kube-Router Parameters:

/usr/local/bin/kube-router --run-router=true --run-firewall=true --run-service-proxy=true --kubeconfig=/etc/kuberouter/kuberouter.kubeconfig --metrics-port=9332 --disable-source-dest-check=false --bgp-graceful-restart --service-cluster-ip-range=10.32.0.0/16

Kubernetes Version (kubectl version) : v1.29.4
Cloud Type: AWS
Kubernetes Deployment Type: Kubeadm
Kube-Router Deployment Type: DaemonSet
Cluster Size: 20-200

** Logs, other output, metrics **

Additional context
two kubeadm kube-controller-manager settings was changed during upgrade:

node-cidr-mask-size: "25" from the default /24
maxPods: 64
Nodes run ubuntu 20.04, linux-aws 5.15 kernel
We're aware 1.4.0 is pretty old, but the issue seems to be somewhere in how bgp is handled? or peers discovered, we've tried to pin point any other changes in our infrastructure but haven't been able to come up with anything other than the changed kubernetes version

The text was updated successfully, but these errors were encountered:

acalm · 2024-05-12T11:25:45Z

I don't know if it's of any concern but checking some kube-router pods gobgp neighbor <NODEIP> show that some neighbors had multiple "open" packets sent and one received, while some others just had 1 sent and received, I was thinking that may impact the initial peer addition.
example:

# gobgp neighbor 10.40.91.212
...
  Message statistics:
                         Sent       Rcvd
    Opens:                 24          1
    Notifications:          0          0
    Updates:                8          8
    Keepalives:          1170       1170
    Route Refresh:          0          0
    Discarded:              0          0
    Total:               1202       1179

and

# gobgp neighbor 10.40.81.200
...
  Message statistics:
                         Sent       Rcvd
    Opens:                  1          1
    Notifications:          0          0
    Updates:                8          8
    Keepalives:          1179       1179
    Route Refresh:          0          0
    Discarded:              0          0
    Total:               1188       1188

aauren · 2024-05-12T15:32:20Z

Unfortunately I can't think of any reason why Kubernetes version would affect BGP convergence. In the past, the only reasons why we've ever seen this type of thing are either:

BGP misconfiguration
Or more specifically BGP Graceful Restart misconfiguration

The fact that it took 5 minutes seems very suspiciously like the default BGP Graceful Restart deferral time: https://github.com/cloudnativelabs/kube-router/blob/v1.4/pkg/options/options.go#L80 (granted that one is technically 6 minutes).

As an aside, I would not run kube-router v1.4.0 against a 1.29 Kubernetes cluster. The client libraries it is built against are Kubernetes v1.22 which is well outside the 2 minor version discrepancy that the upstream project allows. The fact that k8s has such good backwards compatibility is a testament that it is able to run at all.

Beyond this, v1.4.0 is more than 2 years behind active development and the project won't be able to help further on this issue.

acalm · 2024-05-15T19:12:20Z

Thanks for your response @aauren!
Sorry for answering in a closed issue (and creating issues for old versions), but I may have found something that may impact later versions as well!

I think I found out that this issue actually is due to changes made in Kubernetes v1.29 and it happens when you run with --cloud-provider=external, this will leave the node status addresses list empty until the cloud provider controller populates it. I think this is the cause for these messages on the existing kube-router pods:
failed to get node object from api server: host IP unknown
from here, the nrc tries to read the list but it does not exist. I didn't capture these messages in my initial message, as I only included logs on the newly created kube-router pod (sorry).

Anyway, digging further I noticed that it kind of looked like the new node was added as valid bgp neighbors once the network route controller run a sync(?) (default every 5min). I assume nrc would re-read all nodes from the API on sync start.
After some more digging I noticed that when a new node comes up, it's initially missing the status.addresses list, but after a few short seconds it's added, but by then it's too late, because it's not picked up until the controller runs its next sync.

I started looking into the changes made in kubelet for 1.29 and noticed kubernetes/kubernetes#121028, I have yet to confirm this, but it seems like you can get the old behavior back (not waiting for the external cloud controller to populate the address list) by passing --node-ip=$IPADDRESS to kubelet.
It looks like calico recently ran into similar issues here: projectcalico/calico#8453

I'm going to try passing --node-ip to kubelet and see if it helps, just wanted to report back in case some one else runs into this in later versions as the issue may persist there as well.

edit: some clarifications, etc.

aauren · 2024-05-15T19:44:42Z

Thanks for reporting back about what you found! I think this will be helpful for future people running into this issue. Please let the thread know how the addition of the node-ip parameter changes things if at all.

The issue that we have open here #676 is specific to node annotations, but part of that work will be to watch nodes in general. Maybe we could watch for address updates as well whenever we work on that.

acalm · 2024-05-16T06:25:33Z

I've verified that passing --node-ip=<NODE_IP_ADDRESS> populates the status.addresses list read by the nrc here. This restores the original behavior and remediates the issues we observed in our environments, when a new node comes up the other kube-router pods in the daemonset logs:

level=info msg="Add a peer configuration for:10.40.95.197" Topic=Peer
level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.95.197 State=BGP_FSM_OPENSENT Topic=Peer

instead of

bgp_peers.go:303] New node received, but we were unable to add it as we were couldn't find its node IP: host IP unknown
level=info msg="Can't find configuration for a new passive connection from:10.40.92.7" Topic=Peer

In AWS it's possible to get the internal ip via the IMDS while bootstrapping the node and pass something like --node-ip=$IPADDR to KUBELET_EXTRA_ARGS before joining. For control-plane nodes using kubeadm, the node ip can be passed through the kubeadm join configuration, for example:

# -- 8< -- 8< --
nodeRegistration:
  name: "$LOCAL_HOSTNAME"
  kubeletExtraArgs:
    cloud-provider: "$CLOUD_PROVIDER"
    cloud-config: "$CLOUD_CONFIG"
    cluster-dns: 169.254.20.10
    node-ip: "$IPADDR"
# -- 8< -- 8< --

(above variables obviously needs to be expanded prior use)

edit: clarifications, some example

aauren · 2024-06-16T19:31:00Z

This appears to be related to an ongoing upstream conversation: kubernetes/kubernetes#125348

acalm added the bug label May 10, 2024

aauren closed this as completed May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial BGP sync during kube-router startup extremely slow in kubernetes v1.29 #1668

Initial BGP sync during kube-router startup extremely slow in kubernetes v1.29 #1668

acalm commented May 10, 2024 •

edited

Loading

acalm commented May 12, 2024

aauren commented May 12, 2024

acalm commented May 15, 2024 •

edited

Loading

aauren commented May 15, 2024

acalm commented May 16, 2024 •

edited

Loading

aauren commented Jun 16, 2024

Initial BGP sync during kube-router startup extremely slow in kubernetes v1.29 #1668

Initial BGP sync during kube-router startup extremely slow in kubernetes v1.29 #1668

Comments

acalm commented May 10, 2024 • edited Loading

acalm commented May 12, 2024

aauren commented May 12, 2024

acalm commented May 15, 2024 • edited Loading

aauren commented May 15, 2024

acalm commented May 16, 2024 • edited Loading

aauren commented Jun 16, 2024

acalm commented May 10, 2024 •

edited

Loading

acalm commented May 15, 2024 •

edited

Loading

acalm commented May 16, 2024 •

edited

Loading