Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial BGP sync during kube-router startup extremely slow in kubernetes v1.29 #1668

Closed
acalm opened this issue May 10, 2024 · 6 comments
Closed
Labels

Comments

@acalm
Copy link

acalm commented May 10, 2024

What happened?
After cluster was upgraded to v1.29, from v1.28 we started observing increased startup times on workloads due to network not being available for ~5min. Checking logs in kube-router we saw that the initial bgp sync takes several minutes instead of just a few seconds (at most). While the sync is running no running pods can use the network until
time="2024-05-10T10:13:30Z" level=info msg="sync finished" Topic=Server is reported

I0510 10:09:06.802963 1 version.go:21] Running /usr/local/bin/kube-router version v1.4.0, built on 2022-01-05T17:01:42+0000, go1.17.5
I0510 10:09:07.105518 1 metrics_controller.go:164] Starting metrics controller
I0510 10:09:07.146602 1 network_routes_controller.go:1298] Could not find annotation `kube-router.io/bgp-local-addresses` on node object so BGP will listen on node IP: 10.40.52.161 address.
I0510 10:09:07.147089 1 network_routes_controller.go:376] `subnet` in CNI conf file is empty so populating `subnet` in CNI conf file with pod CIDR assigned to the node obtained from node spec.
I0510 10:09:07.284141 1 network_routes_controller.go:242] Setting MTU of kube-bridge interface to: 8981
I0510 10:09:07.285174 1 network_routes_controller.go:273] Starting network route controller
I0510 10:09:07.288415 1 network_routes_controller.go:1024] Could not find BGP peer info for the node in the node annotations so skipping configuring peer.
I0510 10:09:07.291280 1 network_services_controller.go:344] Starting network services controller
time="2024-05-10T10:09:07Z" level=info msg="Add a peer configuration for:10.40.89.141" Topic=Peer
time="2024-05-10T10:09:07Z" level=info msg="Add a peer configuration for:10.40.93.68" Topic=Peer
time="2024-05-10T10:09:07Z" level=info msg="Add a peer configuration for:10.40.55.212" Topic=Peer
...
time="2024-05-10T10:09:07Z" level=info msg="Add a peer configuration for:10.40.93.167" Topic=Peer
time="2024-05-10T10:09:07Z" level=info msg="Add a peer configuration for:10.40.76.147" Topic=Peer
time="2024-05-10T10:09:07Z" level=info msg="Add a peer configuration for:10.40.81.213" Topic=Peer
I0510 10:09:07.491224 1 network_policy_controller.go:159] Starting network policy controller
I0510 10:09:07.517270 1 network_policy_controller.go:171] Starting network policy controller full sync goroutine
time="2024-05-10T10:09:14Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.68.45 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:09:14Z" level=info msg="Peer Up" Key=10.40.68.45 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-10T10:09:40Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.93.167 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:09:40Z" level=info msg="Peer Up" Key=10.40.93.167 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-10T10:09:41Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.86.87 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:09:41Z" level=info msg="Peer Up" Key=10.40.86.87 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-10T10:10:12Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.60.155 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:10:12Z" level=info msg="Peer Up" Key=10.40.60.155 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-10T10:10:37Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.56.231 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:10:37Z" level=info msg="Peer Up" Key=10.40.56.231 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-10T10:10:43Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.77.181 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:10:43Z" level=info msg="Peer Up" Key=10.40.77.181 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-10T10:10:44Z" level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.66.117 State=BGP_FSM_OPENSENT Topic=Peer
time="2024-05-10T10:10:44Z" level=info msg="Peer Up" Key=10.40.66.117 State=BGP_FSM_OPENCONFIRM Topic=Peer
...
time="2024-05-10T10:13:30Z" level=info msg="sync finished" Topic=Server

What did you expect to happen?
Initial BGP sync should complete in the same amount of time as in Kubernetes v1.28

How can we reproduce the behavior you experienced?
Steps to reproduce the behavior:

  1. run kube-router 1.4.0 in a kubernetes v1.29 cluster
  2. Step 2
  3. Step 3
  4. Step 4

**Screenshots / Architecture Diagrams / Network Topologies **
If applicable, add those here to help explain your problem.

** System Information (please complete the following information):**

  • Kube-Router Version (kube-router --version): 1.4.0
  • Kube-Router Parameters:
/usr/local/bin/kube-router --run-router=true --run-firewall=true --run-service-proxy=true --kubeconfig=/etc/kuberouter/kuberouter.kubeconfig --metrics-port=9332 --disable-source-dest-check=false --bgp-graceful-restart --service-cluster-ip-range=10.32.0.0/16
  • Kubernetes Version (kubectl version) : v1.29.4
  • Cloud Type: AWS
  • Kubernetes Deployment Type: Kubeadm
  • Kube-Router Deployment Type: DaemonSet
  • Cluster Size: 20-200

** Logs, other output, metrics **

Additional context
two kubeadm kube-controller-manager settings was changed during upgrade:

  • node-cidr-mask-size: "25" from the default /24
  • maxPods: 64
  • Nodes run ubuntu 20.04, linux-aws 5.15 kernel
    We're aware 1.4.0 is pretty old, but the issue seems to be somewhere in how bgp is handled? or peers discovered, we've tried to pin point any other changes in our infrastructure but haven't been able to come up with anything other than the changed kubernetes version
@acalm acalm added the bug label May 10, 2024
@acalm
Copy link
Author

acalm commented May 12, 2024

I don't know if it's of any concern but checking some kube-router pods gobgp neighbor <NODEIP> show that some neighbors had multiple "open" packets sent and one received, while some others just had 1 sent and received, I was thinking that may impact the initial peer addition.
example:

# gobgp neighbor 10.40.91.212
...
  Message statistics:
                         Sent       Rcvd
    Opens:                 24          1
    Notifications:          0          0
    Updates:                8          8
    Keepalives:          1170       1170
    Route Refresh:          0          0
    Discarded:              0          0
    Total:               1202       1179

and

# gobgp neighbor 10.40.81.200
...
  Message statistics:
                         Sent       Rcvd
    Opens:                  1          1
    Notifications:          0          0
    Updates:                8          8
    Keepalives:          1179       1179
    Route Refresh:          0          0
    Discarded:              0          0
    Total:               1188       1188

@aauren
Copy link
Collaborator

aauren commented May 12, 2024

Unfortunately I can't think of any reason why Kubernetes version would affect BGP convergence. In the past, the only reasons why we've ever seen this type of thing are either:

  • BGP misconfiguration
  • Or more specifically BGP Graceful Restart misconfiguration

The fact that it took 5 minutes seems very suspiciously like the default BGP Graceful Restart deferral time: https://github.com/cloudnativelabs/kube-router/blob/v1.4/pkg/options/options.go#L80 (granted that one is technically 6 minutes).

As an aside, I would not run kube-router v1.4.0 against a 1.29 Kubernetes cluster. The client libraries it is built against are Kubernetes v1.22 which is well outside the 2 minor version discrepancy that the upstream project allows. The fact that k8s has such good backwards compatibility is a testament that it is able to run at all.

Beyond this, v1.4.0 is more than 2 years behind active development and the project won't be able to help further on this issue.

@aauren aauren closed this as completed May 12, 2024
@acalm
Copy link
Author

acalm commented May 15, 2024

Thanks for your response @aauren!
Sorry for answering in a closed issue (and creating issues for old versions), but I may have found something that may impact later versions as well!

I think I found out that this issue actually is due to changes made in Kubernetes v1.29 and it happens when you run with --cloud-provider=external, this will leave the node status addresses list empty until the cloud provider controller populates it. I think this is the cause for these messages on the existing kube-router pods:
failed to get node object from api server: host IP unknown
from here, the nrc tries to read the list but it does not exist. I didn't capture these messages in my initial message, as I only included logs on the newly created kube-router pod (sorry).

Anyway, digging further I noticed that it kind of looked like the new node was added as valid bgp neighbors once the network route controller run a sync(?) (default every 5min). I assume nrc would re-read all nodes from the API on sync start.
After some more digging I noticed that when a new node comes up, it's initially missing the status.addresses list, but after a few short seconds it's added, but by then it's too late, because it's not picked up until the controller runs its next sync.

I started looking into the changes made in kubelet for 1.29 and noticed kubernetes/kubernetes#121028, I have yet to confirm this, but it seems like you can get the old behavior back (not waiting for the external cloud controller to populate the address list) by passing --node-ip=$IPADDRESS to kubelet.
It looks like calico recently ran into similar issues here: projectcalico/calico#8453

I'm going to try passing --node-ip to kubelet and see if it helps, just wanted to report back in case some one else runs into this in later versions as the issue may persist there as well.

edit: some clarifications, etc.

@aauren
Copy link
Collaborator

aauren commented May 15, 2024

Thanks for reporting back about what you found! I think this will be helpful for future people running into this issue. Please let the thread know how the addition of the node-ip parameter changes things if at all.

The issue that we have open here #676 is specific to node annotations, but part of that work will be to watch nodes in general. Maybe we could watch for address updates as well whenever we work on that.

@acalm
Copy link
Author

acalm commented May 16, 2024

I've verified that passing --node-ip=<NODE_IP_ADDRESS> populates the status.addresses list read by the nrc here. This restores the original behavior and remediates the issues we observed in our environments, when a new node comes up the other kube-router pods in the daemonset logs:

level=info msg="Add a peer configuration for:10.40.95.197" Topic=Peer
level=info msg="skipped asn negotiation: peer-as: 64512, peer-type: internal" Key=10.40.95.197 State=BGP_FSM_OPENSENT Topic=Peer

instead of

bgp_peers.go:303] New node received, but we were unable to add it as we were couldn't find its node IP: host IP unknown
level=info msg="Can't find configuration for a new passive connection from:10.40.92.7" Topic=Peer

In AWS it's possible to get the internal ip via the IMDS while bootstrapping the node and pass something like --node-ip=$IPADDR to KUBELET_EXTRA_ARGS before joining. For control-plane nodes using kubeadm, the node ip can be passed through the kubeadm join configuration, for example:

# -- 8< -- 8< --
nodeRegistration:
  name: "$LOCAL_HOSTNAME"
  kubeletExtraArgs:
    cloud-provider: "$CLOUD_PROVIDER"
    cloud-config: "$CLOUD_CONFIG"
    cluster-dns: 169.254.20.10
    node-ip: "$IPADDR"
# -- 8< -- 8< --

(above variables obviously needs to be expanded prior use)

edit: clarifications, some example

@aauren
Copy link
Collaborator

aauren commented Jun 16, 2024

This appears to be related to an ongoing upstream conversation: kubernetes/kubernetes#125348

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants