-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calico-node fail to run on Kubernetes 1.29 (calico-typha error) #8453
Comments
Hi @rrsela, do you have multiple network interfaces? You could configure IP auto-detection https://docs.tigera.io/calico/latest/networking/ipam/ip-autodetection |
Hi @MichalFupso - doesn't seem to help, at least I tried kubernetes-internal-ip but in any case I only got a single interface: root@k8s-master-1:/home/ubuntu# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc fq_codel state UP group default qlen 1000
link/ether fa:16:3e:5b:d9:16 brd ff:ff:ff:ff:ff:ff
altname enp0s3
inet 192.168.3.145/23 metric 100 brd 192.168.3.255 scope global dynamic ens3
valid_lft 3455sec preferred_lft 3455sec
inet6 fe80::f816:3eff:fe5b:d916/64 scope link
valid_lft forever preferred_lft forever |
This sounds like Typha isn't reporting ready, and so calico/node isn't able to find a valid instance to connect to. Do your typha pods report ready in the k8s API? |
@caseydavenport Typha report as ready (probably as the health check passes): root@k8s-master-1:/home/ubuntu# kubectl get pods -n calico-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-6ff5954966-zh929 0/1 Pending 0 10m <none> <none> <none> <none>
calico-node-5f9g2 0/1 Running 0 10m <none> k8s-master-1 <none> <none>
calico-node-6ncbj 0/1 Running 0 10m <none> k8s-worker-1 <none> <none>
calico-typha-75b899457d-cj2vz 1/1 Running 0 10m <none> k8s-worker-1 <none> <none>
csi-node-driver-bgptp 2/2 Running 0 10m 10.244.230.1 k8s-worker-1 <none> <none>
csi-node-driver-zbl62 2/2 Running 0 10m 10.244.196.1 k8s-master-1 <none> <none>
root@k8s-master-1:/home/ubuntu# kubectl get events -n calico-system --field-selector involvedObject.name=calico-typha-75b899457d-cj2vz
LAST SEEN TYPE REASON OBJECT MESSAGE
10m Normal Scheduled pod/calico-typha-75b899457d-cj2vz Successfully assigned calico-system/calico-typha-75b899457d-cj2vz to k8s-worker-1
10m Normal Pulling pod/calico-typha-75b899457d-cj2vz Pulling image "docker.io/calico/typha:v3.27.0"
10m Normal Pulled pod/calico-typha-75b899457d-cj2vz Successfully pulled image "docker.io/calico/typha:v3.27.0" in 8.84s (8.84s including waiting)
10m Normal Created pod/calico-typha-75b899457d-cj2vz Created container calico-typha
10m Normal Started pod/calico-typha-75b899457d-cj2vz Started container calico-typha |
What does |
@matthewdupre it has none: kubectl get endpoints -n calico-system
NAME ENDPOINTS AGE
calico-typha <none> 10m |
@rrsela Yeah, that would cause this. Looking more carefully, I see kubectl is reporting IP for Typha (and kube-controllers doesn't even have a NODE?!). Normally you'd see the node IP appearing for Typha (and Calico-Node) and then everything else will probably work normally. The question now is "how is Typha running without Kubernetes knowing its IP?" |
Trying to run Typha in debug mode and failing to do so in operator-mode... But it made me think, could this be a Tigera issue? All I can see inside its log is repeating info & error messages regarding installation reconciliation: {"level":"info","ts":"2024-03-07T06:34:57Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-03-07T06:34:57Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-03-07T06:34:58Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"periodic-5m0s-reconcile-event"}
{"level":"info","ts":"2024-03-07T06:34:58Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"tigera-operator","Request.Name":"tigera-ca-private"}
{"level":"info","ts":"2024-03-07T06:34:58Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"calico-system"}
{"level":"info","ts":"2024-03-07T06:34:59Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"default"}
{"level":"info","ts":"2024-03-07T06:34:59Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"active-operator"}
{"level":"info","ts":"2024-03-07T06:34:59Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"tigera-operator","Request.Name":"node-certs"}
{"level":"info","ts":"2024-03-07T06:35:00Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"tigera-operator","Request.Name":"typha-certs"}
{"level":"info","ts":"2024-03-07T06:35:00Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-typha"}
{"level":"info","ts":"2024-03-07T06:35:01Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-node"}
{"level":"info","ts":"2024-03-07T06:35:01Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"calico-node"}
{"level":"info","ts":"2024-03-07T06:35:01Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-cni-plugin"}
{"level":"info","ts":"2024-03-07T06:35:02Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"calico-cni-plugin"}
{"level":"info","ts":"2024-03-07T06:35:02Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"csi-node-driver"}
{"level":"info","ts":"2024-03-07T06:35:02Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-kube-controllers"}
{"level":"info","ts":"2024-03-07T06:35:03Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"calico-kube-controllers"}
{"level":"info","ts":"2024-03-07T06:35:27Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"calico"}
{"level":"info","ts":"2024-03-07T06:35:27Z","logger":"controller_windows","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-03-07T06:35:27Z","logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"default"}
{"level":"error","ts":"2024-03-07T06:35:27Z","logger":"controller_apiserver","msg":"Waiting for Installation to be ready","Request.Namespace":"","Request.Name":"default","reason":"ResourceNotReady","stacktrace":"github.com/tigera/operator/pkg/controller/status.(*statusManager).SetDegraded\n\t/go/src/github.com/tigera/operator/pkg/controller/status/status.go:356\ngithub.com/tigera/operator/pkg/controller/apiserver.(*ReconcileAPIServer).Reconcile\n\t/go/src/github.com/tigera/operator/pkg/controller/apiserver/apiserver_controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235"}
{"level":"info","ts":"2024-03-07T06:35:27Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"default"} |
This makes sense - there are no nodes that Kubernetes deems acceptable for scheduling pods yet, so it hasn't been assigned one (note the "Pending" state)
Yeah, this is strange, and suggests perhaps a problem with kubelet (who reports that IP address) or with the Node objects themselves - maybe check to see if
kubelet logs might explain more about why that pod has no IP... |
Actually root@k8s-master-1:/home/ubuntu# kubectl get nodes -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-master-1 Ready control-plane 6d12h v1.29.0-eks-a5ec690 <none> <none> Ubuntu 22.04.3 LTS 5.15.0-92-generic containerd://1.6.27
k8s-worker-1 Ready <none> 6d12h v1.29.0-eks-a5ec690 <none> <none> Ubuntu 22.04.3 LTS 5.15.0-92-generic containerd://1.6.27 Although I do see the node's IP as a calico annotation: root@k8s-master-1:/home/ubuntu# kubectl get node k8s-master-1 -oyaml
apiVersion: v1
kind: Node
metadata:
annotations:
csi.volume.kubernetes.io/nodeid: '{"csi.tigera.io":"k8s-master-1"}'
kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: "0"
projectcalico.org/IPv4Address: 192.168.2.57/23
projectcalico.org/IPv4IPIPTunnelAddr: 10.244.196.0
volumes.kubernetes.io/controller-managed-attach-detach: "true" While on an 1.28 node I would get the node status' address attributes like these: status:
- address: XXX.XXX.XXX.XXX
type: InternalIP
- address: YYY
type: InternalDNS
- address: ZZZ
type: Hostname On the 1.29 node I don't see it at all... BTW nothing much on kubelet log regarding typha itself:
Regarding the csi-node-driver, the kubelet log shows this for
|
@rrsela I am not entirely sure why that is, but it seems likely that the missing node IPs are causing this issue with Calico. I'd recommend investigating why those IPs stopped showing up in your environment. |
@caseydavenport the thing is the same setup work with Flannel or Cilium, but not with Calico... For example with Flannel: root@demo-master-1:/# kubectl get nodes -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
demo-master-1 Ready control-plane 4m41s v1.29.0-eks-a5ec690 192.168.2.216 <none> Ubuntu 22.04.3 LTS 5.15.0-92-generic containerd://1.6.27
demo-worker-1 Ready <none> 4m22s v1.29.0-eks-a5ec690 192.168.2.65 <none> Ubuntu 22.04.3 LTS 5.15.0-92-generic containerd://1.6.27
root@demo-master-1:/# kubectl get node demo-master-1 -oyaml
apiVersion: v1
kind: Node
metadata:
annotations:
csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-2a80c38632a34a80abcb2a4f55b275d3"}'
flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"42:42:55:8f:b6:41"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 192.168.2.216
kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
...
status:
addresses:
- address: 192.168.2.216
type: InternalIP
- address: demo-master-1.symphony.local
type: InternalDNS
- address: demo-master-1.symphony.local
type: Hostname
... |
@rrsela do your nodes show Internal IPs prior to installing a CNI plugin? I'm a little bit stumped about why this would have anything to do with Calico, since Calico doesn't actually modify the IP addresses of the nodes. |
@caseydavenport I validated that the nodes do not have internal IPs prior to the CNI installation (also checked with Flannel to be sure) - seems like this field is being populated following the CNI deployment, via CCM, which depends on the CNI to work in order to be deployed.. |
Based on the above I managed to replicate the issue with vanilla Kubernetes and not just EKS-D - as long as Kubelet is defined with Not sure how/why it works on 1.28 though, maybe some internal cloud-provider code removals? |
Interesting - does the external cloud provider CCM pod run with host networking or with pod networking? |
@caseydavenport it runs with pod networking - at least the AWS cloud controller manager is.. BTW I found the 1.29 change regarding the default node IP behaviour - until this PR was merged nodes always had IPs even prior to the cloud provider (CCM) deployment... |
Ok, so it sounds like because of that change nodes are not getting temporary IP addresses, which prevents calico/node from talking to the apiserver via Typha. Since calico/node can't talk to Typha, pods never get networking and thus the external CCM isn't able to start, and we are in a deadlock situation. These are the options that jump out at me, none of which are super appealing or necessarily a quick fix:
|
@rrsela I am closing it now, but feel free to reopen if you have any new information |
Expected Behavior
calico-node should run after fresh tigera-operator on-prem deployment on Kubernetes 1.29
Current Behavior
calico-node fails to run due to readiness probe error:
calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
I can't perform
kubectl logs
but looking throughcrictl logs
on the container I see the following:Inside calico-typha logs I see the below warnings after the readiness heathcheck:
While normally I would expect to see something like
sync_server.go 430: Accepted from 192.168.3.156:47314 port=5473
Possible Solution
N/A
Steps to Reproduce (for bugs)
Context
Trying to deploy Calico on on-prem kubeadm-based deployment of EKS-D 1.29 using the tigera-operator per the documentation.
Same configuration works fine on EKS-D 1.28, and also when adding an 1.29 worker node to an exiting 1.28 control-plane which is already running Calico..
Worth mentioning while the official documentation does not state Calico supports Kubernetes 1.29, it seems like people are using it in practice and the documentation even reference an issue with the nftables alpha feature of 1.29 (off by default).
Your Environment
The text was updated successfully, but these errors were encountered: