Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"link has incompatible addresses" after restarting Flannel k8s pod #1060

Closed
ljfranklin opened this issue Nov 3, 2018 · 12 comments · Fixed by #1401
Closed

"link has incompatible addresses" after restarting Flannel k8s pod #1060

ljfranklin opened this issue Nov 3, 2018 · 12 comments · Fixed by #1401

Comments

@ljfranklin
Copy link

Flannel pod starts successfully if flannel.1 link doesn't exist. But running kubectl delete pod kube-flannel-... leaves the flannel.1 link and the subsequently created pod will fail to start with the following error:

I1103 17:54:41.197308       1 main.go:475] Determining IP address of default interface
I1103 17:54:41.198443       1 main.go:488] Using interface with name eth0 and address 192.168.1.244
I1103 17:54:41.198533       1 main.go:505] Defaulting external address to interface address (192.168.1.244)
I1103 17:54:41.698456       1 kube.go:131] Waiting 10m0s for node controller to sync
I1103 17:54:41.698812       1 kube.go:294] Starting kube subnet manager
I1103 17:54:42.699051       1 kube.go:138] Node controller sync successful
I1103 17:54:42.699241       1 main.go:235] Created subnet manager: Kubernetes Subnet Manager - k8s-worker2
I1103 17:54:42.699302       1 main.go:238] Installing signal handlers
I1103 17:54:42.796179       1 main.go:353] Found network config - Backend type: vxlan
I1103 17:54:42.796884       1 vxlan.go:120] VXLAN config: VNI=1 Port=0 GBP=false DirectRouting=false
E1103 17:54:42.799114       1 main.go:280] Error registering network: failed to configure interface flannel.1: failed to ensure address of interface flannel.1: link has incompatible addresses. Remove additional addresses and try again. &netlink.Vxlan{LinkAttrs:netlink.LinkAttrs{Index:715, MTU:1450, TxQLen:0, Name:"flannel.1", HardwareAddr:net.HardwareAddr{0x96, 0x11, 0x68, 0xa1, 0x57, 0x2b}, Flags:0x13, RawFlags:0x11043, ParentIndex:0, MasterIndex:0, Namespace:interface {}(nil), Alias:"", Statistics:(*netlink.LinkStatistics)(0x13c1a0e4), Promisc:0, Xdp:(*netlink.LinkXdp)(0x13d673a0), EncapType:"ether", Protinfo:(*netlink.Protinfo)(nil), OperState:0x0}, VxlanId:1, VtepDevIndex:2, SrcAddr:net.IP{0xc0, 0xa8, 0x1, 0xf4}, Group:net.IP(nil), TTL:0, TOS:0, Learning:false, Proxy:false, RSC:false, L2miss:false, L3miss:false, UDPCSum:true, NoAge:false, GBP:false, Age:300, Limit:0, Port:8472, PortLow:0, PortHigh:0}
I1103 17:54:42.799393       1 main.go:333] Stopping shutdownHandler...

This is on a raspberry pi 3 B+, so the arm architecture may be a factor.

Expected Behavior

Recreating the flannel pod should succeed even if flannel.1 link exists.

Current Behavior

Flannel pod goes into CrashLoopBackoff after the pod is recreated. To allow the pod to start successfully, SSH onto the worker and run sudo ip link delete flannel.1. Recreating the pod will then start successfully.

Possible Solution

?

Steps to Reproduce (for bugs)

  1. Deploy k8s with flannel
  2. kubectl delete pod kube-flannel-...
  3. See that recreated pod does not start and logs above error message

Context

About once a week my flannel pods enter this state, possibly due to a crash or restart of the pod, and I have to manually SSH in to delete the flannel link on each affected node.

Related issue: #883

Your Environment

  • Flannel version: v0.10.0
  • Backend used (e.g. vxlan or udp): vxlan
  • Etcd version: 3.2.24
  • Kubernetes version (if used): 1.12.1
  • Operating System and version: HypriotOS 1.9.0 on Raspberry Pi 3 B+
  • Link to your project (optional): n/a
@b3nw
Copy link

b3nw commented Dec 18, 2018

Flannel version: v0.10.0
Backend used (e.g. vxlan or udp): vxlan
Etcd version: 3.2.24
Kubernetes version (if used): 1.13.1
Operating System and version: Debian 9.6 on Raspberry Pi 3 B

@jmeridth
Copy link

same setup as @b3nw. Only difference is I'm using HypriotOS 1.9.0 on Raspberry Pi 3 B+

@mr-sour
Copy link

mr-sour commented Jan 17, 2019

Got this same thing on hypriotos 1.10.0-rc2 on a raspberry Pi 3 B+

@bobmhong
Copy link

Got this same thing on hypriotos 1.10.0-rc2 on a raspberry Pi 3 B+

@mr-sour My config is same as yours and have same the results. I'm glad at least that sudo ip link delete flannel.1 on the failing node allows the pod to recreate successfully after deleting the failing pod.

@dippynark
Copy link

seeing the same issue with flannel v0.11.0

uname -a: Linux pirate1 4.19.58-v7+ #1245 SMP Fri Jul 12 17:25:51 BST 2019 armv7l GNU/Linux

@markus-seidl
Copy link

Same issue from applying

https://raw.githubusercontent.com/coreos/flannel/master/Documentation/k8s-manifests/kube-flannel-rbac.yml
https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

on a small bramble (two RPi 4s). I noticed that there are two pods, after deleting the flannel.1 network, the first one starts without problems, the second one is entering a crash loop (with the same error).

Has someone else two "kube-flannel-ds-arm-xxxx" pods? Maybe that's the problem?

@alexandrujuncu
Copy link

Still an issue with Hypriot v1.11.1 + K8s 1.16.1 + Flannel 0.11.0

@mkuchenbecker
Copy link

@markus-seidl I can confirm, have two.

pi@raspi-0:~ $ k get pods --all-namespaces
NAMESPACE     NAME                              READY   STATUS    RESTARTS   AGE
kube-system   coredns-5644d7b6d9-gtj6h          1/1     Running   0          10m
kube-system   coredns-5644d7b6d9-lx59g          1/1     Running   0          10m
kube-system   etcd-raspi-0                      1/1     Running   0          10m
kube-system   kube-apiserver-raspi-0            1/1     Running   0          10m
kube-system   kube-controller-manager-raspi-0   1/1     Running   0          10m
kube-system   kube-flannel-ds-arm-5brn7         1/1     Running   0          10m
kube-system   kube-flannel-ds-arm-c4hbv         0/1     Error     14         5m41s
kube-system   kube-proxy-htk7f                  1/1     Running   0          10m
kube-system   kube-proxy-kcql2                  1/1     Running   0          5m41s
kube-system   kube-scheduler-raspi-0            1/1     Running   0          10m

@kluzny
Copy link

kluzny commented May 27, 2020

@markus-seidl @mkuchenbecker It is normal to have an instance on each node.

kyle@noobuntu:~/Development/raspberry_patch$ kubectl get pods -o wide -n kube-system | grep flannel
kube-flannel-ds-arm-6lgx4       0/1     CrashLoopBackOff   5          4m49s   192.168.1.101   alpha   <none>           <none>
kube-flannel-ds-arm-nrrk7       0/1     CrashLoopBackOff   5          4m43s   192.168.1.102   beta    <none>           <none>
kube-flannel-ds-arm-nv8zx       0/1     CrashLoopBackOff   5          4m30s   192.168.1.104   delta   <none>           <none>
kube-flannel-ds-arm-rfkft       0/1     CrashLoopBackOff   5          4m57s   192.168.1.103   gamma   <none>           <none>

Manually deleting the link on the node and deleting the pod, as other have suggested, seems to be the resolution.

@pikomen
Copy link

pikomen commented Aug 2, 2020

same issue on k8s v1.16.11 ,flannel:v0.11.0
i have 2 clusters : first on vmware and second on Nutanix-AHV
the installation of k8s cluster performed by Kubespray in the same way.
on vmware cluster i can delete flannel pods without any problems but on Nutanix i see CrashLoopBackOff on some nodes, the solution is to delete the flannel.1 (just a temporary patch).

I0802 12:34:12.140814 1 vxlan.go:120] VXLAN config: VNI=1 Port=0 GBP=false DirectRouting=false
E0802 12:34:12.141386 1 main.go:289] Error registering network: failed to configure interface flannel.1: failed to ensure address of interface flannel.1: link has incompatible addresses. Remove additional addresses and try again. &netlink.Vxlan{LinkAttrs:netlink.LinkAttrs{Index:25, MTU:1450, TxQLen:0, Name:"flannel.1", HardwareAddr:net.HardwareAddr{0xe6, 0x96, 0xa6, 0x84, 0x4b, 0x66}, Flags:0x13, RawFlags:0x11043, ParentIndex:0, MasterIndex:0, Namespace:interface {}(nil), Alias:"", Statistics:(*netlink.LinkStatistics)(0xc4201c50f4), Promisc:0, Xdp:(*netlink.LinkXdp)(0xc42042a360), EncapType:"ether", Protinfo:(*netlink.Protinfo)(nil), OperState:0x0}, VxlanId:1, VtepDevIndex:2, SrcAddr:net.IP{0xa, 0x35, 0xa2, 0x65}, Group:net.IP(nil), TTL:0, TOS:0, Learning:false, Proxy:false, RSC:false, L2miss:false, L3miss:false, UDPCSum:true, NoAge:false, GBP:false, Age:300, Limit:0, Port:8472, PortLow:0, PortHigh:0}
I0802 12:34:12.141422 1 main.go:366] Stopping shutdownHandler...

@VUZhuangweiKang
Copy link

I got the same issue, any progress on resolving this problem.

@Letme
Copy link

Letme commented Dec 4, 2020

I had same issue when power went down and then tried to get nodes back... The trick with removing the link actually helps.

jcaamano added a commit to jcaamano/flannel that referenced this issue Jan 20, 2021
Currently flannel interface ip addresses are checked on startup when
using vxlan and ipip backends. If multiple addresses are found, startup
fails fatally. If only one address is found and is not the currently
leased one, it will be assumed that it comes from a previous lease and
be removed.

This criteria seems arbitrary both in how it is done and in its timing.
It may cause failures in situations where it might not be strictly
necessary like for example if the node is running a dhcp client that is
assigning link local addresses to all interfaces. It also might fail at
flannel unexpected restarts which are completly unrelated to
the external event that caused the unexpected modification in the
flannel interface.

This patch proposes to concern and check only ip address within the
flannel network and takes the simple approach to ignore any other ip
addresses assuming these would pose no problem on flannel operation.

A discarded but more agressive alternative would be to remove all
addresses that are not the currently leased one.

Fixes flannel-io#1060

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
jcaamano added a commit to jcaamano/flannel that referenced this issue Jan 21, 2021
Currently flannel interface ip addresses are checked on startup when
using vxlan and ipip backends. If multiple addresses are found, startup
fails fatally. If only one address is found and is not the currently
leased one, it will be assumed that it comes from a previous lease and
be removed.

This criteria seems arbitrary both in how it is done and in its timing.
It may cause failures in situations where it might not be strictly
necessary like for example if the node is running a dhcp client that is
assigning link local addresses to all interfaces. It also might fail at
flannel unexpected restarts which are completly unrelated to
the external event that caused the unexpected modification in the
flannel interface.

This patch proposes to concern and check only ip address within the
flannel network and takes the simple approach to ignore any other ip
addresses assuming these would pose no problem on flannel operation.

A discarded but more agressive alternative would be to remove all
addresses that are not the currently leased one.

Fixes flannel-io#1060

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
jcaamano added a commit to jcaamano/flannel that referenced this issue Jan 22, 2021
Currently flannel interface ip addresses are checked on startup when
using vxlan and ipip backends. If multiple addresses are found, startup
fails fatally. If only one address is found and is not the currently
leased one, it will be assumed that it comes from a previous lease and
be removed.

This criteria seems arbitrary both in how it is done and in its timing.
It may cause failures in situations where it might not be strictly
necessary like for example if the node is running a dhcp client that is
assigning link local addresses to all interfaces. It also might fail at
flannel unexpected restarts which are completly unrelated to
the external event that caused the unexpected modification in the
flannel interface.

This patch proposes to concern and check only ip address within the
flannel network and takes the simple approach to ignore any other ip
addresses assuming these would pose no problem on flannel operation.

A discarded but more agressive alternative would be to remove all
addresses that are not the currently leased one.

Fixes flannel-io#1060

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
jcaamano added a commit to jcaamano/flannel that referenced this issue Feb 18, 2021
Currently flannel interface ip addresses are checked on startup when
using vxlan and ipip backends. If multiple addresses are found, startup
fails fatally. If only one address is found and is not the currently
leased one, it will be assumed that it comes from a previous lease and
be removed.

This criteria seems arbitrary both in how it is done and in its timing.
It may cause failures in situations where it might not be strictly
necessary like for example if the node is running a dhcp client that is
assigning link local addresses to all interfaces. It also might fail at
flannel unexpected restarts which are completly unrelated to
the external event that caused the unexpected modification in the
flannel interface.

This patch proposes to concern and check only ip address within the
flannel network and takes the simple approach to ignore any other ip
addresses assuming these would pose no problem on flannel operation.

A discarded but more agressive alternative would be to remove all
addresses that are not the currently leased one.

Fixes flannel-io#1060

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
knisbet pushed a commit to gravitational/flannel that referenced this issue Jan 25, 2022
Currently flannel interface ip addresses are checked on startup when
using vxlan and ipip backends. If multiple addresses are found, startup
fails fatally. If only one address is found and is not the currently
leased one, it will be assumed that it comes from a previous lease and
be removed.

This criteria seems arbitrary both in how it is done and in its timing.
It may cause failures in situations where it might not be strictly
necessary like for example if the node is running a dhcp client that is
assigning link local addresses to all interfaces. It also might fail at
flannel unexpected restarts which are completly unrelated to
the external event that caused the unexpected modification in the
flannel interface.

This patch proposes to concern and check only ip address within the
flannel network and takes the simple approach to ignore any other ip
addresses assuming these would pose no problem on flannel operation.

A discarded but more agressive alternative would be to remove all
addresses that are not the currently leased one.

Fixes flannel-io#1060

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
(cherry picked from commit 33a2fac)
knisbet pushed a commit to gravitational/flannel that referenced this issue Jan 25, 2022
* vxlan: Generate MAC address before creating a link

systemd 242+ assigns MAC addresses for all virtual devices which don't
have the address assigned already. That resulted in systemd overriding
MAC addresses of flannel.* interfaces. The fix which prevents systemd
from setting the address is to define the concrete MAC address when
creating the link.

Fixes: flannel-io#1155
Ref: k3s-io/k3s#4188
Signed-off-by: Michal Rostecki <[email protected]>
(cherry picked from commit 0198d5d)

* Concern only about flannel ip addresses

Currently flannel interface ip addresses are checked on startup when
using vxlan and ipip backends. If multiple addresses are found, startup
fails fatally. If only one address is found and is not the currently
leased one, it will be assumed that it comes from a previous lease and
be removed.

This criteria seems arbitrary both in how it is done and in its timing.
It may cause failures in situations where it might not be strictly
necessary like for example if the node is running a dhcp client that is
assigning link local addresses to all interfaces. It also might fail at
flannel unexpected restarts which are completly unrelated to
the external event that caused the unexpected modification in the
flannel interface.

This patch proposes to concern and check only ip address within the
flannel network and takes the simple approach to ignore any other ip
addresses assuming these would pose no problem on flannel operation.

A discarded but more agressive alternative would be to remove all
addresses that are not the currently leased one.

Fixes flannel-io#1060

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
(cherry picked from commit 33a2fac)

* Fix flannel hang if lease expired

(cherry picked from commit 78035d0)

* subnets: move forward the cursor to skip illegal subnet

This PR fixs an issue when flannel gets illegal subnet event in
watching leases, it doesn't move forward the etcd cursor and
will stuck in the same invalid event forever.

(cherry picked from commit 1a1b6f1)

* fix cherry-pick glitches and test failures

* disable udp backend tests since we don't actually have the udp backend in our fork

Co-authored-by: Michal Rostecki <[email protected]>
Co-authored-by: Jaime Caamaño Ruiz <[email protected]>
Co-authored-by: Chun Chen <[email protected]>
Co-authored-by: huangxuesen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.