-
Notifications
You must be signed in to change notification settings - Fork 670
1 Node stopped from connecting to other nodes in Kubernetes Cluster #3392
Comments
Do you have the logs from the node with ID |
Yes below is the log, the error started today 2018/09/03 02:01:09.890616 , I did copy the log from yesterday till after the error started. Engine02 is still working, Engine01 is the one that stopped. Unfortunatly i did restart it several time i don't know if i can still capture the log from Engine01.
|
The symptom is that two nodes in your cluster disagree on the contents of the IPAM data structure. Since there's no clue in the first log you provided as to why, I asked for the other. Absent any clues as to what happened at the time of the failure, this is the same as #3310, and I wrote about how you can clean up at #3310 (comment) |
It did work, I just deleted that file and reboot. Now everything is working, to give you more info. I found on some of my pods errors complaining about max open files not sure if it is related and it caused this Thanks for your help, I spent around 6 hours digging my head to know what happened |
@shahbour Thanks for the logs. Do you know whether
Maybe you can find the stopped container with |
Yes engine02 was on all the time , the error started at 02:01 as below
at hour 13:18 I think I was doing a restart to engine01 to check if goes up , and for sure engine02 was up because all our pods were in it and system was up below log are from engine02 at 13:18:52
|
@shahbour Could you run |
here we go
|
Sweet! Here is the culprit: Any idea why it does not have any internalIP? |
Nope i don't , i noticed that few weeks ago and was trying to manually set it but it did not work for me and i forgot about it . |
I don't see how to set it manually, so you might need to drain the node and re-deploy k8s on it. I'm working on a fix for Weave Net but it will take a while until it gets released. |
I did fix the internal and external IPs
|
Just out of curiosity, how did you set the IP addrs? |
As you said, I did drain the node then update kubelet (it needed to be updated to 1.11.2) and did restart for the node. After it came up it was fixed |
Today Morning i just checked and seems node is loosing its ip address some how
|
Do not exclude k8s node without any IP addr in reclaim
Fixed in #3393 |
in my case, it was only one node that was not able to connect to other nodes in the waeve cluster.
|
What you expected to happen?
What happened?
Yesterday one of the 4 nodes I have in kubernetes cluster stopped working, any traffic going to other pods in another node stopped working and vice versal
How to reproduce it?
I could not find any difference in configuration so I can't reproduce it but it is still there now
Anything else we need to know?
This is the node 192.168.70.230 that stopped talking to others with reason: Received update for IP range I own at 10.44.0.0 v4961: incoming ...
Versions:
Logs:
Network:
The text was updated successfully, but these errors were encountered: