-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS - Pods randomly losing network connectivity and a few other problems on AWS EKS #3446
Comments
Any ideas on this one?
|
@jsalatiel any chance you could ssh into a Node and run |
Hi @antoninbas , I am not able to reproduce the randomly losing network connectivity for new pods, but I can easily replicate the crashloop by just trying to update antrea from 1.5.0 to 1.5.1 ( on a new test cluster ) Maybe this relates to #3471
Output of ip addr
|
This seems to be a duplicate of #3217. This issue was fixed on the main branch. If you could try deploying the latest Antrea to confirm that it resolves your issue, it would be great.
Unfortunately, it seems the issue was not back-ported to 1.5, which is why release 1.5.1 also suffers from the issue. This is a mistake on our part, we should the patch include it in 1.5.2. |
btw, whats the correct way to update antrea?
So I used This fixes the crashloop thanks. Do you think that has anything to do with the new pods losing network connectivity? |
I would think so. This crash would happen after any agent restart, and then there is no way to recover. After a few days it is possible that the agent is restarted for various reasons. |
I think I talked too soon. I restarted one of the nodes and it is crashloopback again with:
|
@jsalatiel Actually it's pretty silly. The YAML manifest I pointed you to and that you applied ( If you replace all instances of |
@jsalatiel were you able to confirm that this is resolved with the |
yes. Closing it. |
Describe the bug
I am trying to use antrea to replace calico on aws for network policies only. In other words, I am trying to keep using aws cni for IPs and antrea for netpolicies only.
Unfortunately, after a while, any new scheduled pods will have no working network. The already running pods are just fine.
When that happens, the only thing that I can do to make new pods be able to work is create a new node group, drain and delete the old ones.
I tried to capture the logs when I start a pod and the pod does not get network connectivity. The log is full of:
Trying to restart antrea-agent crash loops on:
I can't find a way to easily replicate this because it starts to happens after a few days.
The only thing I know is that the crashloop is the same I get whenever I try to update antrea on eks, which also only works after creating new node groups and delete the old ones.
Versions:
The text was updated successfully, but these errors were encountered: