-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random Client Read Timeouts to RDS After Enabling Network Policies via AWS CNI #204
Comments
This also appears to be occurring for REST calls to outbound services. We are seeing pretty consistently Request Timeouts (Root Cause: Net::ReadTimeout: Net::ReadTimeout from the ruby code) when making REST calls to both internal services/components within the cluster and external resources. We're trying to narrow down if it's possible that the additional node agent that is now running on the host nodes is somehow "interfering" with network connections both internally in the cluster, and to external resources outside of the cluster. It seems that the initial connection is successful, but the connection is terminated (possibly an incorrect assumption) while the connection is open, and so the server attempts to respond to the client but cannot. I should add, that we also did update |
@ndrafahl it is possible this issue was fixed by #185, which is being fixed in v1.0.8, which is targeting release in mid-Feb. The TL;DR for that issue is that when multiple pod replicas were on the same node, deleting one of the replicas was inadvertently clearing the BPF pinpath for the other replica. |
Hey @jdn5126! We're actually seeing this issue without any pods being deleted within the cluster. |
I see, let's give the Network Policy maintainers a chance to comment here and see if they have any theories |
We're kind of following this thread over here, that someone else brought up. But the difference for us is that we're seeing the issue in pods that have been running for sometime: #186 |
@ndrafahl - You are on agent version v1.0.6 and certain fixes went in v1.0.7 which fixed conntrack cleanup for long running connections and the issue you are mentioning looks similar. We are in the process of building v1.0.8 and should have the release by next week. If you would like, you can try the |
Hey @jayanthvn - is there anything we could provide from our end that would make the "smoking gun" be that we just need to upgrade to v1.0.7 (via addon version 1.16.2-eksbuild.1)? We're a bit hesitant to touch anything in this environment at this point, only because we're "kind of working" and don't want to push ourselves to "not working at all". Is there any risk in updating to 1.16.2-eksbuild.1? I should add we can absolutely do the same change in our lower environments, but our lower environments, on the same version of the addon and node agent, do not appear to have this issue. |
Quick follow up question, would those conntrack cleanups be related to both connections going out of the cluster (i.e. to RDS) and also connections internal to the cluster (i.e. between pods / services) and also between containers in the same pod? |
Yes all initial connection going out of the probes attached to the pods veth will be logged in the local conntrack table. After the initial connection, remaining packets should just match the local conntrack entry and egress out of the ENI. The local conntrack entry will expire once the kernel conntrack entry expires. Sure you can share the node logs? You can run this script |
Sure thing - do you want it from the host node where the pod was running, that ran into the issue with the RDS timeout? We are seeing issues internally as well, but I figured that could be a good starting point. |
Yes sure that would help.. |
Sounds good, I sent it over with the subject "Issue aws/amazon-vpc-cni-k8s#204". Thanks for the help. |
Thanks @ndrafahl, I sent the logs you shared via support case internally to the dev team.
on your non prod cluster. |
Thanks @jayanthvn and @maiconrocha - It does appear that upgrading to v1.0.7 did solve our issues with the random read timeouts we were seeing, we are considerably more stable today. We'll be watching for the GA of the v1.0.8 release, and will go with that when it's ready. |
Hey - I hate to reopen this issue. We've seen considerably less issues with timeouts after upgrading to the I know one option is to test out the It sounds like
Effectively our other option is to revert using the AWS CNI to enforce network policies, and hopefully go back to a position where we were not seeing any of the network timeouts. I know @jayanthvn you provided the following instructions on how to revert the change by:
The question I have regarding this is, it appears to have left the node agent still running in the Sorry for the long comment, I appreciate you guys having helped us out thus far. |
@ndrafahl - Yes we were able to repro the timeouts in v1.0.7 and it is fixed in v1.0.8. CNI release is in progress and we should have it probably have it end of this week or early next week. Would highly recommend to try it. Regarding the instructions to disable yes those mentioned above are correct and network policy agent enabled or disabled won't impact any conntrack entries in the kernel. |
Hey @jayanthvn - Sorry for the thousand questions I've had, I thought of a few over the weekend as we wait for the 1.0.8 release.
Thanks a ton. |
@ndrafahl - Happy to answer. Please feel free to reach out.... Yes tried with and without pod churn. Yes please redeploy the replicas. Node replacement is not needed. Please keep us updated. v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3 |
Does the release of the managed VPC CNI EKS Addon typically follow suit (which would then have 1.0.8 bundled in), or will that still be a bit for a release of that? |
Yep, the VPC CNI EKS Addon release is in the pipeline. This version should be available in all regions within the next 48 hours |
@ndrafahl are you still experiencing any of these issues after upgrading to Network Policy agent |
It's funny you ask, I was literally talking to coworker about the change request we are planning to do this in Production. We have not yet deployed it to Production. We had the same version of the managed addon, and the node agent, installed in our two lower environment EKS clusters and could not replicate the issue we are seeing in Production (very convenient). The addon popped up late Friday that there was a new version to deploy, so we went ahead and upgraded it this morning in our two lower env clusters, and are doing some testing to see if anything is broken that wasn't broken before. We then have to decide the date to deploy this to Production, and go through restarting all of our deployments in the cluster after the change. We're trying to come up with a backout plan in the event new issues arise, or the problem isn't resolved. I think our true backout plan is probably going to be to try to disable network policy enforcement via addon. tl;dr (sorry, that was a lot to basically say...) we haven't yet deployed it in Production, but I'll be sure to follow up here when we do with the outcome. |
No worries, thank you for the update, and let us know if we can help further! |
Unfortunately in my case, updating the vpc-cni add-on to v1.16.3-eksbuild.2 (nodeagent is running in version v1.0.8-eksbuild.1) did not help. I still see connection issues for few pods. E.g. Keycloak-Pod cannot connect to RDS DB. Pod restart does help sometimes and sometimes not. But moving the pod to another node seems to help. |
@albertschwarzkopf Are you saying the node where pod lands is having an effect on policy enforcement? Does the Pod try to connect to RDS DB endpoint right after boot up? and if yes, are there any retries in place? Can you share the NP and node logs with us from the problematic node - |
@achevuru today the pod start after node downscaling was successful. So let me watch it the next days and I will update this issue. |
@achevuru I do not know why but it is running without issues since 2 days. |
Hey - Just wanted to follow up on this, so the issue doesn't appear stale. We are planning to apply this upgrade during a maintenance window we have coming up. We did some additional investigating on Because of this, we delayed the upgrade to be in our regular maint. window so customer's are aware of the change. I have a random question about the upgrade. You guys suggested to restart the replicas after the change. Originally we had planned to do this by rolling restarting all of the deployments in the cluster, but we have to do an AMI upgrade anyway during our maintenance window. Do you guys see any issue if we instead do:
Is there any cause for concern if pods don't get immediately restarted after the upgrade, basically, and they just get restarted as the nodes get replaced? |
@ndrafahl that strategy sounds good to me, though please use VPC CNI I do not see any cause for concern as long as the pods get restarted on any node. |
Cool, appreciate the reply and the heads up. |
@ndrafahl - Please let us know if you are still seeing an issue with v1.0.8 release |
Will do. Sorry for the delayed update on this. We did the upgrade in our lower environments last week, with no new issues but we also were not seeing the issues in those environments on 1.0.7. We are doing Production this week. |
We upgraded to I did see another new version was available last night when we went to do the upgrade, but we did not go to that as we hadn't yet tested that in our lower environments. Must've been released recently :) |
It appears we are still seeing sporadic timeouts between at least pods/containers within our cluster, after upgrading to We tracked down one that occurred today. Basically the traffic flow should have been: pod2 then makes it's own REST call (to complete the call back to pod1): (We do proxying of calls between applications within our cluster for authentication reasons, which is why the back and forth occurs). (We use an nginx "sidecar" container in most of our pods for TLS, which then forwards traffic on to another container in the same pod). Unfortunately the traffic between the containers in pod1 is where the traffic never occurred. The "final" timeout was found in the nginx container's logs on pod1 that corresponded to the original request:
The timeouts then cascaded back to the origin. We checked the tomcat logs for the application container in pod1, and didn't see any log for the particular POST call, so it doesn't appear that the app container in pod1 saw the traffic coming from the nginx container in the same pod. |
I did run the log-collector script on the node where that |
We saw a small handful of other client timeout errors this morning. I traced down one that was effectively the same as the one in my previous comment. Unfortunately for this one, the The traffic is again, as follows: pod2 then makes it's own REST call (to complete the call back to pod1): With this one it was more interesting - pod1 did get the call (both tomcat logged it, and the application logged it) and the amount of time it took to complete was well below 60 seconds (which would trigger a normal timeout for I traced the call backwards from there, and found the timeout occurred this time between the
It seems like the connection between the I've also grabbed the log-collector tarball for the node where pod2 was running. |
Sorry for the spam, just trying to get information on here in the event someone has ideas on how to troubleshoot the issue. We saw yet another issue this evening of another client timeout, where it appears that the connection between the containers within the same pod was dropped while the connection was actively open. The traffic was the same as above, where there are basically two "connections" occurring. One from pod1 -> pod2. This is kept open while another, separate, connection is made from pod2 -> pod1. I grabbed another log collection from the node where the timeout occurred between the containers on pod2. |
@ndrafahl I see that you've collected node logs. Can you share them with us @ [email protected]? Do you see
Not following this. Containers in the same pod are talking to each other via |
Hey @achevuru -
Correct - that appears to be the case anyway. Let me see if I can better explain the full chain of connections for our latest timeout, with pod names and all as I think I've been poorly explaining this. ingress-nginx-controller deployment (3 replicas) where 2 of them were involved: core deployment (3 replicas) where 1 of them was involved:
task deployment (5 replicas) where 2 of them were involved:
task-59fcb97fb6-nzcmj
There's basically two "connections" that occur as part of the chain. Connection 1 stays open until Connection 2 completes, and then everything rolls backwards. The process is as follows: CONNECTION 1
CONNECTION 2
CONNECTION 1
I believe at this point, the connection between the nginx container and the core container on
To be honest, I'm not sure where to look in the log file for this. But I don't think there would be any DENY entries at all. Both connections were successful to their destinations, it's just that the traffic appears to have been dropped somewhere.
This makes sense, I think. But we didn't have client timeout issues until we enforced network policies via the CNI. Prior to this, these issues weren't present in the cluster. Unfortunately we haven't been able to replicate this issue in a lower environment, not sure if that's due to the amount of traffic or not, so I haven't been able to prove either that disabling policy enforcement would fix the problem.
Definitely - I sent the collected node logs for the node where the core pod was running, where we saw the timeout between the containers within the same pod. I sent it as the subject of "Issue aws/amazon-vpc-cni-k8s#204". The original connection was at (UTC): The timeout was at (UTC): |
Ah unfortunately I won't be able to send them via email, it looks like Amazon bounces it back as it's slightly over size. Not sure if you happen to have another method to get you that file. |
This may be coincidental or unrelated, but looking at the network-policy-agent logs for the nodes where the timeout appears to have occurred between the nginx container and the application container in the same pod, I'm seeing conntrack cleanups a few milliseconds after the connection from the I am wondering if the nginx container is seeing the client timeout to the application container in the same pod only because the connection between the ingress-nginx-controller and the pod itself has been dropped? Here's three examples (I've substituted out the IP addresses below). Example 1:
Example 2:
Example 3:
|
Hey - Wanted to follow up and see if there were any thoughts around the previous comment and it seeming like the conntrack cleanups always appear to occur almost immediately after we see our call being made (and always ~59 seconds before our client timeout error). I wasn't sure if there was anything in those node logs that might better prove/disprove the theory. Thanks! |
@ndrafahl - Are you available on K8s slack? Maybe with slack you can attach the zip file to us? Please share you slack id and we can ping you there.. |
We're experiencing the similar network policy issue. Opened #245. |
Hey guys! Any chance you might have a timeframe for a GA release of the version that'll have have a fix for the race condition for the conntrack cleanups, or still working on a fix? |
We have the PR's merged and working through release testing. We are planning to have the RC image this week and final build will be before May end.. |
Fix is released with network policy agent v1.1.2. - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.2. Please test and let us know if there are any issues. |
What happened:
We had the AWS CNI running in the cluster without the Network Policies being enforced by the CNI:
{ "enableNetworkPolicy": "false" }
We enabled the network policies to be enforced during a maintenance window (only change made was
{ "enableNetworkPolicy": "true" }
), and it appeared at the time like everything was running correctly. Theaws-node
pods rotated out OK on the worker nodes, and the CNI addon update was marked as successful.We noticed though, that after we made this change, we're seeing intermittent client connection read timeouts coming from containers that are running in our cluster, and RDS Postgres, that were not occurring before the change:
The containers come up OK, and connect to the RDS instance initially and start processing work, but at some point in time we see the error message above in the RDS Instance logs, and it appears that the query we were running from our container basically get stuck processing.
Bouncing the pod(s) affected appears to resolve the issue temporarily. The pods are typically moved to one of our other nodes, and reconnect and begin working again for roughly 5-30 minutes.
None of our network policies that are being enforced by the CNI have egress rules that are being defined - we are just enforcing a small handful of ingress rules between services that exist within the cluster itself.
Thank you!
Environment:
kubectl version
): v1.25.16-eks-8cb36c9cat /etc/os-release
): Amazon Linux 2uname -a
): 5.10.199-190.747.amzn2.x86_64The text was updated successfully, but these errors were encountered: