Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Client Read Timeouts to RDS After Enabling Network Policies via AWS CNI #204

Closed
ndrafahl opened this issue Feb 2, 2024 · 47 comments
Labels
needs investigation question Further information is requested

Comments

@ndrafahl
Copy link

ndrafahl commented Feb 2, 2024

What happened:
We had the AWS CNI running in the cluster without the Network Policies being enforced by the CNI:
{ "enableNetworkPolicy": "false" }

We enabled the network policies to be enforced during a maintenance window (only change made was { "enableNetworkPolicy": "true" }), and it appeared at the time like everything was running correctly. The aws-node pods rotated out OK on the worker nodes, and the CNI addon update was marked as successful.

We noticed though, that after we made this change, we're seeing intermittent client connection read timeouts coming from containers that are running in our cluster, and RDS Postgres, that were not occurring before the change:

2024-02-02 01:49:02 UTC:<container_ip_redacted>(52205):<db_user_redacted>@<db_redacted>:[17204]:LOG:  could not receive data from client: Connection timed out
2024-02-02 01:49:02 UTC:<container_id_redacted>(52205):<db_user_redacted>@<db_redacted>:[17204]:LOG:  unexpected EOF on client connection with an open transaction

The containers come up OK, and connect to the RDS instance initially and start processing work, but at some point in time we see the error message above in the RDS Instance logs, and it appears that the query we were running from our container basically get stuck processing.

Bouncing the pod(s) affected appears to resolve the issue temporarily. The pods are typically moved to one of our other nodes, and reconnect and begin working again for roughly 5-30 minutes.

None of our network policies that are being enforced by the CNI have egress rules that are being defined - we are just enforcing a small handful of ingress rules between services that exist within the cluster itself.

Thank you!

Environment:

  • Kubernetes version (use kubectl version): v1.25.16-eks-8cb36c9
  • CNI Version: v1.15.4-eksbuild.1
  • Network Policy Agent Version: v1.0.6-eksbuild.1
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a): 5.10.199-190.747.amzn2.x86_64
@ndrafahl ndrafahl added needs investigation question Further information is requested labels Feb 2, 2024
@ndrafahl
Copy link
Author

ndrafahl commented Feb 5, 2024

This also appears to be occurring for REST calls to outbound services.

We are seeing pretty consistently Request Timeouts (Root Cause: Net::ReadTimeout: Net::ReadTimeout from the ruby code) when making REST calls to both internal services/components within the cluster and external resources.

We're trying to narrow down if it's possible that the additional node agent that is now running on the host nodes is somehow "interfering" with network connections both internally in the cluster, and to external resources outside of the cluster. It seems that the initial connection is successful, but the connection is terminated (possibly an incorrect assumption) while the connection is open, and so the server attempts to respond to the client but cannot.

I should add, that we also did update kube-proxy to the latest supported 1.25 version before we began enforcing the network policies via the CNI.

@jdn5126 jdn5126 transferred this issue from aws/amazon-vpc-cni-k8s Feb 5, 2024
@jdn5126
Copy link
Contributor

jdn5126 commented Feb 5, 2024

@ndrafahl it is possible this issue was fixed by #185, which is being fixed in v1.0.8, which is targeting release in mid-Feb.

The TL;DR for that issue is that when multiple pod replicas were on the same node, deleting one of the replicas was inadvertently clearing the BPF pinpath for the other replica.

@ndrafahl
Copy link
Author

ndrafahl commented Feb 5, 2024

Hey @jdn5126!

We're actually seeing this issue without any pods being deleted within the cluster.

@jdn5126
Copy link
Contributor

jdn5126 commented Feb 5, 2024

Hey @jdn5126!

We're actually seeing this issue without any pods being deleted within the cluster.

I see, let's give the Network Policy maintainers a chance to comment here and see if they have any theories

@ndrafahl
Copy link
Author

ndrafahl commented Feb 5, 2024

We're kind of following this thread over here, that someone else brought up. But the difference for us is that we're seeing the issue in pods that have been running for sometime: #186

@jayanthvn
Copy link
Contributor

jayanthvn commented Feb 5, 2024

@ndrafahl - You are on agent version v1.0.6 and certain fixes went in v1.0.7 which fixed conntrack cleanup for long running connections and the issue you are mentioning looks similar. We are in the process of building v1.0.8 and should have the release by next week. If you would like, you can try the v1.0.8-rc3 release candidate image..

@ndrafahl
Copy link
Author

ndrafahl commented Feb 5, 2024

Hey @jayanthvn - is there anything we could provide from our end that would make the "smoking gun" be that we just need to upgrade to v1.0.7 (via addon version 1.16.2-eksbuild.1)?

We're a bit hesitant to touch anything in this environment at this point, only because we're "kind of working" and don't want to push ourselves to "not working at all".

Is there any risk in updating to 1.16.2-eksbuild.1?

I should add we can absolutely do the same change in our lower environments, but our lower environments, on the same version of the addon and node agent, do not appear to have this issue.

@ndrafahl
Copy link
Author

ndrafahl commented Feb 5, 2024

Quick follow up question, would those conntrack cleanups be related to both connections going out of the cluster (i.e. to RDS) and also connections internal to the cluster (i.e. between pods / services) and also between containers in the same pod?

@jayanthvn
Copy link
Contributor

Yes all initial connection going out of the probes attached to the pods veth will be logged in the local conntrack table. After the initial connection, remaining packets should just match the local conntrack entry and egress out of the ENI. The local conntrack entry will expire once the kernel conntrack entry expires.

Sure you can share the node logs? You can run this script sudo bash /opt/cni/bin/aws-cni-support.sh on the node which is initiating the connection and we can take a look. You can mail them to [email protected]

@ndrafahl
Copy link
Author

ndrafahl commented Feb 5, 2024

Sure thing - do you want it from the host node where the pod was running, that ran into the issue with the RDS timeout? We are seeing issues internally as well, but I figured that could be a good starting point.

@jayanthvn
Copy link
Contributor

Yes sure that would help..

@ndrafahl
Copy link
Author

ndrafahl commented Feb 5, 2024

Sounds good, I sent it over with the subject "Issue aws/amazon-vpc-cni-k8s#204". Thanks for the help.

@maiconrocha
Copy link

Thanks @ndrafahl, I sent the logs you shared via support case internally to the dev team.
They confirmed the incorrect conntrack cleanups identified on the vpc-cni-np-agent / v1.0.6 will be for
any initial connection going through the probes either ingress or egress attached to the veth.
It doesn't differentiate between pod-pod, pod-service or pod-external.
So the upgrade you did to use v1.0.7 of the vpc-cni-np-agent should fix the issue.
If you would like, you can try the v1.0.8-rc3 release candidate image by updating the image to

602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.8-rc3

on your non prod cluster.
v1.0.8-rc3 has few optimizations and enhancements.
The team is working on the v1.0.8 release, which should be out by next week.

@ndrafahl
Copy link
Author

ndrafahl commented Feb 6, 2024

Thanks @jayanthvn and @maiconrocha -

It does appear that upgrading to v1.0.7 did solve our issues with the random read timeouts we were seeing, we are considerably more stable today.

We'll be watching for the GA of the v1.0.8 release, and will go with that when it's ready.

@ndrafahl ndrafahl closed this as completed Feb 6, 2024
@ndrafahl
Copy link
Author

ndrafahl commented Feb 14, 2024

Hey -

I hate to reopen this issue. We've seen considerably less issues with timeouts after upgrading to the 1.0.7 version of the network policy controller, but we're still seeing sporadic timeouts occurring with that version.

I know one option is to test out the 1.0.8-rc3 image specifically, but we're a bit hesitant to try out a release candidate image in our Production environment, and it's been difficult to replicate the same issue in a non-production environment that does not have the same overall usage as production.

It sounds like 1.0.8 should have additional fixes for conntrack cleanups that it sounds like we were hitting in 1.0.6, that were mostly fixed in 1.0.7. Is that the case, and if so, is it:

  • Possible to see if our remaining issues would be solved by going to 1.0.8 when it releases prior to making the change? Anything in logs we could provide to show that's the case?
  • If so, I know it was targeted to release this week. Is that still the planned release window?

Effectively our other option is to revert using the AWS CNI to enforce network policies, and hopefully go back to a position where we were not seeing any of the network timeouts. I know @jayanthvn you provided the following instructions on how to revert the change by:

  • Deleting all Network Policies in the Cluster
  • Disabling the controller by updating the amazon-vpc-cni configmap
  • Set enableNetworkPolicy to false in the managed addon additional configuration options.

The question I have regarding this is, it appears to have left the node agent still running in the aws-node daemonset in my testing in another environment. Would that still actually revert the netpol agent from doing anything with conntrack entries on the host machine, assuming that's still the culprit?

Sorry for the long comment, I appreciate you guys having helped us out thus far.

@ndrafahl ndrafahl reopened this Feb 14, 2024
@jayanthvn
Copy link
Contributor

@ndrafahl - Yes we were able to repro the timeouts in v1.0.7 and it is fixed in v1.0.8. CNI release is in progress and we should have it probably have it end of this week or early next week. Would highly recommend to try it.

Regarding the instructions to disable yes those mentioned above are correct and network policy agent enabled or disabled won't impact any conntrack entries in the kernel.

@ndrafahl
Copy link
Author

Hey @jayanthvn -

Sorry for the thousand questions I've had, I thought of a few over the weekend as we wait for the 1.0.8 release.

  1. Were your repros in clusters that have no pod churn? I know there's a sporadic issue people I have seen (I believe in 1.0.7) where if pods from the same deployment exist on the same node, and are removed it can cause issues with the pod that's left (probably paraphrasing a bit here). We have no pod churn in our cluster, so I was just curious.
  2. Is there anything on the 1.0.8 upgrade we would need to do after the upgrade? i.e. pod restarts, node replacements.
  3. If we had to rollback, is it largely safe to rollback to previous version of the node agent?

Thanks a ton.

@jayanthvn
Copy link
Contributor

@ndrafahl - Happy to answer. Please feel free to reach out....

Yes tried with and without pod churn. Yes please redeploy the replicas. Node replacement is not needed. Please keep us updated. v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3

@ndrafahl
Copy link
Author

Does the release of the managed VPC CNI EKS Addon typically follow suit (which would then have 1.0.8 bundled in), or will that still be a bit for a release of that?

@jdn5126
Copy link
Contributor

jdn5126 commented Feb 20, 2024

Does the release of the managed VPC CNI EKS Addon typically follow suit (which would then have 1.0.8 bundled in), or will that still be a bit for a release of that?

Yep, the VPC CNI EKS Addon release is in the pipeline. This version should be available in all regions within the next 48 hours

@jdn5126
Copy link
Contributor

jdn5126 commented Feb 26, 2024

@ndrafahl are you still experiencing any of these issues after upgrading to Network Policy agent v1.0.8?

@ndrafahl
Copy link
Author

@ndrafahl are you still experiencing any of these issues after upgrading to Network Policy agent v1.0.8?

It's funny you ask, I was literally talking to coworker about the change request we are planning to do this in Production.

We have not yet deployed it to Production. We had the same version of the managed addon, and the node agent, installed in our two lower environment EKS clusters and could not replicate the issue we are seeing in Production (very convenient).

The addon popped up late Friday that there was a new version to deploy, so we went ahead and upgraded it this morning in our two lower env clusters, and are doing some testing to see if anything is broken that wasn't broken before.

We then have to decide the date to deploy this to Production, and go through restarting all of our deployments in the cluster after the change. We're trying to come up with a backout plan in the event new issues arise, or the problem isn't resolved. I think our true backout plan is probably going to be to try to disable network policy enforcement via addon.

tl;dr (sorry, that was a lot to basically say...) we haven't yet deployed it in Production, but I'll be sure to follow up here when we do with the outcome.

@jdn5126
Copy link
Contributor

jdn5126 commented Feb 26, 2024

No worries, thank you for the update, and let us know if we can help further!

@albertschwarzkopf
Copy link

Unfortunately in my case, updating the vpc-cni add-on to v1.16.3-eksbuild.2 (nodeagent is running in version v1.0.8-eksbuild.1) did not help. I still see connection issues for few pods.

E.g. Keycloak-Pod cannot connect to RDS DB. Pod restart does help sometimes and sometimes not. But moving the pod to another node seems to help.

@achevuru
Copy link
Contributor

@albertschwarzkopf Are you saying the node where pod lands is having an effect on policy enforcement? Does the Pod try to connect to RDS DB endpoint right after boot up? and if yes, are there any retries in place? Can you share the NP and node logs with us from the problematic node - [email protected]

@albertschwarzkopf
Copy link

@achevuru today the pod start after node downscaling was successful. So let me watch it the next days and I will update this issue.

@albertschwarzkopf
Copy link

@achevuru I do not know why but it is running without issues since 2 days.

@ndrafahl
Copy link
Author

ndrafahl commented Mar 6, 2024

Hey -

Just wanted to follow up on this, so the issue doesn't appear stale.

We are planning to apply this upgrade during a maintenance window we have coming up.

We did some additional investigating on v1.0.7 and we found that some of timeouts we were seeing do not appear to be directly correlated with the enforcement of the network policies via the netpol agent. It looks instead like a call we're making out of the cluster is randomly taking longer than 60 seconds, and so the nginx controller we're using internally in the cluster is "timing out" that connection after 60 seconds due to it's configuration.

Because of this, we delayed the upgrade to be in our regular maint. window so customer's are aware of the change.

I have a random question about the upgrade. You guys suggested to restart the replicas after the change. Originally we had planned to do this by rolling restarting all of the deployments in the cluster, but we have to do an AMI upgrade anyway during our maintenance window. Do you guys see any issue if we instead do:

  1. Upgrade the CNI to 1.16.3
  2. Upgrade the AMIs to the latest, which will cause all of the nodes to be drained and replaced. Effectively causing all replicas to be restarted.

Is there any cause for concern if pods don't get immediately restarted after the upgrade, basically, and they just get restarted as the nodes get replaced?

@jdn5126
Copy link
Contributor

jdn5126 commented Mar 6, 2024

@ndrafahl that strategy sounds good to me, though please use VPC CNI v1.16.4 as v1.16.3 had a regression that could lead to high CPU consumption in some edge cases.

I do not see any cause for concern as long as the pods get restarted on any node.

@ndrafahl
Copy link
Author

ndrafahl commented Mar 6, 2024

Cool, appreciate the reply and the heads up.

@sjastis
Copy link

sjastis commented Mar 19, 2024

@ndrafahl - Please let us know if you are still seeing an issue with v1.0.8 release

@ndrafahl
Copy link
Author

@ndrafahl - Please let us know if you are still seeing an issue with v1.0.8 release

Will do.

Sorry for the delayed update on this.

We did the upgrade in our lower environments last week, with no new issues but we also were not seeing the issues in those environments on 1.0.7.

We are doing Production this week.

@ndrafahl
Copy link
Author

We upgraded to v1.16.4 last night - I'll keep this open for a bit longer and then follow up and close it out after we've confirmed the timeouts have disappeared.

I did see another new version was available last night when we went to do the upgrade, but we did not go to that as we hadn't yet tested that in our lower environments. Must've been released recently :)

@ndrafahl
Copy link
Author

ndrafahl commented Mar 26, 2024

It appears we are still seeing sporadic timeouts between at least pods/containers within our cluster, after upgrading to v1.16.4.

We tracked down one that occurred today. Basically the traffic flow should have been:
pod1 [ORIGIN] -> REST call -> ingress-nginx-controller -> pod2 (nginx container) -> pod2 (app container)

pod2 then makes it's own REST call (to complete the call back to pod1):
pod2 -> REST call -> ingress-nginx-controller -> pod1 (nginx container) -> pod1 (app container)

(We do proxying of calls between applications within our cluster for authentication reasons, which is why the back and forth occurs).

(We use an nginx "sidecar" container in most of our pods for TLS, which then forwards traffic on to another container in the same pod).

Unfortunately the traffic between the containers in pod1 is where the traffic never occurred. The "final" timeout was found in the nginx container's logs on pod1 that corresponded to the original request:

2024/03/26 15:54:12 [error] 18#18: *24930 upstream timed out (110: Operation timed out) while reading response header from upstream, client: NGINX_CONTROLLER_IP, server: cluster.local, request: "POST /a/path HTTP/1.1", upstream: "http://127.0.0.1:8080/app/api/v2/a/path", host: "subdomain.domain.io"
NGINX_CONTROLLER_IP - - [26/Mar/2024:15:54:12 +0000] "POST /a/path HTTP/1.1" 504 167 "-" "Ruby"

The timeouts then cascaded back to the origin.

We checked the tomcat logs for the application container in pod1, and didn't see any log for the particular POST call, so it doesn't appear that the app container in pod1 saw the traffic coming from the nginx container in the same pod.

@ndrafahl
Copy link
Author

I did run the log-collector script on the node where that pod1 was running on, to see if anything in the logs for our timestamp would tell us if it's the CNI that is impacting the traffic between containers on the same pod. Not sure if that may be helpful to provide.

@ndrafahl
Copy link
Author

ndrafahl commented Mar 27, 2024

We saw a small handful of other client timeout errors this morning. I traced down one that was effectively the same as the one in my previous comment.

Unfortunately for this one, the ingress-nginx-controller container logs had already rotated out and they weren't stored externally, so that's the one piece of the puzzle I don't have.

The traffic is again, as follows:
pod1 [ORIGIN] -> REST call -> ingress-nginx-controller -> pod2 (nginx container) -> pod2 (app container)

pod2 then makes it's own REST call (to complete the call back to pod1):
pod2 -> REST call -> ingress-nginx-controller -> pod1 (nginx container) -> pod1 (app container)

With this one it was more interesting - pod1 did get the call (both tomcat logged it, and the application logged it) and the amount of time it took to complete was well below 60 seconds (which would trigger a normal timeout for nginx).

I traced the call backwards from there, and found the timeout occurred this time between the nginx sidecar container in pod2 and the application container in pod2:

2024/03/27 08:00:00 [error] 25#25: *304040 upstream timed out (110: Operation timed out) while reading response header from upstream, client: INGRESS_NGINX_CONTROLLER_IP, server: cluster.local, request: "POST /a/path HTTP/1.1", upstream: "http://127.0.0.1:8080/a/path", host: "subdomain.domain.io"

It seems like the connection between the nginx sidecar container and the application container in the same pod may have been dropped while it was still open. Nothing within the application container suggests it saw a client timeout connecting back to pod1.

I've also grabbed the log-collector tarball for the node where pod2 was running.

@ndrafahl
Copy link
Author

Sorry for the spam, just trying to get information on here in the event someone has ideas on how to troubleshoot the issue.

We saw yet another issue this evening of another client timeout, where it appears that the connection between the containers within the same pod was dropped while the connection was actively open.

The traffic was the same as above, where there are basically two "connections" occurring. One from pod1 -> pod2. This is kept open while another, separate, connection is made from pod2 -> pod1.

I grabbed another log collection from the node where the timeout occurred between the containers on pod2.

@achevuru
Copy link
Contributor

@ndrafahl I see that you've collected node logs. Can you share them with us @ [email protected]? Do you see DENY entries in the network policy agent logs?

We saw yet another issue this evening of another client timeout, where it appears that the connection between the containers within the same pod was dropped while the connection was actively open.

Not following this. Containers in the same pod are talking to each other via localhost probably? packets shouldn't even leave the pod network namespace in this case? Network policy shouldn't come in to play for this use case at all

@ndrafahl
Copy link
Author

ndrafahl commented Mar 28, 2024

Hey @achevuru -

Not following this. Containers in the same pod are talking to each other via localhost probably? packets shouldn't even leave the pod network namespace in this case? Network policy shouldn't come in to play for this use case at all

Correct - that appears to be the case anyway. Let me see if I can better explain the full chain of connections for our latest timeout, with pod names and all as I think I've been poorly explaining this.

ingress-nginx-controller deployment (3 replicas) where 2 of them were involved:
ingress-nginx-controller-65c7b57b6d-5fmv8
ingress-nginx-controller-65c7b57b6d-n2sxh

core deployment (3 replicas) where 1 of them was involved:
core-5f76cc4c55-6gvrb

  • nginx container (to terminate SSL and forward to localhost in the same pod)
  • core container (application)

task deployment (5 replicas) where 2 of them were involved:
task-59fcb97fb6-zc7st

  • nginx container (to terminate SSL and forward to localhost in the same pod)
  • task container (application)

task-59fcb97fb6-nzcmj

  • same as above

There's basically two "connections" that occur as part of the chain. Connection 1 stays open until Connection 2 completes, and then everything rolls backwards.

The process is as follows:

CONNECTION 1

  1. task-59fcb97fb6-zc7st makes a REST call to https://subdomain.domain.com/app/components/...
  2. DNS record for this call needs to go to an ingress-nginx-controller pod for routing based on nginx ingresses. It makes it to ingress-nginx-controller-65c7b57b6d-5fmv8
2024/03/28 03:04:23 [warn] 438#438: *21137250 [lua] cookie.lua:146: bake(): SameSite value must be 'Strict', 'Lax' or 'None' while connecting to upstream, client: TASK_POD_IP, server: subdomain.domain.com, request: "POST /app/components/... HTTP/1.1", host: "subdomain.domain.com"
  1. The ingress-nginx-controller pod determines it needs to go to the core deployment, and makes it to core-5f76cc4c55-6gvrb. We don't see initially the call in the nginx sidecar until it times out, but here's the timeout:
2024/03/28 03:05:23 [error] 24#24: *534004 upstream timed out (110: Operation timed out) while reading response header from upstream, client: NGINX_CONTROLLER_IP, server: cluster.local, request: "POST /app/components/... HTTP/1.1", upstream: "http://127.0.0.1:8080/SUBDOMAIN/app/components/task/...", host: "subdomain.domain.io"

NGINX_CONTROLLER_IP - user [28/Mar/2024:03:05:23 +0000] "POST /app/components/task/... HTTP/1.1" 504 167 "-" "Ruby"
  1. The nginx sidecar container forwards the call on to the core container in the same pod.
  2. The core container then makes another REST call, which will take us on to another destination. Connection 1 remains open here waiting for Connection 2 to complete.

CONNECTION 2

  1. core-5f76cc4c55-6gvrb makes a REST call to https://subdomain.domain.com/task/app/... to get information so it can reply back to the call on Connection 1.
  2. DNS record for this call needs to go to an ingress-nginx-controller pod for routing based on nginx ingresses. It makes it to ingress-nginx-controller-65c7b57b6d-n2sxh
2024/03/28 03:04:23 [warn] 307#307: *21155396 [lua] cookie.lua:146: bake(): SameSite value must be 'Strict', 'Lax' or 'None' while connecting to upstream, client: CORE_POD_IP, server: subdomain.domain.io, request: "POST /task/app/... HTTP/1.1", host: "subdomain.domain.io"
  1. The ingress-nginx-controller pod determines it needs to go to the task deployment, and makes it to task-59fcb97fb6-nzcmj. The connection here is a 200 from the nginx sidecar container logs as it completes it less than a second.
INGRESS_NGINX_CONTROLLER_IP - - [28/Mar/2024:03:04:23 +0000] "POST /task/app/... HTTP/1.1" 200 147 "-" "Ruby"
  1. And in the ingress-nginx-controller pod, we see that Connection 2 was successful and responded back in less than a second:
CORE_POD_IP - - [28/Mar/2024:03:04:23 +0000] "POST /task/app/... HTTP/1.1" 200 142 "-" "Ruby" 1059 0.109 [task-deployment] [] TASK_POD_IP:443 147 0.109 200 18681ceec65d3ae2a1a27bed9b4fe3e0

CONNECTION 1

  1. At this point, the core pod has everything it needs to respond. There is no timeout seen from this applications perspective, which makes sense - Connection 2 never timed out.
  2. The application completes it's work and "replies back" on Connection 1.
  3. Regardless, we see the timeout from the nginx container in the core pod for the connection:
2024/03/28 03:05:23 [error] 24#24: *534004 upstream timed out (110: Operation timed out) while reading response header from upstream, client: NGINX_CONTROLLER_IP, server: cluster.local, request: "POST /app/components/... HTTP/1.1", upstream: "http://127.0.0.1:8080/SUBDOMAIN/app/components/task/...", host: "subdomain.domain.io"

NGINX_CONTROLLER_IP - user [28/Mar/2024:03:05:23 +0000] "POST /app/components/task/... HTTP/1.1" 504 167 "-" "Ruby"
  1. We then see the timeout in the ingress-nginx-controller logs for Connection 1 as well, a minute later:
2024/03/28 03:05:23 [error] 438#438: *21137250 upstream timed out (110: Operation timed out) while reading response header from upstream, client: NGINX_CONTROLLER_IP, server: subdomain.domain.com, request: "POST /app/components/... HTTP/1.1", upstream: "https://CORE_POD_IP:443/app/components/...", host: "subdomain.domain.com"

TASK_POD_IP - user [28/Mar/2024:03:05:23 +0000] "POST /app/components/... HTTP/1.1" 504 104800 "-" "Ruby" 396 60.002 [upstream-default-backend] [] CORE_POD_IP:443 : INGRESS_DEFAULT_BACKEND_POD_IP:80 0 : 104800 60.000 : 0.002 504 : 504 18681ceec65d3ae2a1a27bed9b4fe3e0

I believe at this point, the connection between the nginx container and the core container on core-5f76cc4c55-6gvrb has been closed/dropped/etc. before Connection 2 completed. There's nothing to suggest that Connection 2 timed out, and it was less than a second in execution time, so there was plenty of time for it to respond.

Do you see DENY entries in the network policy agent logs?

To be honest, I'm not sure where to look in the log file for this. But I don't think there would be any DENY entries at all. Both connections were successful to their destinations, it's just that the traffic appears to have been dropped somewhere.

Network policy shouldn't come in to play for this use case at all

This makes sense, I think. But we didn't have client timeout issues until we enforced network policies via the CNI. Prior to this, these issues weren't present in the cluster. Unfortunately we haven't been able to replicate this issue in a lower environment, not sure if that's due to the amount of traffic or not, so I haven't been able to prove either that disabling policy enforcement would fix the problem.

Can you share them with us @ [email protected]?

Definitely - I sent the collected node logs for the node where the core pod was running, where we saw the timeout between the containers within the same pod. I sent it as the subject of "Issue aws/amazon-vpc-cni-k8s#204".

The original connection was at (UTC):
2024-03-28T03:04:23

The timeout was at (UTC):
2024-03-28T03:05:23

@ndrafahl
Copy link
Author

Definitely - I sent the collected node logs for the node where the core pod was running, where we saw the timeout between the containers within the same pod. I sent it as the subject of "Issue aws/amazon-vpc-cni-k8s#204".

Ah unfortunately I won't be able to send them via email, it looks like Amazon bounces it back as it's slightly over size. Not sure if you happen to have another method to get you that file.

@ndrafahl
Copy link
Author

ndrafahl commented Mar 29, 2024

This may be coincidental or unrelated, but looking at the network-policy-agent logs for the nodes where the timeout appears to have occurred between the nginx container and the application container in the same pod, I'm seeing conntrack cleanups a few milliseconds after the connection from the ingress-nginx-controller and the pod is created. It appears these cleanups occur every 5 minutes.

I am wondering if the nginx container is seeing the client timeout to the application container in the same pod only because the connection between the ingress-nginx-controller and the pod itself has been dropped?

Here's three examples (I've substituted out the IP addresses below).


Example 1:
Time of call between ingress-nginx-controller and pod IP: 2024-03-27T07:59:00,140
Source IP of ingress-nginx-controller: 192.1.1.1
Destination IP of pod where the timeout occurs between containers in pod: 192.2.1.1
Conntrack cleanup entries:

{"level":"info","ts":"2024-03-27T07:59:00.204Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Conntrack cleanup","Entry - ":"Expired/Delete Conntrack Key : Source IP - 192.1.1.1 Source port - 34176 Dest IP - 192.2.1.1 Dest port - 443 Protocol - 6 Owner IP - 192.2.1.1"}
< There are multiple cleanups here for the same source and destination IP >

Example 2:
Time of call between ingress-nginx-controller and pod IP: 2024-03-26T15:53:12,751
Source IP of ingress-nginx-controller: 192.3.1.1
Destination IP of pod where the timeout occurs between containers in pod: 192.4.1.1
Conntrack cleanup entries:

{"level":"info","ts":"2024-03-26T15:53:12.840Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Conntrack cleanup","Entry - ":"Expired/Delete Conntrack Key : Source IP - 192.3.1.1 Source port - 48212 Dest IP - 192.4.1.1 Dest port - 443 Protocol - 6 Owner IP - 192.4.1.1"}
< There are multiple cleanups here for the same source and destination IP >

Example 3:
Time of call between ingress-nginx-controller and pod IP: 2024-03-28T03:04:23,067
Source IP of ingress-nginx-controller: 192.1.1.1
Destination IP of pod where the timeout occurs between containers in pod: 192.5.1.1
Conntrack cleanup entries:

{"level":"info","ts":"2024-03-28T03:04:23.217Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Conntrack cleanup","Entry - ":"Expired/Delete Conntrack Key : Source IP - 192.1.1.1 Source port - 41908 Dest IP - 192.5.1.1 Dest port - 443 Protocol - 6 Owner IP - 192.5.1.1"}
< There are multiple cleanups here for the same source and destination IP >

@ndrafahl
Copy link
Author

ndrafahl commented Apr 2, 2024

Hey -

Wanted to follow up and see if there were any thoughts around the previous comment and it seeming like the conntrack cleanups always appear to occur almost immediately after we see our call being made (and always ~59 seconds before our client timeout error).

I wasn't sure if there was anything in those node logs that might better prove/disprove the theory.

Thanks!

@jayanthvn
Copy link
Contributor

@ndrafahl - Are you available on K8s slack? Maybe with slack you can attach the zip file to us? Please share you slack id and we can ping you there..

@younsl
Copy link
Contributor

younsl commented Apr 11, 2024

We're experiencing the similar network policy issue. Opened #245.

@ndrafahl
Copy link
Author

Hey guys!

Any chance you might have a timeframe for a GA release of the version that'll have have a fix for the race condition for the conntrack cleanups, or still working on a fix?

@jayanthvn
Copy link
Contributor

We have the PR's merged and working through release testing. We are planning to have the RC image this week and final build will be before May end..

@jayanthvn
Copy link
Contributor

Fix is released with network policy agent v1.1.2. - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.2. Please test and let us know if there are any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs investigation question Further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants