Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node HTTP outgoing request fails randomly #1470

Closed
aviramha opened this issue May 25, 2023 · 12 comments
Closed

Node HTTP outgoing request fails randomly #1470

aviramha opened this issue May 25, 2023 · 12 comments
Assignees
Labels
bug Something isn't working retest

Comments

@aviramha
Copy link
Member

aviramha commented May 25, 2023

I believe this is a fd leak / leak issue since it happens randomly. We also had an e2e test of many outgoing http requests that was flaky so it might be relevant.

Happens on Linux container on Mac M1 with node.

@eyalb181 eyalb181 added the bug Something isn't working label May 25, 2023
@aviramha
Copy link
Member Author

Relevant:
#757

@t4lz
Copy link
Member

t4lz commented Jun 7, 2023

When running the test from @infiniteregrets's branch on Mac, but changing the test to send all the requests 30 times in a loop, at some point the agent sends a ConnectTimedOut back to the layer and the application gets an ENETUNREACH. Is this the bug reported here, or is this another, unrelated issue? (it does not happen when running that test app without mirrord)

@aviramha
Copy link
Member Author

aviramha commented Jun 7, 2023

When running the test from @infiniteregrets's branch on Mac, but changing the test to send all the requests 30 times in a loop, at some point the agent sends a ConnectTimedOut back to the layer and the application gets an ENETUNREACH. Is this the bug reported here, or is this another, unrelated issue? (it does not happen when running that test app without mirrord)

I don't think we managed to reproduce the issue locally. It only happens in E2E (after the fix) - before the fix we had another issue (or same?) that occurred locally.

@t4lz
Copy link
Member

t4lz commented Jun 7, 2023

But what was the error? Was it ENETUNREACH?

@aviramha
Copy link
Member Author

aviramha commented Jun 7, 2023

But what was the error? Was it ENETUNREACH?

Before the fix, locally? It was connection reset.

@t4lz
Copy link
Member

t4lz commented Jun 7, 2023

Do we have the exact node error? Is it the same issue as #564?

@aviramha
Copy link
Member Author

aviramha commented Jun 7, 2023

Seems so, yes.

@aviramha
Copy link
Member Author

Anyone knows if this still happens or how to check if it still happens? I believe the internal proxy refactor should've solved it.

@infiniteregrets
Copy link
Contributor

I have stress tested this on main locally and this does not happen. I will give a try on the CI with an e2e, if that passes will add an integration test. related to #1484

@gememma
Copy link
Member

gememma commented Sep 10, 2024

Looks like this passes with 30 loops of makeRequests() in CI now (https://github.com/metalbear-co/mirrord/actions/runs/10794249050/job/29938137749?pr=2748) but fails locally on more than about 10 loops - feels like it could be a workload problem where lots of requests are increasing the latency?
Weirdly sometimes the test_outgoing_traffic_many_requests_disabled (running the test without mirrord) also fails with ECONNRESET which makes me think this isn't a mirrord problem

EDIT After a few more runs im not convinced there is any pattern in the failures (other than more loops is more bad)

@gememma
Copy link
Member

gememma commented Sep 12, 2024

A short(ish) summary of investigation

Methods attempted:

  • in CI, both with and without mirrord (tests test_outgoing_traffic_many_requests_enabled/_disabled respectively)
  • locally on Mac M1, MacOS 14.6.1, via tests above and manually using mirrord as well as running node without mirrord
    All of the above was with the script in tests/node-e2e/outgoing/test_outgoing_traffic_many_requests.mjs - changing the number of loops to make more requests

Observations:

  • the error that occurs (socket hang up) is a timeout issue - the client has waited too long for a response since sending the request and force closes the socket
  • mirrord does not emit any interesting logs that suggest unexpected errors/ weird behaviour, even when using mirrord-console
  • in node, all requests are sent before any responses are received, both with and without mirrord
  • I believe this reveals that the issue is caused by all the requests (critical mass locally with mirrord was around 240 requests sent one after the other) competing for cpu time and choking eachother out - mirrord can only do so many things at once
  • running locally without mirrord succeeds until around 2400 requests, at which point the same error occurs - any computer can only do so many things at once
  • running a similar script in python did not encounter the same issue, probably because the script blocked on receiving the response from each request before making the next
  • in node, the issue is completely fixed if you add a second of sleep() between each batch of 12 requests
  • sending requests to localhost in a cluster does not cause the same issue (I assume because it's so much faster that it never reaches the point of serious cpu time contention)
  • when running in CI, sending enough requests causes both test_outgoing_traffic_many_requests_enabled/_disabled to fail, just like running the script locally

@eyalb181
Copy link
Member

Closing this as it seems it's not a bug on our end, just mirrord exacerbating timeouts with its (negligible) added latency, which is a known issue.

@eyalb181 eyalb181 closed this as not planned Won't fix, can't repro, duplicate, stale Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working retest
Projects
None yet
5 participants