-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPC channel stops delivering messages to cluster workers #9706
Comments
What's ulimit value for It could be you're hitting that value. Does increasing that value help? See: http://superuser.com/a/303058 |
@santigimeno I'm able to repro with the hard and soft file descriptor limits set to 1M. |
@davidvetrano yes, I could reproduce the issue on What I have observed is that at some point the /cc @bnoordhuis |
I lose IPC communication on Linux too, having 41 workers and rather heavy DB access on each. No error is given on console. |
Update: After testing again, v6.1 does in fact have has the bug. Please disregard.
|
Yeah, I have also reproduced it in |
@santigimeno This should remain open? /cc @bnoordhuis, @cjihrig, @mcollina |
@santigimeno #13235 fixed this, didn't it? That PR is on track for v6.x and it seems reasonable to me to also target v4.x since it's a rather insidious bug. |
I think so. I still can reproduce it with current master on |
@bnoordhuis sorry I hadn't read your comment before answering...
I had forgotten about this one but it certainly looks like it could have been solved by #13235, but from a quick check it doesn't look it's solved so it may be a different issue. |
It is possible that `recvmsg()` may return an error on ancillary data reception when receiving a `NODE_HANDLE` message (for example `MSG_CTRUNC`). This would end up, if the handle type was `net.Socket`, on a `message` event with a non null but invalid `sendHandle`. To improve the situation, send a `NODE_HANDLE_NACK` that'll cause the sending process to retransmit the message again. In case the same message is retransmitted 3 times without success, close the handle and print a warning. Fixes: #9706 PR-URL: #13235 Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>
I've landed the fix in v6.x-staging in 5160d3d |
It is possible that `recvmsg()` may return an error on ancillary data reception when receiving a `NODE_HANDLE` message (for example `MSG_CTRUNC`). This would end up, if the handle type was `net.Socket`, on a `message` event with a non null but invalid `sendHandle`. To improve the situation, send a `NODE_HANDLE_NACK` that'll cause the sending process to retransmit the message again. In case the same message is retransmitted 3 times without success, close the handle and print a warning. Fixes: #9706 PR-URL: #13235 Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>
To be clear. This is still an issue, #13235 didn't fix it. |
It is possible that `recvmsg()` may return an error on ancillary data reception when receiving a `NODE_HANDLE` message (for example `MSG_CTRUNC`). This would end up, if the handle type was `net.Socket`, on a `message` event with a non null but invalid `sendHandle`. To improve the situation, send a `NODE_HANDLE_NACK` that'll cause the sending process to retransmit the message again. In case the same message is retransmitted 3 times without success, close the handle and print a warning. Fixes: #9706 PR-URL: #13235 Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>
Believe I'm running into this issue while working on parallelizing a build process (creating a pool of workers and distributing hundreds of tasks among them). That's on Node 8.8.1 |
Hello, I was hoping in node 8 this was fixed but alas it seems to be even worse now. Using the script provided above its very easy to repoduce, and its alarmingly easy to see bad effects when you increase the worker count from 2 to something like 8. I was able to see the immediate effects when doing a larger cluster. here are logs from the script right after ab was started
|
@pitaj @rooftopsparrow what platforms are you observing the issue on? |
This was tested on macOS 10.13.2, node v8.9.4, and latest express. |
@santigimeno I tested it in win32, Node 8.8.1 |
@rooftopsparrow can you check using this PR that includes [email protected]. libuv/libuv@e6168df might have improved things for |
Yes, I can try that when I get a moment. I've also done a little more exploring to figure out what is going on with these missing ( or really really delayed ) pings and will hopefully have a better idea of what is going later |
@santigimeno I'm currently building that PR on my Window 10 x64 system here, then I'll test it. |
@pitaj the fix I was thinking about was specifically for |
I've been running the test for several hours on |
It appears to me that if you send enough messages through the IPC channel, it will completely lock up the main process. I'm running mocha tests and the timeout just doesn't happen, nor does a hard-coded timeout. This seems to occur at exactly 1000 messages per child process (8000 handles total). Edit: it's been running for 17 hours now and still hasn't concluded. |
ping @pitaj - is there a conclusion on this? |
It seems like it eventually concludes if given enough time... but it might be exponential time as very small changes in message number (like 990 vs 1000) result in very large changes in the amount of time required. |
@pitaj - thanks for the quick response, let me investigate. Does the same test case in the OP apply as such, or is there a more refined version? |
thanks @pitaj - that was a vital info. So it turns out to be a performance issue as opposed to functional. When the such as: In my recreate when I profiled Node, I see these patterns in the workers. Out of total 48569 samples,
were spent in these obsure routines which I assume to belong to Darwin's tcp layer. In MAC the clearing up of the closed sockets seem to be slowing down in-flight socket operations in the TCP space, this is a speculation based on the debugging in the connected issue, no documented evidence to support it. However, the circumstances are matching. Frequency of the issue in Linux being very less also supports this theory. Proof: If I remove tcp from the picture by removing the express code, and put up a 100ms interval in which you send the IPC message ('fromEndpoint') I don't see missing pings, and the behavior is consistent across platforms. Related to nodejs/help#1179 and the relevant debug data is nodejs/help#1179 (comment) I would suggest to see if there are |
@gireeshpunathil I'm on Windows, and it seems like your comment is mostly about Linux? |
@pitaj, thanks. I was following the code and the platform from the original postings. Are you using the same code on Windows, or something different? if so, please pass it on. Also, what is the observation - similar to mac os, same as mac os, or different? I too tested in windows, and I got some surprising result (certain tunings to the original test,and we get complete hang!) . We need to separate that issue from this, so let me hear from you. |
looked at the windows hang, and understood the reason too. every time a client connects, 25 messages ( So depending on the number of concurrent requests, performance can really vary, and the dependancy between the requests and the latency is exponential (s you already observed earlier):
There is nothing Windows specific issue observed here from Node.js perspective, other than potential difference in the system configuration / resources. So my original proposal on horizontal scaling stands. |
Not a Node.js bug, closing. Exponential stress in the tcp layer causes process to slow down, suggested to share work between multiple hosts. |
@gireeshpunathil I have some repro code that doesn't use any TCP AFAIK, unless the IPC channel uses TCP itself. I can throw that up on a gist later today. |
Thanks @pitaj for the repro. Turns out that the windows issue is unrelated (compelte hang) to the originally posted issue (slow response) on macos and family. I am able to reproduce the hang. Looking at multiple dumps, I see that the main thread of different processes (including the master) are engaged in: node.exe!uv_pipe_write_impl(uv_loop_s * loop, uv_write_s * req, uv_pipe_s * handle, const uv_buf_t * bufs, unsigned int nbufs, uv_stream_s * send_handle, void(*)(uv_write_s *, int) cb) Line 1347 C
node.exe!uv_write(uv_write_s * req, uv_stream_s * handle, const uv_buf_t * bufs, unsigned int nbufs, void(*)(uv_write_s *, int) cb) Line 139 C
node.exe!node::LibuvStreamWrap::DoWrite(node::WriteWrap * req_wrap, uv_buf_t * bufs, unsigned __int64 count, uv_stream_s * send_handle) Line 345 C++
node.exe!node::StreamBase::Write(uv_buf_t * bufs, unsigned __int64 count, uv_stream_s * send_handle, v8::Local<v8::Object> req_wrap_obj) Line 222 C++
node.exe!node::StreamBase::WriteString<1>(const v8::FunctionCallbackInfo<v8::Value> & args) Line 300 C++
node.exe!node::StreamBase::JSMethod<node::LibuvStreamWrap,&node::StreamBase::WriteString<1> >(const v8::FunctionCallbackInfo<v8::Value> & args) Line 408 C++
node.exe!v8::internal::FunctionCallbackArguments::Call(v8::internal::CallHandlerInfo * handler) Line 30 C++
node.exe!v8::internal::`anonymous namespace'::HandleApiCallHelper<0>(v8::internal::Isolate * isolate, v8::internal::Handle<v8::internal::HeapObject> new_target, v8::internal::Handle<v8::internal::HeapObject> fun_data, v8::internal::Handle<v8::internal::FunctionTemplateInfo> receiver, v8::internal::Handle<v8::internal::Object> args, v8::internal::BuiltinArguments) Line 110 C++
node.exe!v8::internal::Builtin_Impl_HandleApiCall(v8::internal::BuiltinArguments args, v8::internal::Isolate * isolate) Line 138 C++
node.exe!v8::internal::Builtin_HandleApiCall(int args_length, v8::internal::Object * * args_object, v8::internal::Isolate * isolate) Line 126 C++
[External Code] this is a known issue with libuv where multiple parties attempt to write to the same pipe, from either sides, under rare situations. #7657 posted this originally, and libuv/libuv#1843 fixed it recently. It will be sometime before Node.js consumes it. |
Notable changes: - Building via cmake is now supported. PR-URL: libuv/libuv#1850 - Stricter checks have been added to prevent watching the same file descriptor multiple times. PR-URL: libuv/libuv#1851 Refs: nodejs#3604 - An IPC deadlock on Windows has been fixed. PR-URL: libuv/libuv#1843 Fixes: nodejs#9706 Fixes: nodejs#7657 - uv_fs_lchown() has been added. PR-URL: libuv/libuv#1826 Refs: nodejs#19868 - uv_fs_copyfile() sets errno on error. PR-URL: libuv/libuv#1881 Fixes: nodejs#21329 - uv_fs_fchmod() supports -A files on Windows. PR-URL: libuv/libuv#1819 Refs: nodejs#12803 PR-URL: nodejs#21466 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Santiago Gimeno <[email protected]> Reviewed-By: James M Snell <[email protected]>
Notable changes: - Building via cmake is now supported. PR-URL: libuv/libuv#1850 - Stricter checks have been added to prevent watching the same file descriptor multiple times. PR-URL: libuv/libuv#1851 Refs: #3604 - An IPC deadlock on Windows has been fixed. PR-URL: libuv/libuv#1843 Fixes: #9706 Fixes: #7657 - uv_fs_lchown() has been added. PR-URL: libuv/libuv#1826 Refs: #19868 - uv_fs_copyfile() sets errno on error. PR-URL: libuv/libuv#1881 Fixes: #21329 - uv_fs_fchmod() supports -A files on Windows. PR-URL: libuv/libuv#1819 Refs: #12803 PR-URL: #21466 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Santiago Gimeno <[email protected]> Reviewed-By: James M Snell <[email protected]>
Notable changes: - Building via cmake is now supported. PR-URL: libuv/libuv#1850 - Stricter checks have been added to prevent watching the same file descriptor multiple times. PR-URL: libuv/libuv#1851 Refs: nodejs#3604 - An IPC deadlock on Windows has been fixed. PR-URL: libuv/libuv#1843 Fixes: nodejs#9706 Fixes: nodejs#7657 - uv_fs_lchown() has been added. PR-URL: libuv/libuv#1826 Refs: nodejs#19868 - uv_fs_copyfile() sets errno on error. PR-URL: libuv/libuv#1881 Fixes: nodejs#21329 - uv_fs_fchmod() supports -A files on Windows. PR-URL: libuv/libuv#1819 Refs: nodejs#12803 PR-URL: nodejs#21466 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Santiago Gimeno <[email protected]> Reviewed-By: James M Snell <[email protected]>
Notable changes: - Building via cmake is now supported. PR-URL: libuv/libuv#1850 - Stricter checks have been added to prevent watching the same file descriptor multiple times. PR-URL: libuv/libuv#1851 Refs: #3604 - An IPC deadlock on Windows has been fixed. PR-URL: libuv/libuv#1843 Fixes: #9706 Fixes: #7657 - uv_fs_lchown() has been added. PR-URL: libuv/libuv#1826 Refs: #19868 - uv_fs_copyfile() sets errno on error. PR-URL: libuv/libuv#1881 Fixes: #21329 - uv_fs_fchmod() supports -A files on Windows. PR-URL: libuv/libuv#1819 Refs: #12803 Backport-PR-URL: #24103 PR-URL: #21466 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Santiago Gimeno <[email protected]> Reviewed-By: James M Snell <[email protected]>
When many IPC messages are sent between the master process and cluster workers, IPC channels to workers stop delivering messages. I have not been unable to restore working functionality of the workers and so they must be killed to resolve the issue. Since IPC has stopped working, simply using
Worker.destroy()
does not work since the method will wait for thedisconnect
event which never arrives (because of this issue).I am able to repro on OS X by running the following script:
and using ApacheBench to place the server under load as follows:
I see the following, for example:
As I alluded to earlier, I have seen an issue on Linux which I believe is related but I have been so far unable to repro using this technique on Linux.
The text was updated successfully, but these errors were encountered: