test: fix test-async-wrap-getasyncid flakyness #14329

misterdjules · 2017-07-17T17:32:14Z

The test used to assume that if the client successfully writes data to
the server and closes the connection, then the server must have received
that data and ran its connection callback wrapped by common.mustCall.

However, it is not necessarily the case, and as a result the test was
flaky.

With this change, the server is closed only once the server's connection
callback has been called, which guarantees that the server can accept
connections until it actually accepted the single connection that this
test requires to accept, making the test not flaky.

Fixes: #13020

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
~~[ ] tests and/or benchmarks are included~~ not relevant in this case
~~[ ] documentation is changed or added~~ not relevant in this case
commit message follows commit guidelines

Affected core subsystem(s)

test

The test used to assume that if the client successfully writes data to the server and closes the connection, then the server must have received that data and ran its connection callback wrapped by common.mustCall. However, it is not necessarily the case, and as a result the test was flaky. With this change, the server is closed only once the server's connection callback has been called, which guarantees that the server can accept connections until it actually accepted the single connection that this test requires to accept, making the test not flaky. Fixes: nodejs#13020

misterdjules · 2017-07-17T17:34:09Z

As mentioned in #13020:

parallel/test-async-wrap-getasyncid has been running with the change mentionned above successfully for > 12 hours without any failure on the same SmartOS VM on which I was able to consistently reproduce the problem < 1 hour.

and:

FWIW, I ran parallel/test-async-wrap-getasyncid with the changes in #14329 for > 48 hours without a single failure on SmartOS.

geek · 2017-07-17T18:10:06Z

LGTM

refack · 2017-07-17T18:19:04Z

/cc @bnoordhuis, @indutny, @nodejs/streams
I was thinking about this since @misterdjules made the observation that:

The test used to assume that if the client successfully writes data to
the server and closes the connection, then the server must have received
that data and ran its connection callback wrapped by common.mustCall.

However, it is not necessarily the case, and as a result the test was
flaky.

Isn't that a reasonable assumption?
The assumption was even more narrow — that the connection event will fire on the server if shutdown has completed on the client.

refack · 2017-07-17T18:22:29Z

To be more precise:
that the connection event has fired on the server if shutdown has completed on the client, or that it will fire even if server.close() was called.

[addition]
IMHO it's an implication of connection-oriented-communication, that if a client's messages has been "ACK+FIN"ed, something on the server side has seen the message.

[2nd addition]
The message is 'hi'.repeat(100000)

misterdjules · 2017-07-20T23:29:50Z

@refack

IMHO it's an implication of connection-oriented-communication, that if a client's messages has been "ACK+FIN"ed, something on the server side has seen the message.

That "something" is the kernel, not necessarily any user process. The kernel can acknowledge connections that have not been "accepted" yet. It can use receive buffers to buffer any incoming data until that data is read by any user process. These buffers can have their size changed by user processes, but their default size can be larger than the size of the message sent in that test, thus allowing the client to perform the full connect/send/close chain of operations without a user process needing to accept the connection and read from it.

Does that clarify things?

refack · 2017-07-21T11:03:33Z

That "something" is the kernel, not necessarily any user process. The kernel can acknowledge connections that have not been "accepted" yet.

I assumed it the kernel that currently acknowledges the connection, thus fulfilling the obligations of TCP. What I'm asking is — is that enough? Can't we give this guarantee at the application layer as well?

Re buffer sizes, we choose a "large" message ('hi'.repeat(100000) = 2MB) so we'd trigger as much asynchronicity as possible.

So my question still stands:

Isn't it a bug that a client can INIT => SEND 2MB => FIN+ACK without the server (at the application layer, i.e. `node`) ever trigger the `connect` event?

misterdjules · 2017-07-21T17:17:50Z

@refack

What I'm asking is — is that enough? Can't we give this guarantee at the application layer as well?

The client and the server are operating asynchronously from each other.

There is no guarantee from the OS that the fd polling mechanism will identify the server socket as readable (or "acceptable") on the next poll after a successful "connect" call from the client.

The same can be said for any other interaction between the client and the server.

So if the client doesn't wait for data from the server and destroys the socket that the server is listening on when it's done, then there can be no guarantee that the server will be able to perform any operation on that socket before it's closed, including accepting a connection and thus emitting a 'connection' event.

Re buffer sizes, we choose a "large" message ('hi'.repeat(100000) = 2MB) so we'd trigger as much asynchronicity as possible.

Unless I'm missing something, it seems to me that the test sends 200KB of data, not 2MB. It is quite possible that increasing the amount of data sent would reduce the number of test failures, but there would still be a chance of failures. The test would still rely on luck to succeed and would still be incorrect.

refack · 2017-07-21T20:38:48Z

There is no guarantee from the OS that the fd polling mechanism will identify the server socket as readable (or "acceptable") on the next poll after a successful "connect" call from the client.

Unless I'm missing something, it seems to me that the test sends 200KB of data, not 2MB.

@misterdjules, funny thing, googling this brought up nodejs/node-v0.x-archive#15443 (comment)

You are right it's just 200K, but still I would imagine that should fill the receive buffer...

I think I understand what is my wrong assumption, we are not using the net API, but are directly testing the different async operation on the handle (i.e. the socket). We are not waiting for the 'close' event but are shutting down the socket as soon as it finishes pushing the data to the OS. So we are breaking the contract with the OS already on the client side.

So, I have no objections to landings this.

misterdjules · 2017-07-21T22:16:18Z

You are right it's just 200K, but still I would imagine that should fill the receive buffer...

The buffer is larger than 200K by default on SmartOS, its size is 1MB:

[root@e31d952d-043b-4188-e3fa-9ae120820740 ~]# ipadm show-prop -p recv_buf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   recv_buf              rw   1048576      --           1048576      2048-1048576
[root@e31d952d-043b-4188-e3fa-9ae120820740 ~]#

It seems on OSX (at least on the version I'm using), the TCP receive buffer might be 512KiB:

$ sysctl -a  | grep tcp | grep buf    
net.inet.tcp.doautorcvbuf: 1
net.inet.tcp.autorcvbufincshift: 3
net.inet.tcp.autorcvbufmax: 524288
net.inet.tcp.doautosndbuf: 1
net.inet.tcp.autosndbufinc: 8192
net.inet.tcp.autosndbufmax: 524288
net.smb.fs.tcpsndbuf: 4194304
net.smb.fs.tcprcvbuf: 4194304
$

which could explain why #13020 was also reproduced on OSX.

Thank you for all the questions and the review, I'll land this PR asap now.

misterdjules · 2017-07-24T17:43:34Z

CI: https://ci.nodejs.org/job/node-test-pull-request/9330/.
Stress test on SmartOS: https://ci.nodejs.org/job/node-stress-single-test/1345/.

misterdjules · 2017-07-24T23:28:04Z

New CI test after CI platform failure: https://ci.nodejs.org/job/node-test-pull-request/9335/.
New stress test: https://ci.nodejs.org/job/node-stress-single-test/1346/.

misterdjules · 2017-07-25T04:53:27Z

Jobs failures seem to be unrelated to the actual change, and the stress test has run ~7500 tests so far on SmartOS without any failure, so I'll land this change shortly.

misterdjules · 2017-07-25T04:58:15Z

Landed in 2da1af0.

The test used to assume that if the client successfully writes data to the server and closes the connection, then the server must have received that data and ran its connection callback wrapped by common.mustCall. However, it is not necessarily the case, and as a result the test was flaky. With this change, the server is closed only once the server's connection callback has been called, which guarantees that the server can accept connections until it actually accepted the single connection that this test requires to accept, making the test not flaky. PR-URL: #14329 Fixes: #13020 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Refael Ackermann <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Wyatt Preul <[email protected]>

nodejs-github-bot added the test Issues and PRs related to the tests. label Jul 17, 2017

misterdjules mentioned this pull request Jul 17, 2017

test: investigate flakiness of parallel/test-async-wrap-getasyncid #13020

Closed

mscdex added the async_wrap label Jul 17, 2017

cjihrig approved these changes Jul 17, 2017

View reviewed changes

addaleax approved these changes Jul 20, 2017

View reviewed changes

refack approved these changes Jul 21, 2017

View reviewed changes

jasnell approved these changes Jul 21, 2017

View reviewed changes

misterdjules closed this Jul 25, 2017

addaleax mentioned this pull request Aug 2, 2017

v8.3.0 proposal #14594

Merged

MylesBorins added the dont-land-on-v6.x label Aug 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix test-async-wrap-getasyncid flakyness #14329

test: fix test-async-wrap-getasyncid flakyness #14329

misterdjules commented Jul 17, 2017

misterdjules commented Jul 17, 2017

geek commented Jul 17, 2017

refack commented Jul 17, 2017

refack commented Jul 17, 2017 •

edited

Loading

misterdjules commented Jul 20, 2017

refack commented Jul 21, 2017

misterdjules commented Jul 21, 2017 •

edited

Loading

refack commented Jul 21, 2017

misterdjules commented Jul 21, 2017

misterdjules commented Jul 24, 2017

misterdjules commented Jul 24, 2017

misterdjules commented Jul 25, 2017

misterdjules commented Jul 25, 2017

test: fix test-async-wrap-getasyncid flakyness #14329

test: fix test-async-wrap-getasyncid flakyness #14329

Conversation

misterdjules commented Jul 17, 2017

Checklist

Affected core subsystem(s)

misterdjules commented Jul 17, 2017

geek commented Jul 17, 2017

refack commented Jul 17, 2017

refack commented Jul 17, 2017 • edited Loading

misterdjules commented Jul 20, 2017

refack commented Jul 21, 2017

Isn't it a bug that a client can INIT => SEND 2MB => FIN+ACK without the server (at the application layer, i.e. node) ever trigger the connect event?

misterdjules commented Jul 21, 2017 • edited Loading

refack commented Jul 21, 2017

misterdjules commented Jul 21, 2017

misterdjules commented Jul 24, 2017

misterdjules commented Jul 24, 2017

misterdjules commented Jul 25, 2017

misterdjules commented Jul 25, 2017

refack commented Jul 17, 2017 •

edited

Loading

Isn't it a bug that a client can INIT => SEND 2MB => FIN+ACK without the server (at the application layer, i.e. `node`) ever trigger the `connect` event?

misterdjules commented Jul 21, 2017 •

edited

Loading