Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix test-async-wrap-getasyncid flakyness #14329

Conversation

misterdjules
Copy link

The test used to assume that if the client successfully writes data to
the server and closes the connection, then the server must have received
that data and ran its connection callback wrapped by common.mustCall.

However, it is not necessarily the case, and as a result the test was
flaky.

With this change, the server is closed only once the server's connection
callback has been called, which guarantees that the server can accept
connections until it actually accepted the single connection that this
test requires to accept, making the test not flaky.

Fixes: #13020

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • [ ] tests and/or benchmarks are included not relevant in this case
  • [ ] documentation is changed or added not relevant in this case
  • commit message follows commit guidelines
Affected core subsystem(s)

test

The test used to assume that if the client successfully writes data to
the server and closes the connection, then the server must have received
that data and ran its connection callback wrapped by common.mustCall.

However, it is not necessarily the case, and as a result the test was
flaky.

With this change, the server is closed only once the server's connection
callback has been called, which guarantees that the server can accept
connections until it actually accepted the single connection that this
test requires to accept, making the test not flaky.

Fixes: nodejs#13020
@misterdjules
Copy link
Author

As mentioned in #13020:

parallel/test-async-wrap-getasyncid has been running with the change mentionned above successfully for > 12 hours without any failure on the same SmartOS VM on which I was able to consistently reproduce the problem < 1 hour.

and:

FWIW, I ran parallel/test-async-wrap-getasyncid with the changes in #14329 for > 48 hours without a single failure on SmartOS.

@geek
Copy link
Member

geek commented Jul 17, 2017

LGTM

@refack
Copy link
Contributor

refack commented Jul 17, 2017

/cc @bnoordhuis, @indutny, @nodejs/streams
I was thinking about this since @misterdjules made the observation that:

The test used to assume that if the client successfully writes data to
the server and closes the connection, then the server must have received
that data and ran its connection callback wrapped by common.mustCall.

However, it is not necessarily the case, and as a result the test was
flaky.

Isn't that a reasonable assumption?
The assumption was even more narrow — that the connection event will fire on the server if shutdown has completed on the client.

@refack
Copy link
Contributor

refack commented Jul 17, 2017

To be more precise:
that the connection event has fired on the server if shutdown has completed on the client, or that it will fire even if server.close() was called.

[addition]
IMHO it's an implication of connection-oriented-communication, that if a client's messages has been "ACK+FIN"ed, something on the server side has seen the message.

[2nd addition]
The message is 'hi'.repeat(100000)

@misterdjules
Copy link
Author

@refack

IMHO it's an implication of connection-oriented-communication, that if a client's messages has been "ACK+FIN"ed, something on the server side has seen the message.

That "something" is the kernel, not necessarily any user process. The kernel can acknowledge connections that have not been "accepted" yet. It can use receive buffers to buffer any incoming data until that data is read by any user process. These buffers can have their size changed by user processes, but their default size can be larger than the size of the message sent in that test, thus allowing the client to perform the full connect/send/close chain of operations without a user process needing to accept the connection and read from it.

Does that clarify things?

@refack
Copy link
Contributor

refack commented Jul 21, 2017

That "something" is the kernel, not necessarily any user process. The kernel can acknowledge connections that have not been "accepted" yet.

I assumed it the kernel that currently acknowledges the connection, thus fulfilling the obligations of TCP. What I'm asking is — is that enough? Can't we give this guarantee at the application layer as well?

Re buffer sizes, we choose a "large" message ('hi'.repeat(100000) = 2MB) so we'd trigger as much asynchronicity as possible.

So my question still stands:

Isn't it a bug that a client can INIT => SEND 2MB => FIN+ACK without the server (at the application layer, i.e. node) ever trigger the connect event?

@misterdjules
Copy link
Author

misterdjules commented Jul 21, 2017

@refack

What I'm asking is — is that enough? Can't we give this guarantee at the application layer as well?

The client and the server are operating asynchronously from each other.

There is no guarantee from the OS that the fd polling mechanism will identify the server socket as readable (or "acceptable") on the next poll after a successful "connect" call from the client.

The same can be said for any other interaction between the client and the server.

So if the client doesn't wait for data from the server and destroys the socket that the server is listening on when it's done, then there can be no guarantee that the server will be able to perform any operation on that socket before it's closed, including accepting a connection and thus emitting a 'connection' event.

Re buffer sizes, we choose a "large" message ('hi'.repeat(100000) = 2MB) so we'd trigger as much asynchronicity as possible.

Unless I'm missing something, it seems to me that the test sends 200KB of data, not 2MB. It is quite possible that increasing the amount of data sent would reduce the number of test failures, but there would still be a chance of failures. The test would still rely on luck to succeed and would still be incorrect.

@refack
Copy link
Contributor

refack commented Jul 21, 2017

There is no guarantee from the OS that the fd polling mechanism will identify the server socket as readable (or "acceptable") on the next poll after a successful "connect" call from the client.

Unless I'm missing something, it seems to me that the test sends 200KB of data, not 2MB.

@misterdjules, funny thing, googling this brought up nodejs/node-v0.x-archive#15443 (comment)

You are right it's just 200K, but still I would imagine that should fill the receive buffer...


I think I understand what is my wrong assumption, we are not using the net API, but are directly testing the different async operation on the handle (i.e. the socket). We are not waiting for the 'close' event but are shutting down the socket as soon as it finishes pushing the data to the OS. So we are breaking the contract with the OS already on the client side.

So, I have no objections to landings this.

@misterdjules
Copy link
Author

You are right it's just 200K, but still I would imagine that should fill the receive buffer...

The buffer is larger than 200K by default on SmartOS, its size is 1MB:

[root@e31d952d-043b-4188-e3fa-9ae120820740 ~]# ipadm show-prop -p recv_buf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   recv_buf              rw   1048576      --           1048576      2048-1048576
[root@e31d952d-043b-4188-e3fa-9ae120820740 ~]# 

It seems on OSX (at least on the version I'm using), the TCP receive buffer might be 512KiB:

$ sysctl -a  | grep tcp | grep buf    
net.inet.tcp.doautorcvbuf: 1
net.inet.tcp.autorcvbufincshift: 3
net.inet.tcp.autorcvbufmax: 524288
net.inet.tcp.doautosndbuf: 1
net.inet.tcp.autosndbufinc: 8192
net.inet.tcp.autosndbufmax: 524288
net.smb.fs.tcpsndbuf: 4194304
net.smb.fs.tcprcvbuf: 4194304
$

which could explain why #13020 was also reproduced on OSX.

Thank you for all the questions and the review, I'll land this PR asap now.

@misterdjules
Copy link
Author

@misterdjules
Copy link
Author

New CI test after CI platform failure: https://ci.nodejs.org/job/node-test-pull-request/9335/.
New stress test: https://ci.nodejs.org/job/node-stress-single-test/1346/.

@misterdjules
Copy link
Author

Jobs failures seem to be unrelated to the actual change, and the stress test has run ~7500 tests so far on SmartOS without any failure, so I'll land this change shortly.

@misterdjules
Copy link
Author

Landed in 2da1af0.

misterdjules pushed a commit that referenced this pull request Jul 25, 2017
The test used to assume that if the client successfully writes data to
the server and closes the connection, then the server must have received
that data and ran its connection callback wrapped by common.mustCall.

However, it is not necessarily the case, and as a result the test was
flaky.

With this change, the server is closed only once the server's connection
callback has been called, which guarantees that the server can accept
connections until it actually accepted the single connection that this
test requires to accept, making the test not flaky.

PR-URL: #14329
Fixes: #13020
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Refael Ackermann <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Wyatt Preul <[email protected]>
addaleax pushed a commit that referenced this pull request Jul 27, 2017
The test used to assume that if the client successfully writes data to
the server and closes the connection, then the server must have received
that data and ran its connection callback wrapped by common.mustCall.

However, it is not necessarily the case, and as a result the test was
flaky.

With this change, the server is closed only once the server's connection
callback has been called, which guarantees that the server can accept
connections until it actually accepted the single connection that this
test requires to accept, making the test not flaky.

PR-URL: #14329
Fixes: #13020
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Refael Ackermann <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Wyatt Preul <[email protected]>
@addaleax addaleax mentioned this pull request Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test: investigate flakiness of parallel/test-async-wrap-getasyncid
9 participants