Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent test-tls-inception failure on windows #5386

Closed
gibfahn opened this issue Feb 23, 2016 · 33 comments
Closed

Intermittent test-tls-inception failure on windows #5386

gibfahn opened this issue Feb 23, 2016 · 33 comments
Labels
test Issues and PRs related to the tests. tls Issues and PRs related to the tls subsystem. windows Issues and PRs related to the Windows platform.

Comments

@gibfahn
Copy link
Member

gibfahn commented Feb 23, 2016

I'm getting this error on v4.3.1 while running the parallel test suite, it seems to happen about 30% of the time.

1..1
not ok 1 test-tls-inception.js
# 
# assert.js:89
#   throw new assert.AssertionError({
#   ^
# AssertionError: 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA == <Buffer 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 
#     at TLSSocket.<anonymous> (C:\gib\tmp\node\test\parallel\test-tls-inception.js:64:14)
#     at TLSSocket.<anonymous> (C:\gib\tmp\node\test\common.js:395:15)
#     at emitNone (events.js:72:20)
#     at TLSSocket.emit (events.js:166:7)
#     at endReadableNT (_stream_readable.js:905:12)
#     at nextTickCallbackWith2Args (node.js:437:9)
#     at process._tickCallback (node.js:351:17)
  ---
  duration_ms: 0.422
  ...

  • Version: v4.3.1
  • Platform: Windows 7 64 bit
  • Command: python tools/test.py --mode=release --progress=tap parallel/test-tls-inception

Let me know if I forgot to include any info!

@mhdawson mhdawson added the test Issues and PRs related to the tests. label Feb 23, 2016
@mhdawson
Copy link
Member

@santigimeno an FYI

@mscdex mscdex added the tls Issues and PRs related to the tls subsystem. label Feb 23, 2016
@santigimeno
Copy link
Member

@gibm can you add a console.log with the length of buf before the assertion at https://github.com/nodejs/node/blob/master/test/parallel/test-tls-inception.js#L64. Thanks

@gibfahn
Copy link
Member Author

gibfahn commented Feb 23, 2016

I added this on line 64,
console.log("Buffer length: " + buf.length)

And got this:
Buffer length: 376832

@santigimeno
Copy link
Member

Thanks. I'm trying to reproduce it here, in the meantime, and this is a blind guess, but replacing socket.destroy with socket.end at https://github.com/nodejs/node/blob/master/test/parallel/test-tls-inception.js#L36 helps something?

@gibfahn
Copy link
Member Author

gibfahn commented Feb 23, 2016

That change does make a difference, the test now fails more intermittently (about 20% of the time), and now fails with:

1..1
not ok 1 test-tls-inception.js
# events.js:141
#       throw er; // Unhandled 'error' event
#       ^
#
# Error: read ECONNRESET
#     at exports._errnoException (util.js:870:11)
#     at TLSWrap.onread (net.js:544:26)
# Buffer length: 400000
  ---
  duration_ms: 0.428

@santigimeno
Copy link
Member

@gibm I can't reproduce the original error :(. Taking the initial version of the test, if you reduce the size of the buffer from 400000 to 300000, does it make any difference? (With this change, the test still reproduces consistently the failure it was trying to fix originally)

@gibfahn
Copy link
Member Author

gibfahn commented Feb 24, 2016

@santigimeno Interesting that it isn't failing for you. So if you build v4.3.1 on a Windows 7 64bit machine and run python tools/test.py --mode=release --progress=tap parallel/test-tls-inception, say ten times, it never fails? It's failing pretty consistently for me.

Reducing the size of the buffer doesn't seem to make a difference either with socket.end() or socket.destroy(). I get the same errors.

@santigimeno
Copy link
Member

@gibm Sorry for not pointing out before but I have no Windows to test on. I tested on Jessie 64 and OS X.

Looking at the commit that changed the buffer size: d00b9fc, I changed it because it made the error that was trying to fix reproducible in a consistent way in OS X #4195 (comment). Maybe we could revert to the original value of 4000 where the error also happened, though rarely, and probably fixes the Windows problem. Can you check that?

It's interesting to note that this test hasn't been failing on the CI since the change was included, maybe we could run a stress test job in the CI once is back online to see if it can be reproducible.

@gibfahn
Copy link
Member Author

gibfahn commented Feb 24, 2016

@santigimeno Oh I see, I haven't noticed any failures on any other platforms either! I've just tried running it 100 times successively for each buffer length. For length 4000 it didn't fail at all, for length 40,000 it failed five times (out of 100). For 400,000 it seems to fail about 50% of the time.

It'd be good to try to work out what the cause of the problem is, especially if 4,000 makes the test less useful for OSX.

Adding socket.on('error',function(err){ console.error(err)}) on https://github.com/nodejs/node/blob/master/test/parallel/test-tls-inception.js#L38 does stop the test from failing, which is odd, as I would have thought an ECONNRESET error would cause the assert to fail.

@santigimeno
Copy link
Member

@gibm Yes that's odd. Maybe there's no real solution to the ECONNRESET error as it looks like the test is racey as it is now because both ends try close the connection and adding an error listener is the proper solution.

@santigimeno
Copy link
Member

Maybe asking @indutny about this, as he was the creator of the test?

@MylesBorins
Copy link
Contributor

try piping the error events from your various emitters to dev/null that will likely stop everything from exploding

edit: example --> https://github.com/nodejs/node/blob/master/test/parallel/test-debugger-util-regression.js#L55-L62

@gibfahn
Copy link
Member Author

gibfahn commented Feb 26, 2016

@thealphanerd Handling the error does make the test pass, however if there is an error, doesn't that mean that the test should be failing? I know the assert still passes which suggests otherwise, but just catching the error and then ignoring it feels like a bad idea.

socket.on('error', function (err) { console.error(err); });

@MylesBorins
Copy link
Contributor

@gibm I totally agree. I think the theory that @Trott and I had was that this was a deeper bug in the Event Emitter. You can see this pattern taking place in other places of the code base.

Perhaps we can do some digging and find out where the source of this is.

@Trott
Copy link
Member

Trott commented Feb 26, 2016

@thealphanerd You may be right about it being a bug in EventEmitter. I've always operated under the theory that the mysterious ECONNRESET issue we see in tests on Windows was either something particular to the Windows network stack or libuv on Windows.

But I don't actually know. It's true that we've seen this in tests more than once and a deeper/better assessment of what's going on would be good.

@MylesBorins
Copy link
Contributor

we need to make a consistently reproduceable test... which is sooooo hard

@Trott
Copy link
Member

Trott commented Feb 26, 2016

/cc @nodejs/platform-windows

@gibfahn
Copy link
Member Author

gibfahn commented Feb 29, 2016

@thealphanerd well this test does fail pretty consistently without the error handler. It doesn't fail every time, but I have yet to have it run successfully more than twice in a row on Windows 7 64bit.

@joaocgreis
Copy link
Member

I couldn't make this test fail on Windows 2012. I will try Windows 2008.

@gibm What type of machine are you getting these failures on? Virtual machine? How many processors? Do you have free memory or is the machine under stress? I'll try to create similar conditions locally.

@gibfahn
Copy link
Member Author

gibfahn commented Apr 6, 2016

@joaocgreis I'm pretty sure I was running this on my Windows 7 laptop locally (dual core i7 Broadwell, 16Gb RAM). I'll check again tomorrow though. It wouldn't be under much load, and there should be plenty of free RAM. Interesting that you couldn't get it to fail, I'll try it on a win 2012 VM.

Were you testing Node v4?

@joaocgreis
Copy link
Member

Node compiled from tag v4.3.1. I was actually using the default of 32 bits in vcbuild, I'll try using x64.

@MylesBorins
Copy link
Contributor

@joaocgreis would you be able to test on v4.4.2?

@joaocgreis
Copy link
Member

I tried all combinations of Windows {2008, 2012}, node {4.3.1, 4.4.2} and {x86, x64}. Ran the test 100 times for each, no failures.

@gibm could this be something specific of your machine? Could you try to run it on a different computer?

@gibfahn
Copy link
Member Author

gibfahn commented Apr 11, 2016

I have now retested on my Win 7 laptop, a Win 7 VM, and a Win 2012 VM (all 64 bit).

On my laptop, if I download the Node 4.4.2 binary, put it in node/Release (using the git v4.4.2 tag), and run python tools/test.py --mode=release --progress=tap parallel/test-tls-inception, It fails over half the time with the error in my initial post. It's the same with v4.3.2.

If I build v4.4.2 myself with vcbuild (Visual Studio 2015), then I only get about 1 in 200 errors (I ran it ~2000 times).

If I download the Node v4.4.2 binary on the Win 7 VM, I get the problem about once in 200 runs.

If I download the Node v4.4.2 binary on the Win 2012 VM, I get the problem very occasionally (I only saw one failure in several thousand runs). It also only takes 0.2s (the others took about 0.5s).

Machine Node version Test time % failed
Win 7 Laptop v4.4.2 download 0.5s 60%
Win 7 Laptop v4.4.2 vcbuild 0.5s 1%
Win 7 VM v4.4.2 download 0.5s 0.5%
Win 2012 VM v4.4.2 download 0.2s 0.05%

I also noticed that my manual build was about 8.5 MB, while the one I downloaded was about 13.9 MB. Not sure if this is relevant.

So it does seem like there is a specific problem with my machine, but I am still seeing failures elsewhere. @joaocgreis could you try running this test with the downloaded binary on a W7 box, and maybe running it a few hundred times? I'll try and test this on some more machines.

If I'm using the wrong options somewhere, then please let me know!

@MylesBorins
Copy link
Contributor

/cc @nodejs/lts @jasnell

@joaocgreis
Copy link
Member

v4.4.2 x64, downloaded .exe:

  • Win 2012 R2: No failures in 25000 runs
  • Win 7: > 10%
  • Win 2008 R2: < 1%

(All VMs, all test times are about 0.2s)

Now that I have a repro I can investigate.

@gibfahn
Copy link
Member Author

gibfahn commented Apr 12, 2016

@joaocgreis Tested on my Windows 10 laptop and didn't see any failures in 2000 runs.

@joaocgreis
Copy link
Member

@thealphanerd this is not LTS specific, I get the same failures on Windows 7 with 5.10.1.

@Trott
Copy link
Member

Trott commented Jul 1, 2016

Just got it on master branch (with one other test changed): https://ci.nodejs.org/job/node-test-binary-windows/2719/RUN_SUBSET=2,VS_VERSION=vs2013,label=win2008r2/console

not ok 245 parallel/test-tls-inception
# 
# assert.js:89
#   throw new assert.AssertionError({
#   ^
# AssertionError: 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA == <Buffer 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 
#     at TLSSocket.<anonymous> (C:\workspace\node-test-binary-windows\RUN_SUBSET\2\VS_VERSION\vs2013\label\win2008r2\test\parallel\test-tls-inception.js:64:14)
#     at TLSSocket.<anonymous> (C:\workspace\node-test-binary-windows\RUN_SUBSET\2\VS_VERSION\vs2013\label\win2008r2\test\common.js:407:15)
#     at emitNone (events.js:91:20)
#     at TLSSocket.emit (events.js:185:7)
#     at endReadableNT (_stream_readable.js:973:12)
#     at _combinedTickCallback (internal/process/next_tick.js:74:11)
#     at process._tickCallback (internal/process/next_tick.js:98:9)
  ---
  duration_ms: 0.403

@Trott Trott added the windows Issues and PRs related to the Windows platform. label Jul 1, 2016
@Trott
Copy link
Member

Trott commented Oct 29, 2016

Still seeing this. Here's one from today:

https://ci.nodejs.org/job/node-test-binary-windows/4614/RUN_SUBSET=2,VS_VERSION=vs2015,label=win2008r2/console

not ok 261 parallel/test-tls-inception
# 
# assert.js:85
#   throw new assert.AssertionError({
#   ^
# AssertionError: 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA == <Buffer 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 
#     at TLSSocket.<anonymous> (c:\workspace\node-test-binary-windows\RUN_SUBSET\2\VS_VERSION\vs2015\label\win2008r2\test\parallel\test-tls-inception.js:64:14)
#     at TLSSocket.<anonymous> (c:\workspace\node-test-binary-windows\RUN_SUBSET\2\VS_VERSION\vs2015\label\win2008r2\test\common.js:421:15)
#     at emitNone (events.js:91:20)
#     at TLSSocket.emit (events.js:185:7)
#     at endReadableNT (_stream_readable.js:974:12)
#     at _combinedTickCallback (internal/process/next_tick.js:74:11)
#     at process._tickCallback (internal/process/next_tick.js:98:9)
  ---
  duration_ms: 0.399

@gibfahn
Copy link
Member Author

gibfahn commented Mar 26, 2017

Now that I have a repro I can investigate.

ping @joaocgreis

I closed the PR to fix this as I suspect it would just be papering over the issue for this specific test case.

@joaocgreis
Copy link
Member

Still haven't given up on this one. This might be related to other ECONNRESET errors on Windows.

@Trott
Copy link
Member

Trott commented Jul 31, 2017

Have not seen this one for a long time. Feel free to re-open if you disagree, but I'm inclined to close. I'm just tidying up and not acting on a super-strong opinion or anything like that.

@Trott Trott closed this as completed Jul 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Issues and PRs related to the tests. tls Issues and PRs related to the tls subsystem. windows Issues and PRs related to the Windows platform.
Projects
None yet
Development

No branches or pull requests

7 participants