-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation Fault with BlueBird 3.6/3.7 and Node 10 #1618
Comments
This looks like a Node.js core bug in v10 with async_hooks most likely. Thanks a lot for the repro! |
@benjamingr Yup, agree that it's probably |
@addaleax anything else a report should contain? |
Datapoint: Looks like this bug was not present in Node 10.1, but was/is present in Node 10.2. It's not present in Node 11.0. _tfw you wonder if you want to spend the time to hookup git bisect with a node build. _ |
@astormnewrelic yes - I am on holiday at the moment - but if you can git-bisect this on Node with your repro and open an issue in the Node repo that would be helpful :] |
Thanks @benjamingr -- I'll see what I can do w/r/t to a bisect -- although I've never done it on the node source tree before with a long C++ build in there, so we'll see how things go :). If anyone reading has known science for setting something like that up links/info would be appreciated. |
Just more talking out loud to myself -- people on holiday should not respond ;) I'm going to give the following script a try with We'll see what falls over in the (PST/Portland) morning :)
|
@benjamingr I think a full core dump might be nice for a report. Also, just curious, what makes you think this is related to async_hooks? (Also, I can’t reproduce locally on x64 Linux.) |
OK -- the bisect seems to be implicating this commit as the one that introduced this segmentation fault. nodejs/node@56530f0 i.e. a @addaleax Nice to meet you! Not speaking for @benjamingr, but on our side we suspected @addaleax Also! I intend to open an issue for this over in the nodejs GitHub. I'll dockerize the reproduction (in addition to the already reported MacOS, we ran our git bisect and saw the segmentation fault on an Amazon Linux 2 AMI 64-bit (x86)), and I'll do a debug build and get a core dump for the issues as well. Would all that be enough for you'all to go on? |
I've also experienced segfaults with Bluebird 3.6 and 3.7 + Node 10. However, I couldn't isolate the problem. Don't know it helps but here is the link to our Travis with a giant core dump: |
@astormnewrelic Are you sure? That commit doesn’t seem suitable to cause any kind of trouble at the first look, it’s basically only renaming a symbol to a string… Like, is there any chance that the flakiness in the test made it look like that was the first commit that caused trouble?
I can’t speak for other core developers, but I personally strongly prefer a reproduction of the kind that you already have over a containerized one – that makes debugging a lot easier. But I’m also not developing on mac.
@AndreMaz Since you seem to be running Linux – is there any chance you could run the program under valgrind ( |
Hey @addaleax my Linux machine is at my work so I'll do it tomorrow |
As sure as I can be. I had the bisect make a backup of each node binary as it built things. I just ran the |
@astormnewrelic I think that leaves two main options:
Either way, it might be worth trying to run something like valgrind (i.e. memory checkers) on this – I’ve heard valgrind itself doesn’t run too well on macOS, but it’s worth a shot, and maybe there are similar tools that I’m not aware of? |
Hey @addaleax this is what I get when I run Log:
|
@addaleax do you want me to run valgrind with some specific options? |
@addaleax you might be right regarding this being unrelated to async_hooks The changes also include 60ef7a0#diff-d1f5ad3087cc94d3dc6651e3265219b7R59 which calls @AndreMaz can you see if removing the |
@addaleax runnning
Full log here full-log.txt
@benjamingr will try it now |
@benjamingr after removing the |
@AndreMaz can you please try reproducing this with bluebird 3.7.1 ? Talking to @petkaantonov refresh is removed there |
@benjamingr yeah, updating to v3.7.1 seems to solve the issue. |
@benjamingr That still shouldn’t be an issue, right? Like, if calling Sadly, those verbose valgrind warnings correspond to a known issue with OpenSSL that has been fixed since (and that is also known to not cause crashes)… |
Right, this still looks like a bug in core and I am still not sure why it happens. Why would OpenSSL even be related to @AndreMaz thanks for the update and for the thorough debugging! It is very appreciated. |
Just to be clear, the OpenSSL issue does really only cause nothing besides valgrind warnings. It’s unrelated. |
@addaleax ah, that makes sense. Thanks! I was really confused about why the two would be related :] |
Hey all -- got knocked down by a sinus infection so had to take a bit of a break from this. The repo at https://github.com/astormnewrelic/repro-bluebird-segfault now has a dockerfile that can reproduce the problem I was seeing in Amazon Linux -- README includes instructions for reproducting. Next I intend to work on getting that core dump from a debug build. |
And I got the core dumps working -- I had to Core dump attached. This is from a debug build of node 10.16.3 built on Amazon Linux, running the program from my reproduction repo, and the program crashed with a segmentation fault. I'll try to get a bug report submitted to node core sometimes this week, but figured I'd share what I knew now in case it helps. Update: The custom build that generated this core dump. http://18.222.178.114/node.zip |
@astormnewrelic When working with custom builds, the binary used for causing the crash is just as much needed for debugging as the core dump itself :) |
@addaleax Thank you! I've added a link to the post above that should point at the binary I used to generate this core dump. I you wouldn't mind answering a naive question -- what does having the original binary allow you to do with the core dump that you wouldn't be able to do otherwise? I've only every worked with gdb or lldb locally, and a lot of it's firmly in the "not sure what that's really doing seems like magic" column to me :) |
@astormnewrelic As far as I know, the core dump doesn’t contain symbols, which are necessary to figure out what code actually corresponds to which functions in the original binary. I might be wrong, but for me, running |
From a first look, it seems like the libuv timer heap has one entry, but that refers to a null pointer; that seems wrong, but it’s not obvious to me what led to that condition being true. Also, weirdly enough, I couldn’t reproduce the issue even with your debug build locally – does it just take a really long time for that to happen? (I’ll keep it running overnight 😄) And it’s basically a clean build of the v10.16.3 tag of the Node.js repo, right? |
@addaleax the segmentation fault usually happens within 5 runs of the program -- and I've never seen it take more than 100 or so, so overnight might be overkill. Out of "black box reproduction" curiosity -- what distro/version are you running things on where you're not seeing the segmentation fault? Re: clean build of the v10.16.3 tag of the Node.js repo -- yes, that's correct. |
@astorm I’m running Ubuntu 18.04, with Linux 4.15.0-65-generic on x64… and there’s nothing really special about my setup, I think |
@addaleax One more confusing datapoint -- I tried switching up from Amazon Linux to on that docker container to Ubuntu 18.04 and the crashes still happened. So -- all very confusing. |
### 6.1.0 (2019-11-05): - Native-metrics module is defaulted to disabled in serverless mode - New env var, NEW_RELIC_NATIVE_METRICS_ENABLED, was added to enable/disable the native-metrics module * Added a test for querying poolCluster.of() * Removed unused bootstrap test code. * Increased timeout for test to reduce flickers on Node 12. * Changed file modification to leverage for test. This triggers the watcher in a reasonable amount of time much more consistently. * Added module to agent for auto-include on install. * Allow splitting of application name using semicolons in the env var * Don't test Bluebird 3.7 on Node v10 until they fix [the segfault issue](petkaantonov/bluebird#1618) * Instrument for mysql2 * Add HTTP method to segment attributes for external requests - Updates the such that it uses verbose output, will exit on first error code, and will refuse to proceed with LibreSSL (which can't generate certs) - Adds a �[H�[2J sub-command to that will allow developers to quickly remove generated ssl/cert files and regenerate (useful is switch between platforms via containers/docker and certs needs to be regenerated)
TL;DR -- We're seeing an intermittent segmentation fault with BlueBird in one of our tap based test suites, and have a reproduction over here.
3.6 and 3.7
NodeJS version 10 on MacOS 10 -- it does not appear to happen on Node 12. Other plaforms not tested.
It does not happen with BlueBird 3.5.
Details
One of the tests in our test suite has been failing randomly with a segmentation fault recently. We found some time to track it down, and it seems like it started happening with BlueBird 3.6 (maybe the async_hooks stuff?). The segmentation fault doesn't happen on every run -- if you checkout this repository we've setup a small "run until fail" reproduction case. The code in the repro is significantly stripped down form the code in our actual test suite.
It's very "spooky action at a distance" style bug -- sometimes the test runs fine, sometimes it fails with something like the following
We also captured the crash in lldb with a debug build of node 10.
Full backtrace from same follows
The text was updated successfully, but these errors were encountered: