Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault with BlueBird 3.6/3.7 and Node 10 #1618

Open
astormnewrelic opened this issue Oct 12, 2019 · 34 comments
Open

Segmentation Fault with BlueBird 3.6/3.7 and Node 10 #1618

astormnewrelic opened this issue Oct 12, 2019 · 34 comments

Comments

@astormnewrelic
Copy link

TL;DR -- We're seeing an intermittent segmentation fault with BlueBird in one of our tap based test suites, and have a reproduction over here.

  1. What version of bluebird is the issue happening on?

3.6 and 3.7

  1. What platform and version? (For example Node.js 0.12 or Google Chrome 32)

NodeJS version 10 on MacOS 10 -- it does not appear to happen on Node 12. Other plaforms not tested.

  1. Did this issue happen with earlier version of bluebird?

It does not happen with BlueBird 3.5.

Details

One of the tests in our test suite has been failing randomly with a segmentation fault recently. We found some time to track it down, and it seems like it started happening with BlueBird 3.6 (maybe the async_hooks stuff?). The segmentation fault doesn't happen on every run -- if you checkout this repository we've setup a small "run until fail" reproduction case. The code in the repro is significantly stripped down form the code in our actual test suite.

It's very "spooky action at a distance" style bug -- sometimes the test runs fine, sometimes it fails with something like the following

TAP version 13
# Subtest: bluebird
    # Subtest: new Promise() throw
        1..0
    ok 1 - new Promise() throw # time=4.55ms

./run-until-fail.sh: line 3: 36728 Segmentation fault: 11  $@

We also captured the crash in lldb with a debug build of node 10.

(lldb) r methods.tap.js
Process 36969 launched: '/Users/astorm/bin/node-debug' (x86_64)
TAP version 13
# Subtest: bluebird
    # Subtest: new Promise() throw
        1..0
    ok 1 - new Promise() throw # time=63.303ms

Process 36969 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000101cb25de node-debug`heap_insert(heap=0x00000001035094d8, newnode=0x0000000105c11498, less_than=(node-debug`timer_less_than at timer.c:39)) at heap-inl.h:140
   137 	
   138 	  /* Insert the new node. */
   139 	  newnode->parent = *parent;
-> 140 	  *child = newnode;
   141 	  heap->nelts += 1;
   142 	
   143 	  /* Walk up the tree and check at each node if the heap property holds.
Target 0: (node-debug) stopped.

Full backtrace from same follows

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000101cb25de node-debug`heap_insert(heap=0x00000001035094d8, newnode=0x0000000105c0ba48, less_than=(node-debug`timer_less_than at timer.c:39)) at heap-inl.h:140
    frame #1: 0x0000000101cb23b4 node-debug`uv_timer_start(handle=0x0000000105c0b9e0, cb=(node-debug`node::(anonymous namespace)::TimerWrap::OnTimeout(uv_timer_s*) at timer_wrap.cc:134), timeout=1, repeat=0) at timer.c:93
    frame #2: 0x00000001002f2af1 node-debug`node::(anonymous namespace)::TimerWrap::Start(args=0x00007ffeefbfc670) at timer_wrap.cc:120
    frame #3: 0x00000001007f4e21 node-debug`v8::internal::FunctionCallbackArguments::Call(this=0x00007ffeefbfc7e0, handler=0x00003c3547afc889) at api-arguments-inl.h:94
    frame #4: 0x00000001007f2aae node-debug`v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(isolate=0x0000000106004c00, function=Handle<v8::internal::HeapObject> @ 0x00007ffeefbfc8f0, new_target=Handle<v8::internal::HeapObject> @ 0x00007ffeefbfc8e8, fun_data=Handle<v8::internal::FunctionTemplateInfo> @ 0x00007ffeefbfc8e0, receiver=Handle<v8::internal::Object> @ 0x00007ffeefbfc8d8, args=BuiltinArguments @ 0x00007ffeefbfc910) at builtins-api.cc:109
    frame #5: 0x00000001007f111a node-debug`v8::internal::Builtin_Impl_HandleApiCall(args=BuiltinArguments @ 0x00007ffeefbfca40, isolate=0x0000000106004c00) at builtins-api.cc:139
    frame #6: 0x00000001007f0c4d node-debug`v8::internal::Builtin_HandleApiCall(args_length=6, args_object=0x00007ffeefbfcb28, isolate=0x0000000106004c00) at builtins-api.cc:127
    frame #7: 0x000022036e6c1d64
    frame #8: 0x000022036e5a0ab6
    frame #9: 0x000022036e59336f
    frame #10: 0x000022036e719b42
    frame #11: 0x000022036e5a0ab6
    frame #12: 0x000022036e58f303
    frame #13: 0x000022036e5a0ab6
    frame #14: 0x000022036e5a0ab6
    frame #15: 0x000022036e58f303
    frame #16: 0x000022036e5a0ab6
    frame #17: 0x000022036e5a0ab6
    frame #18: 0x000022036e5a0ab6
    frame #19: 0x000022036e5a0ab6
    frame #20: 0x000022036e5a0ab6
    frame #21: 0x000022036e5a0ab6
    frame #22: 0x000022036e58f303
    frame #23: 0x000022036e5a0ab6
    frame #24: 0x000022036e5a0ab6
    frame #25: 0x000022036e5a0ab6
    frame #26: 0x000022036e58f303
    frame #27: 0x000022036e5a0ab6
    frame #28: 0x000022036e5a0ab6
    frame #29: 0x000022036e5a0ab6
    frame #30: 0x000022036e58f303
    frame #31: 0x000022036e5a0ab6
    frame #32: 0x000022036e5a0ab6
    frame #33: 0x000022036e5a0ab6
    frame #34: 0x000022036e5a0ab6
    frame #35: 0x000022036e5a0ab6
    frame #36: 0x000022036e58f303
    frame #37: 0x000022036e5a0ab6
    frame #38: 0x000022036e5a0ab6
    frame #39: 0x000022036e5a0ab6
    frame #40: 0x000022036e58f303
    frame #41: 0x000022036e5a0ab6
    frame #42: 0x000022036e5a0ab6
    frame #43: 0x000022036e5a0ab6
    frame #44: 0x000022036e5a0ab6
    frame #45: 0x000022036e5a0ab6
    frame #46: 0x000022036e58f303
    frame #47: 0x000022036e5a0ab6
    frame #48: 0x000022036e5a0ab6
    frame #49: 0x000022036e597015
    frame #50: 0x000022036e58cfa1
    frame #51: 0x00000001010f11bd node-debug`v8::internal::GeneratedCode<v8::internal::Object*, v8::internal::Object*, v8::internal::Object*, v8::internal::Object*, int, v8::internal::Object***>::Call(this=0x00007ffeefbfdf88, args=0x00003c35e57826f1, args=0x00003c35bea07191, args=0x00003c353ee9f7a1, args=1, args=0x00007ffeefbfe740) at simulator.h:113
    frame #52: 0x00000001010eebe6 node-debug`v8::internal::(anonymous namespace)::Invoke(isolate=0x0000000106004c00, is_construct=false, target=Handle<v8::internal::Object> @ 0x00007ffeefbfe0a0, receiver=Handle<v8::internal::Object> @ 0x00007ffeefbfe098, argc=1, args=0x00007ffeefbfe740, new_target=Handle<v8::internal::Object> @ 0x00007ffeefbfe090, message_handling=kReport, execution_target=kCallable) at execution.cc:155
    frame #53: 0x00000001010ee439 node-debug`v8::internal::(anonymous namespace)::CallInternal(isolate=0x0000000106004c00, callable=Handle<v8::internal::Object> @ 0x00007ffeefbfe170, receiver=Handle<v8::internal::Object> @ 0x00007ffeefbfe168, argc=1, argv=0x00007ffeefbfe740, message_handling=kReport, target=kCallable) at execution.cc:191
    frame #54: 0x00000001010ee2cd node-debug`v8::internal::Execution::Call(isolate=0x0000000106004c00, callable=Handle<v8::internal::Object> @ 0x00007ffeefbfe1d0, receiver=Handle<v8::internal::Object> @ 0x00007ffeefbfe1c8, argc=1, argv=0x00007ffeefbfe740) at execution.cc:202
    frame #55: 0x0000000100683828 node-debug`v8::Function::Call(this=0x0000000107050640, context=(val_ = 0x000000010704f700), recv=(val_ = 0x0000000106021c40), argc=1, argv=0x00007ffeefbfe740) at api.cc:5206
    frame #56: 0x00000001000cbd0a node-debug`node::InternalMakeCallback(env=0x00007ffeefbfe9f8, recv=(val_ = 0x0000000106021c40), callback=(val_ = 0x0000000107050640), argc=1, argv=0x00007ffeefbfe740, asyncContext=(async_id = 35, trigger_async_id = 31)) at node.cc:672
    frame #57: 0x00000001000158b9 node-debug`node::AsyncWrap::MakeCallback(this=0x0000000105f020f0, cb=(val_ = 0x0000000107050640), argc=1, argv=0x00007ffeefbfe740) at async_wrap.cc:683
    frame #58: 0x00000001002f320c node-debug`node::(anonymous namespace)::TimerWrap::OnTimeout(handle=0x0000000105f02140) at timer_wrap.cc:143
    frame #59: 0x0000000101cb2c4f node-debug`uv__run_timers(loop=0x0000000103509220) at timer.c:174
    frame #60: 0x0000000101cba014 node-debug`uv_run(loop=0x0000000103509220, mode=UV_RUN_DEFAULT) at core.c:355
    frame #61: 0x00000001000e1bbe node-debug`node::Start(isolate=0x0000000106004c00, isolate_data=0x000000010601e200, args=size=2, exec_args=size=0) at node.cc:2895
    frame #62: 0x00000001000dcb59 node-debug`node::Start(event_loop=0x0000000103509220, args=size=2, exec_args=size=0) at node.cc:2994
    frame #63: 0x00000001000dbf4f node-debug`node::Start(argc=2, argv=0x0000000105f00000) at node.cc:3053
    frame #64: 0x0000000101f9015e node-debug`main(argc=2, argv=0x00007ffeefbff888) at node_main.cc:124
    frame #65: 0x0000000100001034 node-debug`start + 52
@benjamingr
Copy link
Collaborator

This looks like a Node.js core bug in v10 with async_hooks most likely. Thanks a lot for the repro!

@astormnewrelic
Copy link
Author

@benjamingr Yup, agree that it's probably async_hooks related -- we're trying to narrow it down before reporting over there. Do you happen to know if there's a good summary explanation of how BlueBird is using features from async_hooks? (short of digging into BlueBird's guts and figuring it out? :) )

@benjamingr
Copy link
Collaborator

@addaleax anything else a report should contain?

@astormnewrelic
Copy link
Author

Datapoint: Looks like this bug was not present in Node 10.1, but was/is present in Node 10.2. It's not present in Node 11.0.

_tfw you wonder if you want to spend the time to hookup git bisect with a node build. _

@benjamingr
Copy link
Collaborator

@astormnewrelic yes - I am on holiday at the moment - but if you can git-bisect this on Node with your repro and open an issue in the Node repo that would be helpful :]

@astormnewrelic
Copy link
Author

astormnewrelic commented Oct 15, 2019

Thanks @benjamingr -- I'll see what I can do w/r/t to a bisect -- although I've never done it on the node source tree before with a long C++ build in there, so we'll see how things go :). If anyone reading has known science for setting something like that up links/info would be appreciated.

@astormnewrelic
Copy link
Author

Just more talking out loud to myself -- people on holiday should not respond ;)

I'm going to give the following script a try with git bisect run between v10.2 and v10.1. I've got things screened off in a medium EC2 instance where builds are taking around an hour, and we're in roughly 7 steps territory between bisects.

We'll see what falls over in the (PST/Portland) morning :)

make clean || exit 125
./configure || exit 125
make -j4 || exit 125

cp out/Release/node ~/binaries/`git rev-parse HEAD`

for i in `seq 100`; do
  ./out/Release/node /home/ec2-user/repro-bluebird-segfault/methods.tap.js
  if [ $? -ne 0 ]
  then
    exit 1
  fi
done

@addaleax
Copy link

@benjamingr I think a full core dump might be nice for a report. Also, just curious, what makes you think this is related to async_hooks? (Also, I can’t reproduce locally on x64 Linux.)

@astormnewrelic
Copy link
Author

OK -- the bisect seems to be implicating this commit as the one that introduced this segmentation fault. nodejs/node@56530f0

i.e. a out/Release/node binary built at 41e1dc09dedc592b5c6dded64b18bc0fad9b8211 doesn't segmentation fault, but 56530f084499dba826f93a7b09b8d04ec4cc5df0 does.

@addaleax Nice to meet you! Not speaking for @benjamingr, but on our side we suspected async_hooks because this segmentation fault started in BlueBird 3.6, and 3.6 is when BlueBird started using AsyncResources (_dereq_("async_hooks").AsyncResource). Someone more familiar with BlueBird might be able to speak better to the specifics

@addaleax Also! I intend to open an issue for this over in the nodejs GitHub. I'll dockerize the reproduction (in addition to the already reported MacOS, we ran our git bisect and saw the segmentation fault on an Amazon Linux 2 AMI 64-bit (x86)), and I'll do a debug build and get a core dump for the issues as well. Would all that be enough for you'all to go on?

@AndreMaz
Copy link

AndreMaz commented Oct 16, 2019

I've also experienced segfaults with Bluebird 3.6 and 3.7 + Node 10. However, I couldn't isolate the problem. Don't know it helps but here is the link to our Travis with a giant core dump:
https://travis-ci.org/moleculerjs/moleculer/jobs/594295906
and a few screenshots of different errors that we've got while trying to figure out the source of the problem. In the end we've reverted to v3.5.5

1

2

3

4

5

@addaleax
Copy link

OK -- the bisect seems to be implicating this commit as the one that introduced this segmentation fault. nodejs/node@56530f0

@astormnewrelic Are you sure? That commit doesn’t seem suitable to cause any kind of trouble at the first look, it’s basically only renaming a symbol to a string… Like, is there any chance that the flakiness in the test made it look like that was the first commit that caused trouble?

@addaleax Also! I intend to open an issue for this over in the nodejs GitHub. I'll dockerize the reproduction (in addition to the already reported MacOS, we ran our git bisect and saw the segmentation fault on an Amazon Linux 2 AMI 64-bit (x86)), and I'll do a debug build and get a core dump for the issues as well. Would all that be enough for you'all to go on?

I can’t speak for other core developers, but I personally strongly prefer a reproduction of the kind that you already have over a containerized one – that makes debugging a lot easier. But I’m also not developing on mac.

I've also experienced segfaults with Bluebird 3.6 and 3.7 + Node 10. However, I couldn't isolate the problem.

@AndreMaz Since you seem to be running Linux – is there any chance you could run the program under valgrind (valgrind node path/to/script.js) and report back? Those reports are usually tremendously helpful for memory corruption issues, and your screenshots make it seem like that’s the case there.

@AndreMaz
Copy link

Hey @addaleax my Linux machine is at my work so I'll do it tomorrow

@astormnewrelic
Copy link
Author

Are you sure? That commit doesn’t seem suitable to cause any kind of trouble at the first look, it’s basically only renaming a symbol to a string… Like, is there any chance that the flakiness in the test made it look like that was the first commit that caused trouble?

As sure as I can be. I had the bisect make a backup of each node binary as it built things. I just ran the 41e1dc09dedc592b5c6dded64b18bc0fad9b8211 binary 1000 times and it didn't segfault, and the 56530f084499dba826f93a7b09b8d04ec4cc5df0 binary crashed almost immediately. I'll rerun the bisect using those two hashes and see what it produces.

@addaleax
Copy link

@astormnewrelic I think that leaves two main options:

  • This isn’t related to async hooks, and something else happened between bluebird 3.5 and 3.6 that affected this, or (and I think this is more likely)
  • This is something that depends on subtle differences in memory layout, and isn’t related to the substance of the commit that bisecting points to, but rather the fact that something changed.

Either way, it might be worth trying to run something like valgrind (i.e. memory checkers) on this – I’ve heard valgrind itself doesn’t run too well on macOS, but it’s worth a shot, and maybe there are similar tools that I’m not aware of?

@AndreMaz
Copy link

Hey @addaleax this is what I get when I run valgrind --log-file="log.txt" --leak-check=summary -v jest test/unit/middlewares/action-hook.spec.js

Log:

==18115== Memcheck, a memory error detector
==18115== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==18115== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==18115== Command: /usr/bin/jest test/unit/middlewares/action-hook.spec.js
==18115== Parent PID: 3762
==18115== 
--18115-- 
--18115-- Valgrind options:
--18115--    --log-file=log.txt
--18115--    --leak-check=summary
--18115--    -v
--18115-- Contents of /proc/version:
--18115--   Linux version 5.0.0-31-generic (buildd@lgw01-amd64-046) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #33~18.04.1-Ubuntu SMP Tue Oct 1 10:20:39 UTC 2019
--18115-- 
--18115-- Arch and hwcaps: AMD64, LittleEndian, amd64-cx16-lzcnt-rdtscp-sse3-avx-avx2
--18115-- Page sizes: currently 4096, max supported 4096
--18115-- Valgrind library directory: /usr/lib/valgrind
--18115-- Reading syms from /usr/bin/env
--18115--    object doesn't have a symbol table
--18115-- Reading syms from /lib/x86_64-linux-gnu/ld-2.27.so
--18115--   Considering /lib/x86_64-linux-gnu/ld-2.27.so ..
--18115--   .. CRC mismatch (computed 1b7c895e wanted 2943108a)
--18115--   Considering /usr/lib/debug/lib/x86_64-linux-gnu/ld-2.27.so ..
--18115--   .. CRC is valid
--18115-- Reading syms from /usr/lib/valgrind/memcheck-amd64-linux
--18115--   Considering /usr/lib/valgrind/memcheck-amd64-linux ..
--18115--   .. CRC mismatch (computed c25f395c wanted 0a9602a8)
--18115--    object doesn't have a symbol table
--18115--    object doesn't have a dynamic symbol table
--18115-- Scheduler: using generic scheduler lock implementation.
--18115-- Reading suppressions file: /usr/lib/valgrind/default.supp
==18115== embedded gdbserver: reading from /tmp/vgdb-pipe-from-vgdb-to-18115-by-andmaz-on-???
==18115== embedded gdbserver: writing to   /tmp/vgdb-pipe-to-vgdb-from-18115-by-andmaz-on-???
==18115== embedded gdbserver: shared mem   /tmp/vgdb-pipe-shared-mem-vgdb-18115-by-andmaz-on-???
==18115== 
==18115== TO CONTROL THIS PROCESS USING vgdb (which you probably
==18115== don't want to do, unless you know exactly what you're doing,
==18115== or are doing some strange experiment):
==18115==   /usr/lib/valgrind/../../bin/vgdb --pid=18115 ...command...
==18115== 
==18115== TO DEBUG THIS PROCESS USING GDB: start GDB like this
==18115==   /path/to/gdb /usr/bin/jest
==18115== and then give GDB the following command
==18115==   target remote | /usr/lib/valgrind/../../bin/vgdb --pid=18115
==18115== --pid is optional if only one valgrind process is running
==18115== 
--18115-- REDIR: 0x401f2f0 (ld-linux-x86-64.so.2:strlen) redirected to 0x58060901 (???)
--18115-- REDIR: 0x401f0d0 (ld-linux-x86-64.so.2:index) redirected to 0x5806091b (???)
--18115-- Reading syms from /usr/lib/valgrind/vgpreload_core-amd64-linux.so
--18115--   Considering /usr/lib/valgrind/vgpreload_core-amd64-linux.so ..
--18115--   .. CRC mismatch (computed 4b63d83e wanted 670599e6)
--18115--    object doesn't have a symbol table
--18115-- Reading syms from /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so
--18115--   Considering /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so ..
--18115--   .. CRC mismatch (computed a4b37bee wanted 8ad4dc94)
--18115--    object doesn't have a symbol table
==18115== WARNING: new redirection conflicts with existing -- ignoring it
--18115--     old: 0x0401f2f0 (strlen              ) R-> (0000.0) 0x58060901 ???
--18115--     new: 0x0401f2f0 (strlen              ) R-> (2007.0) 0x04c32db0 strlen
--18115-- REDIR: 0x401d360 (ld-linux-x86-64.so.2:strcmp) redirected to 0x4c33ee0 (strcmp)
--18115-- REDIR: 0x401f830 (ld-linux-x86-64.so.2:mempcpy) redirected to 0x4c374f0 (mempcpy)
--18115-- Reading syms from /lib/x86_64-linux-gnu/libc-2.27.so
--18115--   Considering /lib/x86_64-linux-gnu/libc-2.27.so ..
--18115--   .. CRC mismatch (computed b1c74187 wanted 042cc048)
--18115--   Considering /usr/lib/debug/lib/x86_64-linux-gnu/libc-2.27.so ..
--18115--   .. CRC is valid
--18115-- REDIR: 0x4edac70 (libc.so.6:memmove) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ed9d40 (libc.so.6:strncpy) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edaf50 (libc.so.6:strcasecmp) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ed9790 (libc.so.6:strcat) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ed9d70 (libc.so.6:rindex) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edc7c0 (libc.so.6:rawmemchr) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edade0 (libc.so.6:mempcpy) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edac10 (libc.so.6:bcmp) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ed9d00 (libc.so.6:strncmp) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ed9800 (libc.so.6:strcmp) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edad40 (libc.so.6:memset) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ef80f0 (libc.so.6:wcschr) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ed9ca0 (libc.so.6:strnlen) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ed9870 (libc.so.6:strcspn) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edafa0 (libc.so.6:strncasecmp) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ed9840 (libc.so.6:strcpy) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edb0e0 (libc.so.6:memcpy@@GLIBC_2.14) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ed9da0 (libc.so.6:strpbrk) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ed97c0 (libc.so.6:index) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ed9c70 (libc.so.6:strlen) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ee46c0 (libc.so.6:memrchr) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edaff0 (libc.so.6:strcasecmp_l) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edabe0 (libc.so.6:memchr) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4ef8eb0 (libc.so.6:wcslen) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4eda050 (libc.so.6:strspn) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edaf20 (libc.so.6:stpncpy) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edaef0 (libc.so.6:stpcpy) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edc7f0 (libc.so.6:strchrnul) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4edb040 (libc.so.6:strncasecmp_l) redirected to 0x4a2a6e0 (_vgnU_ifunc_wrapper)
--18115-- REDIR: 0x4fca3c0 (libc.so.6:__strrchr_avx2) redirected to 0x4c32730 (rindex)
--18115-- REDIR: 0x4fa5d60 (libc.so.6:__strcmp_ssse3) redirected to 0x4c33da0 (strcmp)
--18115-- REDIR: 0x4fca590 (libc.so.6:__strlen_avx2) redirected to 0x4c32cf0 (strlen)
--18115-- REDIR: 0x4fc1510 (libc.so.6:__strncmp_sse42) redirected to 0x4c33570 (__strncmp_sse42)
--18115-- REDIR: 0x4fc9fa0 (libc.so.6:__strchr_avx2) redirected to 0x4c32950 (index)
--18115-- REDIR: 0x4fc6420 (libc.so.6:__memchr_avx2) redirected to 0x4c33f80 (memchr)
--18115-- REDIR: 0x4fca1d0 (libc.so.6:__strchrnul_avx2) redirected to 0x4c37020 (strchrnul)
--18115-- REDIR: 0x4ed3070 (libc.so.6:malloc) redirected to 0x4c2faa0 (malloc)
--18115-- REDIR: 0x4fcaab0 (libc.so.6:__mempcpy_avx_unaligned_erms) redirected to 0x4c37130 (mempcpy)
--18115-- REDIR: 0x4fcaad0 (libc.so.6:__memcpy_avx_unaligned_erms) redirected to 0x4c366e0 (memmove)
--18115-- REDIR: 0x4ed3950 (libc.so.6:free) redirected to 0x4c30cd0 (free)
--18115-- REDIR: 0x4fbbc80 (libc.so.6:__stpcpy_ssse3) redirected to 0x4c35f60 (stpcpy)
--18115-- REDIR: 0x4fca720 (libc.so.6:__strnlen_avx2) redirected to 0x4c32c90 (strnlen)

Terminal:
image

@AndreMaz
Copy link

@addaleax do you want me to run valgrind with some specific options?

@benjamingr
Copy link
Collaborator

@addaleax you might be right regarding this being unrelated to async_hooks

The changes also include 60ef7a0#diff-d1f5ad3087cc94d3dc6651e3265219b7R59 which calls timer.refresh()

@AndreMaz can you see if removing the timer.refresh() call there fixes the issue?

@AndreMaz
Copy link

@addaleax runnning valgrind --tool=memcheck --log-file="full-log.txt" --leak-check=summary -v node /usr/bin/jest test/unit/middlewares/action-hook.spec.js I get more verbose with things like:

==18415== Conditional jump or move depends on uninitialised value(s)
==18415==    at 0x14CD96A: drbg_ctr_generate (in /usr/bin/node)
==18415==    by 0x14D1A81: RAND_DRBG_generate (in /usr/bin/node)
==18415==    by 0x14D2971: rand_drbg_get_entropy (in /usr/bin/node)
==18415==    by 0x14D0043: RAND_DRBG_instantiate (in /usr/bin/node)
==18415==    by 0x14D037D: drbg_setup (in /usr/bin/node)
==18415==    by 0x14D1209: drbg_bytes (in /usr/bin/node)
==18415==    by 0xA10F3A: node::crypto::EntropySource(unsigned char*, unsigned long) (in /usr/bin/node)
==18415==    by 0x17AA892: v8::base::RandomNumberGenerator::RandomNumberGenerator() (in /usr/bin/node)
==18415==    by 0x17A9A40: v8::base::CallOnceImpl(long*, std::function<void ()>) (in /usr/bin/node)
==18415==    by 0x17AC2DE: v8::base::OS::GetRandomMmapAddr() (in /usr/bin/node)
==18415==    by 0xAF83C6: v8::internal::GetRandomMmapAddr() (in /usr/bin/node)
==18415==    by 0xF0422E: v8::internal::Heap::SetUp() (in /usr/bin/node)

Full log here full-log.txt

@AndreMaz can you see if removing the timer.refresh() call there fixes the issue?

@benjamingr will try it now

@AndreMaz
Copy link

@benjamingr after removing the timer.refresh(); I got zero segfaults 😄 Everything seems to be working properly

@benjamingr
Copy link
Collaborator

@AndreMaz can you please try reproducing this with bluebird 3.7.1 ?

Talking to @petkaantonov refresh is removed there

@AndreMaz
Copy link

@benjamingr yeah, updating to v3.7.1 seems to solve the issue.

No segfaults on my local machine and Travis is also happy
image

@addaleax
Copy link

@benjamingr That still shouldn’t be an issue, right? Like, if calling timer.refresh() makes the process crash, that’s not “just” a bluebird issue?

Sadly, those verbose valgrind warnings correspond to a known issue with OpenSSL that has been fixed since (and that is also known to not cause crashes)…

@benjamingr
Copy link
Collaborator

@benjamingr That still shouldn’t be an issue, right? Like, if calling timer.refresh() makes the process crash, that’s not “just” a bluebird issue?

Right, this still looks like a bug in core and I am still not sure why it happens. Why would OpenSSL even be related to timer.refresh() causing a segfault?

@AndreMaz thanks for the update and for the thorough debugging! It is very appreciated.

@addaleax
Copy link

Right, this still looks like a bug in core and I am still not sure why it happens. Why would OpenSSL even be related to timer.refresh() causing a segfault?

Just to be clear, the OpenSSL issue does really only cause nothing besides valgrind warnings. It’s unrelated.

@benjamingr
Copy link
Collaborator

@addaleax ah, that makes sense. Thanks! I was really confused about why the two would be related :]

@astormnewrelic
Copy link
Author

Hey all -- got knocked down by a sinus infection so had to take a bit of a break from this. The repo at https://github.com/astormnewrelic/repro-bluebird-segfault now has a dockerfile that can reproduce the problem I was seeing in Amazon Linux -- README includes instructions for reproducting. Next I intend to work on getting that core dump from a debug build.

@astormnewrelic
Copy link
Author

astormnewrelic commented Oct 19, 2019

And I got the core dumps working -- I had to ulimit -c unlimited before making the program crash which (apparently?) tells the OS how big the core dumps are allowed to get.

Core dump attached. This is from a debug build of node 10.16.3 built on Amazon Linux, running the program from my reproduction repo, and the program crashed with a segmentation fault. I'll try to get a bug report submitted to node core sometimes this week, but figured I'd share what I knew now in case it helps.

core-dump.zip

Update: The custom build that generated this core dump. http://18.222.178.114/node.zip

@addaleax
Copy link

@astormnewrelic When working with custom builds, the binary used for causing the crash is just as much needed for debugging as the core dump itself :)

@astormnewrelic
Copy link
Author

@addaleax Thank you! I've added a link to the post above that should point at the binary I used to generate this core dump. I you wouldn't mind answering a naive question -- what does having the original binary allow you to do with the core dump that you wouldn't be able to do otherwise? I've only every worked with gdb or lldb locally, and a lot of it's firmly in the "not sure what that's really doing seems like magic" column to me :)

@addaleax
Copy link

@astormnewrelic As far as I know, the core dump doesn’t contain symbols, which are necessary to figure out what code actually corresponds to which functions in the original binary.

I might be wrong, but for me, running bt inside lldb -c core.30607 gives only the raw function pointers, whereas the backtrace for lldb build/node/out/Debug/node -c core.30607 gives the actual corresponding function names (and line numbers) 🙂

@addaleax
Copy link

From a first look, it seems like the libuv timer heap has one entry, but that refers to a null pointer; that seems wrong, but it’s not obvious to me what led to that condition being true.

Also, weirdly enough, I couldn’t reproduce the issue even with your debug build locally – does it just take a really long time for that to happen? (I’ll keep it running overnight 😄) And it’s basically a clean build of the v10.16.3 tag of the Node.js repo, right?

@astorm
Copy link

astorm commented Oct 20, 2019

@addaleax the segmentation fault usually happens within 5 runs of the program -- and I've never seen it take more than 100 or so, so overnight might be overkill. Out of "black box reproduction" curiosity -- what distro/version are you running things on where you're not seeing the segmentation fault?

Re: clean build of the v10.16.3 tag of the Node.js repo -- yes, that's correct. make clean;./configure --debug; make on the v10.16.3 tag.

@addaleax
Copy link

@astorm I’m running Ubuntu 18.04, with Linux 4.15.0-65-generic on x64… and there’s nothing really special about my setup, I think

@astorm
Copy link

astorm commented Oct 20, 2019

@addaleax One more confusing datapoint -- I tried switching up from Amazon Linux to on that docker container to Ubuntu 18.04 and the crashes still happened. So -- all very confusing.

michaelgoin added a commit to newrelic/node-newrelic that referenced this issue Nov 5, 2019
### 6.1.0 (2019-11-05):

- Native-metrics module is defaulted to disabled in serverless mode
  - New env var, NEW_RELIC_NATIVE_METRICS_ENABLED, was added to enable/disable the
  native-metrics module

* Added a test for querying poolCluster.of()

* Removed unused  bootstrap test code.

* Increased timeout for  test to reduce flickers on Node 12.

* Changed file modification to leverage  for  test. This
  triggers the watcher in a reasonable amount of time much more consistently.

* Added  module to agent for auto-include on install.

* Allow splitting of application name using semicolons in the env var

* Don't test Bluebird 3.7 on Node v10 until they fix [the segfault
  issue](petkaantonov/bluebird#1618)

* Instrument  for mysql2

* Add HTTP method to segment attributes for external requests

- Updates the  such that it uses verbose output, will exit on first
  error code, and will refuse to proceed with LibreSSL (which can't generate certs)

  - Adds a �[H�[2J sub-command to  that will allow developers to quickly
  remove generated ssl/cert files and regenerate (useful is switch between platforms
  via containers/docker and certs needs to be regenerated)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants