node does not abort at the right time when using --abort-on-uncaught-exception #3035

misterdjules · 2015-09-24T01:21:27Z

nodejs/master's head does not abort at the right time when using --abort-on-uncaught-exception. Here's a reproduction of the problem on SmartOS:

$ git rev-parse --short HEAD
02448c6
[root@dev ~/node-1]# make
make -C out BUILDTYPE=Release V=1
make[1]: Entering directory '/root/node-1/out'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/root/node-1/out'
ln -fs out/Release/node node
$ ln -sf `pwd`/out/Release/node /opt/local/bin/node
[root@dev ~/node-1]# node --version
v5.0.0-pre
$ node --abort-on-uncaught-exception -e 'setTimeout(function () { function boom() { throw new Error("foo") } boom(); }, 10);'
[eval]:1
setTimeout(function () { function boom() { throw new Error("foo") } boom(); }, 10);
                                           ^

Error: foo
    at boom ([eval]:1:50)
    at null._onTimeout ([eval]:1:69)
    at Timer.listOnTimeout (timers.js:89:15)
Abort (core dumped)
$ mdb /var/cores/core.node.8212 
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]
> ::load /root/mdb_v8/build/amd64/mdb_v8.so
mdb_v8 version: 1.0.0 (dev)
V8 version: 4.5.103.33
Autoconfigured V8 support from target
C++ symbol demangling enabled
> ::jsstack
native: libc.so.1`_lwp_kill+0xa
native: libc.so.1`raise+0x20
native: libc.so.1`abort+0x98
native: node::FatalException+0xd7
native: v8::internal::MessageHandler::ReportMessage+0x1ee
native: v8::internal::Isolate::ReportPendingMessages+0x234
native: v8::internal::Execution::Call+0x4a3
native: v8::Function::Call+0xff
native: v8::Function::Call+0x41
native: node::AsyncWrap::MakeCallback+0x23a
native: node::TimerWrap::OnTimeout+0x96
native: uv__run_timers+0x7d
native: uv_run+0x35a
native: node::Start+0x538
native: _start+0x6c
>

Note that the call stack doesn't contain any JavaScript frame, and indicates that node aborted in node::FatalException, not when the error was thrown.

This is a problem because users of post-mortem debuggers and --abort-on-uncaught-exception need to have core dumps that are generated when the exception is thrown, and the core dumps need to have the frame that throws the error in the call stack. Otherwise, it becomes much more difficult, if not impossible to determine the root cause of the problem.

In the example above, there's no way to know that the error was thrown by the function named foo, we just know that one timer's callback threw.

This regression was introduced by #922, which made V8 ignore --abort-on-uncaught-exception. An attempt at fixing that regression was made with #2776, but instead of letting V8 abort in Isolate::Throw (when it actually throws the error and when all the relevant frames are active on the stack), it throws in node::FatalException as shown above.

I will submit two different PRs that fix this issue in two different ways so that we can discuss the pros and cons of the two different approaches I came up with.

/cc @nodejs/post-mortem

The text was updated successfully, but these errors were encountered:

This PR fixes 0af4c9e so that node aborts at the right time when throwing an error and using --abort-on-uncaught-exception. Basically, it wraps most node internal callbacks with: if (!domain || domain.emittingTopLevelError) runCallback(); else { try { runCallback(); } catch (err) { process._fatalException(err); } } so that V8 can abort properly in Isolate::Throw if --abort-on-uncaught-exception was passed on the command line, and domain can handle the error if one is active and not already in the top level domain's error handler. It also reverts 921f2de partially: node::FatalException does not abort anymore because at that time, it's already too late. It adds process._forceTickDone, which is really a hack to allow test-next-tick-error-spin.js to pass. It's here to basically avoid an infinite recursion when throwing in a domain from a nextTick callback, and queuing the same callback on the next tick from the domain's error handler. This change is an alternative approach to nodejs#3036 for fixing nodejs#3035. Fixes nodejs#3035.

misterdjules · 2015-09-24T02:56:23Z

I implemented two different approaches with #3036 and #3038 to solve this problem. They're both drafts of PRs, and I didn't pay too much attention to details. I'm mainly looking for feedback on what would be the preferred approach.

I personally have a preference for #3036, because I find it less intrusive and it's been tested with node v0.10. However I would welcome any other opinion/feedback, and there may be other valid approaches.

yunong · 2015-09-24T06:35:25Z

Just to weigh in here -- this is critical for us at Netflix and our production stack -- and will potentially prevent us from moving to 4.x until this is fixed. Thanks for all the hardwork @misterdjules and we'd love to see this integrated soon!

evanlucas · 2015-09-24T13:12:46Z

Thanks @misterdjules! I feel like #3036 is a much cleaner way of doing it and it doesn't require messing with timers and repl.

Raynos · 2015-09-25T21:01:09Z

👍

It's important that we have the right stack. --abort-on-uncaught-exception is critical for debugging "Maximum call stack size exceeded".

We need the stack in the core file to have the entire call stack that we blew up.

misterdjules · 2015-09-29T18:52:00Z

If the two approaches described in #3036 and #3038 are not viable, a third option is to revert 0af4c9e and 921f2de, and to display a warning when domains and --abort-on-uncaught-exception are used at the same time.

This would bring us back to the point where domains and --abort-on-uncaught-exception work independently, but not when used together in some use cases.

Raynos · 2015-09-29T22:48:21Z

If a domain catches an exception and then determines there is no active domain and rethrows it into the global uncaught you now have a new stack trace that is not the exception you want to debug with a corefile.

Is it possible to use domains and --abort-on-uncaught-exception together in a useful fashion for stack debugging?

From what I understand whenever you have domains enabled the stacktrace will always be a rethrow instead of the stack trace you actually want.

That being said; if all you care about is heap analysis instead of stack analysis then domains + --abort-on-uncaught-exception can work together.

misterdjules · 2015-09-30T06:36:20Z

If a domain catches an exception and then determines there is no active domain and rethrows it into the global uncaught you now have a new stack trace that is not the exception you want to debug with a corefile.

Just to make sure we're on the same page, and because wording can be tricky when discussing these topics, if you mean that given the following code:

var domain = require('domain');
var d = domain.create();

d.on('error', function onError(err) {
  throw new Error('boom');
});

d.run(function someFunction() {
  throw new Error('original error');
});

running it with --abort-on-uncaught-exception and examining the stack from the resulting core file will not give you someFunction as an active stack frame, that is correct.

Is it possible to use domains and --abort-on-uncaught-exception together in a useful fashion for stack debugging?

Not with the above -mentioned use case as far as I know.

From what I understand whenever you have domains enabled the stacktrace will always be a rethrow instead of the stack trace you actually want.

Yes, if an error is thrown from the top-level domain's error handler, the stack trace won't contain the frame where the original error was thrown.

That being said; if all you care about is heap analysis instead of stack analysis then domains + --abort-on-uncaught-exception can work together.

Exactly. In general, using --abort-on-uncaught-exception should not break domains. However it's acceptable that using domains will change the execution flow in such a way that makes post-mortem debugging less practical when investigating core files generated from processes that crashed due to uncaught exceptions.That's what domains do and it's documented. Outputting a warning to the console might be a good idea when they're used together regardless of the approach taken to fix this issue.

This issue is about two things:

Fixing --abort-on-uncaught-exception which is completely broken right now, whether or not domains are used.
In doing so, fix the non-acceptable issues of using domains with --abort-on-uncaught-exception that previous changes who broke 1) tried to fix (these issues were originally filed as Domain error handler not preventing exception from bubbling up when --abort-on-uncaught-exception used node-v0.x-archive#8631 and Process exits with incorrect message when throwing error in top-level domain's error handler node-v0.x-archive#8630).

Raynos · 2015-09-30T07:54:36Z

Sounds great. We are on the same page.

misterdjules · 2015-10-05T20:32:38Z

#3036 was updated now that its V8-related changes landed upstream (see https://codereview.chromium.org/1375933003/).

Revert 0af4c9e, parts of 921f2de and port nodejs/node-v0.x-archive#25835 from v0.12 to master so that node aborts at the right time when an error is thrown and --abort-on-uncaught-exception is used. Fixes nodejs#3035.

misterdjules · 2015-10-06T01:24:05Z

Fixed by 49dec1a and 77a10ed.

Revert 0af4c9e, parts of 921f2de and port nodejs/node-v0.x-archive#25835 from v0.12 to master so that node aborts at the right time when an error is thrown and --abort-on-uncaught-exception is used. Fixes #3035. PR: #3036 PR-URL: #3036 Reviewed-By: Ben Noordhuis <[email protected]>

misterdjules added the post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js. label Sep 24, 2015

misterdjules mentioned this issue Sep 24, 2015

src: fix --abort-on-uncaught-exception #3036

Closed

misterdjules mentioned this issue Sep 24, 2015

src: fix abort-on-uncaught-exception #3038

Closed

misterdjules mentioned this issue Sep 24, 2015

domains: port fix abort on uncaught to v0.12 nodejs/node-v0.x-archive#25835

Closed

misterdjules mentioned this issue Sep 29, 2015

Discussion: LTS & v5 release planning #3000

Closed

8 tasks

misterdjules mentioned this issue Sep 29, 2015

LTS Meeting 2015-10-05 nodejs/Release#43

Closed

misterdjules closed this as completed in 77a10ed Oct 6, 2015

mhdawson mentioned this issue Oct 7, 2015

test-domain-with-abort-on-uncaught-exception.js fails on PPC platforms #3239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node does not abort at the right time when using --abort-on-uncaught-exception #3035

node does not abort at the right time when using --abort-on-uncaught-exception #3035

misterdjules commented Sep 24, 2015

misterdjules commented Sep 24, 2015

yunong commented Sep 24, 2015

evanlucas commented Sep 24, 2015

Raynos commented Sep 25, 2015

misterdjules commented Sep 29, 2015

Raynos commented Sep 29, 2015

misterdjules commented Sep 30, 2015

Raynos commented Sep 30, 2015

misterdjules commented Oct 5, 2015

misterdjules commented Oct 6, 2015

node does not abort at the right time when using --abort-on-uncaught-exception #3035

node does not abort at the right time when using --abort-on-uncaught-exception #3035

Comments

misterdjules commented Sep 24, 2015

misterdjules commented Sep 24, 2015

yunong commented Sep 24, 2015

evanlucas commented Sep 24, 2015

Raynos commented Sep 25, 2015

misterdjules commented Sep 29, 2015

Raynos commented Sep 29, 2015

misterdjules commented Sep 30, 2015

Raynos commented Sep 30, 2015

misterdjules commented Oct 5, 2015

misterdjules commented Oct 6, 2015