-
Notifications
You must be signed in to change notification settings - Fork 20
Responsibly aborting in node #18
Comments
The biggest concern I have (from that thread) is with Aborting on every uncaught exception (like I'd like to share that I have similar experience to @Raynos - both in my code base and when I did contract work for a Node project of a large company. I've seen "throw on abort" cause denial of service attacks in the real world . I think we can also agree that letting exceptions "escape" and running Note that unlike that, promise rejections play nice with the call-stack and will run cleanup code at every phase of the chain meaning that a promise rejection doesn't lead to undefined state in the server. It can be handled at the process level without leaking resources - using patterns similar to Go's However, promise rejections currently have a problematic post mortem story for people who debug with heap-dumps because the error is rejected after cleanup has been performed and the stack has been unwound. So while you get a nice stack trace and avoid indeterminate state - you don't get a nice heap dump when the error occurred. This problem can possibly be amplified when async/await lands and more people use promises. Note that it's not specific to promises, basically what we need isn't promises per-se it's async pieces of code that run a I'd like to see how we can both have cleanup and a nice heap dump. Tricks like "the first rejection causes a heap dump and the the following contain regular behavior can be considered.
|
Why is that? To avoid the disk from filling up? I'd say that's the job of abrtd or systemd or whatever you use to collect and manage core dumps. If it's because the time it takes to dump core is a service outage, I think you can solve that by calling fork() before abort(). |
Because the time to restart a node server is typically a few seconds. Let's say it's 8 for a specific instance and I have two servers with 4 cores each running two instances of node (so 8 server workers total). This means that I have to send the server 1 request per second to that particular path and I can single handedly take it down. |
Disk filling up does not matter. What matters is downtime. I do not believe you can fork before abort as you would be forking the undefined behaviour process into a new undefined behaviour process. The issue is that process startup is generally not cheap. At least 2s ranging up to 5s. If your service does 2000 qps per process and any request can cause a crash then you just lost 4k requests. If you have retries (you should) then the retry storm can bring your entire deployment to its knees. Now imagine that your not running a stateless service. You might have to load data at startup. You might have to gossip. You might have to replicate. You might have to migrate. You might have to do leadership election. I've seen bad cascading failures and they generally happen at >100 uncaughts per second. Even if you have a fleet of workers available you need to be able to pre-spawn workers. Note this is a valid solution; we should document any solution that achieves good uptime under an "uncaught storm". The kind of bug with every request. This is generally a bug created by new code and fixed by rollback. However rollback takes time and there is no reason to be down for 5 minutes. |
Why would that matter if you're going to call abort() immediately afterwards anyway? To clarify, I had the impression you wanted to truck on after an error (possibly using |
Can I disable --abort-on-uncaught-exception after forking ? That would actually have the correct semantics and should in theory take less time then a restart. In fact if we could have the exact same error go to process.on(uncaughtException) in the forked process that would be beautiful. I rather like this idea as its a truly abort once approach. |
Note to self: mobile GitHub puts the close button in a bad place. |
I don't see why not. Just call |
Sounds like a good technical solution. Do you think this is something worth recommending ? Should we recommend to never continue on uncaughts ? What is a responsible default approach to recommend for production ? Alternatively we can also recommend "here are some possible options; figure it out yourself!" |
I put those big scary warnings in the documentation for a reason. :-)
That's what I would do. If the pros and cons are clearly spelled out, people can make their own informed decision. |
@Raynos @bnoordhuis I agree with the idea of spelling out various options, pros, and cons, and maybe even providing concrete references if companies are willing to do that. I think there are strongly divergent opinions on these things and many of these come down to policy or value choices that may legitimately differ across teams. |
Currently the nodejs documentation says "resuming from uncaughtException is a bad idea" when in reality it's a necessary evil for production at scale. I think this is a documentation bug and I would like to get second opinions of post-mortem users about "do you ever continue on uncaught because availability at all costs" ? If there is a better solution for availability then continuing then that's great too. I don't know about it. I want us to update node documentation with current best practices around abort(), post-mortem debugging and high availability. I can respect if there are different opinions and no consensus, in that case we should just document two or three of them. I want to make sure I don't accidentally recommend nuanced approaches without explaining nuance. I suspect saying "just use --abort-on-uncaught-exception and mdb. It's the best" glosses over a lot of nuance and can lead to less experienced customers having a bad day. |
@RyanOS You asked what others in this group do about I agree that "just use --abort-on-uncaught-exception and mdb" is not nuanced enough. There's a lot more nuance in the way we wrote our initial guidelines around this: I think the language that's actually in the documentation now isn't so bad: I'd support softening it to say that "It's usually not possible to reason about exceptions thrown from arbitrary points in your program, so it's usually not safe to resume normal operation after |
@davepacheo Just for clarification; we don't treat uncaughts lightly at all.
We have a manual restart procedure because spin crashing causes fleet wide outages. I think everyone agrees that we need an operator to check the error and deal with it. I just want to get second opinions on how many people automate the restarting and how they do so safely. |
Using |
Setting the |
@misterdjules |
@bmeck
So you have to return something that can be implicitly converted to a boolean value from within that callback.
I'm not sure I understand what you mean. Do you mean "would returning a |
The suggestion from @Raynos to implement a mechanism for limiting the number of aborts on exception to prevent DoS'ing a service is really interested. That said, I'm not particularly in favour of bloating core by adding an First, the parameter would bake in too many assumptions:
Second, it should be pretty easy to implement a process manager which simply restarts the node process without the Thoughts? |
That seems a reasonable solution for the "necessary evil for production at scale" choice above. If the DOS concern is the time taken to write the core dump, there is the fork+abort solution above (continue doing the core dump in a separate process), or you could have the process manager set |
Closing due to inactivity. If this work still needs to be tracked, please open a new issue over in https://github.com/nodejs/diagnostics. |
Based on a conversation with @benjamingr ( groundwater/nodejs-symposiums#5 (comment) ).
We recommend aborting and core dumps as the debugging best practice. However we don't document how to do so responsibly to ensure uptime and availability.
People object to exiting and restarting as its a potential DDOS attack vector, I've brushed this off in the past by saying "it's a bug. Just get paged and fix this bug", however I no longer think this can be ignored for a few reasons
I 100% believe a core dump is the only way to debug that 1 in a 100 edgecase. We all speak positively about how amazing post mortem debugging is but there is little sharing around how to do so responsibly.
Here is a straw man suggestion and I would love feedback:
We could implement --abort-on-uncaught-exception-once=filepath flag.
This functionality could be implemented outside of node core. If it is we have to clearly in the node core documentation, instruct people how to use --abort-on-uncaught-exception safely.
Because the process will not abort for 24hr the application developer must use process.on(uncaughtException) to avoid exiting and to be able to degrade to partial availability.
I would love to hear some feedback on how to best responsibly use post mortem debugging techniques
The text was updated successfully, but these errors were encountered: