Skip to content
This repository has been archived by the owner on Aug 7, 2023. It is now read-only.

Responsibly aborting in node #18

Closed
Raynos opened this issue Feb 10, 2016 · 21 comments
Closed

Responsibly aborting in node #18

Raynos opened this issue Feb 10, 2016 · 21 comments

Comments

@Raynos
Copy link

Raynos commented Feb 10, 2016

Based on a conversation with @benjamingr ( groundwater/nodejs-symposiums#5 (comment) ).

We recommend aborting and core dumps as the debugging best practice. However we don't document how to do so responsibly to ensure uptime and availability.

People object to exiting and restarting as its a potential DDOS attack vector, I've brushed this off in the past by saying "it's a bug. Just get paged and fix this bug", however I no longer think this can be ignored for a few reasons

  • I've seen accidental internal DDOS due to cascading failures.
  • I've seen how even 10 minutes of outage can lose millions in revenue.
  • I've even seen accidental DDOS simply due to an embedded gossip library having a corruptive bug.
  • I've seen a fleet of servers being destroyed due to a single user uuid and round robin load balancing

I 100% believe a core dump is the only way to debug that 1 in a 100 edgecase. We all speak positively about how amazing post mortem debugging is but there is little sharing around how to do so responsibly.

Here is a straw man suggestion and I would love feedback:

  • a programmer error is a bug and needs human attention to verify any undefined state. We recommend setting up alerts or email notifications and verifying the impact of the programmer error.
  • running an available service is important, you cannot just abort without consequence. You need a technique to avoid repeated aborting.

We could implement --abort-on-uncaught-exception-once=filepath flag.

  • it would check the file on filepath, if it exists, read the value of the file as Unix time stamp (second granularity).
  • if the time is less then 24hr in the past then instead of aborting create an error and pass it to process.on(uncaughtException)
  • if the file does not exist or time is greater then 24hr in the past then write the current Unix time stamp (second granularity) to the file then continue to abort.

This functionality could be implemented outside of node core. If it is we have to clearly in the node core documentation, instruct people how to use --abort-on-uncaught-exception safely.

Because the process will not abort for 24hr the application developer must use process.on(uncaughtException) to avoid exiting and to be able to degrade to partial availability.

I would love to hear some feedback on how to best responsibly use post mortem debugging techniques

@benjamingr
Copy link
Member

The biggest concern I have (from that thread) is with abort on throw.

Aborting on every uncaught exception (like JSON.parse or accessing an unknown property) effectively means anyone who finds the smallest "TypeError: cannot read property foo of undefined" the code base can single-handedly cause a denial of service attack on the servers by sending a low amount of requests to it.

I'd like to share that I have similar experience to @Raynos - both in my code base and when I did contract work for a Node project of a large company. I've seen "throw on abort" cause denial of service attacks in the real world .


I think we can also agree that letting exceptions "escape" and running process.on("uncaughtException", () => ...) leaves the server at indeterminate state which can be very bad and cause it - for example - to hold connections to the database and cascade the failure on.

Note that unlike that, promise rejections play nice with the call-stack and will run cleanup code at every phase of the chain meaning that a promise rejection doesn't lead to undefined state in the server. It can be handled at the process level without leaking resources - using patterns similar to Go's defer keyword and C++'s RAII.

However, promise rejections currently have a problematic post mortem story for people who debug with heap-dumps because the error is rejected after cleanup has been performed and the stack has been unwound. So while you get a nice stack trace and avoid indeterminate state - you don't get a nice heap dump when the error occurred. This problem can possibly be amplified when async/await lands and more people use promises.

Note that it's not specific to promises, basically what we need isn't promises per-se it's async pieces of code that run a finally logic for resource clean up so the code is always in determined clean state. Promises just happen to be able to do that but it's not a capability that's tied to promises, core could just as easily introduce a mechanism without them.

I'd like to see how we can both have cleanup and a nice heap dump. Tricks like "the first rejection causes a heap dump and the the following contain regular behavior can be considered.

--abort-on-uncaught-exception-once is a good idea but it sounds problematic because it leaves the code in undetermined state, if we had a cleanup mechanism for when a callback finished or something similar. --abort-on-first-rejection-once might be interesting if the heap dump can be fixed which will mean we get the best of both worlds.

@bnoordhuis
Copy link
Member

running an available service is important, you cannot just abort without consequence. You need a technique to avoid repeated aborting.

Why is that? To avoid the disk from filling up? I'd say that's the job of abrtd or systemd or whatever you use to collect and manage core dumps.

If it's because the time it takes to dump core is a service outage, I think you can solve that by calling fork() before abort().

@benjamingr
Copy link
Member

Why is that? To avoid the disk from filling up? I'd say that's the job of abrtd or systemd or whatever you use to collect and manage core dumps.

Because the time to restart a node server is typically a few seconds. Let's say it's 8 for a specific instance and I have two servers with 4 cores each running two instances of node (so 8 server workers total). This means that I have to send the server 1 request per second to that particular path and I can single handedly take it down.

@Raynos
Copy link
Author

Raynos commented Feb 10, 2016

@bnoordhuis

Disk filling up does not matter. What matters is downtime.

I do not believe you can fork before abort as you would be forking the undefined behaviour process into a new undefined behaviour process.

The issue is that process startup is generally not cheap. At least 2s ranging up to 5s.

If your service does 2000 qps per process and any request can cause a crash then you just lost 4k requests. If you have retries (you should) then the retry storm can bring your entire deployment to its knees.

Now imagine that your not running a stateless service. You might have to load data at startup. You might have to gossip. You might have to replicate. You might have to migrate. You might have to do leadership election.

I've seen bad cascading failures and they generally happen at >100 uncaughts per second.

Even if you have a fleet of workers available you need to be able to pre-spawn workers. Note this is a valid solution; we should document any solution that achieves good uptime under an "uncaught storm". The kind of bug with every request. This is generally a bug created by new code and fixed by rollback. However rollback takes time and there is no reason to be down for 5 minutes.

@Raynos Raynos closed this as completed Feb 10, 2016
@Raynos Raynos reopened this Feb 10, 2016
@bnoordhuis
Copy link
Member

I do not believe you can fork before abort as you would be forking the undefined behaviour process into a new undefined behaviour process.

Why would that matter if you're going to call abort() immediately afterwards anyway? To clarify, I had the impression you wanted to truck on after an error (possibly using process.on('uncaughtException') and may $DEITY have mercy on your soul in that case) but still get a core dump; fork + abort (in C++) will accomplish that.

@Raynos
Copy link
Author

Raynos commented Feb 10, 2016

@bnoordhuis

Can I disable --abort-on-uncaught-exception after forking ?

That would actually have the correct semantics and should in theory take less time then a restart. In fact if we could have the exact same error go to process.on(uncaughtException) in the forked process that would be beautiful.

I rather like this idea as its a truly abort once approach.

@Raynos Raynos closed this as completed Feb 10, 2016
@Raynos Raynos reopened this Feb 10, 2016
@Raynos
Copy link
Author

Raynos commented Feb 10, 2016

Note to self: mobile GitHub puts the close button in a bad place.

@bnoordhuis
Copy link
Member

Can I disable --abort-on-uncaught-exception after forking ?

I don't see why not. Just call isolate->SetAbortOnUncaughtExceptionCallback(nullptr) afterwards in the original process.

@Raynos
Copy link
Author

Raynos commented Feb 10, 2016

@bnoordhuis

Sounds like a good technical solution.

Do you think this is something worth recommending ? Should we recommend to never continue on uncaughts ? What is a responsible default approach to recommend for production ?

Alternatively we can also recommend "here are some possible options; figure it out yourself!"

@bnoordhuis
Copy link
Member

Should we recommend to never continue on uncaughts ?

I put those big scary warnings in the documentation for a reason. :-)

Alternatively we can also recommend "here are some possible options; figure it out yourself!"

That's what I would do. If the pros and cons are clearly spelled out, people can make their own informed decision.

@davepacheco
Copy link

@Raynos @bnoordhuis I agree with the idea of spelling out various options, pros, and cons, and maybe even providing concrete references if companies are willing to do that. I think there are strongly divergent opinions on these things and many of these come down to policy or value choices that may legitimately differ across teams.

@Raynos
Copy link
Author

Raynos commented Feb 11, 2016

Currently the nodejs documentation says "resuming from uncaughtException is a bad idea" when in reality it's a necessary evil for production at scale.

I think this is a documentation bug and I would like to get second opinions of post-mortem users about "do you ever continue on uncaught because availability at all costs" ?

If there is a better solution for availability then continuing then that's great too. I don't know about it.

I want us to update node documentation with current best practices around abort(), post-mortem debugging and high availability.

I can respect if there are different opinions and no consensus, in that case we should just document two or three of them.

I want to make sure I don't accidentally recommend nuanced approaches without explaining nuance. I suspect saying "just use --abort-on-uncaught-exception and mdb. It's the best" glosses over a lot of nuance and can lead to less experienced customers having a bad day.

@davepacheco
Copy link

@RyanOS You asked what others in this group do about uncaughtException: I hate making generalizations, but I don't think there's a case in the services we're building at Joyent where we'd ever make use of an 'uncaughtException' handler. I believe at face value your experience that it's improved availability, but I also believe that many Node users fall into the trap of believing they can increase availability using such a handler when they're actually making it worse. For historical reasons, we do have some services that use it, and in those cases the uncaughtException handlers have cost us significantly in availability. The most famous case was a service that threw an exception processing a database response. Because of the exception, the database connection remained open with database locks held, causing other queries to block. If the process had aborted, it would have taken out a few concurrent queries and the clients could have retried anyway. Instead, we were down for several minutes until an operator was paged to diagnose the problem and restart the service. This is not the only case where an uncaughtException handler resulted in more downtime than if the process had crashed.

I agree that "just use --abort-on-uncaught-exception and mdb" is not nuanced enough. There's a lot more nuance in the way we wrote our initial guidelines around this:
https://www.joyent.com/developers/node/design/errors

I think the language that's actually in the documentation now isn't so bad:
https://nodejs.org/dist/latest-v4.x/docs/api/process.html#process_event_uncaughtexception

I'd support softening it to say that "It's usually not possible to reason about exceptions thrown from arbitrary points in your program, so it's usually not safe to resume normal operation after uncaughtException" (emphasis on usually). I wouldn't soften it much more than that, though. Having talked with a lot of people who used to think they could handle arbitrary errors and now believe that's not possible, I really think your case is the exception (no pun intended). Advanced users can always decide it will work for them, but they will at least have been warned to think carefully about it.

@Raynos
Copy link
Author

Raynos commented Feb 11, 2016

@davepacheo

Just for clarification; we don't treat uncaughts lightly at all.

  • page on call engineer.
  • on call engineer monitors application and logs
  • on call engineer restarts process.

We have a manual restart procedure because spin crashing causes fleet wide outages.

I think everyone agrees that we need an operator to check the error and deal with it.

I just want to get second opinions on how many people automate the restarting and how they do so safely.

@bnoordhuis
Copy link
Member

Currently the nodejs documentation says "resuming from uncaughtException is a bad idea" when in reality it's a necessary evil for production at scale.

Using 'uncaughtException' voids the warranty so yeah, it may be the lesser of two evils but it's still a bad idea. There have been numerous bug reports where (for example) DNS queries stopped working after continuing from an uncaught exception.

@misterdjules
Copy link

@Raynos @bnoordhuis

Can I disable --abort-on-uncaught-exception after forking ?

I don't see why not. Just call isolate->SetAbortOnUncaughtExceptionCallback(nullptr) afterwards in the original process.

Setting the AbortOnUncaughtExceptionCallback to nullptr would still make the process abort. One would need to setup an AbortOnUncaughtExceptionCallback that returns false instead.

@bmeck
Copy link
Member

bmeck commented Feb 11, 2016

@misterdjules false or falsey? Would Function.prototype work?

@misterdjules
Copy link

@bmeck falsey, but the type of AbortOnUncaughtExceptionCallback is:

typedef bool (*AbortOnUncaughtExceptionCallback)(Isolate*);

So you have to return something that can be implicitly converted to a boolean value from within that callback.

Would Function.prototype work?

I'm not sure I understand what you mean. Do you mean "would returning a Function.prototype from an AbortOnUncaughtExceptionCallback prevent the process from aborting?

@mpareja
Copy link

mpareja commented Oct 14, 2016

The suggestion from @Raynos to implement a mechanism for limiting the number of aborts on exception to prevent DoS'ing a service is really interested. That said, I'm not particularly in favour of bloating core by adding an --abort-on-uncaught-exception-once parameter.

First, the parameter would bake in too many assumptions:

  • the number of core dumps to take (1)
  • how long to wait between core dumps
  • all the core dumps are for the same exception

Second, it should be pretty easy to implement a process manager which simply restarts the node process without the --abort-on-uncaught-exception parameter on subsequent errors. Not sure why we'd add to node core something that could be implemented outside of core. One could make this a simple CLI that wraps the launch of the node process. Then you'd have a lot more fidelity around when to enable core dumps and how long in between, etc. For instance, consider only disabling aborts after detecting 2 exceptions with similar fingerprints (message, stack , etc.?).

Thoughts?

@rnchamberlain
Copy link
Contributor

That seems a reasonable solution for the "necessary evil for production at scale" choice above. If the DOS concern is the time taken to write the core dump, there is the fork+abort solution above (continue doing the core dump in a separate process), or you could have the process manager set ulimit -c 0 as part of the recovery action. Then restart with the recommended --abort-on-uncaught-exception parameter still set. So subsequent exceptions will still abort but without further core dumps.

@richardlau
Copy link
Member

Closing due to inactivity. If this work still needs to be tracked, please open a new issue over in https://github.com/nodejs/diagnostics.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants