Skip to content
This repository has been archived by the owner on Mar 25, 2018. It is now read-only.

error docs / guide #82

Closed
stevemao opened this issue Feb 9, 2016 · 20 comments
Closed

error docs / guide #82

stevemao opened this issue Feb 9, 2016 · 20 comments

Comments

@stevemao
Copy link
Contributor

stevemao commented Feb 9, 2016

I have been working with streams and I had a lot of trouble handling errors (as you can see I've raised a lot of issues). A lot of error handling guides are scattered across GitHub issues, tweets, blog posts and even joyent website. I want to improve docs/api/errors page. This is probably not on the roadmap, but it's definitely related some of them. It might help to make the direction of stream more clear. It will cover nodejs/node#4491

This is for error handling in general, but a bit more focused on error handling in stream.

@chrisdickinson
Copy link
Contributor

/cc'ing @davepacheco, who I understand is the author of the Joyent error document from which much of the content of the Node errors document is derived, to advise.

@stevemao
Copy link
Contributor Author

stevemao commented Feb 9, 2016

I know people would start arguing (from my research you have fundamentally different views). I have panicked in other threads so I try not to complain :)
Unfortunately this is pretty much the only "official" docs we have :(
I'll write something anyways just to kick it off.

@chrisdickinson
Copy link
Contributor

@stevemao Ah — our views aren't fundamentally different. I think @davepacheco would be an excellent person to consult in order to improve the doc. My first draft of the existing error doc was derived from his original Joyent doc in an attempt to catalogue the common error codes, and his input would be super valuable for the future direction of the document.

@stevemao
Copy link
Contributor Author

stevemao commented Feb 9, 2016

Here's another related issue: groundwater/nodejs-symposiums#5

@davepacheco
Copy link

I'd heard there was some interest in integrating some of the content from the Joyent error handling guide into the Node core docs. I think it would be great to see that happen. That document came out of a design discussion with several community members, and we spent a lot of time trying to make sure that document provided appropriate context as well as a holistic discussion, and also that it matched guidelines used in other communities.

Since you mentioned streams: I'd also be really interested to see some of the details around errors be better incorporated into other core API docs. There are some conventions people use, like: when 'error' is emitted from any EventEmitter, the object will usually emit no more events [except maybe 'close'] and cannot be used for anything else. This could be documented under EventEmitter or Streams. It would also be great if more core APIs documented exactly what operational errors they can experience. A lot of times this will include system-specific errors, in which case even a clause like "...and any system errors that can be returned by fork(2) on current system" would help.

@stevemao
Copy link
Contributor Author

stevemao commented Apr 5, 2016

I'd heard there was some interest in integrating some of the content from the Joyent error handling guide into the Node core docs.

It'll be a very good start :) instead of a guide, it should be a rule because if streams handle error differently, they can't work together well. People should be enforced and educated to follow the same convention.

@benjamingr
Copy link
Member

There is a fundamental disagreement inside Core about how errors should be handled. I for example disagree with the "recommended" error handling in that blog post and find it extremely irresponsible and risky to run servers doing it. Others feel the same way about the flow I use.

I'm fine with a solution that lets both sides express their views.

@stevemao
Copy link
Contributor Author

stevemao commented Apr 7, 2016

There is a fundamental disagreement inside Core about how errors should be handled.

That's what I'm saying. I met @rvagg in person and expressed the feeling. He's pretty confident that there will be a solution soon :) Current situation is a bit unfortunate.

@eljefedelrodeodeljefe
Copy link
Contributor

@benjamingr what exactly is your take on it? Since I find the article very educated, presenting options though a little biased (I like restify, sorry).

@bcantrill what was your opinion about this? Likely the article?

Concerning streams: Can't something as simple as the following just go into the docs into the stream.md? Basically saying: handle stream like so, and for broader discussion about errors go to the errors guide?

var a = createStream();
a.on('error', function(e){handleError(e)})
.pipe(b)
.on('error', function(e){handleError(e)})
.pipe(c)
.on('error', function(e){handleError(e)});

// or 

var a = createReadableStream()
var b = anotherTypeOfStream()
var c = createWriteStream()

a.on('error', handler)
b.on('error', handler)
c.on('error', handler)

a.pipe(b).pipe(c)

function handler (err) { console.log(err) }

taken from http://stackoverflow.com/questions/21771220/error-handling-with-node-js-streams

@bcantrill
Copy link

Yes, my opinion on this was to show agreement with the article, as it very much reflects our experiences running critical node.js-based services in production. I'll apologize for a bit of a drive-by in terms of thumbs-downing @benjamingr's comment, but I take (strong) issue with the characterization of the article as "extremely irresponsible" and "risky" -- especially without any more specific detail. As a practical matter, I assume @benjamingr is denigrating the recommendation that fatal programmer errors be fatal to the program -- from which I can only infer a naïveté with respect to how highly reliable systems actually get that way.

@eljefedelrodeodeljefe
Copy link
Contributor

Calling for this to be on the agenda of the next docs-wg meeting with a priority.

@stevemao
Copy link
Contributor Author

I don't disagree what @benjamingr said. In fact, from the tweets & github issues I have linked across related issues, a lot of people are doing that. I guess there are a few things need to be done before we could reach an agreement here. Promise and whatwg streams API for example, don't comply to Joyent error document.

@benjamingr
Copy link
Member

You can thumbs down me all you want @bcantrill I don't mind. What I do mind, is terms like "naïveté" when describing someone else's experiences - that is unacceptable in a discussion here or anywhere else in the Node org. I don't take personal offense in case that concerns you and of course you're welcome to disagree with me respectfully as much as you'd like. Let's please keep the discussion respectful.

About error handling: If you want to do your error handling in a non fault tolerant manner where a single place a consumer finds a place to pass invalid JSON crashes your server causes a denial of service on your whole server farm - and where you have to run your server cluster after uncaughtException is emitted to maintain semi-reasonable availability - be my guest. I'm not making this up - check out nodejs/post-mortem#18 yourself. This is in fact what the Joyent document practically suggests we do.

I realize that we probably won't reach an agreement about this - and that I'll have a hard time convincing people that programmer errors and operational errors do not model to synchronous and asynchronous errors at all (see nodejs/promises#10 for some examples).

That's fine. I'm not the one arguing for node to take an opinionated strategy here and I'm the one arguing that we maintain the current status quo where an error handling strategy is not imposed on users.

Of course, there are many parts in the Joyent article that I strongly agree with like always using Error objects for errors, using .name property, the concept of programmer and operational errors (but not how to deal with everything) and wrapping errors so they're clearer, and I'm fine with these in the docs.

I'll do my best to participate in the WG meeting and argue my opinion.

@benjamingr
Copy link
Member

@eljefedelrodeodeljefe it's not as much the fact I disagree with it (see my comment above) it's more the fact I'm hardly alone, and I can name at least another collaborator or two who disagree with several things from that article. Even if I suddenly change my opinion and everyone agrees with it in nodejs/cods it'll have to reach the CTC to be actually merged in Joyent/node because it won't reach consensus among collaborators - and I doubt the CTC will advocate for a strong change from the status quo where there is disagreement.

That is not to say I am against error handling docs in general - I think there are lots of uncontested things everyone agrees about that the docs can and should cover like:

  • Only using Error objects for errors because those have stack traces.
  • Naming errors and using meaningful errors.
  • The Node .code error style convention.
  • Decorating errors caused by implementation detail.
  • The difference itself between programmer and operational errors.
  • Functions should propagate errors asynchronously or synchronously but never both ways - Zalgo.

And so on - there are many things everyone agrees on and we can focus on those.

I'll do my best to participate in the WG meeting and argue my opinion.

@davepacheco
Copy link

If you want to do your error handling in a non fault tolerant manner where a single place a consumer finds a place to pass invalid JSON crashes your server causes a denial of service on your whole server farm - and where you have to run your server cluster after uncaughtException is emitted to maintain semi-reasonable availability - be my guest. I'm not making this up - check out nodejs/post-mortem#18 yourself. This is in fact what the Joyent document practically suggests we do.

For what it's worth, I think this is a misrepresentation (or misunderstanding) of the Joyent Error Handling document. Yes, it advocates not handling programmer errors, and that means that if a server fails to validate input, it may crash. But it also explains that clients must handle the operational error represented by that server crash, which mitigates the impact. The document goes further, arguing that programmer errors cannot be handled in general. It provides six cases where a developer might think they can handle a programmer error, but where the end result of avoiding the crash is significantly worse than the crash would have been. I've personally dealt with several service outages that were literally caused by the mere presence of uncaughtException handlers, where a crash would have been much less impacting on end users.

This is not novel advice. There's a reason that most C programs do not install SIGSEGV handlers and why it's usually considered poor practice to catch Throwable in Java. I recently learned of an earlier reference to programmer errors and the idea that they should not be handled in the Haskell wiki. JavaScript isn't those languages, of course, but I believe these principles are fundamental to programming, at least as we know it today.

I realize that we probably won't reach an agreement about this - and that I'll have a hard time convincing people that programmer errors and operational errors do not model to synchronous and asynchronous errors at all (see nodejs/promises#10 for some examples).

Again, it sounds like you've misunderstood the position. There certainly are synchronous operational errors. The document talks about them explicitly when it says "the next most common case is an operational error in a synchronous function like JSON.parse." It's concerning to me that your firm arguments against the original document seem to misunderstand basic points about it.

@Knighton910
Copy link

+1

Calling for this to be on the agenda of the next docs-wg meeting with a priority.

@benjamingr
Copy link
Member

@davepacheco

For what it's worth, I think this is a misrepresentation (or misunderstanding) of the Joyent Error Handling document. Yes, it advocates not handling programmer errors, and that means that if a server fails to validate input, it may crash. But it also explains that clients must handle the operational error represented by that server crash, which mit

Well, it says:

*The best way to recover from programmer errors is to crash immediately. *

And then

The only downside to crashing on programmer errors is that connected clients may be temporarily disrupted

Then:

rather than trying to avoid crashing in cases where the code is obviously wrong. The best way to debug these problems is to configure Node to dump core on an uncaught exception.

Now let's say I'm a small company learning Node and reading the docs for the first time. All I have is a service serving REST for my mobile app on top of Mongo. Let's say I'm exceptionally overpowered and I have a two quad core machines running Node behind nginx and I have a load balancer and I'm taking core dumps on every error. Let's say there is a single client sending an invalid JSON that by programmer error isn't checked and is parsed in the server.

The client has retry logic because the phone isn't always online so it sends the request until it is fulfilled, let's say once every 200 ms. The server takes about 4 seconds to load with all the dependencies (reasonable for Node). 5 requests per second * 4 seconds means 20 workers just to have any availability - unfortunately for our 16 worker (8 per quad core server) setup - we get a denial of service for our extremely overpowered setup.

Except that wasn't a startup, it happened to a big 10K+ employee tech company and I was called in in the middle of the night to clean the mess. It was hardly an isolated incident.

It's true that you then mention that it turns to an operational error for the end user - but that's hardly enough given all the "crash fast in production" advice the document just gave its readers.

@davepacheco
Copy link

@benjamingr Thanks for that context. That's very informative.

The situation I've run into a few times is similar, but where the uncaught exception was thrown after the Node program had already made a database query that took locks inside the database. Someone had added an uncaughtException handler to the program to avoid precisely the problem you described. But in the face of that exception, the code that would release the locks never ran, so the locks remained held with no way the code would ever release them. In this case, rather than making the server vulnerable to the DOS that you described, a single bad request ended up corking up most other requests on the system. When we restarted the service, the database socket was closed and the database released the locks. If the program had crashed instead of limping along, we would have had a few failed requests instead of several minutes of hung (and subsequently failed) requests.

Clearly, it's possible for this to go very badly no matter which approach you pick. I think it's always possible for there to be devastating bugs that reliably take out a server, and insufficient input validation is a major source of them.

People need to make the appropriate choices for their own use-cases, and these two examples help demonstrate the perils of either approach going wrong. I still believe the crash-on-programmer-error approach is a better default not just because the program is by definition in an undefined state in the face of programmer error, but because this approach turns implicit, non-fatal failure (which is usually very hard to debug) into explicit, fatal failure (which is generally easy to debug), making it more likely you can eliminate the problem altogether the first time it happens.

If people want to drop these ideas from official docs, I understand. My main interest is helping out if I can to help integrate that content and to make sure that the position in that guide is accurately represented.

@benjamingr
Copy link
Member

If the program had crashed instead of limping along, we would have had a few failed requests instead of several minutes of hung (and subsequently failed) requests.

That's a very helpful example, thanks. I 100% agree that in this case you must throw and restart the server - in fact, we do this ourselves in Promise.using in bluebird - if you fail to dispose a resource you have to crash fast. I think official error docs must mention this fact in them. This is however not a general programmer error - it is both specific to resources and can be an operational error.

I still believe the crash-on-programmer-error approach is a better default not just because the program is by definition in an undefined state in the face of programmer error, but because this approach turns implicit, non-fatal failure (which is usually very hard to debug) into explicit, fatal failure (which is generally easy to debug), making it more likely you can eliminate the problem altogether the first time it happens.

I'm not sure why we can't display both problems to readers and then recommend users handle their errors with care. That is my primary objection to the advice in the Joyent error handling document - it also contains some very useful advice and if we can amend it to show both cases.

Note that personally my optimal work flow is kind of strange but I think it has a nice balance:

  • I use promises, that's probably objectionable and debatable on its own.
  • Workers start by aborting on unhandled rejections (throws).
  • After one worker has aborted, following workers log but not abort.
  • Resources are always handled through disposers, all resources always contain cleanup logic enforced by RAII like behavior - if a resource fails to clean and leaks it throws and the worker restarts.

Not perfect - but working on it :)

@Trott
Copy link
Member

Trott commented Mar 13, 2018

Closing as this repository is dormant and likely to be archived soon. If this is still an issue, feel free to open it as an issue on the main node repository.

@Trott Trott closed this as completed Mar 13, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants