Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

Better error handling in daemon + gateway #1325

Closed
victorb opened this issue Apr 24, 2018 · 38 comments
Closed

Better error handling in daemon + gateway #1325

victorb opened this issue Apr 24, 2018 · 38 comments
Labels
awesome endeavour exp/wizard Extensive knowledge (implications, ramifications) required status/ready Ready to be worked

Comments

@victorb
Copy link
Member

victorb commented Apr 24, 2018

Today, if there is a error in a stream or any other error within the daemon or the gateway, the entire process crashes. While not a problem for testing, it's a problem when deploying and trying to run a long-term daemon (for example, for the js-ipfs daemons).

I raised this before in #1243

Basically, we should be able to gracefully handle errors down the stack without having the daemon and gateway crash.

Current issue is the following error that happens every ~10 minutes:

/usr/src/app/node_modules/pull-pair/index.js:11
      throw new Error('already piped')
      ^

Error: already piped
    at sink (/usr/src/app/node_modules/pull-pair/index.js:11:13)
    at consume (/usr/src/app/node_modules/pull-defer/sink.js:7:17)
    at consume (/usr/src/app/node_modules/pull-defer/sink.js:7:17)
    at Connection.consume (/usr/src/app/node_modules/pull-defer/sink.js:7:17)
    at pull (/usr/src/app/node_modules/pull-stream/pull.js:43:9)
    at Dialer.handle (/usr/src/app/node_modules/multistream-select/src/dialer/index.js:47:5)
    at attemptMuxerUpgrade (/usr/src/app/node_modules/libp2p-switch/src/dial.js:164:10)
    at gotWarmedUpConn (/usr/src/app/node_modules/libp2p-switch/src/dial.js:55:7)
    at Switch.dial (/usr/src/app/node_modules/libp2p-switch/src/dial.js:41:9)
    at _getPeerInfo (/usr/src/app/node_modules/libp2p/src/index.js:238:19)

I'm not saying that we should (or should not) fix this particular error, but rather when this error (and similar) happens, the daemon and gateway should not die but continue running.

@victorb
Copy link
Member Author

victorb commented Apr 25, 2018

@fsdiogo as you're the current OKR owner for this, I've assigned this to you

@victorb
Copy link
Member Author

victorb commented Apr 25, 2018

Another error that happens directly after starting daemon right now:

Initializing daemon...
Using wrtc for webrtc support
/usr/src/app/src/cli/commands/daemon.js:40
        throw err
        ^

Error: websocket error
    at WS.Transport.onError (/usr/src/app/node_modules/engine.io-client/lib/transport.js:64:13)
    at WebSocket.ws.onerror (/usr/src/app/node_modules/engine.io-client/lib/transports/websocket.js:150:10)
    at WebSocket.onError (/usr/src/app/node_modules/engine.io-client/node_modules/ws/lib/EventTarget.js:109:16)
    at WebSocket.emit (events.js:160:13)
    at WebSocket.finalize (/usr/src/app/node_modules/engine.io-client/node_modules/ws/lib/WebSocket.js:182:41)
    at ClientRequest._req.on (/usr/src/app/node_modules/engine.io-client/node_modules/ws/lib/WebSocket.js:653:12)
    at ClientRequest.emit (events.js:160:13)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:539:21)
    at HTTPParser.parserOnHeadersComplete (_http_common.js:112:17)
    at TLSSocket.socketOnData (_http_client.js:439:20)

@fsdiogo
Copy link
Contributor

fsdiogo commented May 10, 2018

@victorbjelkholm I can't seem to be able to reproduce the error in your last comment, can you share the steps or configs you have that produces the error?

@victorb
Copy link
Member Author

victorb commented May 14, 2018

@fsdiogo this issue (#1325) is not about a specific issue but rather having having a error handler for uncaught exceptions from within the daemon. Currently, any error that happens down the stack (in js-bitswap for example) crashes the daemon, which is why it's so unstable currently. We need to make it so the daemon won't fully be crashed and needing a restart just because a error in one stream or for one operation.

@fsdiogo
Copy link
Contributor

fsdiogo commented May 14, 2018

Alright, so it's best to start off on the daemon itself, i.e. try to reproduce an error and make it so that the daemon doesn't crash?

@victorb
Copy link
Member Author

victorb commented May 14, 2018

@fsdiogo I think just adding if (Math.random() > 0.5) throw new Error('Test error') in various levels in the stack and testing manually will let you discover one or two places where we can catch the errors appropriately.

@fsdiogo
Copy link
Contributor

fsdiogo commented May 16, 2018

I've been thinking about a better approach to this problem. Instead of a band aid approach, we should really understand how we're dealing with errors and have a robust solution to propagate them.

We are using async throughout the project, but check this:

/*
  With async the stack trace gets lost: it only tells the line
  where we throw the error, but it came from one of the calls
  in the series, we don't have a way to know from which one.
 */

const async = require('async')

async.series([
  (cb) => cb(null, 'async x'),
  (cb) => cb('async error'),
  (cb) => cb(null, 'async xxx'),
], (err, res) => {
  if (err) {
	  throw new Error(err)
	}
})

/*
> node async.js
  Users/diogo/async-promises/async.js:15
  	throw new Error(err);
  	^

  Error: async error
    at async.series (/Users/diogo/async-promises/async.js:15:10)
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:3874:9
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:473:16
    at iterateeCallback (/Users/diogo/async-promises/node_modules/async/dist/async.js:976:17)
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:958:16
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:3871:13
    at async.series (/Users/diogo/async-promises/async.js:11:11)
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:3866:24
    at replenish (/Users/diogo/async-promises/node_modules/async/dist/async.js:998:17)
    at iterateeCallback (/Users/diogo/async-promises/node_modules/async/dist/async.js:983:17)
*/

A solution for this is Promises:

/*
  With promises the stack trace is intact, being possible
  to observe all the lines and modules that lead to the error.
 */

new Promise((resolve, reject) => {
  setTimeout(() => resolve('promise x'), 100)
})
.then((res) => res + 'x')
.then((res) => randomFn())
.then((res) => res + 'xxx')
.catch((err) => console.log(err))

const randomFn = () => {
  throw new Error('randomFn error')
}

/*
> node promises.js
  Error: randomFn error
    at randomFn (/Users/diogo/async-promises/promises.js:14:9)
    at Promise.then.then (/Users/diogo/async-promises/promises.js:9:16)
    at <anonymous>
*/

Promises will vastly improve our debug capability and therefore help us with the handling of future errors. Moreover, we standardise the way we handle errors and improve the codebase by handling errors that are being lost along the way right now.

PR #1335 by @hugomrdias is a good example of that.

I had already opened the issue #1311 to start using promises, but I think this KR is the perfect case of why we should really do that. It's not a matter of callbacks vs promises, but of what works vs what doesn't.

I've been thinking of possible solutions but this is the best way that I've found, I've talked with @vasco-santos and @hugomrdias and they agree with this approach.

What do you guys think? @victorbjelkholm @diasdavid

@fsdiogo
Copy link
Contributor

fsdiogo commented May 16, 2018

@dryajov want to chime in as you've dealt with this too?

@dryajov
Copy link
Member

dryajov commented May 16, 2018

@fsdiogo so the issue here is not really with async or Promise. The resason the stack is lost in your async example is because you are instantiating the Error object inside the series done callback:

async.series([
  (cb) => cb(null, 'async x'),
  (cb) => cb('async error'),
  (cb) => cb(null, 'async xxx'),
], (err, res) => {
  if (err) {
	  throw new Error(err) // --> error instantiated here
	}
})

The Error object will capture the stack at the current place of instantiation, so if we change you're example to:

const async = require('async')

async.series([
  (cb) => cb(null, 'async x'),
  (cb) => cb(new Error('async error')),  // --> error instantiated here
  (cb) => cb(null, 'async xxx'),
], (err, res) => {
  if (err) {
    throw err
  }
})

We get the correct stacktrace:

/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/test.js:11
    throw err
    ^

Error: async error
    at async.series (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/test.js:7:14)
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:3866:24
    at replenish (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:998:17)
    at iterateeCallback (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:983:17)
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:958:16
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:3871:13
    at async.series (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/test.js:6:11)
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:3866:24
    at replenish (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:998:17)
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:1002:9

Note that in you're promise example you do instantiate the error at the place of rejection, so the stack comes up correctly as well:

/*
  With promises the stack trace is intact, being possible
  to observe all the lines and modules that lead to the error.
 */

new Promise((resolve, reject) => {
  setTimeout(() => resolve('promise x'), 100)
})
.then((res) => res + 'x')
.then((res) => randomFn())
.then((res) => res + 'xxx')
.catch((err) => console.log(err))

const randomFn = () => {
  throw new Error('randomFn error')
}

/*
> node promises.js
  Error: randomFn error
    at randomFn (/Users/diogo/async-promises/promises.js:14:9)
    at Promise.then.then (/Users/diogo/async-promises/promises.js:9:16)
    at <anonymous>
*/

@hugomrdias
Copy link
Member

@dryajov sure ur correct, but the problem isn't where you create the errors. It's propagation, with async, errors instances are just passed around (or teleported as i say it). With promises, errors are automatically re throwed and stack-traces follow the code path and with a lot less noise from async internals.

Not having full stack traces is really painful because we have to manually search the code base, commenting or whatever until we find the line we need. In contrast with complete stacks we can cmd+click BOOM editor jumped to the line we need perfect developer experience and full productivity.

@dryajov
Copy link
Member

dryajov commented May 16, 2018

Aside from my comment above, this is still obviously a legitimate issue overall. Promises do help with that somewhat, but they are IMO not the final solution for this particular issue and I would argue that bringing them up here as an end all solution for this problem, might end up doing more harm than good.

I though about it for some time and I think some of the issues we have boil down to mixing error handling styles. We rely heavily on callback style async constructs in our code base, which occasionally get mixed with try/catch somewhere deep in the codebase or a third party module, add EventEmmiter error event types and Promises on top of that and we get a pretty wild compound of things to that can and do go wrong.

IMO, in order to handle errors correctly we need to:

  • find all the places where we're not handling errors appropriately and fix them
  • Standarise error handling across our code base
    • look for throws and return them as callback errors where possible
    • ditto for promises
  • EventEmmiters errors should also be appropriately handled - non fatal (error) type errors should be emited as warnings/user errors, since error is a special case and throws if there are no listeners attached
  • Finally, have a global error handler that catches unhandled errors and terminates the process gracefuly

@hugomrdias
Copy link
Member

hugomrdias commented May 16, 2018

This stuff below also is tricky because callback might not be called, in some places we even have setImmediate(self.emit('error', err)) to solve that.

  const done = (err, res) => {
      if (err) {
        self.emit('error', err)
        return callback(err)
      }

      self.state.initialized()
      self.emit('init')
      callback(null, res)
    }

What do you think about removing event emitters ? moving stuff like boot() inside ipfs constructor to a instance method or a static factory method where we can return a promise or use callbacks.

@dryajov
Copy link
Member

dryajov commented May 16, 2018

Ah, yeah - setImmediate(self.emit('error', err)) is mostly there because emit is sync, and the handler might actually never return which breaks the flow, hence the setImmediate to force it async. I think the same thing would happen with promises too?


Yeah, I think removing EventEmiter can be a good thing, we use it in js-ipfs (js-libp2p might still be apropriate?) primarily because we want to propagate errors during instantiation, so factory method seems like the right way to go.

@hugomrdias
Copy link
Member

Yes it's true for promises, that's why I mention removing event emitters. IMO this should be one of the first things to do then all the stuff you said and finally add promises to help with stacks, debugging , back track cancelables, retry/timeouts etc

@fsdiogo
Copy link
Contributor

fsdiogo commented May 17, 2018

Thanks for the input @dryajov and @hugomrdias!

  • Find where we're not handling errors appropriately and fix them
  • Standardise error handling across our code base
    • look for throws and return them as callback errors where possible
    • same for promises
  • Have a global error handler that catches unhandled errors and terminates the process gracefuly
  • Remove EventEmitter and move boot to a factory method (requires a breaking change)

@Mr0grog
Copy link
Contributor

Mr0grog commented May 17, 2018

So, I realize that removing event emitting can simplify things (no argument from me that any critical error during startup should translate to an error in the callback/rejection in a promise), but we should keep in mind that the events are part of the public API. Various users may be relying on them. They at least deserve notice of the removal well before it happens, and ideally a release or two where the events still work but log a warning when used.

(That said, event emitters should all be wrapped in setImmediate, nextTick, or try/catch to ensure that a listener cannot break the emitting code.)

@dryajov
Copy link
Member

dryajov commented May 17, 2018

@Mr0grog agree, this change should be treated as a breaking change and possibly a major version bump (0.3.x?).

@daviddias
Copy link
Member

@fsdiogo before refactoring the code, make sure to have tests that check for the situations where the daemon errors out today. Try things like opening a connection and write trash at different levels of the dialing (transport -> secio -> stream muxing -> protocols).

@fsdiogo
Copy link
Contributor

fsdiogo commented May 18, 2018

Yeah @Mr0grog @dryajov, it will be a breaking change and apps using ipfs will probably have to be refactored, so that's why in its first iteration we should offer both versions. I've opened a small WIP PR with that: #1359.

But as @diasdavid pointed out, I shall start with a different task. I've updated #1325 (comment) with the tasks order.

@alanshaw
Copy link
Member

I have a couple more suggestions for your consideration:

  • When dealing with Node.js streams, use pump whenever possible to ensure errors are caught and handled
  • Add context to errors and allow our tests to assert on error messages we control in our modules by using explain-error

To address @victorbjelkholm's original request:

In Node.js the best practice on uncaught exceptions is to crash the process. We need to track down and fix any issues that cause the process to crash. In Node.js you can't meaningfully reason about the state of the system in the case of an unhandled exception so you need to restart the process to avoid leaking memory, holding open connections unnecessarily, corrupting shared state etc. etc.

It's worth having a read of https://www.joyent.com/node-js/production/design/errors - relevant section:

The best way to recover from programmer errors is to crash immediately.

The only downside to crashing on programmer errors is that connected clients may be temporarily disrupted, but remember:

  • By definition, these errors are always bugs. We're not talking about legitimate system or network failures, but actual bugs in the program. They should be rare in production, and the top priority has to be to debug and fix them.
  • For all the cases described above (and many more), the requests in flight are not necessarily going to complete successfully anyway. They may complete successfully, they may crash the server again, they may complete incorrectly in obvious ways, or they may complete wrongly in very subtle ways that are very hard to debug.
  • In a reliable distributed system, clients must be able to deal with server failure by reconnecting and retrying requests. Network and system failure are a reality, whether or not the Node.js program itself is allowed to crash.
  • If your production program is crashing so often that these disconnections are a problem, then the real problem is that the server is so buggy, not that it crashes in the case of a bug.

If disconnecting clients is a frequently problem because a server crashes so often, you should focus on the bugs that cause the service to crash — and make those exceptional — rather than trying to avoid crashing in cases where the code is obviously wrong

@alanshaw
Copy link
Member

I'm not clear on the benefits of removing EventEmitter. It's used for boot but IPFS also emits init start and stop events in addition to ready and error (note that error is used for startup failures as well as boot failures). People are probably using all of those.

To avoid crashing the process on an error event you can check and emit the error event only if there are listeners for it.

I get the point about handlers throwing and not allowing the callback to be called, but they are unhandled exceptions - you don't really want to catch them (what would you do? - it's not your code that err'd) and if you put them on the next tick then it's just going to be an unhandled exception there too. Our code should be doing any cleanup and/or running critical code paths before emit is called.

BTW I'm +1 on all the other points that @fsdiogo has listed.

@Mr0grog
Copy link
Contributor

Mr0grog commented May 25, 2018

When dealing with Node.js streams, use pump whenever possible

👍 pump and pumpify are both excellent and hugely useful little tools for streams. It’s a lot of work to make sure you do handle errors right without them.

In Node.js you can't meaningfully reason about the state of the system in the case of an unhandled exception so you need to restart the process to avoid leaking memory, holding open connections unnecessarily, corrupting shared state etc. etc.

I guess I read this issue more as “there are a lot of errors we could be catching and handling so as not to leave ourselves in a bad state, so we don’t need to crash the daemon as often.” Because otherwise I totally agree that simply ignoring any old error is a recipe for really confusing and possibly more dangerous disasters, like data loss, sometime later in the process’s lifetime.

I get the point about handlers throwing and not allowing the callback to be called, but they are unhandled exceptions - you don't really want to catch them (what would you do? - it's not your code that err'd)

In any other large system I’ve worked on, we’ve generally followed the logic that, if you catch them, you do you clean-up and then re-throw (or just use a finally without a catch). The point isn’t to stop the error; it’s to stop an external error from breaking your internal state. (e.g. you want to be as recoverable as possible if the external error is caught and recovered in some other way, like with a global error handler.)

and if you put them on the next tick then it's just going to be an unhandled exception there too.

Exactly! Same as above; that’s the point. It’s not your error, so you shouldn’t be preventing/stopping it. You should only be preventing it from breaking your internal state because, again, it’s not your error.

Our code should be doing any cleanup and/or running critical code paths before emit is called.

Wherever possible, 👍 ! It’s just not always reasonable/possible. Sometimes it’s important that the ordering of the event come before some other code (e.g. before[action]/will[action] events, which I know we don’t have here, but in the abstract, it still applies).

@dryajov
Copy link
Member

dryajov commented May 25, 2018

👍 pump and pumpify are both excellent and hugely useful little tools for streams. It’s a lot of work to make sure you do handle errors right without them.

I would actually abstain of using node stream at all and instead use pull-stream (converting the node stream to pull stream with stream-to-pull-stream), the parts of our codebase that is still using node streams are most likely legacy leftovers that need to be reworked anyways.

@Mr0grog
Copy link
Contributor

Mr0grog commented May 25, 2018

I would actually abstain of using node stream at all and instead use pull-stream

Yeah, you’re probably right for the case of all the stuff here. I think pump/pumpify are good light-weight tools for making built-in streams sane, while pull-stream is a heavier-weight replacement for built-in streams that already solves most of those problems :)

@dryajov
Copy link
Member

dryajov commented May 25, 2018

Also, since we made the decision of using pull-stream exclusively a while back. Plus, although full featured, pull-stream are still light weight compared to node streams in general.

@fsdiogo fsdiogo added the exp/wizard Extensive knowledge (implications, ramifications) required label May 29, 2018
@fsdiogo fsdiogo added help wanted Seeking public contribution on this issue awesome endeavour labels May 29, 2018
@alanshaw
Copy link
Member

@dryajov point taken! The sentiment was to ensure stream errors do not cause the daemon to crash so as long as we're ensuring that I'm happy 😄

@daviddias daviddias added the status/ready Ready to be worked label May 30, 2018
@daviddias
Copy link
Member

@fsdiogo during last js-ipfs weekly, there were some question with regards on how to approach this. Do you feel it would be good to sync?

@fsdiogo
Copy link
Contributor

fsdiogo commented May 30, 2018

Well, I was going to follow these steps, if you feel that's the way to go, awesome! If not, we should sync 🙂

@daviddias
Copy link
Member

What is the approach to achieve this:

image

@fsdiogo
Copy link
Contributor

fsdiogo commented May 30, 2018

@daviddias
Copy link
Member

Ok, I see, so that is where we diverge. You should first of all create the tests that cause the crashes.

@fsdiogo
Copy link
Contributor

fsdiogo commented May 30, 2018

You want that approach so we don't start changing the code blindly, right? I get it.

I just think that way the process will be a lot slower, no?

@daviddias
Copy link
Member

@fsdiogo we are not optimizing for shipping fast here, the target should be correctness and ensuring that the same problems don't come back to haunt us.

@dryajov
Copy link
Member

dryajov commented May 30, 2018

Agree, we should have a test for every failure we run into 👍

@daviddias daviddias mentioned this issue Jun 7, 2018
@daviddias
Copy link
Member

@fsdiogo a good source of problems to solve is the list of bugs -- https://github.com/ipfs/js-ipfs/labels/bug --. Creating tests that reproduce those will help you nail down where the problems are coming from.

@alanshaw, @pgte and @satazor have a vast amount of experience in making large Node.js services robust. They can give a ton of tips on how to move forward the best way possible, reach out!

@fsdiogo
Copy link
Contributor

fsdiogo commented Jun 8, 2018

Alright, thanks @diasdavid! I'll reach out then!

@daviddias
Copy link
Member

No more "already piped" with 0.31 - #1458

@daviddias
Copy link
Member

Duplicate/Same as #1406. Let's continue there

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
awesome endeavour exp/wizard Extensive knowledge (implications, ramifications) required status/ready Ready to be worked
Projects
None yet
Development

No branches or pull requests

7 participants