Skip to content
This repository has been archived by the owner on Aug 23, 2019. It is now read-only.

Make switch a state machine #278

Merged
merged 27 commits into from
Oct 19, 2018
Merged

Make switch a state machine #278

merged 27 commits into from
Oct 19, 2018

Conversation

jacobheun
Copy link
Contributor

@jacobheun jacobheun commented Sep 26, 2018

Getting this pushed up for visibility while I work on cleanup and add some additional tests.

Goal

This PR updates libp2p-switch to be a state machine, including its connections #24.

Updates

  • Switch will now emit start, stop, and error events. Callback legacy support is included.
  • Incoming connections will now ensure crypto is done before other muxing. Resolves #239 .
  • Outgoing and Incoming connections are now state machines. Rather than being internally, automatically driven, they are driven by dialer.js and handler.js respectively. This allows finer control over transitions and helps provide clarity around when and where errors occur.
  • Adds in some error standardization via the err-codes module
  • Fixed readme links

TODOs

  • There is redundant logic with the new Connection FSMs that needs to be cleaned up.
  • Add more, thorough tests for each of the connection FSMs.
  • Ensure incoming connections are being properly reused
  • General TODO cleanup
  • Add class-is module and remove instanceof logic, as it's unreliable across package versions.

Thoughts on Future Additions

  • As error codes are added throughout the system, have the FSMs handle fatal and non fatal errors appropriately.
  • Add hooks for new connections to allow modules like the connection manager, to make intelligent decisions about those connections (dial backoffs, etc).

References

Required by libp2p/js-libp2p#257

@ghost ghost assigned jacobheun Sep 26, 2018
@ghost ghost added the in progress label Sep 26, 2018
@jacobheun
Copy link
Contributor Author

@diasdavid @pgte @vasco-santos I've still got some cleanup to do per my notes in the PR overview, but I'd love to get some feedback on the overall implementation of the FSM. There may be some considerations I've missed that we could roll into this update.

Copy link
Contributor

@pgte pgte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with some questions / comments.

if (circuitTried) {
this.emit('error', new Error(`No available transports to dial peer ${this.theirB58Id}!`))
return this._state('disconnect')
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These last two conditions can dictate the end of the dialing attempt. As a client of the switch API, I'd like to now the reason why a dial failed. "No available transport" or "Circuit not enabled" doesn't tell me much of why the other transports failed.

As a remedy, I suggest that we keep track of the failing reasons and make them somehow available to the switch user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, emit a warning-like event like "connection attempt failed", containing an error object as payload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the warning option as that would provide the user with more information about the various attempts, even if connections succeed on a later transport.

With the current structure the switch would need to emit the events. Perhaps the payload should also include the peerInfo so users understand where it came from. Something like:

_switch.emit('error:connection_attempt_failed', {
  peerInfo: theirPeerInfo,
  error: ...
})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This dialing is triggered by a switch.dial(), which is callback-based..
Having the switch emit these events independently of the dial() call means that, for the API user, it will be hard to correlate a call with these warnings.
Usually I solve this by either returning an even stream or an event emitter. In this case, I think that something like a dialFSM would be something useful to consider...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Currently Libp2p doesn't return anything from its dial methods so returning an event emitter there from a Switch dialFSM would enable us to keep backwards compatibility more easily while we transition the system.

For js-ipfs we'd also need to make sure to return that emitter from swarm.connect().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgte I've added a dialsFSM method to the switch, that can be used instead of dial. Instead of a connection, it will return the fsm. I've added some tests for it to show the events and functionality it has https://github.com/libp2p/js-libp2p-switch/pull/278/files#diff-080e21482eea9ffdd7ee511eb71d3522.

The main things to note:
The raw connection is still returned in the callback passed to dialFSM

  1. Users can listen on error events to get fatal messages
  2. Users can listen on error:connection_attempt_failed events to get a list of errors for failed transport dials, these are not fatal to the connection.
  3. Users can listen on close events to know when the specific connection is closed.
  4. Users can call fsm.close() to close that specific connection.

Are there other events or methods that would be helpful to have? We can always wait to add other things if there aren't higher priority nice to haves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobheun That's great! Other than successful connection attempt (which you can infer from having the "connected" callback called), I don't see any other so far.

src/connection/index.js Show resolved Hide resolved
src/dialer.js Outdated
connection.on('connected', () => connection.protect())
connection.on('private', () => connection.encrypt())
connection.on('encrypted', () => connection.upgrade())
connection.on('muxed', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can all these events occur more than once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they should only occur once, I'll update these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@vasco-santos vasco-santos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me! Left just a few minor comments

src/dialer.js Outdated
const peerInfo = getPeerInfo(peer, _switch._peerBook)
const b58Id = peerInfo.id.toB58String()

log(`${_switch._peerInfo.id.toB58String().slice(0, 8)} dial request to ${b58Id.slice(0, 8)} with protocol ${protocol}`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify if protocol exists and if not add unknown instead of allowing undefined

src/index.js Outdated
log('The switch has stopped')
this.emit('stopped')
})
this.state.on('error', (err) => this.emit('error', err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aiming to ease debuggability, I think we should log the error here as well.

src/index.js Outdated
@@ -167,7 +167,7 @@ class Switch extends EE {

/**
* If a muxed Connection exists for the given peer, it will be closed
* and its reference on the Switch will be removed.
* and its reference on the Switch and will be removed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? I think it was correct before 🤔

src/index.js Show resolved Hide resolved
src/index.js Outdated

conn.muxer.end((err) => {
// If OK things are fine, and someone just shut down
if (err && err.message !== 'Fatal error: OK') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log the error?

this._dialWithTimeout(transport, addr, (err, conn) => {
if (err) {
log('work:error')
log(`${transport.constructor.name}:work:error`, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use log.error here

@jacobheun
Copy link
Contributor Author

After reviewing the crypto logic in switch as it currently is, we're not actually supporting multiple crypto modules on dial, only on listen. This needs to be corrected, but after doing some small changes in this branch I think it should wait until this is done as it's a significant change. That will enable us to review those changes more easily to ensure they're done properly.

Also, since secio is currently our only crypto library in libp2p this isn't a critical need.

- [`switch.stop(callback)`](#swarmclosecallback)
- [`switch.connection`](#connection)
- [`switch.dial(peer, protocol, callback)`](#switchdialpeer-protocol-callback)
- [`switch.dialFSM(peer, protocol, callback)`](#switchdialfsmpeer-protocol-callback)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better just mimic js-libp2p and do dial and dialProtocol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diasdavid are you suggesting to add dialProtocol in addition to the new dialFSM or as a naming replacement? I think having dial handle calls with and without a protocol is reasonable overloading. Replacing it wouldn't be good as it has a different callback footprint to give users access to the ConnectionFSM for finer control/handling.

I've exposed the same dialFSM method in the libp2p branch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to block this PR for this. What I mean is that given that libp2p (the module) is just an extension to the libp2p-switch (the dialing machine) that adds it other features, it would be cool if they both expose the same dialing/hangup APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Ideally I'd like to deprecate dialProtocol in libp2p and just use dial. Having both feel unnecessary. I'll propose the change in an issue there and if dialProtocol ends up staying we can add that in here.

@jacobheun jacobheun changed the title [WIP] Feat/state machine Make switch a state machine Oct 18, 2018
@jacobheun
Copy link
Contributor Author

@vasco-santos this should be ready for final review. I have this branch integrated in libp2p/js-libp2p#257, which should also be ready for review.

Copy link
Member

@vasco-santos vasco-santos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobheun all in all looks good to me 🙂

Left just minor notes

README.md Outdated
- [`switch.unhandle(protocol)`](#switchunhandleprotocol)
- [`switch.start(callback)`](#switchstartcallback)
- [`switch.stop(callback)`](#switchstopcallback)
- [`switch.connection`](#switchconnection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please alphabetize the API? I believe this helps when looking for documentation.

src/index.js Outdated
* Issues a start on the Switch state.
*
* @param {function} callback deprecated: Listening for the `error` and `start` events are recommended
* @fires Switch#started
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fires Switch#started ?

src/index.js Outdated
* A listener that will turn off all running services and listeners
*
* @private
* @fires Switch#error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@jacobheun
Copy link
Contributor Author

@vasco-santos I updated the order in the readme and fixed the improper usage of jsdocs @fires on individual methods and updated those events on the Switch class itself.

@jacobheun jacobheun merged commit 7dcabdd into master Oct 19, 2018
@ghost ghost removed the in progress label Oct 19, 2018
@jacobheun jacobheun deleted the feat/state-machine branch October 19, 2018 11:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants