Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protocols/dcutr: Don't retry when incoming direct connection succeeded #2607

Open
mxinden opened this issue Apr 6, 2022 · 7 comments
Open
Labels
bug difficulty:moderate getting-started Issues that can be tackled if you don't know the internals of libp2p very well help wanted

Comments

@mxinden
Copy link
Member

mxinden commented Apr 6, 2022

As of today we retry the DCUtR process when the outgoing direct connection upgrade failed.

In the case where the incoming direction connection upgraded succeeded, there is no need for retrying the process.

We should check direct_connections to see whether an (incoming) direct connection succeeded in the meantime.

} => {
let peer_id =
peer_id.expect("Peer of `Prototype::DirectConnection` is always known.");
if attempt < MAX_NUMBER_OF_UPGRADE_ATTEMPTS {
self.queued_actions.push_back(ActionBuilder::Connect {
peer_id,
handler: NotifyHandler::One(relayed_connection_id),
attempt: attempt + 1,
});
} else {
self.queued_actions.extend([
NetworkBehaviourAction::NotifyHandler {
peer_id,
handler: NotifyHandler::One(relayed_connection_id),
event: Either::Left(
handler::relayed::Command::UpgradeFinishedDontKeepAlive,
),
}
.into(),
NetworkBehaviourAction::GenerateEvent(
Event::DirectConnectionUpgradeFailed {
remote_peer_id: peer_id,
error: UpgradeError::Dial,
},
)
.into(),
]);
}
}

@mxinden mxinden added bug difficulty:moderate help wanted getting-started Issues that can be tackled if you don't know the internals of libp2p very well labels Apr 6, 2022
@esotericbyte
Copy link
Contributor

esotericbyte commented Apr 7, 2022

I have skimmed: https://pdos.csail.mit.edu/papers/p2pnat.pdf
Is the current behavior inefficient or does is lead to more connection failures because the peers pursue different connections?
Is timing of the connections and interplay between the network, queued events, and futures executor a relevant concern?

Implement direct_connections check:
If there is a direct connection then stop trying to connect , and just drop effort and don't inject any events because it would have generated it's own success event)
If there is not then continue to a retry of the process
Finish reading the library code used by dcutr behavior
Test the code and evaluate the solution
Based on testing consider instrumenting with metrics or tracing.

Code reading questions:
The check needs to be made before the upgrade process from a relay connection to direct connection is restarted.
Precisely where should the check go?

Is this just a single code branch?
Keep in mind that rust resolves some of the logic during compile time even if the code is futures code
Are the events created by GenerateEvent futures tasks or a separate event queue?
If separate when they are handled do they create tasks and thus placed on a secondary queue?
What happens if both connections succeed? ( believe the direct_connections is a hash from ids to connections and so it would contain more than one connection on each end for their respective ids. )
If the the incoming direction connection upgrade is in process how many success cases may still be missed even if direct_connections is checked? (tasks are executed in parallel)
What is the latency of various queues? What applications use this code that might be analyzed to understand under various loads? Under various loads and contexts do these queues actually cause significant delays or are the delays from internal queues negligible? If not: Look for a diagram of network messages combined with event & event queue activity, and if there is none then make a diagram.

@mxinden
Copy link
Member Author

mxinden commented Apr 16, 2022

Is the current behavior inefficient or does is lead to more connection failures because the peers pursue different connections?

Just inefficient.

Is timing of the connections and interplay between the network, queued events, and futures executor a relevant concern?

Probably relevant, though we will be collecting more data soon to back up this point. See https://github.com/dennis-tra/punchr.

If there is a direct connection then stop trying to connect , and just drop effort

Yes, that sounds right.

Precisely where should the check go?

inject_dial_failure should work.

If separate when they are handled do they create tasks and thus placed on a secondary queue?

There is a single task running all NetworkBehaviour and Swarm logic plus one task per connection.

What happens if both connections succeed? ( believe the direct_connections is a hash from ids to connections and so it would contain more than one connection on each end for their respective ids. )

For now, one connection would eventually idle and thus be closed. In the long run we would need libp2p/specs#328.

What is the latency of various queues? What applications use this code that might be analyzed to understand under various loads? Under various loads and contexts do these queues actually cause significant delays or are the delays from internal queues negligible? If not: Look for a diagram of network messages combined with event & event queue activity, and if there is none then make a diagram.

Unfortunately I don't have data on this. Based on intuition and past experience I would guess it is negligible given that network latencies are an order of magnitude higher.

@normanade
Copy link

Actually, the direct_connections hashmap cannot be reached from outside the Behaviour struct, making the pub struct rather useless.

pub struct Behaviour {
    /// Queue of actions to return when polled.
    queued_actions: VecDeque<ActionBuilder>,

    /// All direct (non-relayed) connections.
    direct_connections: HashMap<PeerId, HashSet<ConnectionId>>,
}

@thomaseizinger
Copy link
Contributor

Actually, the direct_connections hashmap cannot be reached from outside the Behaviour struct

No need to access it from the outside. The behaviour gets notified about new incoming connections!

@esotericbyte
Copy link
Contributor

esotericbyte commented Nov 29, 2022

Looking back I am beguiled by how to debug multi-node issues like this in libp2p or an app where the network infrastructure like routers is involved. I probably should have asked about it sooner. Are there docs on how to set up automated testing or is it necessary to recruit machines?

@mxinden
Copy link
Member Author

mxinden commented Dec 6, 2022

@esotericbyte unfortunately we don't have automated testing in place here today. You can follow progress here libp2p/test-plans#21.

@thomaseizinger
Copy link
Contributor

Update: We now have automated tests for hole-punching!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug difficulty:moderate getting-started Issues that can be tackled if you don't know the internals of libp2p very well help wanted
Projects
None yet
Development

No branches or pull requests

4 participants