Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gossipsub): use signed peer records on handle prune #3579

Closed
wants to merge 2 commits into from

Conversation

dapplion
Copy link

@dapplion dapplion commented Mar 10, 2023

Description

Gossipsub's PeerExchange functionality suggests peers to connect to upon pruning a node from the mesh. Currently, we cannot import sending signed peer records to other nodes because we don't have a peer store (yet). This patch implements handling incoming signed peer records from other nodes in case we get pruned from the mesh.

Related #2398.

Notes

N/A

Links to any relevant issues

Open Questions

N/A

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
    • Could not find any docs affected, please indicate otherwise
  • I have added tests that prove my fix is effective or that my feature works
    • A test covers dialing peers from signed_peer_records, but does not assert that addresses is correct, see

// TODO: How to extract addresses from DialOpts to assert them in the test?
NetworkBehaviourAction::Dial { opts } => opts.get_peer_id(),

  • A changelog entry has been made in the appropriate crates
    • Where should I append the entry? Add a new # 0.45.0 section?

@dapplion dapplion changed the title Use signed peer records on handle prune Use signed peer records on gossipsub handle prune Mar 10, 2023
@dapplion dapplion changed the title Use signed peer records on gossipsub handle prune feat(gossipsub) use signed peer records on gossipsub handle prune Mar 10, 2023
@dapplion dapplion changed the title feat(gossipsub) use signed peer records on gossipsub handle prune feat(gossipsub) use signed peer records on handle prune Mar 10, 2023
@thomaseizinger thomaseizinger changed the title feat(gossipsub) use signed peer records on handle prune feat(gossipsub): use signed peer records on handle prune Mar 11, 2023
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is definitely an improvement by itself.

Comment on lines +1907 to +1910
// No duplicates
assert_eq!(dials_set.len(), config.prune_peers());

// all dial peers must be in px
assert!(dials_set.is_subset(&px_peer_ids.iter().copied().collect::<HashSet<_>>()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to assert the presence of the addresses too!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion on how to retrieve the addresses from DialOpts struct?

// TODO: How to extract addresses from DialOpts to assert them in the test?
NetworkBehaviourAction::Dial { opts } => opts.get_peer_id(),

Or should I make this fn public?

pub(crate) fn get_addresses(&self) -> Vec<Multiaddr> {
self.addresses.clone()
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really a fan of the style of the tests in gossipsub anyway. We should only be testing behaviours by running them in a Swarm and not call these lifecycle functions manually.

We now have a new crate, libp2p-swarm-test which makes it easy to set up swarms for testing.

How about we do the following:

  • Implement "sending" of signed peer records by keeping a hashmap of records in the behaviour where we perform a lookup upon prune
  • At the moment, we don't have a way of adding records to this hashmap so in reality, there won't ever be any
  • Within the test, we can add a record (by accessing the field directly)
  • We can then write a test using libp2p-swarm-test that ends up sending a Prune to a node which will then contain the SignedPeerRecord
  • We can then assert that the Swarm establishes a connection to the node specified in the prune message

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going this way seems reasonable to me

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes tho if all tests in this package are written with a consistent style, should this PR stick to that style? And then refactor them into using swarm-test in another PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes tho if all tests in this package are written with a consistent style, should this PR stick to that style? And then refactor them into using swarm-test in another PR

Eventually yes but that is a massive effort. I don't mind mixing styles in a transition period if we get a better, more feature-oriented test in exchange.

@thomaseizinger
Copy link
Contributor

Regarding changelog: Given that this is backwards-compatible, can you can add an entry to the 0.44.1 section.

@dapplion
Copy link
Author

cargo-semver-checks CI step errors with

    Finished dev [unoptimized + debuginfo] target(s) in 38.99s
     Parsing libp2p-core v0.39.0 (baseline, cached)
Error: version mismatch between current (v24) and baseline (v23) format versions
Error: Process completed with exit code 1.

Any pointers on what could be the root issue? https://github.com/libp2p/rust-libp2p/actions/runs/4380836480/jobs/7668318386#step:14:409

@thomaseizinger
Copy link
Contributor

cargo-semver-checks CI step errors with

    Finished dev [unoptimized + debuginfo] target(s) in 38.99s
     Parsing libp2p-core v0.39.0 (baseline, cached)
Error: version mismatch between current (v24) and baseline (v23) format versions
Error: Process completed with exit code 1.

Any pointers on what could be the root issue? libp2p/rust-libp2p/actions/runs/4380836480/jobs/7668318386#step:14:409

Uhm this is likely because the caches are stale. I'll delete them.

@mergify
Copy link
Contributor

mergify bot commented Mar 12, 2023

This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏

@mxinden
Copy link
Member

mxinden commented Mar 16, 2023

Let me know if you need a review from my end. Also adding @divagant-martian and @AgeManning for visibility and, if time permits, a review.

Copy link
Contributor

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah nice. Looks good. Just a small comment

// mark as px peer
self.px_peers.insert(peer_id);

// dial peer
self.events.push_back(NetworkBehaviourAction::Dial {
opts: DialOpts::peer_id(peer_id).build(),
opts: DialOpts::peer_id(peer_id).addresses(addresses).build(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this should have been done earlier, probably on line 1617.

I think we probably don't want to try to connect to all the peers, we probably only want to try and get up to mesh_n_high. I'd suggest we set n on line 1617 to be the min of self.config.prune_peers() or self.mesh.get(&topic_hash).map(|x| x.len()).unwrap_or(0) or something. Which may mean calculating this in handle_prune() because we don't have the topic hash here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the needed change in general.

Note that prune_peers according to our docs configure the number of peers we send to others, not the number of peers we connect to. Either code or docs need to be adjusted. But using the current docs, this should probably not be used here (or in handle_prune)

It's also not clear to me if handling the prune is the best place to connect to those peers. We already graft peers in the heartbeat when below mesh_n_low. We might want to connect to these peers to accelerate the process, or we might want to prioritize already connected peers since they are somewhat more trusted. In any case, both mesh_n_high and mesh_n_low should probably be used to select the number of peers from this set we want to try.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PeerExchange is meant to give you more peers in case you are only connected to a few, right?

Perhaps it would be better to just add those to some kind of list and let the grafting logic upon heartbeats do the connecting?

For the case where we don't have many peers, that would otherwise "fail" in that we can't connect to as many peers as we'd like.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good points, and path forward is opinionated. Maintainers and @divagant-martian @AgeManning could you comment on what makes more sense?

IMO connecting eagerly instead of accumulating keep it simpler + limit only up to mesh needs. For reference Go impl eagerly schedules all peer exchanges to the connect channel (with max cap MaxPendingConnections)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not touched gossipsub much so I'll leave it to @mxinden and @AgeManning to make a decision here.

For reference Go impl eagerly schedules all peer exchanges to the connect channel (with max cap MaxPendingConnections)

That is an interesting data point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of unblocking this PR we can go with eager connection, since as Age mentioned, this needs to be explicitly turned on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AgeManning Should do_px be used to not consume px by default? Or add a new config param like use_px to split provision and consumption of px

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a more explicit config. I think we should opt for enable_peer_exchange() which enables both sending and receiving. I dont think users really need granularity over one or the other (we can add it later if its requested).

I'd suggest renaming do_px to enable_peer_exchange. Have it off my default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dapplion to achieve this you technically need to change nothing. This is already the case. Age's suggestion is a nice to have, but would change the public api, so the version should be updated accordingly. I'm fine with doing or not doing this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it in a non-breaking way by adding enable_peer_exchange on Config and deprecating the do_px function. Internally, the field can be renamed without issues.

Copy link
Contributor

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review since it's getting late in my timezone

protocols/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/behaviour.rs Show resolved Hide resolved
// mark as px peer
self.px_peers.insert(peer_id);

// dial peer
self.events.push_back(NetworkBehaviourAction::Dial {
opts: DialOpts::peer_id(peer_id).build(),
opts: DialOpts::peer_id(peer_id).addresses(addresses).build(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the needed change in general.

Note that prune_peers according to our docs configure the number of peers we send to others, not the number of peers we connect to. Either code or docs need to be adjusted. But using the current docs, this should probably not be used here (or in handle_prune)

It's also not clear to me if handling the prune is the best place to connect to those peers. We already graft peers in the heartbeat when below mesh_n_low. We might want to connect to these peers to accelerate the process, or we might want to prioritize already connected peers since they are somewhat more trusted. In any case, both mesh_n_high and mesh_n_low should probably be used to select the number of peers from this set we want to try.

Copy link
Contributor

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for a risk pf panic and test updates I think this is close to ready

Comment on lines 1602 to 1608
let max_px_to_connect = self.config.mesh_n_low()
- self.mesh.get(&topic_hash).map(|x| x.len()).unwrap_or(0);
Copy link
Contributor

@divagant-martian divagant-martian Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likely to panic on release debug

Also, I think you want mesh_n_high instead of mesh_n_low ? Both are fine depending on what we want, @AgeManning suggested the high one which would be interpreted as capacity more than need

//TODO: Once signed records are spec'd: Can we use peerInfo without any IDs if they have a
// signed peer record?
px.retain(|p| p.peer_id.is_some());
fn px_connect(&mut self, mut px: Vec<PeerInfo>, n: usize) {
if px.len() > n {
// only use at most prune_peers many random peers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another line no longer in sync with code

// mark as px peer
self.px_peers.insert(peer_id);

// dial peer
self.events.push_back(NetworkBehaviourAction::Dial {
opts: DialOpts::peer_id(peer_id).build(),
opts: DialOpts::peer_id(peer_id).addresses(addresses).build(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of unblocking this PR we can go with eager connection, since as Age mentioned, this needs to be explicitly turned on.

.collect();

// Exactly config.prune_peers() many random peers should be dialled
assert_eq!(dials.len(), config.prune_peers());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely to fail?

Comment on lines +1907 to +1910
// No duplicates
assert_eq!(dials_set.len(), config.prune_peers());

// all dial peers must be in px
assert!(dials_set.is_subset(&px_peer_ids.iter().copied().collect::<HashSet<_>>()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going this way seems reasonable to me

@dapplion
Copy link
Author

Rebased to latest master to fix conflicts. Since the discussion is getting long I'll link the remaining points:

@mergify
Copy link
Contributor

mergify bot commented May 1, 2023

This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏

2 similar comments
@mergify
Copy link
Contributor

mergify bot commented May 4, 2023

This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏

Copy link
Contributor

mergify bot commented Nov 27, 2023

This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏

@dapplion
Copy link
Author

Closing PR for now since I have no capacity to address it. If there's a real need for this feature I'm happy to prioritise it again, but from offline discussions it seems like it is not the case. Sorry for polluting your PR inbox for a while!

@dapplion dapplion closed this Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants