Peer Management: Connection and Disconnection #914

fryorcraken · 2022-08-31T02:35:54Z

Planned start date:
Due date:

Summary

In a browser environment, loss of connectivity may happen.

When it happens, this would impede the application to receive and send messages.

Strategies need to be designed to:

detect a loss of connection
Automated actions when resuming a connection
- auto reconnect to remote peer
- Auto trigger some protocols (e.g. Waku Store Retrieve Store Messages when resuming connectivity #252)
- Renewing Filter subscription
- Waiting for peer to be in mesh relay before considering the action live (eg waitForRemotePeer)
cope with lack of connectivity when trying to use a protocol: retry strategy, feedback to the app and guideline on how to handle
- Querying message using Waku Store: eg Peer not found despite using waitForRemotePeer #913
- Trying to send a message with light push

Proposed Solutions

The result of this issue would be a mix of:

Documentation to guide the developers in handling failure
Strategy embedded in the protocols (e.g. retry)
Helper library that can provide utility to developers (e.g. auto store query, auto filter renewal)

At this stage it is not sure what should go in waku core and what should be a library helper.

Acceptance Criteria

Upon start-up, app dial a reasonable set of peers from peer discovery methods (no persistence, only dial 1 bootstrap peer)
Upon start-up, if some peers are not reachable, app attempt fallback method (try 2nd bootstrap peer)
Following above rules, when disconnected to a remote peer, attempts to reconnect
Clear guidelines for developer to know when node is connected or disconnected Advanced docs for js-waku docs.waku.org#104 (comment)

Tasks

Tasks Moved to Out of Scope

Notes

libp2p/js-libp2p#744

RAID (Risks, Assumptions, Issues and Dependencies)

The text was updated successfully, but these errors were encountered:

This module will just consume a generate Waku and Waku Relay interfaces so we already we want to extract it. It is also one opinionated to handle connection management, other ways might come with #914.

919: include wait_for_remote_peer in exports map r=fryorcraken a=fryorcraken This module will just consume a generate Waku and Waku Relay interfaces so we already we want to extract it. It is also one opinionated to handle connection management, other ways might come with #914. Co-authored-by: fryorcraken.eth <[email protected]>

fryorcraken · 2022-12-13T05:08:38Z

Need to re-groom this issue to take peer discovery in account. Some draft notes:

    // createLightNode => 2 light push, 2 filter, 1 store
    // dns discovery: emits "bootstrap" nodes
    // catch "bootstrap" node -> peer exchange? connect to it
    // How do we know it has peer exchange? need to extend peerInfo to include `waku2` ENR field or full ENR
    // Already connected to a "bootstrap" node? new node, park it.
    // random order? probably at dns discovery level, because we know when we are "done".
    // Now we receive a `peer-exchange` node
    // -> Do we need this node? What capabilities does it have?
    // -> What ndes are we currently connected to?
    // -> connect, discard or save for later (add to peer store)
    // -> Do a filter query: make it lazy, no wait for remote peer
    // Peer management module: good defaults and configurable/swappable
    ```

danisharora099 · 2023-01-04T19:08:04Z

Since this issue is a considerable overhaul, I'd like to be sure of my interpretation of the same.

I think the biggest one is:

* detect a loss of connection

For this, my interpretation is that libp2p's connection manager doesn't emit an event when a peer disconnects/streams are closed (on peer:disconnect) (ref: status-im/status-web#288 (comment)) and that this will either be something we'll be handling ourselves (similar to status-im/status-web#288 or xmtp/xmtp-js#128) or an upstream change to libp2p (cont of libp2p/js-libp2p#939 / libp2p/js-libp2p#744)
Please let me know if that's a fair interpretation.

Secondly, an explanation on the following, of how this fits with the initially described scope of the PR, will also be greatly appreciated.

Need to re-groom this issue to take peer discovery in account. Some draft notes:

    // createLightNode => 2 light push, 2 filter, 1 store
    // dns discovery: emits "bootstrap" nodes
    // catch "bootstrap" node -> peer exchange? connect to it
    // How do we know it has peer exchange? need to extend peerInfo to include `waku2` ENR field or full ENR
    // Already connected to a "bootstrap" node? new node, park it.
    // random order? probably at dns discovery level, because we know when we are "done".
    // Now we receive a `peer-exchange` node
    // -> Do we need this node? What capabilities does it have?
    // -> What ndes are we currently connected to?
    // -> connect, discard or save for later (add to peer store)
    // -> Do a filter query: make it lazy, no wait for remote peer
    // Peer management module: good defaults and configurable/swappable
    ```

Other than this, appreciate that the issue has been written very descriptively and will have a great impact!

cc @fryorcraken @felicio

fryorcraken · 2023-01-05T23:56:06Z

As mentioned. this is a considerable overhaul and would need to be tackled iteratively. Each iteration can increase the functionality/complexity of the manager while taking onboard feedback from dogfooding.

I suggest to focus the first iterations on using the peer discovery protocols dns discovery and peer exchange with the following logic:

when using createLightNode, apply default values of wanting 2 light push nodes, 2 filters nodes and 1 store node (user can override values).
DNS discovery bootstrap method is used, it discovers and emit nodes that are tagged as bootstrap
Connection Manager processes the bootstrap node
i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)
ii. Use peer exchange to discover more nodes
Connection Manager discards/saves for later other bootstrap nodes
peer exchange discover new nodes (using a bootstrap node), tags then as peer-exchange
peer-exchange node is discovered/emitted, connect to it (that's our second light push/filter node)
connect to more peer-exchange node to fulfill requirements set in (1) if needed.

(3) introduces an issue: the first node discovery by DNS discovery will always be the same given a given enr tree. Hence we may need to introduce some pseudo random shuffling logic. this should probably be done in the connection manager.

The final result above can already be split in several tasks/PRs.

Then, we can add error resilience on top of it, I'd recommend in the following order:

Fail to dial a node: handle this and try another node (easy as all located in connection manager)
protocol failed on a node (ie, light push query failed): disconnect node and try other node: this is harder because the protocol needs to feedback the failure to the manager which can then handle the disconnect/connection. Also the protocol should know whether the error is worthy a disconnection. This is not really feasible now (no differentiations in errors) but may be with the waku store protocol overhaul).
Node disconnects: as you mentioned, we need to have feedback from libp2p for that so it may be more difficult.

danisharora099 · 2023-01-09T17:48:34Z

Connection Manager processes the bootstrap node

We do two things with the peer, is that what we mean by "process" here?

Tag the peer within the PeerStore
Dial

i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)

why do we want to “later remove this logic”?

Connection Manager discards/saves for later other bootstrap nodes

How does the ConnectionManager do that? Do we mean the PeerStore/ AddressManager?
(later, making a note: we might also have to think about the edge cases like the addresses changing for these saved peers in the interim)

peer exchange discover new nodes (using a bootstrap node), tags then as peer-exchange

peer-exchange node is discovered/emitted, connect to it (that's our second light push/filter node)

“that’s our second node” — this step is skipped if the initial node requirements are met, correct?
the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)

Fail to dial a node: handle this and try another node (easy as all located in connection manager)

how can we reproduce a non-dialable node?
in the case the dial fails, the ConnectionManager/PeerStore should automatically ignore the node (from my understanding, with autoDial:true (which is used by default), all nodes are tried to connect to)

also, a lot of the process you describe, from my understanding, takes place implicitly by libp2p
for eg:

discovery, tagging, dial attempts, etc are all handled implicitly

I'm not entirely sure what you mean by manually handling the above-described process

we might not need to manually handle/connect to discovered peers as libp2p automatically dials to discovered peers and handles pruning if necessary (upper connection limit on ConnectionManager) — we should just focus on discovering the most number of peers (ref: #1117)

Please let me know if I'm missing something

fryorcraken · 2023-01-10T03:21:26Z

Connection Manager processes the bootstrap node

We do two things with the peer, is that what we mean by "process" here?
* Tag the peer within the PeerStore

The tagging is done by the discovery service:

js-waku/packages/peer-exchange/src/waku_peer_exchange_discovery.ts

Line 92 in 0b08320

await this.components.peerStore.tagPeer(

* Dial

The dialling is done by the connection manager:

js-waku/packages/core/src/lib/waku.ts

Line 140 in 0b08320

libp2p.dial(peerId).catch((err) => {

We will have to make this trivial logic more complex in the future and have a proper ConnectionManager class/module.

By "process" I mean deciding whether to dial and dialing.

i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)
* why do we want to “later remove this logic”?

Because if all js-waku nodes connects to bootstrap nodes to be served (store/light push/filter), then the waku network will be centralized because it would only use the bootstrap nodes.
It would also mean that the bootstrap nodes would need to be high performing node to serve all js-waku nodes.

Instead, I propose for the bootstrap node to only be used for bootstrapping, ie, getting access to the network. meaning only used for discv5 and peer-exchange.

Connection Manager discards/saves for later other bootstrap nodes
* How does the `ConnectionManager` do that? Do we mean the `PeerStore`/  `AddressManager`?

I don't know. The node will be tagged as bootstrap and store in peerStore so I guess that's good enough.
Or maybe we'll have to also store the nodes in the ConnectionManager. We'll have to decide at implementation time if the peerStore gives us enough.

The ConnectManager is the new module that handles most of the logic described here.

* (later, making a note: we might also have to think about the edge cases like the addresses changing for these saved peers in the interim)
peer exchange discover new nodes (using a bootstrap node), tags then as peer-exchange

peer-exchange node is discovered/emitted, connect to it (that's our second light push/filter node)
* “that’s our second node” — this step is skipped if the initial node requirements are met, correct?

Yes. For the rest of the logic I assume the proposed default requirements are applied.

* the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)

We should automatically connect to all node we discover. Which is why we need a Connection Manager and why we need to implement this logic.

Fail to dial a node: handle this and try another node (easy as all located in connection manager)
* how can we reproduce a non-dialable node?

Pass an invalid multiaddr.

* in the case the dial fails, the ConnectionManager/PeerStore should automatically ignore the node (from my understanding, with `autoDial:true` (which is used by default), all nodes are tried to connect to)

We'll have to review in details the retry logic when it's time to implement.

also, a lot of the process you describe, from my understanding, takes place implicitly by libp2p for eg:
* discovery, tagging, dial attempts, etc are all handled implicitly

We should not dial every node we discover. This is why custom connection management is needed.

I'm not entirely sure what you mean by manually handling the above-described process

we might not need to manually handle/connect to discovered peers as libp2p automatically dials to discovered peers and handles pruning if necessary (upper connection limit on ConnectionManager) — we should just focus on discovering the most number of peers (ref: #1117)

Please let me know if I'm missing something

danisharora099 · 2023-01-10T08:24:46Z

By "process" I mean deciding whether to dial and dialing.

How are we deducing whether to dial or not?
If there's a peer available other than bootstrap, we give it more preference and dial to it instead of the bootstrap peer - correct? (to increase decentralisation)

Because if all js-waku nodes connects to bootstrap nodes to be served (store/light push/filter), then the waku network will be centralized because it would only use the bootstrap nodes. It would also mean that the bootstrap nodes would need to be high performing node to serve all js-waku nodes.
That makes sense! We should prioritize nodes found by discovery mechanisms other the bootstrapping. Noted.

* the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)
We should automatically connect to all node we discover. Which is why we need a Connection Manager and why we need to implement this logic.

We should not dial every node we discover. This is why custom connection management is needed.

My interpretation so far:
Our connection priority should be something along the lines of:

connect to only 1 bootstrap peer (as little as possible to avoid centralisation on bootstrap nodes)
find more peers using other discovery mechanisms, via the bootstrap peer
connect to all new peers found until requirements are fulfilled

is that a fair interpretation?

danisharora099 · 2023-01-10T08:32:28Z

Also, as concluded from #1117 (comment), libp2p by default autodials.

Considering the scope of this PR, it's best to default autoDial to false considering we don't want to connect to all discovered bootstrap nodes immediately.

fryorcraken · 2023-01-11T22:12:20Z

My interpretation so far: Our connection priority should be something along the lines of:
* connect to only 1 bootstrap peer (as little as possible to avoid centralisation on bootstrap nodes)

* find more peers using other discovery mechanisms, via the bootstrap peer

* connect to all new peers found until requirements are fulfilled
is that a fair interpretation?

Yes.

fryorcraken · 2023-01-31T05:39:04Z

As discussed with @danisharora099 the first step was actually a focus on connections to node discovered, Need to update description to record this first step.

Next, would be exploratory work to understand what information we can get when a node is disconnected.

fryorcraken · 2023-06-29T10:37:08Z

Blocked by #1412

danisharora099 · 2023-10-05T12:10:11Z

referring from #1412,

upon further investigation and taking into context previous findings, it's safe to conclude:

connection:close: monitor a permanent connection close between the local & remote node (which should usually be triggered along with peer:disconnect for cases where we only have one connection with the remote node)

peer:disconnect: monitor permanent disconnections with peers (this would imply not that no the underlying connection(s) have been permanently closed, and the only way to communicate with this peer again is to open a new connection/reconnect)

pings will fail when there are temporary network degradations or reachability issues. this does not mean that the underlying connection has been closed.

to address part of the acceptance criteria from this milestone,
the principal question remains: how do we understand if the disconnection with the remote peer is deliberate, or simply due to network conditions (implying we want to initiate a reconnection)

as we concluded with #1403 (comment), js-waku redials automatically after the 10 mins mark, which is not a conscious disconnection and requires reconnection so perhaps the above question is enforced by default already?

cc @fryorcraken

danisharora099 · 2023-10-09T08:38:02Z

Weekly Update

achieved: investigated & closed (re)investigate gauging peer disconnections with js-waku #1412
next: look into addressing deliberate vs accidental disconnections

danisharora099 · 2023-10-10T10:30:31Z

the principal question remains: how do we understand if the disconnection with the remote peer is deliberate, or simply due to network conditions (implying we want to initiate a reconnection)

for this, we want to simply attempt reconnections

For library consumers,

For Filter,

we should open a subscription, and also send recurring SUBSCRIBE_PING to the node to ensure we're maintaing active subscriptions
- if we don't have an active subscription, we should reinitiate a subscription with the node
in case of reconnection, SUBSCRIBER_PING should be used to check for active subscription on the service nodes' end
write explicit tests to cover this hypothetical as much as possible Advanced docs for js-waku docs.waku.org#104 (comment)
add something like manage your filter subscriptions to advanced js-waku docs Advanced docs for js-waku docs.waku.org#104 (comment)

danisharora099 · 2023-10-13T06:33:25Z

Weekly Update

achieved: reached a conclusion tackling deliberate vs accidental disconnections, PRs opened to handle Filter subscriptions on disconnection/reconnections, iterative fixes on addressing multiple dial attempts for same peer, fixes around keep alive pings
next: getting reviews & merging these PRs which should enable us to close this epic 🥳

danisharora099 · 2023-10-23T06:24:30Z

Weekly Update

achieved: The Connection and Disconnection Peer Management epic has been closed

fryorcraken added track:restricted-run Restricted run track (Secure Messaging/Waku Product), e.g. filter, WebRTC track:production labels Aug 31, 2022

fryorcraken mentioned this issue Sep 1, 2022

include wait_for_remote_peer in exports map #919

Merged

fryorcraken changed the title ~~Connection Management~~ Peer Management Dec 13, 2022

danisharora099 self-assigned this Jan 3, 2023

fryorcraken mentioned this issue Jan 11, 2023

explicit usage of libp2p.dial()/libp2p's autodial #1117

Closed

fryorcraken mentioned this issue Jan 18, 2023

feat: introduce ConnectionManager #1118

Closed

fryorcraken mentioned this issue Jan 27, 2023

feat!: ConnectionManager and KeepAliveManager #1135

Merged

fryorcraken changed the title ~~Peer Management~~ [Epic] Peer Management Apr 6, 2023

fryorcraken added the milestone Tracks a subteam milestone label Apr 6, 2023

fryorcraken changed the title ~~[Epic] Peer Management~~ [Milestone] Peer Management Apr 6, 2023

fryorcraken mentioned this issue Jul 18, 2023

[Milestone] Restricted-run (light node) protocols are production ready waku-org/pm#25

Closed

4 tasks

Ivansete-status modified the milestone: Developer Ready Jul 24, 2023

danisharora099 added the blocked This issue is blocked by some other work label Jul 27, 2023

fryorcraken removed the track:production label Jul 31, 2023

fryorcraken mentioned this issue Aug 8, 2023

[Connection Manager] Improve fallback mechanism when remote peer rejects connection #1326

Closed

5 tasks

fryorcraken changed the title ~~[Milestone] Peer Management~~ [Milestone] Peer Management: Connection and Disconnection Aug 8, 2023

This was referenced Aug 8, 2023

feat: SDK for redundant usage of filter/lightpush #1463

Closed

Peer Management: Automated actions upon reconnection #1464

Open

multiple connections opened for the same peer #1459

Closed

danisharora099 removed the blocked This issue is blocked by some other work label Aug 10, 2023

fryorcraken changed the title ~~[Milestone] Peer Management: Connection and Disconnection~~ [Epic] Peer Management: Connection and Disconnection Aug 24, 2023

fryorcraken added epic Tracks a yearly team epic (only for waku-org/pm repo) E:2023-light-protocols and removed milestone Tracks a subteam milestone E:2023-light-protocols labels Aug 24, 2023

fryorcraken mentioned this issue Aug 29, 2023

[Milestone] Peer management strategy for relay and light nodes are defined and implemented waku-org/pm#33

Closed

5 tasks

chaitanyaprem mentioned this issue Aug 29, 2023

feat: SDK: Reliable Message Subscription API for lightClient protocols waku-org/go-waku#693

Closed

danisharora099 mentioned this issue Sep 6, 2023

Peer randomly disconnects when new peers join #1442

Closed

fryorcraken added E:2.1: Production testing of existing protocols See https://github.com/waku-org/pm/issues/49 for details and removed E:2023-peer-mgmt epic Tracks a yearly team epic (only for waku-org/pm repo) labels Sep 8, 2023

fryorcraken changed the title ~~[Epic] Peer Management: Connection and Disconnection~~ Peer Management: Connection and Disconnection Sep 8, 2023

fryorcraken mentioned this issue Sep 13, 2023

[Epic] 2.1: Production testing of existing protocols waku-org/pm#49

Closed

5 tasks

fryorcraken mentioned this issue Oct 9, 2023

Filter subscribe on 2 different nwaku nodes doesn't work reliable #1606

Closed

danisharora099 mentioned this issue Oct 10, 2023

feat: enable pinging connected peers by default #1647

Merged

This was referenced Oct 11, 2023

pingKeepAlive throws uncaught exception #1646

Closed

chore: add a test that uses ping to check filter subscription #1656

Merged

Advanced docs for js-waku waku-org/docs.waku.org#104

Closed

danisharora099 closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peer Management: Connection and Disconnection #914

Peer Management: Connection and Disconnection #914

fryorcraken commented Aug 31, 2022 •

edited by danisharora099

Loading

fryorcraken commented Dec 13, 2022

danisharora099 commented Jan 4, 2023 •

edited

Loading

fryorcraken commented Jan 5, 2023

danisharora099 commented Jan 9, 2023 •

edited

Loading

fryorcraken commented Jan 10, 2023

danisharora099 commented Jan 10, 2023 •

edited

Loading

danisharora099 commented Jan 10, 2023

fryorcraken commented Jan 11, 2023

fryorcraken commented Jan 31, 2023

fryorcraken commented Jun 29, 2023 •

edited by danisharora099

Loading

danisharora099 commented Oct 5, 2023

danisharora099 commented Oct 9, 2023

danisharora099 commented Oct 10, 2023 •

edited

Loading

danisharora099 commented Oct 13, 2023 •

edited by fryorcraken

Loading

danisharora099 commented Oct 23, 2023 •

edited

Loading

Peer Management: Connection and Disconnection #914

Peer Management: Connection and Disconnection #914

Comments

fryorcraken commented Aug 31, 2022 • edited by danisharora099 Loading

Summary

Proposed Solutions

Acceptance Criteria

Tasks

Tasks Moved to Out of Scope

Notes

RAID (Risks, Assumptions, Issues and Dependencies)

fryorcraken commented Dec 13, 2022

danisharora099 commented Jan 4, 2023 • edited Loading

fryorcraken commented Jan 5, 2023

danisharora099 commented Jan 9, 2023 • edited Loading

fryorcraken commented Jan 10, 2023

danisharora099 commented Jan 10, 2023 • edited Loading

danisharora099 commented Jan 10, 2023

fryorcraken commented Jan 11, 2023

fryorcraken commented Jan 31, 2023

fryorcraken commented Jun 29, 2023 • edited by danisharora099 Loading

danisharora099 commented Oct 5, 2023

danisharora099 commented Oct 9, 2023

danisharora099 commented Oct 10, 2023 • edited Loading

danisharora099 commented Oct 13, 2023 • edited by fryorcraken Loading

danisharora099 commented Oct 23, 2023 • edited Loading

fryorcraken commented Aug 31, 2022 •

edited by danisharora099

Loading

danisharora099 commented Jan 4, 2023 •

edited

Loading

danisharora099 commented Jan 9, 2023 •

edited

Loading

danisharora099 commented Jan 10, 2023 •

edited

Loading

fryorcraken commented Jun 29, 2023 •

edited by danisharora099

Loading

danisharora099 commented Oct 10, 2023 •

edited

Loading

danisharora099 commented Oct 13, 2023 •

edited by fryorcraken

Loading

danisharora099 commented Oct 23, 2023 •

edited

Loading