Prevent opening another substream right after the connection is closed #1563

altonen · 2023-09-14T10:49:35Z

Back-off wasn't applied to locally closed or rejected connections which could cause ProtocolController to reconnect to the node right after a connection to them had been closed. Remote mightn't have noticed that a connection was closed and the immediate reconnection would cause remote side substream to not work as it was no longer writable.

Fixes paritytech/substrate#13778
Relates to #1499

Back-off wasn't applied to locally closed or rejected connections which could cause `ProtocolController` to reconnect to the node right after a connection to them had been closed. Remote mightn't have noticed that a connection was closed and the immediate reconnection would cause remote side substream to not work as it was no longer writable.

dmitry-markin · 2023-09-14T11:20:22Z

TBH, I'm not sure that this is going to fix the issue with the collation protocol — it doesn't matter if the connection was reinitiated 0 seconds or 10 seconds after it was closed, unless the remote writes something into the substream.

On the other hand, delaying a connection for 5-10 seconds might introduce unneeded latency in case the connection was closed intentionally or by error (not sure how this is going to play with 6s block times in case a connection needs to be reestablished).

I would prefer we correctly detect the closed substream on the remote, instead of introducing a delay here.

dmitry-markin · 2023-09-14T11:30:14Z

Fixes paritytech/substrate#13778

It seems the PR is not fixing the issue referenced, because it only applies the back-off to locally-disconnected connections, not rejected by the remote ones.

altonen · 2023-09-14T12:08:24Z

This is a fix for paritytech/substrate#13778 but it relates to #1499,

The PR description is wrong. There is already a back-off for OpenResultErr but not for the SyncingEngine "rejection" where it just disconnects the node. Comparing this change against the logs in paritytech/substrate#13778 seems to address the correct state transition, could you recheck?

This change is orthogonal to detecting a closed substream and I fail to understand your concerns regarding the delay since back-off mechanism is already used in Notifications and we need to introduce back-off either way, and indicated in paritytech/substrate#13778. Could you perhaps elaborate?

dmitry-markin · 2023-09-14T14:06:26Z

<...> There is already a back-off for OpenResultErr but not for the SyncingEngine "rejection" where it just disconnects the node. Comparing this change against the logs in paritytech/substrate#13778 seems to address the correct state transition, could you recheck?

Back-offs of this PR are added to disconnect_peer_inner() (called from disconnect_peer()) and peerset_report_disconnect(). The first is triggered if the outer API decides to disconnect the peer, and the second when ProtocolController disconnects the peer, e.g., if the peer is removed from reserved peers on a reserved-only peer set. In both cases the local node initiates the disconnect, and this is the reason why this PR (as it is now) doesn't fix paritytech/substrate#13778.

It looks like the right place to insert a back-off if a substream was closed by the remote (i.e., rejected by SyncingEngine) is in on_connection_handler_event() here:

polkadot-sdk/substrate/client/network/src/protocol/notifications/behaviour.rs

Lines 1744 to 1745 in 76724ce

    
           *entry.into_mut() = 
        
           	PeerState::Disabled { connections, backoff_until: None };

It seems the back-off is already in place when handling FromSwarm::ConnectionClosed in on_swarm_event() (when all the substreams are closed because the entire connection to a peer is closed), at least for the Enabled peer state. But better double check, may be we need to handle other peer states in addition to Enabled.

This change is orthogonal to detecting a closed substream and I fail to understand your concerns regarding the delay since back-off mechanism is already used in Notifications and we need to introduce back-off either way, and indicated in paritytech/substrate#13778. Could you perhaps elaborate?

I was just afraid that there might have been legitimate reasons to close a substream, and then open it again, e.g. after one block (hence 6 s). Or there could be behavior like the one we saw with async backing where substreams were closed then opened again. This comment mostly applies to the original version of this PR at the moment of writing where the back-off was applied if the local node had closed the substream. Applying a back-off if the remote has closed a substream seems less risky.

altonen · 2023-09-14T15:06:04Z

Back-offs of this PR are added to disconnect_peer_inner() (called from disconnect_peer()) and peerset_report_disconnect(). The first is triggered if the outer API decides to disconnect the peer, and the second when ProtocolController disconnects the peer, e.g., if the peer is removed from reserved peers on a reserved-only peer set. In both cases the local node initiates the disconnect, and this is the reason why this PR (as it is now) doesn't fix paritytech/substrate#13778.

I don't believe this analysis to be correct, or at least it doesn't cover the full issue as I see it. The reason for inbound spam/rapid reconnections was local node disconnecting a remote node and then immediately reconnecting. If it was the remote that disconnected the local node, there would anyway be a delay in detecting it since the closed connection would be noticed only when the local node tried to send data. The issue is that there is no back-off applied to a reconnection if local node was the one that disconnected the node. This is what paritytech/substrate#13778 is about, as is shown in the logs here: paritytech/substrate#13778 (comment).

I was just afraid that there might have been legitimate reasons to close a substream, and then open it again, e.g. after one block (hence 6 s). Or there could be behavior like the one we saw with async backing where substreams were closed then opened again. This comment mostly applies to the original version of this PR at the moment of writing where the back-off was applied if the local node has closed the substream. Applying a back-off if the remote has closed a substream seems less risky.

I'm not convinced what async backing changes were doing was legitimate in terms of peer connections, as evident by the peerset warning about removing an unknown reserved peer. Although you could argue that chilling outbound connections should be limited to normal connections and not apply to reserved peers. That requires tweaking ProtocolController instead of Notifications but if that is preferable, I can close this PR.

dmitry-markin · 2023-09-14T15:41:07Z

If it was the remote that disconnected the local node, there would anyway be a delay in detecting it since the closed connection would be noticed only when the local node tried to send data.

Yeah, makes sense. But may be on block announcement substream it's enough to advertise a new block to detect that the substream is closed?

I'm probably missing the complete picture of the issue described in paritytech/substrate#13778 (comment). Do you understand what causes the local node to disconnect the remote and not vice versa?

dmitry-markin · 2023-09-14T15:46:06Z

Although you could argue that chilling outbound connections should be limited to normal connections and not apply to reserved peers. That requires tweaking ProtocolController instead of Notifications but if that is preferable, I can close this PR.

I'm not actively advocating for reserved peers being special in respect to connection chilling. Mostly I'm just afraid that adding back-offs to outbound connections might break something.

altonen · 2023-09-14T16:12:57Z

Yeah, makes sense. But may be on block announcement substream it's enough to advertise a new block to detect that the substream is closed?

I think we should still try and detect closed subtreams but it's hard to say which of these are undocumented implementation details and which are bugs. As perverse as it is, detecting a closed substream causes its own set of issues but we probably just have to bite the bullet.

Do you understand what causes the local node to disconnect the remote and not vice versa?

In the testing code I was working with, I just issued disconnect_peer(), no reason behind other than just testing what happens when a connection is closed.

I'm not actively advocating for reserved peers being special in respect to connection chilling. Mostly I'm just afraid that adding back-offs to outbound connections might break something.

I would argue rapid reconnections were behind the issues we saw earlier this year with NetworkEventStream and what was fixed in the most hackish way possible in this PR paritytech/substrate#13725. There's no way to prove it anymore but if the spam was severe enough, all of those NotificationStreamOpened/NotificationStreamClosed events would've been relayed to all listening protocols and as a result, NetworkEventStream would get chocked.

dmitry-markin

I'm still a little bit hesitant to insert delays if the local node was the one who closed the substream, but otherwise the logic of backing-off seems correct.

dmitry-markin · 2023-09-18T09:39:42Z

I went through Notifications banning code again, and I don't understand why it's implemented the way it is. At least I see the following issues.

The first issue: connection to a disabled peer with an expired ban is delayed again.
In peerset_report_connect() the disabled peer with a non-expired ban is handled here:

polkadot-sdk/substrate/client/network/src/protocol/notifications/behaviour.rs

Line 592 in 76724ce

PeerState::Disabled { connections, backoff_until: Some(ref backoff) }

The next match arm therefore handles the case of the expired ban:

polkadot-sdk/substrate/client/network/src/protocol/notifications/behaviour.rs

Line 623 in 76724ce

PeerState::Disabled { mut connections, backoff_until } => {

Nevertheless, it's assumed that backoff_untill can be Some here

polkadot-sdk/substrate/client/network/src/protocol/notifications/behaviour.rs

Lines 654 to 661 in 76724ce

    
           let timer_deadline = { 
        
           	let base = now + Duration::from_secs(5); 
        
           	if let Some(backoff_until) = backoff_until { 
        
           		cmp::max(base, backoff_until) 
        
           	} else { 
        
           		base 
        
           	} 
        
           };

This is a logic error on it's own, but more importantly the base timer deadline is installed and delays the connection, even though the ban has already expired once. (It's effectively introduced again.)

The second issue: when the remote closes the entire connection, it's banned, while if it closes a substream only, it's not.
When the remote closes the entire connection, the ban is introduced here:

polkadot-sdk/substrate/client/network/src/protocol/notifications/behaviour.rs

Lines 1342 to 1358 in 76724ce

    
           let ban_dur = Uniform::new(5, 10).sample(&mut rand::thread_rng()); 
        
           let delay_id = self.next_delay_id; 
        
           self.next_delay_id.0 += 1; 
        
           let delay = futures_timer::Delay::new(Duration::from_secs(ban_dur)); 
        
           self.delays.push( 
        
           	async move { 
        
           		delay.await; 
        
           		(delay_id, peer_id, set_id) 
        
           	} 
        
           	.boxed(), 
        
           ); 
        
           *entry.get_mut() = PeerState::Backoff { 
        
           	timer: delay_id, 
        
           	timer_deadline: Instant::now() + Duration::from_secs(ban_dur), 
        
           };

At the same time, if the remote just closes a substream, no ban is applied here:

polkadot-sdk/substrate/client/network/src/protocol/notifications/behaviour.rs

Lines 1744 to 1745 in 76724ce

    
           *entry.into_mut() = 
        
           	PeerState::Disabled { connections, backoff_until: None };

I would put more though into it before fixing the first issue, but it seems the second issue might lead to the connection spam when the remote closes only a substream and not the entire connection. @altonen could you check my reasoning, please?

altonen · 2023-09-18T16:49:57Z

@dmitry-markin

I remember we discussed this in some other issue but I can't remember which it was but I think we should think about moving all banning logic to ProtocolController since it should be the one dealing with substreams but right now it's really dumb and just keeps opening substreams even if the dial failed or if the peer is banned or whatever else state there might be which doesn't make sense. I'm slightly concerned about the race conditions that are already present in the ProtocolController <-> Notifications communication but I think sc-network should be moved into a direction where ProtocolController/PeerStore both have more important roles.

The first part of your analysis looks correct but I'm not sure which spam exactly you're referring to in the second point. Spam caused by the remote or local node? This PR attempted to fix the issue where local node disconnects a peer and opens it immediately again but you're correct in that it looks like there is no back-off applied if remote closed the substream. I think the spam is caused because there is no back-off on new outbound substreams.

dmitry-markin · 2023-09-19T09:12:09Z

but I'm not sure which spam exactly you're referring to in the second point. Spam caused by the remote or local node?

I'm referring to spam caused by the remote node reopening substream after being rejected. I'm not sure if we still have this issue after all the refactorings of sync protocol substream validation.

I think the spam is caused because there is no back-off on new outbound substreams.

Are you referring to the case when the remote node closes the substream and then opens it again immediately? I still don't understand why would the remote node close the substream in the first place. Apart from the collation protocol, where it should be fixed now (and where introducing the delays might affect collators' operation).

altonen · 2023-09-19T12:43:43Z

Are you referring to the case when the remote node closes the substream and then opens it again immediately? I still don't understand why would the remote node close the substream in the first place. Apart from the collation protocol, where it should be fixed now (and where introducing the delays might affect collators' operation).

The issue we had with syncing spamming substreams was because SyncingEngine had accepted more than --in-peers many peers so ProtocolController tried to fill the free outbound slots and even if the peer was accepted by ProtocolController, it was rejected (disconnected) by SyncingEngine. I believe this caused the spam (also when syncing was still in Protocol) but that was fixed recently with another hack.

bkchr

Approving to move forward with this.

bkchr · 2024-07-17T12:32:40Z

@dmitry-markin @lexnv do we still need this?

dmitry-markin · 2024-07-17T13:22:48Z

@dmitry-markin @lexnv do we still need this?

I was originally skeptical regarding this PR. May be we don't need it, but if the issue it tries to fix is still there, we might need to fix it another way. I'll have a look.

altonen requested review from dmitry-markin, rphmeier and a team September 14, 2023 10:49

altonen added the T0-node This PR/Issue is related to the topic “node”. label Sep 14, 2023

dmitry-markin approved these changes Sep 14, 2023

View reviewed changes

lexnv mentioned this pull request Sep 21, 2023

Add sudo_network_unstable_watch paritytech/json-rpc-interface-spec#91

Merged

bkchr approved these changes Oct 1, 2023

View reviewed changes

This was referenced Oct 9, 2023

[network/notifications] Connection to a disabled peer with an expired back-off is backed off again #1826

Open

[network/notifications] A remote peer is not backed off when it rejects a substream #1827

Open

rphmeier mentioned this pull request Oct 13, 2023

libp2p may be falsely re-establishing connections for reserved peers, or dropping messages #1499

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent opening another substream right after the connection is closed #1563

Prevent opening another substream right after the connection is closed #1563

altonen commented Sep 14, 2023 •

edited

Loading

dmitry-markin commented Sep 14, 2023 •

edited

Loading

dmitry-markin commented Sep 14, 2023

altonen commented Sep 14, 2023

dmitry-markin commented Sep 14, 2023 •

edited

Loading

altonen commented Sep 14, 2023

dmitry-markin commented Sep 14, 2023 •

edited

Loading

dmitry-markin commented Sep 14, 2023

altonen commented Sep 14, 2023 •

edited

Loading

dmitry-markin left a comment

dmitry-markin commented Sep 18, 2023 •

edited

Loading

altonen commented Sep 18, 2023

dmitry-markin commented Sep 19, 2023

altonen commented Sep 19, 2023

bkchr left a comment

bkchr commented Jul 17, 2024

dmitry-markin commented Jul 17, 2024

Prevent opening another substream right after the connection is closed #1563

Are you sure you want to change the base?

Prevent opening another substream right after the connection is closed #1563

Conversation

altonen commented Sep 14, 2023 • edited Loading

dmitry-markin commented Sep 14, 2023 • edited Loading

dmitry-markin commented Sep 14, 2023

altonen commented Sep 14, 2023

dmitry-markin commented Sep 14, 2023 • edited Loading

altonen commented Sep 14, 2023

dmitry-markin commented Sep 14, 2023 • edited Loading

dmitry-markin commented Sep 14, 2023

altonen commented Sep 14, 2023 • edited Loading

dmitry-markin left a comment

Choose a reason for hiding this comment

dmitry-markin commented Sep 18, 2023 • edited Loading

altonen commented Sep 18, 2023

dmitry-markin commented Sep 19, 2023

altonen commented Sep 19, 2023

bkchr left a comment

Choose a reason for hiding this comment

bkchr commented Jul 17, 2024

dmitry-markin commented Jul 17, 2024

altonen commented Sep 14, 2023 •

edited

Loading

dmitry-markin commented Sep 14, 2023 •

edited

Loading

dmitry-markin commented Sep 14, 2023 •

edited

Loading

dmitry-markin commented Sep 14, 2023 •

edited

Loading

altonen commented Sep 14, 2023 •

edited

Loading

dmitry-markin commented Sep 18, 2023 •

edited

Loading