Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent opening another substream right after the connection is closed #1563

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

altonen
Copy link
Contributor

@altonen altonen commented Sep 14, 2023

Back-off wasn't applied to locally closed or rejected connections which could cause ProtocolController to reconnect to the node right after a connection to them had been closed. Remote mightn't have noticed that a connection was closed and the immediate reconnection would cause remote side substream to not work as it was no longer writable.

Fixes paritytech/substrate#13778
Relates to #1499

Back-off wasn't applied to locally closed or rejected connections which
could cause `ProtocolController` to reconnect to the node right after
a connection to them had been closed. Remote mightn't have noticed that
a connection was closed and the immediate reconnection would cause
remote side substream to not work as it was no longer writable.
@altonen altonen requested review from dmitry-markin, rphmeier and a team September 14, 2023 10:49
@altonen altonen added the T0-node This PR/Issue is related to the topic “node”. label Sep 14, 2023
@dmitry-markin
Copy link
Contributor

dmitry-markin commented Sep 14, 2023

TBH, I'm not sure that this is going to fix the issue with the collation protocol — it doesn't matter if the connection was reinitiated 0 seconds or 10 seconds after it was closed, unless the remote writes something into the substream.

On the other hand, delaying a connection for 5-10 seconds might introduce unneeded latency in case the connection was closed intentionally or by error (not sure how this is going to play with 6s block times in case a connection needs to be reestablished).

I would prefer we correctly detect the closed substream on the remote, instead of introducing a delay here.

@dmitry-markin
Copy link
Contributor

Fixes paritytech/substrate#13778

It seems the PR is not fixing the issue referenced, because it only applies the back-off to locally-disconnected connections, not rejected by the remote ones.

@altonen
Copy link
Contributor Author

altonen commented Sep 14, 2023

This is a fix for paritytech/substrate#13778 but it relates to #1499,

The PR description is wrong. There is already a back-off for OpenResultErr but not for the SyncingEngine "rejection" where it just disconnects the node. Comparing this change against the logs in paritytech/substrate#13778 seems to address the correct state transition, could you recheck?

This change is orthogonal to detecting a closed substream and I fail to understand your concerns regarding the delay since back-off mechanism is already used in Notifications and we need to introduce back-off either way, and indicated in paritytech/substrate#13778. Could you perhaps elaborate?

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Sep 14, 2023

<...> There is already a back-off for OpenResultErr but not for the SyncingEngine "rejection" where it just disconnects the node. Comparing this change against the logs in paritytech/substrate#13778 seems to address the correct state transition, could you recheck?

Back-offs of this PR are added to disconnect_peer_inner() (called from disconnect_peer()) and peerset_report_disconnect(). The first is triggered if the outer API decides to disconnect the peer, and the second when ProtocolController disconnects the peer, e.g., if the peer is removed from reserved peers on a reserved-only peer set. In both cases the local node initiates the disconnect, and this is the reason why this PR (as it is now) doesn't fix paritytech/substrate#13778.

It looks like the right place to insert a back-off if a substream was closed by the remote (i.e., rejected by SyncingEngine) is in on_connection_handler_event() here:

*entry.into_mut() =
PeerState::Disabled { connections, backoff_until: None };

It seems the back-off is already in place when handling FromSwarm::ConnectionClosed in on_swarm_event() (when all the substreams are closed because the entire connection to a peer is closed), at least for the Enabled peer state. But better double check, may be we need to handle other peer states in addition to Enabled.

This change is orthogonal to detecting a closed substream and I fail to understand your concerns regarding the delay since back-off mechanism is already used in Notifications and we need to introduce back-off either way, and indicated in paritytech/substrate#13778. Could you perhaps elaborate?

I was just afraid that there might have been legitimate reasons to close a substream, and then open it again, e.g. after one block (hence 6 s). Or there could be behavior like the one we saw with async backing where substreams were closed then opened again. This comment mostly applies to the original version of this PR at the moment of writing where the back-off was applied if the local node had closed the substream. Applying a back-off if the remote has closed a substream seems less risky.

@altonen
Copy link
Contributor Author

altonen commented Sep 14, 2023

Back-offs of this PR are added to disconnect_peer_inner() (called from disconnect_peer()) and peerset_report_disconnect(). The first is triggered if the outer API decides to disconnect the peer, and the second when ProtocolController disconnects the peer, e.g., if the peer is removed from reserved peers on a reserved-only peer set. In both cases the local node initiates the disconnect, and this is the reason why this PR (as it is now) doesn't fix paritytech/substrate#13778.

I don't believe this analysis to be correct, or at least it doesn't cover the full issue as I see it. The reason for inbound spam/rapid reconnections was local node disconnecting a remote node and then immediately reconnecting. If it was the remote that disconnected the local node, there would anyway be a delay in detecting it since the closed connection would be noticed only when the local node tried to send data. The issue is that there is no back-off applied to a reconnection if local node was the one that disconnected the node. This is what paritytech/substrate#13778 is about, as is shown in the logs here: paritytech/substrate#13778 (comment).

I was just afraid that there might have been legitimate reasons to close a substream, and then open it again, e.g. after one block (hence 6 s). Or there could be behavior like the one we saw with async backing where substreams were closed then opened again. This comment mostly applies to the original version of this PR at the moment of writing where the back-off was applied if the local node has closed the substream. Applying a back-off if the remote has closed a substream seems less risky.

I'm not convinced what async backing changes were doing was legitimate in terms of peer connections, as evident by the peerset warning about removing an unknown reserved peer. Although you could argue that chilling outbound connections should be limited to normal connections and not apply to reserved peers. That requires tweaking ProtocolController instead of Notifications but if that is preferable, I can close this PR.

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Sep 14, 2023

If it was the remote that disconnected the local node, there would anyway be a delay in detecting it since the closed connection would be noticed only when the local node tried to send data.

Yeah, makes sense. But may be on block announcement substream it's enough to advertise a new block to detect that the substream is closed?

I'm probably missing the complete picture of the issue described in paritytech/substrate#13778 (comment). Do you understand what causes the local node to disconnect the remote and not vice versa?

@dmitry-markin
Copy link
Contributor

Although you could argue that chilling outbound connections should be limited to normal connections and not apply to reserved peers. That requires tweaking ProtocolController instead of Notifications but if that is preferable, I can close this PR.

I'm not actively advocating for reserved peers being special in respect to connection chilling. Mostly I'm just afraid that adding back-offs to outbound connections might break something.

@altonen
Copy link
Contributor Author

altonen commented Sep 14, 2023

Yeah, makes sense. But may be on block announcement substream it's enough to advertise a new block to detect that the substream is closed?

I think we should still try and detect closed subtreams but it's hard to say which of these are undocumented implementation details and which are bugs. As perverse as it is, detecting a closed substream causes its own set of issues but we probably just have to bite the bullet.

Do you understand what causes the local node to disconnect the remote and not vice versa?

In the testing code I was working with, I just issued disconnect_peer(), no reason behind other than just testing what happens when a connection is closed.

I'm not actively advocating for reserved peers being special in respect to connection chilling. Mostly I'm just afraid that adding back-offs to outbound connections might break something.

I would argue rapid reconnections were behind the issues we saw earlier this year with NetworkEventStream and what was fixed in the most hackish way possible in this PR paritytech/substrate#13725. There's no way to prove it anymore but if the spam was severe enough, all of those NotificationStreamOpened/NotificationStreamClosed events would've been relayed to all listening protocols and as a result, NetworkEventStream would get chocked.

Copy link
Contributor

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little bit hesitant to insert delays if the local node was the one who closed the substream, but otherwise the logic of backing-off seems correct.

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Sep 18, 2023

I went through Notifications banning code again, and I don't understand why it's implemented the way it is. At least I see the following issues.

  1. The first issue: connection to a disabled peer with an expired ban is delayed again.
    In peerset_report_connect() the disabled peer with a non-expired ban is handled here:

    PeerState::Disabled { connections, backoff_until: Some(ref backoff) }

    The next match arm therefore handles the case of the expired ban:
    PeerState::Disabled { mut connections, backoff_until } => {

    Nevertheless, it's assumed that backoff_untill can be Some here
    let timer_deadline = {
    let base = now + Duration::from_secs(5);
    if let Some(backoff_until) = backoff_until {
    cmp::max(base, backoff_until)
    } else {
    base
    }
    };
    This is a logic error on it's own, but more importantly the base timer deadline is installed and delays the connection, even though the ban has already expired once. (It's effectively introduced again.)

  2. The second issue: when the remote closes the entire connection, it's banned, while if it closes a substream only, it's not.
    When the remote closes the entire connection, the ban is introduced here:

    let ban_dur = Uniform::new(5, 10).sample(&mut rand::thread_rng());
    let delay_id = self.next_delay_id;
    self.next_delay_id.0 += 1;
    let delay = futures_timer::Delay::new(Duration::from_secs(ban_dur));
    self.delays.push(
    async move {
    delay.await;
    (delay_id, peer_id, set_id)
    }
    .boxed(),
    );
    *entry.get_mut() = PeerState::Backoff {
    timer: delay_id,
    timer_deadline: Instant::now() + Duration::from_secs(ban_dur),
    };

    At the same time, if the remote just closes a substream, no ban is applied here:
    *entry.into_mut() =
    PeerState::Disabled { connections, backoff_until: None };

I would put more though into it before fixing the first issue, but it seems the second issue might lead to the connection spam when the remote closes only a substream and not the entire connection. @altonen could you check my reasoning, please?

@altonen
Copy link
Contributor Author

altonen commented Sep 18, 2023

@dmitry-markin

I remember we discussed this in some other issue but I can't remember which it was but I think we should think about moving all banning logic to ProtocolController since it should be the one dealing with substreams but right now it's really dumb and just keeps opening substreams even if the dial failed or if the peer is banned or whatever else state there might be which doesn't make sense. I'm slightly concerned about the race conditions that are already present in the ProtocolController <-> Notifications communication but I think sc-network should be moved into a direction where ProtocolController/PeerStore both have more important roles.

The first part of your analysis looks correct but I'm not sure which spam exactly you're referring to in the second point. Spam caused by the remote or local node? This PR attempted to fix the issue where local node disconnects a peer and opens it immediately again but you're correct in that it looks like there is no back-off applied if remote closed the substream. I think the spam is caused because there is no back-off on new outbound substreams.

@dmitry-markin
Copy link
Contributor

but I'm not sure which spam exactly you're referring to in the second point. Spam caused by the remote or local node?

I'm referring to spam caused by the remote node reopening substream after being rejected. I'm not sure if we still have this issue after all the refactorings of sync protocol substream validation.

I think the spam is caused because there is no back-off on new outbound substreams.

Are you referring to the case when the remote node closes the substream and then opens it again immediately? I still don't understand why would the remote node close the substream in the first place. Apart from the collation protocol, where it should be fixed now (and where introducing the delays might affect collators' operation).

@altonen
Copy link
Contributor Author

altonen commented Sep 19, 2023

Are you referring to the case when the remote node closes the substream and then opens it again immediately? I still don't understand why would the remote node close the substream in the first place. Apart from the collation protocol, where it should be fixed now (and where introducing the delays might affect collators' operation).

The issue we had with syncing spamming substreams was because SyncingEngine had accepted more than --in-peers many peers so ProtocolController tried to fill the free outbound slots and even if the peer was accepted by ProtocolController, it was rejected (disconnected) by SyncingEngine. I believe this caused the spam (also when syncing was still in Protocol) but that was fixed recently with another hack.

Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to move forward with this.

@bkchr
Copy link
Member

bkchr commented Jul 17, 2024

@dmitry-markin @lexnv do we still need this?

@dmitry-markin
Copy link
Contributor

@dmitry-markin @lexnv do we still need this?

I was originally skeptical regarding this PR. May be we don't need it, but if the issue it tries to fix is still there, we might need to fix it another way. I'll have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

Rejected nodes keep reconnecting without delay
3 participants