Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

shutdown devp2p connections #9711

Closed
wants to merge 2 commits into from
Closed

Conversation

debris
Copy link
Collaborator

@debris debris commented Oct 7, 2018

fixes #9656

@debris debris added A0-pleasereview 🤓 Pull request needs code review. B1-patch-beta 🕷 M4-core ⛓ Core client code / Rust. B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. labels Oct 7, 2018
Copy link
Collaborator Author

@debris debris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes made in this pull request should solve the issue described in #9656 and possibly several other issues related with parity sync. Unfortunately there are no unit tests for host.rs and I believe that in current state of it is almost not testable. Please manually test it before merging and in the meantime I'll create an issue to refactor this code.

@@ -551,7 +551,7 @@ impl SyncHandler {
}
Err(()) => {
trace!(target: "sync", "{}: Got bad snapshot chunk", peer_id);
io.disconnect_peer(peer_id);
io.disable_peer(peer_id);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated issue, if chunk is bad, we should disable peer (disconnect + mark as bad)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is ok? How are the chunks requested? What if the peer just sent a chunk from different snapshot and we don't differentiate that on the request level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't request 2 different snapshots anyway, so I guess this should be OK.

}

impl GenericSocket for TcpStream {
fn shutdown(&self) -> io::Result<()> {
self.shutdown(Shutdown::Both)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never shutdown tcp stream

}

fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) {
fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed remote flag as I believe it was confusing and often used incorrectly.

e.g.

  • fn stop was calling it for all nodes with remote set to true and because of that, we were calling note_failure for all of them

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we are deregistering the stream everytime we call kill_connection. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.

The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second issue: So now we don't note_failure on any of the previous kill_connection(_,_, true) - to avoid breaking stuff we should still review all call sites of that function if note_failure should be there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems strange to remove remote just because it was wrongly used. Maybe just rename the variable? Or have a deregister bool and a note_failure one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we are deregistering the stream every time we call kill_connection. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.

@tomusdrw the leak is happening now. Session is always set to expired, session is removed from handlers, but the value is never deregistered if remote = false, cause s.done() returns true as long as the socket is writeable.

In this pr the leak is not happening, cause after setting expired to true, and unregistering handlers, we always call deregister.

https://github.com/paritytech/parity-ethereum/blob/05550429c7dc0e154710d171bb432e68d000e34d/util/network-devp2p/src/host.rs#L917-L926

The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).

I believe that CLOSE_WAIT state is just a symptom of a bigger problem. After we call kill_connection, sessions leak in memory and they are already disconnected from the handlers. Remote nodes close the connection to our unresponsive socket, but because they are no handlers connected, we are stuck on the CLOSE_WAIT state.

for p in to_disconnect {
let reserved = self.reserved_nodes.read();
if let Some(h) = self.handlers.read().get(&p) {
h.disconnected(&NetworkContext::new(io, p, expired_session.clone(), self.sessions.clone(), &reserved), &token);
}
}
if deregister {
io.deregister_stream(token).unwrap_or_else(|e| debug!("Error deregistering stream: {:?}", e));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deregister set to false lead to incorrect state of the application. connection that was supposed to be killed, was never killed and occupied an entry in sessions hashmap

Copy link
Collaborator

@tomusdrw tomusdrw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of issues

@@ -551,7 +551,7 @@ impl SyncHandler {
}
Err(()) => {
trace!(target: "sync", "{}: Got bad snapshot chunk", peer_id);
io.disconnect_peer(peer_id);
io.disable_peer(peer_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is ok? How are the chunks requested? What if the peer just sent a chunk from different snapshot and we don't differentiate that on the request level?

}

fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) {
fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we are deregistering the stream everytime we call kill_connection. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.

The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).

}

fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) {
fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second issue: So now we don't note_failure on any of the previous kill_connection(_,_, true) - to avoid breaking stuff we should still review all call sites of that function if note_failure should be there.

@@ -996,7 +983,7 @@ impl IoHandler<NetworkIoMessage> for Host {
fn stream_hup(&self, io: &IoContext<NetworkIoMessage>, stream: StreamToken) {
trace!(target: "network", "Hup: {}", stream);
match stream {
FIRST_SESSION ... LAST_SESSION => self.connection_closed(stream, io),
FIRST_SESSION ... LAST_SESSION => self.kill_connection(stream, io),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would make sense to note_failure here imho.

@@ -813,7 +808,7 @@ impl Host {
}

if kill {
self.kill_connection(token, io, true);
self.kill_connection(token, io);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we note_failure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in some cases of kill = true, we already noted failure, when in others we did not. I'll fix it.

@@ -908,13 +903,11 @@ impl Host {

fn connection_timeout(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) {
trace!(target: "network", "Connection timeout: {}", token);
self.kill_connection(token, io, true)
self.kill_connection(token, io)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we should note_failure

This was referenced Oct 8, 2018
@5chdn 5chdn added this to the 2.2 milestone Oct 9, 2018
Copy link
Collaborator Author

@debris debris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the logic here is extremely fragile. Apart from the leaked sessions I believe that we often call note_failure when we should not. Because changes in this pull request are quite complex and controversial, I'll try to prepare tests proving that they are working

}

fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) {
fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we are deregistering the stream every time we call kill_connection. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.

@tomusdrw the leak is happening now. Session is always set to expired, session is removed from handlers, but the value is never deregistered if remote = false, cause s.done() returns true as long as the socket is writeable.

In this pr the leak is not happening, cause after setting expired to true, and unregistering handlers, we always call deregister.

https://github.com/paritytech/parity-ethereum/blob/05550429c7dc0e154710d171bb432e68d000e34d/util/network-devp2p/src/host.rs#L917-L926

The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).

I believe that CLOSE_WAIT state is just a symptom of a bigger problem. After we call kill_connection, sessions leak in memory and they are already disconnected from the handlers. Remote nodes close the connection to our unresponsive socket, but because they are no handlers connected, we are stuck on the CLOSE_WAIT state.

@@ -813,7 +808,7 @@ impl Host {
}

if kill {
self.kill_connection(token, io, true);
self.kill_connection(token, io);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in some cases of kill = true, we already noted failure, when in others we did not. I'll fix it.

@5chdn 5chdn added A5-grumble 🔥 Pull request has minor issues that must be addressed before merging. and removed A0-pleasereview 🤓 Pull request needs code review. labels Oct 15, 2018
Copy link
Contributor

@andresilva andresilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but there's a couple of places where we need to note_failure.

@@ -532,7 +532,7 @@ impl Host {
}
for p in to_kill {
trace!(target: "network", "Ping timeout: {}", p);
self.kill_connection(p, io, true);
self.kill_connection(p, io);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we note_failure here as well?

@5chdn 5chdn added the A3-stale 🍃 Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Oct 26, 2018
@5chdn
Copy link
Contributor

5chdn commented Oct 26, 2018

are you still working on this @debris ?

@5chdn 5chdn modified the milestones: 2.2, 2.3 Oct 29, 2018
This was referenced Nov 1, 2018
@5chdn 5chdn closed this Nov 25, 2018
@soc1c soc1c deleted the shutdown-devp2p-connections branch March 20, 2019 11:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A3-stale 🍃 Pull request did not receive any updates in a long time. No review needed at this stage. Close it. A5-grumble 🔥 Pull request has minor issues that must be addressed before merging. B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TCP connections hang in CLOSE_WAIT, count as valid peers, syncing stops
5 participants