-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes made in this pull request should solve the issue described in #9656 and possibly several other issues related with parity sync. Unfortunately there are no unit tests for host.rs
and I believe that in current state of it is almost not testable. Please manually test it before merging and in the meantime I'll create an issue to refactor this code.
@@ -551,7 +551,7 @@ impl SyncHandler { | |||
} | |||
Err(()) => { | |||
trace!(target: "sync", "{}: Got bad snapshot chunk", peer_id); | |||
io.disconnect_peer(peer_id); | |||
io.disable_peer(peer_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated issue, if chunk is bad, we should disable peer (disconnect + mark as bad)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this is ok? How are the chunks requested? What if the peer just sent a chunk from different snapshot and we don't differentiate that on the request level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't request 2 different snapshots anyway, so I guess this should be OK.
} | ||
|
||
impl GenericSocket for TcpStream { | ||
fn shutdown(&self) -> io::Result<()> { | ||
self.shutdown(Shutdown::Both) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we never shutdown tcp stream
} | ||
|
||
fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) { | ||
fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed remote
flag as I believe it was confusing and often used incorrectly.
e.g.
fn stop
was calling it for all nodes with remote set totrue
and because of that, we were callingnote_failure
for all of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now we are deregistering
the stream everytime we call kill_connection
. Previously it was done only if session
was done. Seems that deregister
will remove entry from self.sessions
, but only when session is expired - otherwise we will leak the the value in the HashMap
forever.
The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second issue: So now we don't note_failure
on any of the previous kill_connection(_,_, true)
- to avoid breaking stuff we should still review all call sites of that function if note_failure
should be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems strange to remove remote
just because it was wrongly used. Maybe just rename the variable? Or have a deregister
bool and a note_failure
one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now we are deregistering the stream every time we call
kill_connection
. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.
@tomusdrw the leak is happening now. Session is always set to expired, session
is removed from handlers, but the value is never deregistered if remote = false
, cause s.done()
returns true as long as the socket is writeable
.
In this pr the leak is not happening, cause after setting expired to true, and unregistering handlers, we always call deregister.
The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).
I believe that CLOSE_WAIT
state is just a symptom of a bigger problem. After we call kill_connection
, sessions leak in memory and they are already disconnected from the handlers. Remote nodes close the connection to our unresponsive socket, but because they are no handlers connected, we are stuck on the CLOSE_WAIT
state.
for p in to_disconnect { | ||
let reserved = self.reserved_nodes.read(); | ||
if let Some(h) = self.handlers.read().get(&p) { | ||
h.disconnected(&NetworkContext::new(io, p, expired_session.clone(), self.sessions.clone(), &reserved), &token); | ||
} | ||
} | ||
if deregister { | ||
io.deregister_stream(token).unwrap_or_else(|e| debug!("Error deregistering stream: {:?}", e)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deregister set to false lead to incorrect state of the application. connection that was supposed to be killed, was never killed and occupied an entry in sessions hashmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of issues
@@ -551,7 +551,7 @@ impl SyncHandler { | |||
} | |||
Err(()) => { | |||
trace!(target: "sync", "{}: Got bad snapshot chunk", peer_id); | |||
io.disconnect_peer(peer_id); | |||
io.disable_peer(peer_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this is ok? How are the chunks requested? What if the peer just sent a chunk from different snapshot and we don't differentiate that on the request level?
} | ||
|
||
fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) { | ||
fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now we are deregistering
the stream everytime we call kill_connection
. Previously it was done only if session
was done. Seems that deregister
will remove entry from self.sessions
, but only when session is expired - otherwise we will leak the the value in the HashMap
forever.
The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).
} | ||
|
||
fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) { | ||
fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second issue: So now we don't note_failure
on any of the previous kill_connection(_,_, true)
- to avoid breaking stuff we should still review all call sites of that function if note_failure
should be there.
@@ -996,7 +983,7 @@ impl IoHandler<NetworkIoMessage> for Host { | |||
fn stream_hup(&self, io: &IoContext<NetworkIoMessage>, stream: StreamToken) { | |||
trace!(target: "network", "Hup: {}", stream); | |||
match stream { | |||
FIRST_SESSION ... LAST_SESSION => self.connection_closed(stream, io), | |||
FIRST_SESSION ... LAST_SESSION => self.kill_connection(stream, io), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would make sense to note_failure
here imho.
@@ -813,7 +808,7 @@ impl Host { | |||
} | |||
|
|||
if kill { | |||
self.kill_connection(token, io, true); | |||
self.kill_connection(token, io); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we note_failure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in some cases of kill = true
, we already noted failure, when in others we did not. I'll fix it.
@@ -908,13 +903,11 @@ impl Host { | |||
|
|||
fn connection_timeout(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) { | |||
trace!(target: "network", "Connection timeout: {}", token); | |||
self.kill_connection(token, io, true) | |||
self.kill_connection(token, io) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO we should note_failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the logic here is extremely fragile. Apart from the leaked sessions I believe that we often call note_failure
when we should not. Because changes in this pull request are quite complex and controversial, I'll try to prepare tests proving that they are working
} | ||
|
||
fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) { | ||
fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now we are deregistering the stream every time we call
kill_connection
. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.
@tomusdrw the leak is happening now. Session is always set to expired, session
is removed from handlers, but the value is never deregistered if remote = false
, cause s.done()
returns true as long as the socket is writeable
.
In this pr the leak is not happening, cause after setting expired to true, and unregistering handlers, we always call deregister.
The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).
I believe that CLOSE_WAIT
state is just a symptom of a bigger problem. After we call kill_connection
, sessions leak in memory and they are already disconnected from the handlers. Remote nodes close the connection to our unresponsive socket, but because they are no handlers connected, we are stuck on the CLOSE_WAIT
state.
@@ -813,7 +808,7 @@ impl Host { | |||
} | |||
|
|||
if kill { | |||
self.kill_connection(token, io, true); | |||
self.kill_connection(token, io); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in some cases of kill = true
, we already noted failure, when in others we did not. I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, but there's a couple of places where we need to note_failure
.
@@ -532,7 +532,7 @@ impl Host { | |||
} | |||
for p in to_kill { | |||
trace!(target: "network", "Ping timeout: {}", p); | |||
self.kill_connection(p, io, true); | |||
self.kill_connection(p, io); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we note_failure
here as well?
are you still working on this @debris ? |
fixes #9656