-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connectd: avoid use-after-free upon multiple reconnections by a peer #5300
connectd: avoid use-after-free upon multiple reconnections by a peer #5300
Conversation
`peer_reconnected` was freeing a `struct peer_reconnected` instance while a pointer to that instance was registered to be passed as an argument to the `retry_peer_connected` callback function. This caused a use-after-free crash when `retry_peer_connected` attempted to reparent the instance to the temporary context. Instead, never have `peer_reconnected` free a `struct peer_reconnected` instance, and only ever allow such an instance to be freed after the `retry_peer_connected` callback has finished with it. To ensure that the instance is freed even if the connection is closed before the callback can be invoked, parent the instance to the connection rather than to the daemon. Absent the need to free `struct peer_reconnected` instances outside of the `retry_peer_connected` callback, there is no use for the `reconnected` hashtable, so remove it as well. See: ElementsProject#5282 (comment) Fixes: ElementsProject#5282 Fixes: ElementsProject#5284 Changelog-Fixed: connectd no longer crashes when peers reconnect.
This looks to be obviated by #5261. |
Despite my effort to avoid leaking memory, my |
Does anyone know if I can undo the database schema changes (manually) and successfully go back to running v0.10.2? Aside from the resource leaks, v0.11.1 is costing me several channels a day due to "internal error" closures. 😢 |
Re: memleak. I looked at my node, and the immediate workaround is common/gossip_rcvd_filter.c line 78, change The question of why we're not cleaning this cache correctly will take some diag. |
(For future reference: attaching a debugger to the process and running |
@rustyrussell: Okay, I'll give that a shot, but I don't see how even 10000 gossip entries in a cache would be consuming 2.5 GiB of memory. That would be an average of 262 KiB per gossip entry. (Obviously that ignores all other users of memory, but presumably the leaked memory comprises the vast majority of the used memory.) |
It might not be the only source, but it's definitely the memory hog on my node (it's per-peer, with two caches per peer). |
@rustyrussell: #5312 completely solves it for me. |
I believe this reintroduces the "multiple connections" bug, where we can have multiple reconnections, and a stale one eventually gets restored (which is what the patch which introduced this htable was avoiding). One alternative is to have us only try reconnecting once: if we reconnect and it doesn't work, simply free it (by closing the connection). (I was hoping to get the more significant rewrite in, but it's stalled on a heap of timing changes which broke CI) |
@rustyrussell: Let me see if I understand the logic. There can be multiple connections with a given peer but only one "active" connection with any given peer at any given instant. The first duplicate connection from a peer to arrive will cause the current active connection with that peer to be terminated. While that is occurring, a second duplicate connection from the same peer can arrive. Both duplicate connections will be stalled, waiting for the active connection to be dropped. Once the active connection is dropped, then the first duplicate connection will become the active connection. I think the intention at this point is that this new active connection will then immediately be dropped, and subsequently the second duplicate connection will become the active connection. This effectively forms a FIFO queue, where only the last (i.e., the most recently arrived) duplicate connection is allowed to survive as the active connection. I am not confident about the timing of the message that is sent to the master daemon to tell it to drop the active connection for the peer. I have a spidey sense that something like the following could occur:
It seems like the intent of the htable was to discard the first duplicate connection (which has not yet become the active connection) when the second duplicate connection arrives, but the code wasn't actually doing that, as it was not actually closing that connection but was merely deallocating the structure that the I/O callback function would later reference. |
So far, so good.
Well, it's not clear which will actually get in next, and which will go back to waiting. The intent of the code is that there can only ever be one waiting: if there's one already waiting, take its place.
Possibly. The rewrite eschews all this for a smoother flow of "kill current connection, tell lightningd it's dead, then tell it about new connection like normal".
Indeed! A quick fix would have been to keep the conn pointer in the struct, and free that... |
Happens occasionally when running `tests/test_connection.py::test_mutual_reconnect_race` (which is too flaky to add, without more fixes): ``` lightningd: lightningd/peer_control.c:1252: peer_active: Assertion `!channel->owner' failed. lightningd: FATAL SIGNAL 6 (version v0.11.0.1-38-g4f167da) 0x5594a41f8f45 send_backtrace common/daemon.c:33 0x5594a41f8fef crashdump common/daemon.c:46 0x7f7cb585c08f ??? /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 0x7f7cb585c00b __GI_raise ../sysdeps/unix/sysv/linux/raise.c:51 0x7f7cb583b858 __GI_abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79 0x7f7cb583b728 __assert_fail_base /build/glibc-SzIz7B/glibc-2.31/assert/assert.c:92 0x7f7cb584cfd5 __GI___assert_fail /build/glibc-SzIz7B/glibc-2.31/assert/assert.c:101 0x5594a41b45ca peer_active lightningd/peer_control.c:1252 0x5594a418794c connectd_msg lightningd/connect_control.c:457 0x5594a41cd457 sd_msg_read lightningd/subd.c:556 0x5594a41ccbe5 read_fds lightningd/subd.c:357 0x5594a4269fc2 next_plan ccan/ccan/io/io.c:59 0x5594a426abca do_plan ccan/ccan/io/io.c:407 0x5594a426ac0c io_ready ccan/ccan/io/io.c:417 0x5594a426ceff io_loop ccan/ccan/io/poll.c:453 0x5594a41930d9 io_loop_with_timers lightningd/io_loop_with_timers.c:22 0x5594a4199293 main lightningd/lightningd.c:1181 0x7f7cb583d082 __libc_start_main ../csu/libc-start.c:308 0x5594a416e15d ??? ???:0 0xffffffffffffffff ??? ???:0 ``` Signed-off-by: Rusty Russell <[email protected]>
This was fixed in 1c495ca ("connectd: fix accidental handling of old reconnections.") and then reverted by the rework in "connectd: avoid use-after-free upon multiple reconnections by a peer". The latter made the race much less likely, since we cleaned up the reconnecting struct once the connection was hung up by the remote node, but it's still theoretically possible. Signed-off-by: Rusty Russell <[email protected]>
I added two minimal fixes to this, which resulted from stress-testing (the test unfortunately hangs frequently, since without other fixes the connect command can race). This is because I want to do an 11.2 release with this fix and the memleak fix... basically, the "whitslack" point release. |
peer_reconnected
was freeing astruct peer_reconnected
instance while a pointer to that instance was registered to be passed as an argument to theretry_peer_connected
callback function. This caused a use-after-free crash whenretry_peer_connected
attempted to reparent the instance to the temporary context.Instead, never have
peer_reconnected
free astruct peer_reconnected
instance, and only ever allow such an instance to be freed after theretry_peer_connected
callback has finished with it. To ensure that the instance is freed even if the connection is closed before the callback can be invoked, parent the instance to the connection rather than to the daemon.Absent the need to free
struct peer_reconnected
instances outside of theretry_peer_connected
callback, there is no use for thereconnected
hashtable, so remove it as well.See: #5282 (comment)
Fixes: #5282
Fixes: #5284
Changelog-Fixed: connectd no longer crashes when peers reconnect.
This may also fix #5299.