connectd: avoid use-after-free upon multiple reconnections by a peer #5300

whitslack · 2022-06-01T08:01:23Z

peer_reconnected was freeing a struct peer_reconnected instance while a pointer to that instance was registered to be passed as an argument to the retry_peer_connected callback function. This caused a use-after-free crash when retry_peer_connected attempted to reparent the instance to the temporary context.

Instead, never have peer_reconnected free a struct peer_reconnected instance, and only ever allow such an instance to be freed after the retry_peer_connected callback has finished with it. To ensure that the instance is freed even if the connection is closed before the callback can be invoked, parent the instance to the connection rather than to the daemon.

Absent the need to free struct peer_reconnected instances outside of the retry_peer_connected callback, there is no use for the reconnected hashtable, so remove it as well.

See: #5282 (comment)
Fixes: #5282
Fixes: #5284
Changelog-Fixed: connectd no longer crashes when peers reconnect.

This may also fix #5299.

`peer_reconnected` was freeing a `struct peer_reconnected` instance while a pointer to that instance was registered to be passed as an argument to the `retry_peer_connected` callback function. This caused a use-after-free crash when `retry_peer_connected` attempted to reparent the instance to the temporary context. Instead, never have `peer_reconnected` free a `struct peer_reconnected` instance, and only ever allow such an instance to be freed after the `retry_peer_connected` callback has finished with it. To ensure that the instance is freed even if the connection is closed before the callback can be invoked, parent the instance to the connection rather than to the daemon. Absent the need to free `struct peer_reconnected` instances outside of the `retry_peer_connected` callback, there is no use for the `reconnected` hashtable, so remove it as well. See: ElementsProject#5282 (comment) Fixes: ElementsProject#5282 Fixes: ElementsProject#5284 Changelog-Fixed: connectd no longer crashes when peers reconnect.

whitslack · 2022-06-01T22:14:11Z

This looks to be obviated by #5261.

whitslack · 2022-06-02T21:04:50Z

Despite my effort to avoid leaking memory, my lightning_connectd process, aside from pegging the CPU at 100% continuously, is also leaking memory like a sieve. After only a day and a half of uptime, the process is using 246 MiB of memory. (lightningd is using 1.7 GiB of memory, more than double what v0.10.2 had been using after running for weeks.) I really don't understand why these processes are holding so much state in RAM.

whitslack · 2022-06-06T02:26:28Z

lightning_connectd is now using 2.3 GiB of RAM. 😠

Does anyone know if I can undo the database schema changes (manually) and successfully go back to running v0.10.2? Aside from the resource leaks, v0.11.1 is costing me several channels a day due to "internal error" closures. 😢

rustyrussell · 2022-06-07T10:37:32Z

Re: memleak. I looked at my node, and the immediate workaround is common/gossip_rcvd_filter.c line 78, change 10000 to 100.

The question of why we're not cleaning this cache correctly will take some diag.

rustyrussell · 2022-06-07T10:38:33Z

(For future reference: attaching a debugger to the process and running p tal_dump() will have it dump all the memory allocations to stderr, which is how I found this)

whitslack · 2022-06-07T14:46:22Z

common/gossip_rcvd_filter.c line 78, change 10000 to 100.

@rustyrussell: Okay, I'll give that a shot, but I don't see how even 10000 gossip entries in a cache would be consuming 2.5 GiB of memory. That would be an average of 262 KiB per gossip entry. (Obviously that ignores all other users of memory, but presumably the leaked memory comprises the vast majority of the used memory.)

rustyrussell · 2022-06-14T00:05:41Z

It might not be the only source, but it's definitely the memory hog on my node (it's per-peer, with two caches per peer).

whitslack · 2022-06-14T00:14:14Z

It might not be the only source, but it's definitely the memory hog on my node (it's per-peer, with two caches per peer).

@rustyrussell: #5312 completely solves it for me. lightning_connectd has been using about 13 MiB the entire time it's been running (which is close to a day now).

rustyrussell · 2022-06-16T05:27:30Z

I believe this reintroduces the "multiple connections" bug, where we can have multiple reconnections, and a stale one eventually gets restored (which is what the patch which introduced this htable was avoiding).

One alternative is to have us only try reconnecting once: if we reconnect and it doesn't work, simply free it (by closing the connection).

(I was hoping to get the more significant rewrite in, but it's stalled on a heap of timing changes which broke CI)

whitslack · 2022-06-16T06:27:28Z

@rustyrussell: Let me see if I understand the logic. There can be multiple connections with a given peer but only one "active" connection with any given peer at any given instant. The first duplicate connection from a peer to arrive will cause the current active connection with that peer to be terminated. While that is occurring, a second duplicate connection from the same peer can arrive. Both duplicate connections will be stalled, waiting for the active connection to be dropped. Once the active connection is dropped, then the first duplicate connection will become the active connection. I think the intention at this point is that this new active connection will then immediately be dropped, and subsequently the second duplicate connection will become the active connection. This effectively forms a FIFO queue, where only the last (i.e., the most recently arrived) duplicate connection is allowed to survive as the active connection.

I am not confident about the timing of the message that is sent to the master daemon to tell it to drop the active connection for the peer. I have a spidey sense that something like the following could occur:

First duplicate connection arrives. connectd sends message to lightningd to disconnect the active connection for this peer.
Second duplicate connection arrives. connectd sends message to lightningd to disconnect the active connection for this peer.
lightningd receives the message sent in step 1, arranges for the active connection with the peer to be killed, and informs connectd that this has been done.
lightningd receives the message sent in step 2, but there is no active connection for the peer. Whoops.
connectd receives the confirmation sent in step 3 and turns the first duplicate connection into the active connection.
Uhh?? Is the second duplicate connection simply left hanging at this point?

It seems like the intent of the htable was to discard the first duplicate connection (which has not yet become the active connection) when the second duplicate connection arrives, but the code wasn't actually doing that, as it was not actually closing that connection but was merely deallocating the structure that the I/O callback function would later reference.

rustyrussell · 2022-06-17T11:12:05Z

@rustyrussell: Let me see if I understand the logic. There can be multiple connections with a given peer but only one "active" connection with any given peer at any given instant. The first duplicate connection from a peer to arrive will cause the current active connection with that peer to be terminated. While that is occurring, a second duplicate connection from the same peer can arrive. Both duplicate connections will be stalled, waiting for the active connection to be dropped.

So far, so good.

Once the active connection is dropped, then the first duplicate connection will become the active connection. I think the intention at this point is that this new active connection will then immediately be dropped, and subsequently the second duplicate connection will become the active connection. This effectively forms a FIFO queue, where only the last (i.e., the most recently arrived) duplicate connection is allowed to survive as the active connection.

Well, it's not clear which will actually get in next, and which will go back to waiting. The intent of the code is that there can only ever be one waiting: if there's one already waiting, take its place.

I am not confident about the timing of the message that is sent to the master daemon to tell it to drop the active connection for the peer. I have a spidey sense that something like the following could occur:

1. First duplicate connection arrives. `connectd` sends message to `lightningd` to disconnect the active connection for this peer.

2. Second duplicate connection arrives. `connectd` sends message to `lightningd` to disconnect the active connection for this peer.

3. `lightningd` receives the message sent in step 1, arranges for the active connection with the peer to be killed, and informs `connectd` that this has been done.

4. `lightningd` receives the message sent in step 2, but there is no active connection for the peer. Whoops.

5. `connectd` receives the confirmation sent in step 3 and turns the first duplicate connection into the active connection.

6. Uhh?? Is the second duplicate connection simply left hanging at this point?

Possibly. The rewrite eschews all this for a smoother flow of "kill current connection, tell lightningd it's dead, then tell it about new connection like normal".

It seems like the intent of the htable was to discard the first duplicate connection (which has not yet become the active connection) when the second duplicate connection arrives, but the code wasn't actually doing that, as it was not actually closing that connection but was merely deallocating the structure that the I/O callback function would later reference.

Indeed! A quick fix would have been to keep the conn pointer in the struct, and free that...

Happens occasionally when running `tests/test_connection.py::test_mutual_reconnect_race` (which is too flaky to add, without more fixes): ``` lightningd: lightningd/peer_control.c:1252: peer_active: Assertion `!channel->owner' failed. lightningd: FATAL SIGNAL 6 (version v0.11.0.1-38-g4f167da) 0x5594a41f8f45 send_backtrace common/daemon.c:33 0x5594a41f8fef crashdump common/daemon.c:46 0x7f7cb585c08f ??? /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 0x7f7cb585c00b __GI_raise ../sysdeps/unix/sysv/linux/raise.c:51 0x7f7cb583b858 __GI_abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79 0x7f7cb583b728 __assert_fail_base /build/glibc-SzIz7B/glibc-2.31/assert/assert.c:92 0x7f7cb584cfd5 __GI___assert_fail /build/glibc-SzIz7B/glibc-2.31/assert/assert.c:101 0x5594a41b45ca peer_active lightningd/peer_control.c:1252 0x5594a418794c connectd_msg lightningd/connect_control.c:457 0x5594a41cd457 sd_msg_read lightningd/subd.c:556 0x5594a41ccbe5 read_fds lightningd/subd.c:357 0x5594a4269fc2 next_plan ccan/ccan/io/io.c:59 0x5594a426abca do_plan ccan/ccan/io/io.c:407 0x5594a426ac0c io_ready ccan/ccan/io/io.c:417 0x5594a426ceff io_loop ccan/ccan/io/poll.c:453 0x5594a41930d9 io_loop_with_timers lightningd/io_loop_with_timers.c:22 0x5594a4199293 main lightningd/lightningd.c:1181 0x7f7cb583d082 __libc_start_main ../csu/libc-start.c:308 0x5594a416e15d ??? ???:0 0xffffffffffffffff ??? ???:0 ``` Signed-off-by: Rusty Russell <[email protected]>

This was fixed in 1c495ca ("connectd: fix accidental handling of old reconnections.") and then reverted by the rework in "connectd: avoid use-after-free upon multiple reconnections by a peer". The latter made the race much less likely, since we cleaned up the reconnecting struct once the connection was hung up by the remote node, but it's still theoretically possible. Signed-off-by: Rusty Russell <[email protected]>

rustyrussell · 2022-06-19T07:11:06Z

I added two minimal fixes to this, which resulted from stress-testing (the test unfortunately hangs frequently, since without other fixes the connect command can race).

This is because I want to do an 11.2 release with this fix and the memleak fix... basically, the "whitslack" point release.

heyambob mentioned this pull request Jun 12, 2022

v0.11.1, crash in connectd #5282

Closed

rustyrussell added 2 commits June 19, 2022 15:59

rustyrussell added this to the v0.11.2 milestone Jun 19, 2022

rustyrussell merged commit fd90e57 into ElementsProject:master Jun 28, 2022

whitslack deleted the connectd_use_after_free branch August 30, 2022 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connectd: avoid use-after-free upon multiple reconnections by a peer #5300

connectd: avoid use-after-free upon multiple reconnections by a peer #5300

whitslack commented Jun 1, 2022

whitslack commented Jun 1, 2022

whitslack commented Jun 2, 2022

whitslack commented Jun 6, 2022 •

edited

Loading

rustyrussell commented Jun 7, 2022

rustyrussell commented Jun 7, 2022

whitslack commented Jun 7, 2022 •

edited

Loading

rustyrussell commented Jun 14, 2022

whitslack commented Jun 14, 2022

rustyrussell commented Jun 16, 2022

whitslack commented Jun 16, 2022 •

edited

Loading

rustyrussell commented Jun 17, 2022

rustyrussell commented Jun 19, 2022

connectd: avoid use-after-free upon multiple reconnections by a peer #5300

connectd: avoid use-after-free upon multiple reconnections by a peer #5300

Conversation

whitslack commented Jun 1, 2022

whitslack commented Jun 1, 2022

whitslack commented Jun 2, 2022

whitslack commented Jun 6, 2022 • edited Loading

rustyrussell commented Jun 7, 2022

rustyrussell commented Jun 7, 2022

whitslack commented Jun 7, 2022 • edited Loading

rustyrussell commented Jun 14, 2022

whitslack commented Jun 14, 2022

rustyrussell commented Jun 16, 2022

whitslack commented Jun 16, 2022 • edited Loading

rustyrussell commented Jun 17, 2022

rustyrussell commented Jun 19, 2022

whitslack commented Jun 6, 2022 •

edited

Loading

whitslack commented Jun 7, 2022 •

edited

Loading

whitslack commented Jun 16, 2022 •

edited

Loading