Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connectd+: Flake/race fix for new channels #5601

Merged
merged 1 commit into from
Sep 18, 2022

Conversation

ddustin
Copy link
Collaborator

@ddustin ddustin commented Sep 16, 2022

  1. dualopen has fd to connectd
  2. channeld needs to take over
  3. dualopen passes fd that leads to a connectd over for channeld to use
  4. lightningd must receive the fd transfer request and process it
  5. dualopen shuts down and closes everything it owns

4 & 5 end up in a race. If 5 happens before 4, channeld ends up with an invalid fd for connectd — leaving it in a position to not receive messages.

Lingering for a second makes 4 win the race. Since the daemon is closing anyway, waiting for a second should be alright.

@rustyrussell
Copy link
Contributor

Oh, good catch! I'm not entirely surprised that exiting closes pending fds we've sent that haven't been read yet.

(You didn't describe what the user-visible result of this is, though, which would be useful! And a Changelog-Fixed line).

This might be fine for 0.12.1 (depending on what this actually fixes!) but I think we want to add something more robust in master. I think lightningd could simply close the fd to the subd after receiving the passed fds: it could wait for that and terminate.

channeld/channeld.c Outdated Show resolved Hide resolved
@ddustin
Copy link
Collaborator Author

ddustin commented Sep 17, 2022

You didn't describe what the user-visible result of this is, though, which would be useful!

On my machine (Intel Mac 12.3.1 [21E258]) this script would (sometimes) trigger the problem

source contrib/start_regtest.sh
start_ln
fund_nodes
INVOICE=$(l2-cli invoice 10000 test test | jq -r ".bolt11")
l1-cli pay $INVOICE

Resulting in a user output that showed this error:

Waiting for lightning node funds... found.
Funding channel from node 1 to node 2. Waiting for confirmation... done.
{
   "code": 210,
   "message": "Ran out of routes to try after 1 attempt: see `paystatus`",
   "attempts": [
      {
         "status": "failed",
         "failreason": "Failing a partial payment due to a failed RPC call: {\"code\":204,\"message\":\"failed: WIRE_TEMPORARY_CHANNEL_FAILURE (First peer not ready)\",\"data\":{\"erring_index\":0,\"failcode\":4103,\"failcodename\":\"WIRE_TEMPORARY_CHANNEL_FAILURE\",\"erring_node\":\"03f8eaa056c5c9d390f183457cae5abde8af6cce2adb873b9dc83b44b4a2a81872\",\"erring_channel\":\"103x1x1\",\"erring_direction\":1}}",
         "partid": 0,
         "amount_msat": 10000
      }
   ]
}

Periodically on new channel creation (immediately after hand off to channeld) one side would never receive WIRE_ANNOUNCEMENT_SIGNATURES even though the other side did send it. Resulting in a log like this on the failing side:

2022-09-14T22:20:16.504Z DEBUG   03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-channeld-chan#1: peer_out WIRE_ANNOUNCEMENT_SIGNATURES
2022-09-14T22:20:16.504Z DEBUG   03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-channeld-chan#1: billboard: Funding transaction locked. Waiting for their announcement signatures.
2022-09-14T22:20:16.504Z DEBUG   03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-channeld-chan#1: billboard: Funding transaction locked. Waiting for their announcement signatures.
2022-09-14T22:20:16.504Z DEBUG   03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-channeld-chan#1: billboard: Funding transaction locked. Waiting for their announcement signatures.
2022-09-14T22:20:16.504Z INFO    03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-channeld-chan#1: Peer connection lost
2022-09-14T22:20:16.504Z DEBUG   03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-channeld-chan#1: Status closed, but not exited. Killing
2022-09-14T22:20:16.505Z INFO    03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-chan#1: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)

The other side would both send & receive WIRE_ANNOUNCEMENT_SIGNATURES. This implies (for the first node) the socket was in a strange state where messages could be sent across it yet not received. The call to read() would return 0 implying a clean shutdown of the connection. At first I thought this was between the two nodes but eventually realized it was the inter-daemon socket from channeld <-> connectd.

And a Changelog-Fixed line

Whoops, added it!

This might be fine for 0.12.1 (depending on what this actually fixes!) but I think we want to add something more robust in master. I think lightningd could simply close the fd to the subd after receiving the passed fds: it could wait for that and terminate.

Ah I was exploring ways to ack the passed fd and this sounds like a nice simple one. Had the same idea we could push this for 0.12.1 and build out a more robust solution after.

The problem is relatively reliably created on mac but my hope is it fix obscure flakiness issues for others. It has that sort of feel to it that it might.

1) dualopen has fd to connectd
2) channeld needs to take over
3) dualopen passes fd that leads to a connectd over for channeld to use
4) lightningd must receive the fd transfer request and process
5) dualopen shuts down and closes everything it owns

4 & 5 end up in a race. If 5 happens before 4, channeld ends up with an invalid fd for connectd — leaving it in a position to not receive messages.

Lingering for a second makes 4 win the race. Since the daemon is closing anyway, waiting for a second should be alright.

Changelog-Fixed: Fixed a condition for newly created channels that could trigger a need for reconnect.
@rustyrussell rustyrussell added this to the v0.12.1 milestone Sep 18, 2022
@rustyrussell rustyrussell added the voodoo Suspect user, issue or code has ancient curse label Sep 18, 2022
@rustyrussell
Copy link
Contributor

Pushed trivial whitespace fix, tagged for 0.12.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
voodoo Suspect user, issue or code has ancient curse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants