Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matchmaker eventually stops accepting connections after being deployed #428

Closed
MaxCWhitehead opened this issue Aug 2, 2024 · 4 comments · Fixed by n0-computer/iroh#2782 or #487
Closed
Assignees

Comments

@MaxCWhitehead
Copy link
Collaborator

MaxCWhitehead commented Aug 2, 2024

I've had test matchmaker deployed for ~4 days, Jumpy now times out when trying to connect. I think I've hit this before and typically restarting the container seems to resolve, going to try to debug before restarting.

2024-08-02T03:36:26.643321Z  INFO bones_framework::networking::online: Starting search for online game with player count 2
2024-08-02T03:36:26.643897Z  INFO bones_framework::networking::online: Connecting to online matchmaker
2024-08-02T03:36:26.974294Z  INFO magic_ep{me=lgui4qsh7nmonau5}:magicsock{me=lgui4qsh7nmonau5}:actor: iroh_net::magicsock: home is now relay https://use1-1.relay.iroh.network./, was None
2024-08-02T03:36:26.974384Z  INFO pkarr_publish{me=lgui4qsh7nmonau5}: iroh_net::discovery::pkarr: Publish node info to pkarr relay_url=Some("https://use1-1.relay.iroh.network./")
2024-08-02T03:36:26.974468Z  INFO magic_ep{me=lgui4qsh7nmonau5}:magicsock{me=lgui4qsh7nmonau5}:relay-actor: iroh_net::magicsock::relay_actor: adding connection to relay: https://use1-1.relay.iroh.network./ for home-keep-alive
2024-08-02T03:36:26.998492Z  INFO discovery{me=lgui4qsh7nmonau5 node=e7ubfz7cuqcx4xpr}:add_node_addr{me=lgui4qsh7nmonau5}:add_node_addr{node=e7ubfz7cuqcx4xpr}: iroh_net::magicsock::node_map: inserting new node in NodeMap node=e7ubfz7cuqcx4xpr relay_url=Some(RelayUrl("https://use1-1.relay.iroh.network./")) source=discovery
2024-08-02T03:36:26.999192Z  INFO poll_send{me=lgui4qsh7nmonau5}:get_send_addrs{node=e7ubfz7cuqcx4xpr}: iroh_net::magicsock::node_map::node_state: new connection type typ=relay
2024-08-02T03:36:37.000977Z  WARN bones_framework::networking::online: Online Game Search failed: failed connecting to provider

Caused by:
    timed out
@MaxCWhitehead MaxCWhitehead self-assigned this Aug 18, 2024
@MaxCWhitehead
Copy link
Collaborator Author

Haven't identified the issue yet here with additional logging - setup a client that attempts connection to mathcmaker every 5 seconds, and monitoring with tokio-console. Hoping to pin down exact timestamp that it breaks to see if anything more subtle happens in logs. Otherwise maybe will see a change in the async state that points towards an issue.

@flub
Copy link

flub commented Oct 3, 2024

Thanks for your patience on this! And many thanks for the 1G of logfile you sent us. It allowed me to figure out what was going on and if I have it right this should be the fix: n0-computer/iroh#2782

It makes sense that this happens only with your dev matchmaker: because it depends on not much traffic happening. A matchmaker that constantly has remote nodes connecting and disconnecting would be much less likely to run into this.

@MaxCWhitehead
Copy link
Collaborator Author

MaxCWhitehead commented Oct 4, 2024

@flub

Yeah that checks out - I was probing it repeatedly for a while to see if could pin point a timestamp of exact failure, and during this test it never failed. Definitely points to reproducing when idle.

Thank you very much for digging into this and doing autopsy on the logs. Nice detective work!

Will circle back in the future once this is merged + we deploy it and can close the loop on confirming the fix.

github-merge-queue bot pushed a commit to n0-computer/iroh that referenced this issue Oct 4, 2024
## Description

When the connection to the relay server fails the read channel will
return a read error. At this point the ActiveRelay actor will passively
wait until it has been asked to send something again before it will
re-establish a connection.

However if the local node has no reason to send anything to the relay
server, the connection is never re-established. This is problematic when
the relay has remote nodes trying to send to this node. This doubly
problematic when the connection is to the home relay: the node just sits
there thinking everything is healty and quiet, but no traffic is
reaching it.

In a node with active traffic this doesn't really show up, since a send
will be triggered quickly for an active connection and the connection
with the relay server would be re-established.

The start of the ActiveRelay run loop is the right place for this. A
read error triggers the loop to go round, logs a read error already and
then re-estagblishes the connection.

This does not keep the relay connection open forever. The mechanism that
is cleans up
unused connections to relay servers will still function correctly since
this only takes
the time something was last sent to a relay server into account. As long
as a connection
with a remote node exists there will be a DISCO ping between the two
nodes over the relay
path, so the connection is correctly kept alive. The home relay is
exempted from the
relay connection cleanup so is also kept connected, leaving this node
available to be
contacted via the relay server. Which is the entire point of this
bugfix.

The relay_client.is_connected() call sends a message to the relay Client
actor, and relay_client.connect() does that again. Taking the shortcut
to only call .connect() however is not better because the logging
becomes messier. In the common case there is one roundrip-message to the
relay Client actor and this would not improve anyway. The two messages
for the case where a reconnect is needed does not occur commonly.

## Breaking Changes

None

## Notes & open questions

Fixes fishfolk/bones#428

It is rather difficult to test though.

This targets #2781 as base.

## Change checklist

- [x] Self-review.
- ~~[ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.~~
- ~~[ ] Tests if relevant.~~
- ~~[ ] All breaking changes documented.~~
rklaehn pushed a commit to n0-computer/iroh-blobs that referenced this issue Oct 22, 2024
## Description

When the connection to the relay server fails the read channel will
return a read error. At this point the ActiveRelay actor will passively
wait until it has been asked to send something again before it will
re-establish a connection.

However if the local node has no reason to send anything to the relay
server, the connection is never re-established. This is problematic when
the relay has remote nodes trying to send to this node. This doubly
problematic when the connection is to the home relay: the node just sits
there thinking everything is healty and quiet, but no traffic is
reaching it.

In a node with active traffic this doesn't really show up, since a send
will be triggered quickly for an active connection and the connection
with the relay server would be re-established.

The start of the ActiveRelay run loop is the right place for this. A
read error triggers the loop to go round, logs a read error already and
then re-estagblishes the connection.

This does not keep the relay connection open forever. The mechanism that
is cleans up
unused connections to relay servers will still function correctly since
this only takes
the time something was last sent to a relay server into account. As long
as a connection
with a remote node exists there will be a DISCO ping between the two
nodes over the relay
path, so the connection is correctly kept alive. The home relay is
exempted from the
relay connection cleanup so is also kept connected, leaving this node
available to be
contacted via the relay server. Which is the entire point of this
bugfix.

The relay_client.is_connected() call sends a message to the relay Client
actor, and relay_client.connect() does that again. Taking the shortcut
to only call .connect() however is not better because the logging
becomes messier. In the common case there is one roundrip-message to the
relay Client actor and this would not improve anyway. The two messages
for the case where a reconnect is needed does not occur commonly.

## Breaking Changes

None

## Notes & open questions

Fixes fishfolk/bones#428

It is rather difficult to test though.

This targets #2781 as base.

## Change checklist

- [x] Self-review.
- ~~[ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.~~
- ~~[ ] Tests if relevant.~~
- ~~[ ] All breaking changes documented.~~
github-merge-queue bot pushed a commit that referenced this issue Oct 26, 2024
@MaxCWhitehead
Copy link
Collaborator Author

Updated bones version to 0.27 - thanks again @flub!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants