Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocket immediate disconnect on vasil-dev network #230

Closed
1 of 4 tasks
rhyslbw opened this issue Jun 23, 2022 · 21 comments
Closed
1 of 4 tasks

WebSocket immediate disconnect on vasil-dev network #230

rhyslbw opened this issue Jun 23, 2022 · 21 comments
Labels
bug Something isn't working won't fix / upstream issue An issue which is either not covered by Ogmios, or requires changes upstream first.

Comments

@rhyslbw
Copy link
Member

rhyslbw commented Jun 23, 2022

What Git revision are you using?

e3b53aa

input-output-hk/cardano-configurations@77ad26a

What operating system are you using, and which version?

  • Linux / Ubuntu
  • Linux / Other
  • OSX
  • Windows

Describe what the problem is?

Connections made to the Ogmios server on vasil-dev result in an immediate WebSocket close:

Server logs
cardano-node-ogmios_1  | {"severity":"Error","timestamp":"2022-06-23T15:17:04.942155567Z","thread":"49","message":{"WebSocket":{"exception":"MuxError MuxBearerClosed \"<socket: 18> closed when reading data, waiting on next header True\"","tag":"WebSocketUnknownException"}},"version":"v5.5.0"}
cardano-node-ogmios_1  | {"severity":"Info","timestamp":"2022-06-23T15:17:04.942280371Z","thread":"49","message":{"WebSocket":{"tag":"WebSocketConnectionEnded","userAgent":"User-Agent unknown"}},"version":"v5.5.0"}
To reproduce
  1. Start an instance of cardano-node-ogmios in the vasil-dev network.
  2. Change
    const context = await createInteractionContext(console.error, () => {}, { connection })
    to: const context = await createInteractionContext(console.error, (code) => { log(code) }, { connection }) (not sure why I didn't log the disconnect in the initial impl) 😶
  3. In clients/TypeScript run yarn repl:start --port __
  4. Observe the disconnect code 1006

What should be the expected behavior?

Repeat steps 1,2, and 4 with a testnet instance to observe correct behaviour of no disconnect, and the ability to issue commands.

@rhyslbw rhyslbw added the bug Something isn't working label Jun 23, 2022
@KtorZ

This comment was marked as off-topic.

@KtorZ

This comment was marked as off-topic.

@KtorZ

This comment was marked as off-topic.

@KtorZ
Copy link
Member

KtorZ commented Jun 26, 2022

I've tried on the testnet and mainnet and as far as I can tell, this behavior doesn't occur there. I wonder if this has anything to do with the issues regarding other peers (namely, the one using an invalid network magic and the one that is propagating bad blocks).

@rhyslbw

This comment was marked as off-topic.

@KtorZ
Copy link
Member

KtorZ commented Jun 27, 2022

Also, I just recalled: #208

Could it be that vasil-dev has p2p enabled ?

@catch-21

This comment was marked as off-topic.

@KtorZ

This comment was marked as off-topic.

@catch-21

This comment was marked as off-topic.

@KtorZ

This comment was marked as off-topic.

@KtorZ
Copy link
Member

KtorZ commented Jun 27, 2022

Trying to run the node with p2p disabled, I can confirmed that the behavior isn't present anymore. I can keep the repl open and do chain-sync just fine (or at least, for more than 3s and up until my current node's state). Though, I haven't yet figured what the topology configuration should be in non-p2p mode on vasil-dev, so my node is just hanging there not syncing.

This is the case also with 1.35.0. Running with EnableP2P causes the client connection to be terminated after 3s 🤷 ... And there's no trace on the node's side (even after enabling more logs on the local chain sync). I am.. puzzled.

@KtorZ
Copy link
Member

KtorZ commented Jun 27, 2022

What also puzzles me is that, there's also an internal client connection that Ogmios uses for the health heartbeat. This one is basically a chain-sync, and it remains up and running without problem 🤔

edit: So I've tried to artificially remove the health heartbeat and, it didn't solve the issue. However, I went a bit deeper in the rabbit hole and remove the second client connection made with each websocket (because each websocket actually starts two client connections, one for most protocols, and a side-one used to fetch protocols parameters & the like when submitting transactions or evaluating units). And this, seems to be the root cause of the issue. If I remove the second connection (which isn't needed for chain-sync) then the connection remains open (and even with the health heartbeat).

Now entering the realm of hypothesis: the second connection is mostly idle all-the-time, with the 'agency' on the client. So I think that the second connection is being terminated because it's idle, and thus, causes the first one to also terminates and the websocket to terminates in consequence. Thus, the exception we see "MuxBearerClosed" actually comes from us terminating the first connection, because the second is maybe terminated for some reason. I'll try to confirm that.

@KtorZ
Copy link
Member

KtorZ commented Jun 27, 2022

Hypothesis above ☝️ confirmed, so the second client (which is basically just idling waiting for a request) is indeed terminated automatically after ~3s somehow, causing the other to fail. If I simply add an error handling on that second client and forever restart it, then the rest seems fine.

So, we have a "quickfix" / mitigation, which is good. But it doesn't quite explain why the second client is terminated and why this only occurs when p2p is enabled (I'd expect p2p to be completely orthogonal to the local client protocols).

@KtorZ
Copy link
Member

KtorZ commented Jun 27, 2022

It seems that the problem comes with the (second) client connecting but not actually sending any message; causing the node to probably clean up the connection? I've slightly change the second client such that it sends a dummy message when connecting, and then awaits for requests and it seems to go through, at least a little longer (I can do a full chain-sync and the connection remains open until the end).

I assume that this is perhaps something configurable and now makes me wonder.. will the node automatically terminates the connection should there be no client activity for X minutes or hours. That'd be pretty annoying 😬 ... Need to ask Marcin.

KtorZ added a commit that referenced this issue Jun 27, 2022
  It seems like, if we just go and idle right after opening the
  connection, the connection gets closed automatically after 3s.

  So, this commits tries to circumvent the issue by making sure that
  at least one message is sent when the connection is established,
  AND THEN (and only then), we await for client requests (which may
  never come if the client does not evaluate execution units).

  See more of the discussion in: #230.

  At this stage, I don't know if this really solves the problem or, if
  it simply postpones it to later.
@KtorZ
Copy link
Member

KtorZ commented Jun 27, 2022

@rhyslbw The commit above does partially fix the problem but it raises also a few questions. I say partially because, the same behavior can still be observed if the client does not send any request within a short enough time window (which seems to be ~2s). For example:

client.on('open', () => {
  setTimeout(() => {
    client.send(requestNext);
  }, 2000);
});

This fails for me and gives me an error 1006 connection closed. But, anything below that seems to work fine. And it seems to only be a problem for the first message. So, if I send a first request promptly, I can wait a lot longer between messages afterwards (I've been up to one minute without problem).

Again, this is a bit odd and I hope configurable so I can probably set that to infinity 😬

@KtorZ
Copy link
Member

KtorZ commented Jun 27, 2022

Marcin confirms that this is a strange behavior and, to quote him:

In P2P we have a timeout which shuts down a connection if nothing happens after one negotiated a connection. My hypothesis is that this timeout is also set for node-to-client connection, but it should only be set for node-to-node connections.

https://github.com/input-output-hk/ouroboros-network/blob/master/ouroboros-network/src/Ouroboros/Network/Diffusion/P2P.hs?plain=1#L258-L259

If my hypothesis is right, your solution will actually solve the problem (although the right thing is to remove the timeout from node-to-client protocol).

Looks like we're going to need a 1.35.1 😶

@KtorZ KtorZ added the won't fix / upstream issue An issue which is either not covered by Ogmios, or requires changes upstream first. label Jun 27, 2022
@rhyslbw
Copy link
Member Author

rhyslbw commented Jun 28, 2022

Thanks for the detailed investigation notes @KtorZ. I'm attempting to see if we can work around this by joining the network without enabling p2p, but am stuck since I'm not sure what the relay address is. Have you considered this as a workaround?

@KtorZ
Copy link
Member

KtorZ commented Jun 28, 2022

Yes but, I am not even sure that there are relays available...

@rhyslbw
Copy link
Member Author

rhyslbw commented Jun 28, 2022

Looks to be working for me with this topology, and p2p disabled in the node config.

{
  "Producers": [
    {
      "addr": "vasil-dev-node.world.dev.cardano.org",
      "port": 30001,
      "valency": 1
    }
  ]
}

@KtorZ
Copy link
Member

KtorZ commented Jun 28, 2022

IntersectMBO/ouroboros-network#3844 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working won't fix / upstream issue An issue which is either not covered by Ogmios, or requires changes upstream first.
Projects
None yet
Development

No branches or pull requests

3 participants