WebSocket immediate disconnect on vasil-dev network #230

rhyslbw · 2022-06-23T15:18:29Z

What Git revision are you using?

e3b53aa

input-output-hk/cardano-configurations@77ad26a

What operating system are you using, and which version?

Linux / Ubuntu
Linux / Other
OSX
Windows

Describe what the problem is?

Connections made to the Ogmios server on vasil-dev result in an immediate WebSocket close:

Server logs

cardano-node-ogmios_1  | {"severity":"Error","timestamp":"2022-06-23T15:17:04.942155567Z","thread":"49","message":{"WebSocket":{"exception":"MuxError MuxBearerClosed \"<socket: 18> closed when reading data, waiting on next header True\"","tag":"WebSocketUnknownException"}},"version":"v5.5.0"}
cardano-node-ogmios_1  | {"severity":"Info","timestamp":"2022-06-23T15:17:04.942280371Z","thread":"49","message":{"WebSocket":{"tag":"WebSocketConnectionEnded","userAgent":"User-Agent unknown"}},"version":"v5.5.0"}

To reproduce

Start an instance of cardano-node-ogmios in the vasil-dev network.
Change

ogmios/clients/TypeScript/packages/repl/src/index.ts

Line 29 in e3b53aa

const context = await createInteractionContext(console.error, () => {}, { connection })

to: const context = await createInteractionContext(console.error, (code) => { log(code) }, { connection }) (not sure why I didn't log the disconnect in the initial impl) 😶
In clients/TypeScript run yarn repl:start --port __
Observe the disconnect code 1006

What should be the expected behavior?

Repeat steps 1,2, and 4 with a testnet instance to observe correct behaviour of no disconnect, and the ability to issue commands.

The text was updated successfully, but these errors were encountered:

KtorZ · 2022-06-26T15:08:59Z

I've tried on the testnet and mainnet and as far as I can tell, this behavior doesn't occur there. I wonder if this has anything to do with the issues regarding other peers (namely, the one using an invalid network magic and the one that is propagating bad blocks).

KtorZ · 2022-06-27T08:07:13Z

Also, I just recalled: #208

Could it be that vasil-dev has p2p enabled ?

KtorZ · 2022-06-27T19:10:05Z

Trying to run the node with p2p disabled, I can confirmed that the behavior isn't present anymore. I can keep the repl open and do chain-sync just fine (or at least, for more than 3s and up until my current node's state). Though, I haven't yet figured what the topology configuration should be in non-p2p mode on vasil-dev, so my node is just hanging there not syncing.

This is the case also with 1.35.0. Running with EnableP2P causes the client connection to be terminated after 3s 🤷 ... And there's no trace on the node's side (even after enabling more logs on the local chain sync). I am.. puzzled.

KtorZ · 2022-06-27T19:11:43Z

What also puzzles me is that, there's also an internal client connection that Ogmios uses for the health heartbeat. This one is basically a chain-sync, and it remains up and running without problem 🤔

edit: So I've tried to artificially remove the health heartbeat and, it didn't solve the issue. However, I went a bit deeper in the rabbit hole and remove the second client connection made with each websocket (because each websocket actually starts two client connections, one for most protocols, and a side-one used to fetch protocols parameters & the like when submitting transactions or evaluating units). And this, seems to be the root cause of the issue. If I remove the second connection (which isn't needed for chain-sync) then the connection remains open (and even with the health heartbeat).

Now entering the realm of hypothesis: the second connection is mostly idle all-the-time, with the 'agency' on the client. So I think that the second connection is being terminated because it's idle, and thus, causes the first one to also terminates and the websocket to terminates in consequence. Thus, the exception we see "MuxBearerClosed" actually comes from us terminating the first connection, because the second is maybe terminated for some reason. I'll try to confirm that.

KtorZ · 2022-06-27T19:41:49Z

Hypothesis above ☝️ confirmed, so the second client (which is basically just idling waiting for a request) is indeed terminated automatically after ~3s somehow, causing the other to fail. If I simply add an error handling on that second client and forever restart it, then the rest seems fine.

So, we have a "quickfix" / mitigation, which is good. But it doesn't quite explain why the second client is terminated and why this only occurs when p2p is enabled (I'd expect p2p to be completely orthogonal to the local client protocols).

KtorZ · 2022-06-27T19:54:13Z

It seems that the problem comes with the (second) client connecting but not actually sending any message; causing the node to probably clean up the connection? I've slightly change the second client such that it sends a dummy message when connecting, and then awaits for requests and it seems to go through, at least a little longer (I can do a full chain-sync and the connection remains open until the end).

I assume that this is perhaps something configurable and now makes me wonder.. will the node automatically terminates the connection should there be no client activity for X minutes or hours. That'd be pretty annoying 😬 ... Need to ask Marcin.

It seems like, if we just go and idle right after opening the connection, the connection gets closed automatically after 3s. So, this commits tries to circumvent the issue by making sure that at least one message is sent when the connection is established, AND THEN (and only then), we await for client requests (which may never come if the client does not evaluate execution units). See more of the discussion in: #230. At this stage, I don't know if this really solves the problem or, if it simply postpones it to later.

KtorZ · 2022-06-27T20:15:14Z

@rhyslbw The commit above does partially fix the problem but it raises also a few questions. I say partially because, the same behavior can still be observed if the client does not send any request within a short enough time window (which seems to be ~2s). For example:

client.on('open', () => {
  setTimeout(() => {
    client.send(requestNext);
  }, 2000);
});

This fails for me and gives me an error 1006 connection closed. But, anything below that seems to work fine. And it seems to only be a problem for the first message. So, if I send a first request promptly, I can wait a lot longer between messages afterwards (I've been up to one minute without problem).

Again, this is a bit odd and I hope configurable so I can probably set that to infinity 😬

KtorZ · 2022-06-27T20:27:23Z

Marcin confirms that this is a strange behavior and, to quote him:

In P2P we have a timeout which shuts down a connection if nothing happens after one negotiated a connection. My hypothesis is that this timeout is also set for node-to-client connection, but it should only be set for node-to-node connections.

https://github.com/input-output-hk/ouroboros-network/blob/master/ouroboros-network/src/Ouroboros/Network/Diffusion/P2P.hs?plain=1#L258-L259

If my hypothesis is right, your solution will actually solve the problem (although the right thing is to remove the timeout from node-to-client protocol).

Looks like we're going to need a 1.35.1 😶

rhyslbw · 2022-06-28T05:04:26Z

Thanks for the detailed investigation notes @KtorZ. I'm attempting to see if we can work around this by joining the network without enabling p2p, but am stuck since I'm not sure what the relay address is. Have you considered this as a workaround?

KtorZ · 2022-06-28T06:41:41Z

Yes but, I am not even sure that there are relays available...

rhyslbw · 2022-06-28T08:20:47Z

https://github.com/input-output-hk/cardano-world/blob/d6bdf6024d94bd48c47e00c9316c25eecae26735/nix/cardano/environments/default.nix#L167-L186

rhyslbw · 2022-06-28T08:28:43Z

Looks to be working for me with this topology, and p2p disabled in the node config.

{
  "Producers": [
    {
      "addr": "vasil-dev-node.world.dev.cardano.org",
      "port": 30001,
      "valency": 1
    }
  ]
}

KtorZ · 2022-06-28T19:09:06Z

IntersectMBO/ouroboros-network#3844 🎉

rhyslbw added the bug Something isn't working label Jun 23, 2022

This comment was marked as off-topic.

Sign in to view

KtorZ mentioned this issue Jun 27, 2022

Cardano Node 1.34.1 p2p enabled. #208

Closed

This comment was marked as off-topic.

Sign in to view

KtorZ added the won't fix / upstream issue An issue which is either not covered by Ogmios, or requires changes upstream first. label Jun 27, 2022

coot mentioned this issue Jun 28, 2022

Change local_PROTOCOL_IDLE_TIMEOUT IntersectMBO/ouroboros-network#3843

Closed

KtorZ closed this as completed Jun 29, 2022

CarlosLopezDeLara mentioned this issue Jul 11, 2022

NODE VNEXT SCOPE IntersectMBO/cardano-node#4144

Open

40 tasks

KtorZ mentioned this issue Aug 29, 2022

Websocket unexpectedly closes on pre-prod, preview or vasil-dev #253

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebSocket immediate disconnect on vasil-dev network #230

WebSocket immediate disconnect on vasil-dev network #230

rhyslbw commented Jun 23, 2022 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

KtorZ commented Jun 26, 2022

This comment was marked as off-topic.

KtorZ commented Jun 27, 2022

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

KtorZ commented Jun 27, 2022

KtorZ commented Jun 27, 2022 •

edited

Loading

KtorZ commented Jun 27, 2022 •

edited

Loading

KtorZ commented Jun 27, 2022

KtorZ commented Jun 27, 2022

KtorZ commented Jun 27, 2022 •

edited

Loading

rhyslbw commented Jun 28, 2022

KtorZ commented Jun 28, 2022

rhyslbw commented Jun 28, 2022

rhyslbw commented Jun 28, 2022

KtorZ commented Jun 28, 2022

WebSocket immediate disconnect on vasil-dev network #230

WebSocket immediate disconnect on vasil-dev network #230

Comments

rhyslbw commented Jun 23, 2022 • edited Loading

What Git revision are you using?

What operating system are you using, and which version?

Describe what the problem is?

Server logs

To reproduce

What should be the expected behavior?

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

KtorZ commented Jun 26, 2022

This comment was marked as off-topic.

KtorZ commented Jun 27, 2022

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

KtorZ commented Jun 27, 2022

KtorZ commented Jun 27, 2022 • edited Loading

KtorZ commented Jun 27, 2022 • edited Loading

KtorZ commented Jun 27, 2022

KtorZ commented Jun 27, 2022

KtorZ commented Jun 27, 2022 • edited Loading

rhyslbw commented Jun 28, 2022

KtorZ commented Jun 28, 2022

rhyslbw commented Jun 28, 2022

rhyslbw commented Jun 28, 2022

KtorZ commented Jun 28, 2022

rhyslbw commented Jun 23, 2022 •

edited

Loading

KtorZ commented Jun 27, 2022 •

edited

Loading

KtorZ commented Jun 27, 2022 •

edited

Loading

KtorZ commented Jun 27, 2022 •

edited

Loading