-
-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebSocket immediate disconnect on vasil-dev network #230
Comments
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
I've tried on the testnet and mainnet and as far as I can tell, this behavior doesn't occur there. I wonder if this has anything to do with the issues regarding other peers (namely, the one using an invalid network magic and the one that is propagating bad blocks). |
This comment was marked as off-topic.
This comment was marked as off-topic.
Also, I just recalled: #208 Could it be that vasil-dev has p2p enabled ? |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Trying to run the node with p2p disabled, I can confirmed that the behavior isn't present anymore. I can keep the repl open and do chain-sync just fine (or at least, for more than 3s and up until my current node's state). Though, I haven't yet figured what the topology configuration should be in non-p2p mode on This is the case also with |
What also puzzles me is that, there's also an internal client connection that Ogmios uses for the health heartbeat. This one is basically a chain-sync, and it remains up and running without problem 🤔 edit: So I've tried to artificially remove the health heartbeat and, it didn't solve the issue. However, I went a bit deeper in the rabbit hole and remove the second client connection made with each websocket (because each websocket actually starts two client connections, one for most protocols, and a side-one used to fetch protocols parameters & the like when submitting transactions or evaluating units). And this, seems to be the root cause of the issue. If I remove the second connection (which isn't needed for chain-sync) then the connection remains open (and even with the health heartbeat). Now entering the realm of hypothesis: the second connection is mostly idle all-the-time, with the 'agency' on the client. So I think that the second connection is being terminated because it's idle, and thus, causes the first one to also terminates and the websocket to terminates in consequence. Thus, the exception we see "MuxBearerClosed" actually comes from us terminating the first connection, because the second is maybe terminated for some reason. I'll try to confirm that. |
Hypothesis above ☝️ confirmed, so the second client (which is basically just idling waiting for a request) is indeed terminated automatically after ~3s somehow, causing the other to fail. If I simply add an error handling on that second client and forever restart it, then the rest seems fine. So, we have a "quickfix" / mitigation, which is good. But it doesn't quite explain why the second client is terminated and why this only occurs when p2p is enabled (I'd expect p2p to be completely orthogonal to the local client protocols). |
It seems that the problem comes with the (second) client connecting but not actually sending any message; causing the node to probably clean up the connection? I've slightly change the second client such that it sends a dummy message when connecting, and then awaits for requests and it seems to go through, at least a little longer (I can do a full chain-sync and the connection remains open until the end). I assume that this is perhaps something configurable and now makes me wonder.. will the node automatically terminates the connection should there be no client activity for X minutes or hours. That'd be pretty annoying 😬 ... Need to ask Marcin. |
It seems like, if we just go and idle right after opening the connection, the connection gets closed automatically after 3s. So, this commits tries to circumvent the issue by making sure that at least one message is sent when the connection is established, AND THEN (and only then), we await for client requests (which may never come if the client does not evaluate execution units). See more of the discussion in: #230. At this stage, I don't know if this really solves the problem or, if it simply postpones it to later.
@rhyslbw The commit above does partially fix the problem but it raises also a few questions. I say partially because, the same behavior can still be observed if the client does not send any request within a short enough time window (which seems to be ~2s). For example: client.on('open', () => {
setTimeout(() => {
client.send(requestNext);
}, 2000);
}); This fails for me and gives me an error Again, this is a bit odd and I hope configurable so I can probably set that to infinity 😬 |
Marcin confirms that this is a strange behavior and, to quote him:
Looks like we're going to need a |
Thanks for the detailed investigation notes @KtorZ. I'm attempting to see if we can work around this by joining the network without enabling p2p, but am stuck since I'm not sure what the relay address is. Have you considered this as a workaround? |
Yes but, I am not even sure that there are relays available... |
Looks to be working for me with this topology, and p2p disabled in the node config. {
"Producers": [
{
"addr": "vasil-dev-node.world.dev.cardano.org",
"port": 30001,
"valency": 1
}
]
} |
What Git revision are you using?
e3b53aa
input-output-hk/cardano-configurations@77ad26a
What operating system are you using, and which version?
Describe what the problem is?
Connections made to the Ogmios server on vasil-dev result in an immediate WebSocket close:
Server logs
To reproduce
cardano-node-ogmios
in the vasil-dev network.ogmios/clients/TypeScript/packages/repl/src/index.ts
Line 29 in e3b53aa
const context = await createInteractionContext(console.error, (code) => { log(code) }, { connection })
(not sure why I didn't log the disconnect in the initial impl) 😶clients/TypeScript
runyarn repl:start --port __
1006
What should be the expected behavior?
Repeat steps 1,2, and 4 with a testnet instance to observe correct behaviour of no disconnect, and the ability to issue commands.
The text was updated successfully, but these errors were encountered: