-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Waku V2 prod fleet WebSockify connectivity issues #19
Comments
Hi @jakubgs. I've summarised some of the problems I see on |
See #20 (comment) for deployment notes |
Also, yeah, separate problems require separate issues. But whatever. |
I purged SQLite databases with: echo 'DELETE FROM Peer;' | sudo sqlite3 data/store.sqlite3 And then re-ran Ansible for all the prod hosts, and the connections look correct:
{
"jsonrpc": "2.0",
"id": 1,
"result": [
{
"multiaddr": "/ip4/34.121.100.108/tcp/30303/p2p/16Uiu2HAmVkKntsECaYfefR1V2yCR79CegLATuTPE6B9TxgxBiiiA",
"protocol": "/vac/waku/relay/2.0.0-beta2",
"connected": true
},
{
"multiaddr": "/ip4/8.210.222.231/tcp/30303/p2p/16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD",
"protocol": "/vac/waku/relay/2.0.0-beta2",
"connected": true
},
{
"multiaddr": "/ip4/188.166.135.145/tcp/9000/p2p/16Uiu2HAmUsvgMBF7yhzFMumEcccjWfg2EPCSXtE33Tbs8jNQj2bS",
"protocol": "/vac/waku/relay/2.0.0-beta2",
"connected": true
}
]
} |
prod
fleet health
There are certainly some logs indicating connectivity issues:
But stuff like that is usually created by clients, not the server. |
There is no proxy in place, the container exposes the port 443 directly:
So there is no reverse proxy tuning we could possible do. |
@jm-clius your report about WebSocket connectivity issues is very vague. I see nothing wrong with the setup as far as I can tell. You have to give me a way to reproduce this issue in order to be able to try to start to debug it. |
Thanks for the fixes, @jakubgs. I will monitor on our side, but things definitely look better.
Yeah, I added the |
I witness connectivity issues from js-waku side: waku-org/js-waku#185 |
@D4nte is this still relevant? |
All good on my side so this comment #19 (comment) can now be ignored. |
I'm closing this then. Feel free to reopen if this shows up again. |
We've seen some new issues with Websockify today. One thing I've noticed is a lot of processes in the containers:
|
And we can see that the websocket healthcheck is failing:
On some hosts it's just takes a really long time to respond. |
Here's some errors we can find in the logs:
|
I'm wondering if the processes aren't stuck in some weird state after a connection breaking of not cleanly. |
It's definitely not a limit of file descriptors:
|
Although the RPC call shows only 13 of them:
|
Ok so this time it's due to over usage of the resources. I continued the discussion there: https://discord.com/channels/864066763682218004/865467749923684403/887493985256497172 Happy to move it to GitHub :) @jakubgs @jm-clius @oskarth what are you thought on the matter re expenditure of resources (see my discord message)? |
No it's not. The issue is a limit if peers set at 50. |
Indeed, the libp2p max number of connections is set to
|
This does not make sense. The RPC call should only show the peers that were explicitly added using RPC
|
@jm-clius I have no idea, but look at the IPs of the peers other than the first 3, which are just the cluster ones. The IPs are either from |
I've purged the
|
@jm-clius Does the 50 peers limit make sense? Should we increase this limit? |
The limit here is a maximum of 50 simultaneous connections (suspect we can store many more peers), which I think makes sense. (@Menduist, have you found that there is a better practical upper limit for connections for, say, nimbus?) The solution is to have clients connect to other nodes than just the cluster. If this is difficult to get done in a short time, we can increase the conn limit, but I'd prefer this to be a temporary measure unless we have a good understanding of what impact more connections will have. |
You can have as many peers as you want, but that will generally cost more bandwidth & cpu. The only "surprising" upper limit is that ISP's router can crash or overheat if you have too many connections, so for software running at home, you should keep peer count to a minimum. But for beefy servers, not an issue |
Ok, interesting! Thanks, @Menduist. Perhaps we can then go up quite a bit from the current default @D4nte I think we should make this a configurable CLI parameter and run the cluster nodes with a much higher connection limit (probably |
Issues are being reported as we have a hackathon at the moment. I can see a number of exception in elastic
According due novnc/websockify#439 it happens with "large amount of binary data" which could fit our scenario depending on the messages being sent/received. The fix has been released in websockify 10.0.0. We currently use 0.9.0. Can we please upgrade websockify? @jakubgs @arthurk As we have a hackathon in progress, any chance this could be done now? |
Apparently `0.9.0` has an issue with transferring "large amount of binary data". This results in an exception like this: ``` handler exception: 'str' object has no attribute 'decode' ``` Issues: * novnc/websockify#439 * #19 Signed-off-by: Jakub Sokołowski <[email protected]>
I've deployed So far we see no issues, but will continue to monitor. |
Thank you for that. The |
Nice. Feel free to to close this issue if you consider this resolved. Though as I said, a native websocket support in the node would be much easier to manage and support. |
Happy to close. @jm-clius, happy too? |
Happy to close. I'll go ahead. :) |
I've noticed the following possible issues with the Waku v2
prod
fleet:Logging/Kibana issuesResolvedprod
nodes on Kibana since ~11:13 UTC on the 14th of June. (In fact, the only Waku v2 service that's still logging seems to be the one on hostnode-01.gc-us-central1-a.wakuv2.test
.)node-01.do-ams3.wakuv2.prod
disappeared even earlier from Kibana (9th of June).Connectivity toResolvednode-01.do-ams3.wakuv2.prod
prod
nodes do not seem to be (stably) connected tonode-01.do-ams3.wakuv2.prod
. From available logs:prod
nodes seemed to be trying to reach thepeerId
fornode-01.do-ams3.wakuv2.prod
at the wire address fornode-01.ac-cn-hongkong-c.wakuv2.prod
(/ip4/8.210.222.231/tcp/30303
). Perhaps some earlier inconsistency in the connect.yml script? Since these get persisted, we may need to drop thePeer
table on each of the prod nodes, before running the connect script again.node-01.do-ams3.wakuv2.prod
was not updated with the other two nodes after deploying to prod. This is based on the other two nodes complaining that it does not support the same protocols as them. (Though this could just be a side-effect of the issue (1) above.)Possible
websockify
issueswebsockify
(on all hosts) logs frequent errors: (e.g. "Connection reset", "no shared cipher", "bad syntax", "Method not allowed") on available logs.The text was updated successfully, but these errors were encountered: