Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection dropped/timed out when connecting to deployed agent #414

Closed
emmacasolin opened this issue Jul 13, 2022 · 19 comments · Fixed by #445
Closed

Connection dropped/timed out when connecting to deployed agent #414

emmacasolin opened this issue Jul 13, 2022 · 19 comments · Fixed by #445
Assignees
Labels
bug Something isn't working r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@emmacasolin
Copy link
Contributor

Describe the bug

Occasionally, when attempting to connect to a deployed agent, we see connection timeouts. Sometimes, this also results in a connection being dropped (and shutting down both agents if the connection is attempted during pk agent start with the deployed agent specified as a seed node). A specific occurrence of both agents shutting down ended with this error on the local agent:

{"type":"ErrorPolykeyRemote","data":{"message":"","timestamp":"2022-07-13T06:34:55.448Z","data":{},"cause":{"type":"ErrorPolykeyRemote","data":{"message":"","timestamp":"2022-07-13T06:34:55.409Z","data":{},"cause":{"type":"ErrorConnectionStartTimeout","data":{"message":"","timestamp":"2022-07-13T06:34:55.363Z","data":{},"description":"Connection start timed out","exitCode":68}},"description":"Remote error from RPC call","exitCode":68,"metadata":{"nodeId":"v5neee032adia1v5u8lj483hhb5bp45ejb8knlmtbuei00lvupudg","host":"54.206.24.79","port":1314,"command":"nodesHolePunchMessageSend"}}},"stack":"ErrorPolykeyRemote\n    at toError (/home/emma/Projects/js-polykey/src/grpc/utils/utils.ts:203:29)\n    at Object.callback (/home/emma/Projects/js-polykey/src/grpc/utils/utils.ts:381:17)\n    at Object.onReceiveStatus (/home/emma/Projects/js-polykey/node_modules/@grpc/grpc-js/src/client.ts:351:26)\n    at /home/emma/Projects/js-polykey/node_modules/@grpc/grpc-js/src/call-stream.ts:206:27\n    at Object.onReceiveStatus (/home/emma/Projects/js-polykey/src/grpc/utils/FlowCountInterceptor.ts:57:13)\n    at InterceptingListenerImpl.onReceiveStatus (/home/emma/Projects/js-polykey/node_modules/@grpc/grpc-js/src/call-stream.ts:202:19)\n    at Object.onReceiveStatus (/home/emma/Projects/js-polykey/node_modules/@grpc/grpc-js/src/client-interceptors.ts:462:34)\n    at Object.onReceiveStatus (/home/emma/Projects/js-polykey/node_modules/@grpc/grpc-js/src/client-interceptors.ts:424:48)\n    at /home/emma/Projects/js-polykey/node_modules/@grpc/grpc-js/src/call-stream.ts:330:24\n    at processTicksAndRejections (node:internal/process/task_queues:78:11)","description":"Remote error from RPC call","exitCode":68,"metadata":{"nodeId":"v0ntccuvju21203s05i11jd9o3d35ae08d29csledv3ntpbjjatcg","host":"54.253.9.3","port":1314,"command":"nodesHolePunchMessageSend"}}}

And this series of errors/warnings on the deployed agent:

INFO:NodeConnectionManager:Getting connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:existing entry found for vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:withConnF calling function with connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
WARN:ConnectionReverse 129.94.8.25:31843:Reverse Error: ErrorConnectionTimeout
INFO:ConnectionReverse 129.94.8.25:31843:Stopping Connection Reverse
WARN:ConnectionForward 129.94.8.25:31843:Forward Error: ErrorConnectionTimeout
INFO:ConnectionForward 129.94.8.25:31843:Stopping Connection Forward
INFO:clientFactory:Destroying GRPCClientAgent connected to 129.94.8.25:31843
INFO:NodeConnection 129.94.8.25:31843:Destroying NodeConnection
INFO:NodeConnection 129.94.8.25:31843:Destroyed NodeConnection
INFO:clientFactory:Destroyed GRPCClientAgent connected to 129.94.8.25:31843
{     "type": "ErrorGRPCClientCall",     "data": {         "message": "14 UNAVAILABLE: Connection dropped",         "timestamp": "2022-07-13T06:35:15.212Z",         "data": {             "code": 14,             "details": "Connection dropped",             "metadata": {}         },         "stack": "ErrorGRPCClientCall: 14 UNAVAILABLE: Connection dropped\n    at toError (/lib/node_modules/@matrixai/polykey/dist/grpc/utils/utils.js:174:24)\n    at Object.callback (/lib/node_modules/@matrixai/polykey/dist/grpc/utils/utils.js:331:25)\n    at Object.onReceiveStatus (/lib/node_modules/@matrixai/polykey/node_modules/@grpc/grpc-js/build/src/client.js:189:36)\n    at /lib/node_modules/@matrixai/polykey/node_modules/@grpc/grpc-js/build/src/call-stream.js:111:35\n    at Object.onReceiveStatus (/lib/node_modules/@matrixai/polykey/dist/grpc/utils/FlowCountInterceptor.js:80:29)\n    at InterceptingListenerImpl.onReceiveStatus (/lib/node_modules/@matrixai/polykey/node_modules/@grpc/grpc-js/build/src/call-stream.js:106:23)\n    at Object.onReceiveStatus (/lib/node_modules/@matrixai/polykey/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)\n    at Object.onReceiveStatus (/lib/node_modules/@matrixai/polykey/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)\n    at /lib/node_modules/@matrixai/polykey/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78\n    at processTicksAndRejections (node:internal/process/task_queues:78:11)",         "description": "Generic call error",         "exitCode": 69     } }

To Reproduce

The exact cause of the error is still unknown, however, it may be reproducible by connecting to a deployed agent as a seed node npm run polykey -- agent start --seednodes="[deployedAgentNodeId]@[deployedAgentHost]:[deployedAgentPort] --verbose --format json

Expected behavior

Even if a connection times out, this shouldn't kill an agent, especially not the seed node. The seed node should be able to ignore a failed connection, even if it was the cause of the failed connection, since we need seed nodes to be stable.

@emmacasolin
Copy link
Contributor Author

Interestingly, I'm having a lot less difficulty connecting to deployed agents on my home wifi compared to uni wifi, and I haven't had the connection dropped like it was yesterday. This may just be our agents not being able to punch through stronger NATs, so this issue may be relevant when it comes to working on #383

@CMCDragonkai
Copy link
Member

The testnet isn't behind any NAT. It has public IP addresses, when connecting from local agent to the testnet, there's isn't any punch through. As soon as your local agent initiates a message, your local NAT would allow reverse packets from the same IP+port.

@emmacasolin
Copy link
Contributor Author

emmacasolin commented Jul 14, 2022

The testnet isn't behind any NAT. It has public IP addresses, when connecting from local agent to the testnet, there's isn't any punch through. As soon as your local agent initiates a message, your local NAT would allow reverse packets from the same IP+port.

Yes, I meant in the case of the testnet trying to connect back to me. If the conntrack entry has expired then the testnet would not be able to reach me, if that's indeed what it's attempting to do. These connection re-attempts (coming from the testnet) are continuing for 20+ hours after my local agent was killed, probably longer, but that's the longest any of the agents showing this behaviour have been active.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jul 14, 2022

Ah yes, that can happen.

Now that I think about it. What is the reason for the testnet attempting to connect back? It has something to do with DHT queries I believe, but in that case, one must expect failures to reconnect in case your local node goes offline. Is the connection attempt solely at the proxy level (proxy connection), or at the grpc connection level (node connection)?

Are there are other reasons? @tegefaulkes

For punching back to your local node, you have to have #383 done too. Plus your node is pretty transient since it's both mobile on the Uni wifi network, and then the Uni network itself would have a NAT. Note that #383 isn't solely about 2 levels of NATs, but really any number of levels of NATs.

Tailscale/tailnet is good source of inspiration for how to address some of this.

@CMCDragonkai
Copy link
Member

Furthermore if reconnection attempts fail for any arbitrary node on the network, these may make very noisy logs, because there would lots of failed network connection attempts in the future as the network scales. I reckon structured logging will need to be made use of to help filter out these logs.

@emmacasolin
Copy link
Contributor Author

Now that I think about it. What is the reason for the testnet attempting to connect back? It has something to do with DHT queries I believe, but in that case, one must expect failures to reconnect in case your local node goes offline. Is the connection attempt solely at the proxy level (proxy connection), or at the grpc connection level (node connection)?

While my agent is online the connection attempts are at the grpc connection/node connection level, but once it goes offline I believe it's only at the proxy level (since the only logs are coming from ConnectionForward).

@tegefaulkes
Copy link
Contributor

AFAIK the only times a node will automatically try to connect to another node is

  • initial start up discovery with syncNodeGraph.
  • discovery queue doing it's thing
  • refresh buckets queue, initially right after startup and for every bucket every hour. This will generate a lot of noise since it will do a find node operation of 255 buckets every hour. In a small network we will only find the same nodes for each operation. We still need to add some way to cancel refresh bucket operations if we're not going to find any new nodes.
  • ping node queue, should only happen in the background if a bucket is full. shouldn't be doing anything here. and I think pingNode only opens a proxy connection and not a full nodeConnection.

If the nodes are continuously trying to connect over and over again then I'd check

  1. The queues are not the culprit. I don't know if all of then avoid duplicating when adding to the queue. should be easy to check if you add some debug logging on the agents for the queues.
  2. connections that are failing to connect are actually timing out and giving up.

@CMCDragonkai
Copy link
Member

This problem is similar to #418. This may also be an unhandled promise rejection.

I believe our PK agent should have a unhandled promise rejection handler registered at the top level. And if it receives it, should report it appropriately and also exit the program. We should have this done inside bin/polykey-agent.ts.

At the moment there is no unhandled promise rejection handler.

We can add this in to aid in our debugging.

At the same time, we could also address: #307.

See: https://nodejs.org/api/process.html#event-unhandledrejection

We should have 2 new exceptions at the top level:

ErrorPolykeyUnhandledRejection
ErrorPolykeyAsynchronousDeadlock

These 2 indicate software bugs, so they should use sysexits.SOFTWARE as the sysexit code.

Actually I just remembered bin/utils/ExitHandlers. This is actually being registered for signals, but not for those problems. So the implementation of these handlers should be done in ExitHandlers.

@CMCDragonkai
Copy link
Member

I don't have a solution for this bug yet, @tegefaulkes can add the handlers, and let's see if we reproduce AFTER we have solved the other bugs #418 and #415.

@tegefaulkes
Copy link
Contributor

I just looked, src/bin/utils/ExitHandlers.ts:111 already registers handlers for unhandledRejection and uncaughtException. The handler prints the error to stderr and exits the process. So this is already being handled.

Right now it will set the process exitcode to the error's code if it's an ErrorPolykey. Otherwise the code is set to 255 before exiting. It doesn't non ErrorPolykey errors into anything else. There doesn't seem to be a need for it.

With #414 (comment) already addressed then I don't know what else to do for this issue.

@tegefaulkes
Copy link
Contributor

I could resolve #307 for detecting promise deadlocks.

@CMCDragonkai
Copy link
Member

Yes the ExitHandler has errorHandler property being used for both unhandledRejection and uncaughtException.

However we want to change this to rethrow a special set of exceptions: #414 (comment)

So that it is clear to us that these are unrecoverable errors. In fact, the presence of these indicate a software bug.

@CMCDragonkai
Copy link
Member

Since they are all unrecoverable errors, we can use sysexits.SOFTWARE as the software bug code.

Regardless of whether it is an instance of ErrorPolykey or not, it should be wrapped up.

ErrorUnhandledRejection
ErrorUncaughtException
ErrorAsynchronousDeadlock

We should not use ErrorPolykey prefix here. These are actually not about PK at all. They are nodejs runtime problems.

I suggest we putting all of these exceptions into the ExitHandlers class itself.

@tegefaulkes
Copy link
Contributor

We don't want these errors to extend ErrorPolykey we need a different error to extend. Currently ErrproPolykey is the base class we use to create all of our errors and it's the class that adds the exitCode and description properties.

I'm going to rename ErrorPolykey to GenericError (or whatever is a better name) and have ErrorPolykey and the other errors above extend that.

@tegefaulkes
Copy link
Contributor

I'll hold off on that change for now until I decide the best action.

  1. Would ErrorGeneric be in it's own file separate from ErrorPolykey.ts?
  2. Would ErrorCLI (ErrorBin) extend ErrorPolykey or ErrorGeneric?

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Sep 20, 2022

Ok well in that case, it should be ErrorBinUnhandledRejection and ErrorBinUncaughtException and ErrorBinAsynchronousDeadlock.

And have the bin/errors be ErrorBin.

@CMCDragonkai
Copy link
Member

Consider this:

ErrorBin

ErrorBinUncaughtException extends ErrorBin
ErrorBinUnhandledRejection extends ErrorBin
ErrorBinAsynchronousDeadlock extends ErrorBin

ErrorCLI extends ErrorBin

ErrorCLI...

@CMCDragonkai
Copy link
Member

Change the 2 ErrorNode* in src/bin/errors to also be ErrorCLI....

Also in the src/errors.ts. What are these 2:

  class ErrorInvalidId<T> extends ErrorPolykey<T> {}
  
  class ErrorInvalidConfigEnvironment<T> extends ErrorPolykey<T> {}

@CMCDragonkai
Copy link
Member

Those last 2 errors, nothing is using those. Must have been forgotten, can be removed.

tegefaulkes added a commit that referenced this issue Sep 20, 2022
Un-recoverable errors include `ErrorBinUncaughtException`, `ErrorBinUnhandledRejection` and `ErrorBinAsynchronousDeadlock`.

#414
tegefaulkes added a commit that referenced this issue Sep 21, 2022
Un-recoverable errors include `ErrorBinUncaughtException`, `ErrorBinUnhandledRejection` and `ErrorBinAsynchronousDeadlock`.

#414
tegefaulkes added a commit that referenced this issue Sep 21, 2022
Un-recoverable errors include `ErrorBinUncaughtException`, `ErrorBinUnhandledRejection` and `ErrorBinAsynchronousDeadlock`.

#414
@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

Successfully merging a pull request may close this issue.

3 participants