Connection dropped/timed out when connecting to deployed agent #414

emmacasolin · 2022-07-13T07:04:34Z

Describe the bug

Occasionally, when attempting to connect to a deployed agent, we see connection timeouts. Sometimes, this also results in a connection being dropped (and shutting down both agents if the connection is attempted during pk agent start with the deployed agent specified as a seed node). A specific occurrence of both agents shutting down ended with this error on the local agent:

{"type":"ErrorPolykeyRemote","data":{"message":"","timestamp":"2022-07-13T06:34:55.448Z","data":{},"cause":{"type":"ErrorPolykeyRemote","data":{"message":"","timestamp":"2022-07-13T06:34:55.409Z","data":{},"cause":{"type":"ErrorConnectionStartTimeout","data":{"message":"","timestamp":"2022-07-13T06:34:55.363Z","data":{},"description":"Connection start timed out","exitCode":68}},"description":"Remote error from RPC call","exitCode":68,"metadata":{"nodeId":"v5neee032adia1v5u8lj483hhb5bp45ejb8knlmtbuei00lvupudg","host":"54.206.24.79","port":1314,"command":"nodesHolePunchMessageSend"}}},"stack":"ErrorPolykeyRemote\n    at toError (/home/emma/Projects/js-polykey/src/grpc/utils/utils.ts:203:29)\n    at Object.callback (/home/emma/Projects/js-polykey/src/grpc/utils/utils.ts:381:17)\n    at Object.onReceiveStatus (/home/emma/Projects/js-polykey/node_modules/@grpc/grpc-js/src/client.ts:351:26)\n    at /home/emma/Projects/js-polykey/node_modules/@grpc/grpc-js/src/call-stream.ts:206:27\n    at Object.onReceiveStatus (/home/emma/Projects/js-polykey/src/grpc/utils/FlowCountInterceptor.ts:57:13)\n    at InterceptingListenerImpl.onReceiveStatus (/home/emma/Projects/js-polykey/node_modules/@grpc/grpc-js/src/call-stream.ts:202:19)\n    at Object.onReceiveStatus (/home/emma/Projects/js-polykey/node_modules/@grpc/grpc-js/src/client-interceptors.ts:462:34)\n    at Object.onReceiveStatus (/home/emma/Projects/js-polykey/node_modules/@grpc/grpc-js/src/client-interceptors.ts:424:48)\n    at /home/emma/Projects/js-polykey/node_modules/@grpc/grpc-js/src/call-stream.ts:330:24\n    at processTicksAndRejections (node:internal/process/task_queues:78:11)","description":"Remote error from RPC call","exitCode":68,"metadata":{"nodeId":"v0ntccuvju21203s05i11jd9o3d35ae08d29csledv3ntpbjjatcg","host":"54.253.9.3","port":1314,"command":"nodesHolePunchMessageSend"}}}

And this series of errors/warnings on the deployed agent:

INFO:NodeConnectionManager:Getting connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:existing entry found for vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:withConnF calling function with connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
WARN:ConnectionReverse 129.94.8.25:31843:Reverse Error: ErrorConnectionTimeout
INFO:ConnectionReverse 129.94.8.25:31843:Stopping Connection Reverse
WARN:ConnectionForward 129.94.8.25:31843:Forward Error: ErrorConnectionTimeout
INFO:ConnectionForward 129.94.8.25:31843:Stopping Connection Forward
INFO:clientFactory:Destroying GRPCClientAgent connected to 129.94.8.25:31843
INFO:NodeConnection 129.94.8.25:31843:Destroying NodeConnection
INFO:NodeConnection 129.94.8.25:31843:Destroyed NodeConnection
INFO:clientFactory:Destroyed GRPCClientAgent connected to 129.94.8.25:31843
{     "type": "ErrorGRPCClientCall",     "data": {         "message": "14 UNAVAILABLE: Connection dropped",         "timestamp": "2022-07-13T06:35:15.212Z",         "data": {             "code": 14,             "details": "Connection dropped",             "metadata": {}         },         "stack": "ErrorGRPCClientCall: 14 UNAVAILABLE: Connection dropped\n    at toError (/lib/node_modules/@matrixai/polykey/dist/grpc/utils/utils.js:174:24)\n    at Object.callback (/lib/node_modules/@matrixai/polykey/dist/grpc/utils/utils.js:331:25)\n    at Object.onReceiveStatus (/lib/node_modules/@matrixai/polykey/node_modules/@grpc/grpc-js/build/src/client.js:189:36)\n    at /lib/node_modules/@matrixai/polykey/node_modules/@grpc/grpc-js/build/src/call-stream.js:111:35\n    at Object.onReceiveStatus (/lib/node_modules/@matrixai/polykey/dist/grpc/utils/FlowCountInterceptor.js:80:29)\n    at InterceptingListenerImpl.onReceiveStatus (/lib/node_modules/@matrixai/polykey/node_modules/@grpc/grpc-js/build/src/call-stream.js:106:23)\n    at Object.onReceiveStatus (/lib/node_modules/@matrixai/polykey/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)\n    at Object.onReceiveStatus (/lib/node_modules/@matrixai/polykey/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)\n    at /lib/node_modules/@matrixai/polykey/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78\n    at processTicksAndRejections (node:internal/process/task_queues:78:11)",         "description": "Generic call error",         "exitCode": 69     } }

To Reproduce

The exact cause of the error is still unknown, however, it may be reproducible by connecting to a deployed agent as a seed node npm run polykey -- agent start --seednodes="[deployedAgentNodeId]@[deployedAgentHost]:[deployedAgentPort] --verbose --format json

Expected behavior

Even if a connection times out, this shouldn't kill an agent, especially not the seed node. The seed node should be able to ignore a failed connection, even if it was the cause of the failed connection, since we need seed nodes to be stable.

The text was updated successfully, but these errors were encountered:

emmacasolin · 2022-07-14T00:13:44Z

Interestingly, I'm having a lot less difficulty connecting to deployed agents on my home wifi compared to uni wifi, and I haven't had the connection dropped like it was yesterday. This may just be our agents not being able to punch through stronger NATs, so this issue may be relevant when it comes to working on #383

CMCDragonkai · 2022-07-14T02:25:32Z

The testnet isn't behind any NAT. It has public IP addresses, when connecting from local agent to the testnet, there's isn't any punch through. As soon as your local agent initiates a message, your local NAT would allow reverse packets from the same IP+port.

emmacasolin · 2022-07-14T02:30:37Z

The testnet isn't behind any NAT. It has public IP addresses, when connecting from local agent to the testnet, there's isn't any punch through. As soon as your local agent initiates a message, your local NAT would allow reverse packets from the same IP+port.

Yes, I meant in the case of the testnet trying to connect back to me. If the conntrack entry has expired then the testnet would not be able to reach me, if that's indeed what it's attempting to do. These connection re-attempts (coming from the testnet) are continuing for 20+ hours after my local agent was killed, probably longer, but that's the longest any of the agents showing this behaviour have been active.

CMCDragonkai · 2022-07-14T02:34:35Z

Ah yes, that can happen.

Now that I think about it. What is the reason for the testnet attempting to connect back? It has something to do with DHT queries I believe, but in that case, one must expect failures to reconnect in case your local node goes offline. Is the connection attempt solely at the proxy level (proxy connection), or at the grpc connection level (node connection)?

Are there are other reasons? @tegefaulkes

For punching back to your local node, you have to have #383 done too. Plus your node is pretty transient since it's both mobile on the Uni wifi network, and then the Uni network itself would have a NAT. Note that #383 isn't solely about 2 levels of NATs, but really any number of levels of NATs.

Tailscale/tailnet is good source of inspiration for how to address some of this.

CMCDragonkai · 2022-07-14T02:36:51Z

Furthermore if reconnection attempts fail for any arbitrary node on the network, these may make very noisy logs, because there would lots of failed network connection attempts in the future as the network scales. I reckon structured logging will need to be made use of to help filter out these logs.

emmacasolin · 2022-07-14T02:48:47Z

Now that I think about it. What is the reason for the testnet attempting to connect back? It has something to do with DHT queries I believe, but in that case, one must expect failures to reconnect in case your local node goes offline. Is the connection attempt solely at the proxy level (proxy connection), or at the grpc connection level (node connection)?

While my agent is online the connection attempts are at the grpc connection/node connection level, but once it goes offline I believe it's only at the proxy level (since the only logs are coming from ConnectionForward).

tegefaulkes · 2022-07-14T03:09:24Z

AFAIK the only times a node will automatically try to connect to another node is

initial start up discovery with syncNodeGraph.
discovery queue doing it's thing
refresh buckets queue, initially right after startup and for every bucket every hour. This will generate a lot of noise since it will do a find node operation of 255 buckets every hour. In a small network we will only find the same nodes for each operation. We still need to add some way to cancel refresh bucket operations if we're not going to find any new nodes.
ping node queue, should only happen in the background if a bucket is full. shouldn't be doing anything here. and I think pingNode only opens a proxy connection and not a full nodeConnection.

If the nodes are continuously trying to connect over and over again then I'd check

The queues are not the culprit. I don't know if all of then avoid duplicating when adding to the queue. should be easy to check if you add some debug logging on the agents for the queues.
connections that are failing to connect are actually timing out and giving up.

CMCDragonkai · 2022-09-14T05:14:41Z

This problem is similar to #418. This may also be an unhandled promise rejection.

I believe our PK agent should have a unhandled promise rejection handler registered at the top level. And if it receives it, should report it appropriately and also exit the program. We should have this done inside bin/polykey-agent.ts.

At the moment there is no unhandled promise rejection handler.

We can add this in to aid in our debugging.

At the same time, we could also address: #307.

See: https://nodejs.org/api/process.html#event-unhandledrejection

We should have 2 new exceptions at the top level:

ErrorPolykeyUnhandledRejection
ErrorPolykeyAsynchronousDeadlock

These 2 indicate software bugs, so they should use sysexits.SOFTWARE as the sysexit code.

Actually I just remembered bin/utils/ExitHandlers. This is actually being registered for signals, but not for those problems. So the implementation of these handlers should be done in ExitHandlers.

CMCDragonkai · 2022-09-14T05:16:09Z

I don't have a solution for this bug yet, @tegefaulkes can add the handlers, and let's see if we reproduce AFTER we have solved the other bugs #418 and #415.

tegefaulkes · 2022-09-16T03:45:41Z

I just looked, src/bin/utils/ExitHandlers.ts:111 already registers handlers for unhandledRejection and uncaughtException. The handler prints the error to stderr and exits the process. So this is already being handled.

Right now it will set the process exitcode to the error's code if it's an ErrorPolykey. Otherwise the code is set to 255 before exiting. It doesn't non ErrorPolykey errors into anything else. There doesn't seem to be a need for it.

With #414 (comment) already addressed then I don't know what else to do for this issue.

tegefaulkes · 2022-09-16T03:51:20Z

I could resolve #307 for detecting promise deadlocks.

CMCDragonkai · 2022-09-16T07:20:35Z

Yes the ExitHandler has errorHandler property being used for both unhandledRejection and uncaughtException.

However we want to change this to rethrow a special set of exceptions: #414 (comment)

So that it is clear to us that these are unrecoverable errors. In fact, the presence of these indicate a software bug.

CMCDragonkai · 2022-09-16T07:25:37Z

Since they are all unrecoverable errors, we can use sysexits.SOFTWARE as the software bug code.

Regardless of whether it is an instance of ErrorPolykey or not, it should be wrapped up.

ErrorUnhandledRejection
ErrorUncaughtException
ErrorAsynchronousDeadlock

We should not use ErrorPolykey prefix here. These are actually not about PK at all. They are nodejs runtime problems.

I suggest we putting all of these exceptions into the ExitHandlers class itself.

tegefaulkes · 2022-09-20T03:38:10Z

We don't want these errors to extend ErrorPolykey we need a different error to extend. Currently ErrproPolykey is the base class we use to create all of our errors and it's the class that adds the exitCode and description properties.

I'm going to rename ErrorPolykey to GenericError (or whatever is a better name) and have ErrorPolykey and the other errors above extend that.

tegefaulkes · 2022-09-20T03:49:50Z

I'll hold off on that change for now until I decide the best action.

Would ErrorGeneric be in it's own file separate from ErrorPolykey.ts?
Would ErrorCLI (ErrorBin) extend ErrorPolykey or ErrorGeneric?

CMCDragonkai · 2022-09-20T03:51:41Z

Ok well in that case, it should be ErrorBinUnhandledRejection and ErrorBinUncaughtException and ErrorBinAsynchronousDeadlock.

And have the bin/errors be ErrorBin.

CMCDragonkai · 2022-09-20T03:54:01Z

Consider this:

ErrorBin

ErrorBinUncaughtException extends ErrorBin
ErrorBinUnhandledRejection extends ErrorBin
ErrorBinAsynchronousDeadlock extends ErrorBin

ErrorCLI extends ErrorBin

ErrorCLI...

CMCDragonkai · 2022-09-20T03:55:08Z

Change the 2 ErrorNode* in src/bin/errors to also be ErrorCLI....

Also in the src/errors.ts. What are these 2:

  class ErrorInvalidId<T> extends ErrorPolykey<T> {}
  
  class ErrorInvalidConfigEnvironment<T> extends ErrorPolykey<T> {}

CMCDragonkai · 2022-09-20T03:56:00Z

Those last 2 errors, nothing is using those. Must have been forgotten, can be removed.

Un-recoverable errors include `ErrorBinUncaughtException`, `ErrorBinUnhandledRejection` and `ErrorBinAsynchronousDeadlock`. #414

emmacasolin added the bug Something isn't working label Jul 13, 2022

This was referenced Jul 13, 2022

Continuous connection attempts between a deployed seed node and local agent #415

Closed

Integration Tests for testnet.polykey.com MatrixAI/Polykey-CLI#71

Closed

CMCDragonkai mentioned this issue Sep 8, 2022

Integrate TaskManager into NodeGraph and Discovery #445

Merged

15 tasks

CMCDragonkai assigned tegefaulkes Sep 14, 2022

tegefaulkes mentioned this issue Sep 16, 2022

Process exit handler for tracking unresolved asynchronous work ("promise deadlocks"/"why is node terminating for no reason?") #307

Closed

tegefaulkes added a commit that referenced this issue Sep 20, 2022

fix: fixes to errors and adding un-recoverable error handlers

f6fe013

Un-recoverable errors include `ErrorBinUncaughtException`, `ErrorBinUnhandledRejection` and `ErrorBinAsynchronousDeadlock`. #414

tegefaulkes added a commit that referenced this issue Sep 21, 2022

fix: fixes to errors and adding un-recoverable error handlers

ed47010

Un-recoverable errors include `ErrorBinUncaughtException`, `ErrorBinUnhandledRejection` and `ErrorBinAsynchronousDeadlock`. #414

tegefaulkes added a commit that referenced this issue Sep 21, 2022

fix: fixes to errors and adding un-recoverable error handlers

ca2c966

Un-recoverable errors include `ErrorBinUncaughtException`, `ErrorBinUnhandledRejection` and `ErrorBinAsynchronousDeadlock`. #414

CMCDragonkai closed this as completed in #445 Sep 21, 2022

CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection dropped/timed out when connecting to deployed agent #414

Connection dropped/timed out when connecting to deployed agent #414

emmacasolin commented Jul 13, 2022

emmacasolin commented Jul 14, 2022

CMCDragonkai commented Jul 14, 2022

emmacasolin commented Jul 14, 2022 •

edited

Loading

CMCDragonkai commented Jul 14, 2022 •

edited

Loading

CMCDragonkai commented Jul 14, 2022

emmacasolin commented Jul 14, 2022

tegefaulkes commented Jul 14, 2022

CMCDragonkai commented Sep 14, 2022

CMCDragonkai commented Sep 14, 2022

tegefaulkes commented Sep 16, 2022

tegefaulkes commented Sep 16, 2022

CMCDragonkai commented Sep 16, 2022

CMCDragonkai commented Sep 16, 2022

tegefaulkes commented Sep 20, 2022

tegefaulkes commented Sep 20, 2022

CMCDragonkai commented Sep 20, 2022 •

edited

Loading

CMCDragonkai commented Sep 20, 2022

CMCDragonkai commented Sep 20, 2022

CMCDragonkai commented Sep 20, 2022

Connection dropped/timed out when connecting to deployed agent #414

Connection dropped/timed out when connecting to deployed agent #414

Comments

emmacasolin commented Jul 13, 2022

Describe the bug

To Reproduce

Expected behavior

emmacasolin commented Jul 14, 2022

CMCDragonkai commented Jul 14, 2022

emmacasolin commented Jul 14, 2022 • edited Loading

CMCDragonkai commented Jul 14, 2022 • edited Loading

CMCDragonkai commented Jul 14, 2022

emmacasolin commented Jul 14, 2022

tegefaulkes commented Jul 14, 2022

CMCDragonkai commented Sep 14, 2022

CMCDragonkai commented Sep 14, 2022

tegefaulkes commented Sep 16, 2022

tegefaulkes commented Sep 16, 2022

CMCDragonkai commented Sep 16, 2022

CMCDragonkai commented Sep 16, 2022

tegefaulkes commented Sep 20, 2022

tegefaulkes commented Sep 20, 2022

CMCDragonkai commented Sep 20, 2022 • edited Loading

CMCDragonkai commented Sep 20, 2022

CMCDragonkai commented Sep 20, 2022

CMCDragonkai commented Sep 20, 2022

emmacasolin commented Jul 14, 2022 •

edited

Loading

CMCDragonkai commented Jul 14, 2022 •

edited

Loading

CMCDragonkai commented Sep 20, 2022 •

edited

Loading