Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[web3.js] Infinite recursion caused by _updateSubscriptions #1106

Closed
Disperito opened this issue Jun 24, 2022 · 32 comments · Fixed by elpheria/rpc-websockets#138 or #1180
Closed

[web3.js] Infinite recursion caused by _updateSubscriptions #1106

Disperito opened this issue Jun 24, 2022 · 32 comments · Fixed by elpheria/rpc-websockets#138 or #1180
Labels

Comments

@Disperito
Copy link

Disperito commented Jun 24, 2022

Problem

I'm using the version 1.44.1 of @solana/web3.js, to develop an application with high-frequency calling.
And the connection between network on my machine and Solana' Network seems to be not so stable: it's often to throw error about transaction confirmation timeout
In this situation, it suddenly to fastly and infinitely throw errors like follow:
image

Here is the key log:

signatureSubscribe error for argument [
  '5R9WXhnHtz5umHHh9LpXXz9CSXjucTbrUPCJV2iQAkUgBX4jQ4UcQYRyta3LbBFZwuMGcKMaqCPiPkyuPT2JPKSr',
  { commitment: 'processed' }
] WebSocket is not open: readyState 0 (CONNECTING)

As you can see, the connection is frantically trying to build up subscriptions on TWO transaction signatures.

Then I found out, the problem was caused in function _updateSubscriptions() in connection.ts:
image

Proposed Solution

Maybe you team should fix the logic of error handling. Here's my humble opinion: if there are some subsciptions failing for too many times(whatever the reason is), they should be directly removed from subscriptions list.

@Disperito Disperito changed the title Infinite loop caused by _updateSubscriptions Infinite recursion caused by _updateSubscriptions Jun 24, 2022
@Disperito Disperito changed the title Infinite recursion caused by _updateSubscriptions [web3.js] Infinite recursion caused by _updateSubscriptions Jun 27, 2022
@steveluscher
Copy link
Collaborator

Thanks for this report!

Other than spamming the error console, does this interfere with the operation of your app?

@Disperito
Copy link
Author

Disperito commented Jul 13, 2022

Thanks for this report!

Other than spamming the error console, does this interfere with the operation of your app?

Of course. It caused infinite recursion and made CPU usage soar to 100%, so that my app cannot run as normal. @steveluscher

@gallynaut
Copy link

I was also seeing this issue on v1.47.3 and reverted to v1.43.5 and it went away.

@steveluscher
Copy link
Collaborator

Totally, I get it, @Disperito. Unbounded recursion is super bad news, no matter which way you slice it. What I was trying to get at is that there are two kinds of infinite recursion.

One leads to the JavaScript thread literally locking, and freezing the browser:

while (true) {
  // Browser's dead.
}

The other results in code running as often as it can, and high CPU usage, but doesn't lock the thread; other code can continue to run in the period between yields.

while (true) {
  // Between the time this yields and resumes, other work can get done on the JS thread.
  await someAsyncWork();
}

In any case, we need a general solution for this.

@steveluscher
Copy link
Collaborator

Can I ask you a question? Can you count the number of subscriptions that you're making to the RPC in total? I've seen situations where RPCs limit the number of open subscriptions you may hold. When you exceed a certain number, they murder your connection abruptly. See solana-labs/solana#25322 (comment).

@Disperito
Copy link
Author

Disperito commented Jul 16, 2022

@Disperito. Unbounded recursion is super bad news, no matter which way you slice it. What I was trying to get at is that there are two kinds of infinite recursion.

One leads to the JavaScript thread literally locking, and freezing the browser:

Sorry I have no definite answer. But I have reviewed the logs near when this error occurs, I think there were only less than 10 subscriptions at the same time, even less.

I don't think the large number of subscriptions is the root cause of the bug. It is true that a large number of subscriptions can lead to poor connnection with RPCs, but the point is: your library should recognize this situation and do special handling, such as throwing an exception, or actively unsubscribing

@Jac0xb
Copy link
Contributor

Jac0xb commented Sep 9, 2022

Still getting this error with @solana/[email protected] @steveluscher. Spammed logs until server was killed.
image

signatureSubscribe error for argument [
'{SOME_SIGNATURE}',
{ commitment: 'confirmed' }
] socket not ready

@steveluscher
Copy link
Collaborator

@Jac0xb If you have a reliable repro of this, can you debug it a bit to see if this is the root cause? solana-labs/solana#25578

@Jac0xb
Copy link
Contributor

Jac0xb commented Sep 9, 2022

Not sure the repro steps but happens often with sendConfirmation and possibly log lived connection objects. Haven't seen it before. Happened in Node.js. Looks like there is an onSignature call in there

@steveluscher
Copy link
Collaborator

To anyone who's still experiencing this problem, can you try to upgrade to an experimental release, by running yarn add @solana/web3.js@pr-29195? That's a package built from PR solana-labs/solana#29195.

Let me know if that fixes anything.

@steveluscher
Copy link
Collaborator

@Jac0xb, @Disperito, @gallynaut do you have time to give this another try, if you still have a repro? See the message above.

@steveluscher steveluscher transferred this issue from solana-labs/solana Feb 1, 2023
@grabbou
Copy link

grabbou commented Feb 12, 2023

I am seeing this on latest 1.73.2. Anything that I can do to help resolve this? It's random and happens frequently after some transactions are made

signatureSubscribe error for argument ["...", {"commitment": "confirmed"}] INVALID_STATE_ERR

Update:
after some digging, the error is caused on this line:

const serverSubscriptionId: ServerSubscriptionId =
(await this._rpcWebSocket.call(method, args)) as number;
this._setSubscription(hash, {

There's definitely something strange going on here, because upon debugging uncaught exceptions, this._rpcWebSocket.socket has readyState: 0, but that somehow doesn't reject the promise on a conditional in _call.

Looks like this:

All of this can lead to corruption of the internal state of the RPC; it can think that it's connected, able to send messages, and is owed responses, when it is not.

from the PR description might still be happening.

@grabbou
Copy link

grabbou commented Feb 12, 2023

I have left some additional comments on the PR solana-labs/solana#29195 to further continue discussion. I hope with my unfortunate stable reproduction we can get this fixed.

@grabbou
Copy link

grabbou commented Feb 12, 2023

Update: this might be false positive, but after reading this issue again, one thing from OP caught my immediate attention. We both execute a series of transactions, one after another. In my case, I have an array and for each item in an array, I execute 3 transactions. I will eventually refactor to a single one (and batch those calls), but I think that could be highly related.

I added a setTimeout of 15 seconds between each call (just to stay on the safe side) and my script executed successfully. Are we dealing with a race condition here?

Update: unfortunately, it errored again, but I have a feeling the error happens less frequently.

@grabbou
Copy link

grabbou commented Feb 13, 2023

Update: I have managed to find a fix and posted in the PR solana-labs/solana#29195 (comment). Hope we can land this quick!

@dfl-zeke
Copy link

I'm having the same issue. Application is getting rekt by subscriptions.

@steveluscher
Copy link
Collaborator

@dfl-zeke, do you have a repro that you can share? I'm happy to check out a repository and reproduce the problem over here.

@dfl-zeke
Copy link

@dfl-zeke, do you have a repro that you can share? I'm happy to check out a repository and reproduce the problem over here.

I don't have it, but I can suggest how you can reproduce it. Subscribe to 5K random accounts via Websocket and you will get this error surely. I was testing how much subscriptions I could handle but was unable to test, as sometimes it starts throwing errors from 2K subs, sometimes from 14K so it's kind of random. I've even added sleeps between chunk subs and it did not help at all, same error. If you try to subscribe all of them at the same time, it will start throwing this error immediately. I even tried to have multiple socket connections, but still got this error.

Also, while I'm here and you might see this, I would suggest to add retry limits on websocket (because now it always retry to reconnect in case of fail) and ping turn on/off, it would made this package much more comfortable to work with.

@steveluscher
Copy link
Collaborator

Subscribe to 5K random accounts via Websocket and you will get this error surely.

I see! Can you drum me up a quick repro of exactly that and post it on GitHub, @dfl-zeke? Also, let me know exactly which RPC URL you were able to reproduce the problem with because, as noted here, different RPCs have different subscription limits, and take different actions when you overflow that limit. I want to avoid a situation where I try to stand up a repro, can't trigger the bug, and have to come back to you for more information1.

Footnotes

  1. Believe me, I've tried to create a repro before, to no avail!

@dfl-zeke
Copy link

Subscribe to 5K random accounts via Websocket and you will get this error surely.

I see! Can you drum me up a quick repro of exactly that and post it on GitHub, @dfl-zeke? Also, let me know exactly which RPC URL you were able to reproduce the problem with because, as noted here, different RPCs have different subscription limits, and take different actions when you overflow that limit. I want to avoid a situation where I try to stand up a repro, can't trigger the bug, and have to come back to you for more information1.

Footnotes

  1. Believe me, I've tried to create a repro before, to no avail!

Will try to create one. I'm using rpcpool endpoint, so I can't share it with u publicly at least. Also, if I get limited, I should get 429 (which I got sometimes to be honest). I'll share quicknode one with free tier subscription. (I have same problem with them too).

@dfl-zeke
Copy link

Subscribe to 5K random accounts via Websocket and you will get this error surely.

I see! Can you drum me up a quick repro of exactly that and post it on GitHub, @dfl-zeke? Also, let me know exactly which RPC URL you were able to reproduce the problem with because, as noted here, different RPCs have different subscription limits, and take different actions when you overflow that limit. I want to avoid a situation where I try to stand up a repro, can't trigger the bug, and have to come back to you for more information1.

Footnotes

  1. Believe me, I've tried to create a repro before, to no avail!

Here you go: https://github.com/dfl-zeke/solana-web3js-subs . If you have any questions, ping me.

@steveluscher
Copy link
Collaborator

How long do you have to leave this running before you see the problem, @dfl-zeke? I'm at index 8300 and haven't seen it yet.

@steveluscher
Copy link
Collaborator

Nevermind. There it is.

@steveluscher
Copy link
Collaborator

Here's what I discovered about your repro:

First, you're subscribing to the same RPC node 100 times, and fanning out subscriptions to each one of those connections. RPCs really hate when you do this, and will start to limit you and/or drop your connection (with a 1006 ‘abnormal closure’ code) real fast. When I limited this to one RPC node, or even 10, I was able to make all ~20K subscriptions with no problem.

Now, let's talk about the actual problem that results when the RPC does unceremoniously drop you. This code is correctly thowing an exception that results when a message heads for a socket that's actually in a bad (non OPEN) state:

if (readyState === 1 /* WebSocket.OPEN */) {
return super.call(...args);
}
return Promise.reject(
new Error(
'Tried to call a JSON-RPC method `' +
args[0] +
'` but the socket was not `CONNECTING` or `OPEN` (`readyState` was ' +
readyState +
')',
),
);

And this code is catching it in the subscriptions state machine:

} catch (e) {
if (e instanceof Error) {
console.error(
`${method} error for argument`,
args,
e.message,
);
}
if (!isCurrentConnectionStillActive()) {
return;
}
// TODO: Maybe add an 'errored' state or a retry limit?
this._setSubscription(hash, {
...subscription,
state: 'pending',
});
await this._updateSubscriptions();
}

Now, when the socket goes down like this, two things happen: its readyState changes, and its close event fires. Problem is, sometimes you'll detect one before the other.

  1. If the close event fires before we detect a change in the readyState, we're good. The onClose handler will increment the _rpcWebsocketGeneration and the state machine will abandon processing.
  2. If the readyState is detected not to be OPEN before the close event fires, we're in trouble. The exception will be caught, but the state machine will immediately try to set the subscription up again. The fact that it tries to set the subscription back up is not, in itself, the problem. It's that it tries to set it up synchronously. The synchronous nature of the loop blocks the onClose handler from being able to work. Because the onClose handler never fires, the state machine never becomes aware that the socket is dead. That's the problem.

Some possible solutions:

  1. Make the state machine yield the main thread on exceptions. This breathing room will allow asynchronous events from the socket to be handled, righting the internal state of the state machine. Not great. Adds asynchronicity to an already complex system.
  2. Enable the state machine to infer, from the error, that the socket has died, and reset its own socket-related state.

I have a few half-baked implementations of this on the go. Give me some time to find the best one.

@steveluscher
Copy link
Collaborator

I'm pretty sure that elpheria/rpc-websockets#138 fixes the source of corruption of the subscriptions state machine here.

@dfl-zeke
Copy link

@steveluscher I understand, that 20K subscription is not a normal thing to have, but I get it on 500 subscriptions sometimes and it's annoying. If I have too many subs, then okay, I can understand that host can limit me and give me error on the call, but do not crash the program.

@steveluscher
Copy link
Collaborator

Yeah, I doubt it has anything to do with the number of subscriptions. It probably has everything to do with which box the load balancer shunts you to, and how many open subscriptions it currently has.

@grabbou
Copy link

grabbou commented Feb 23, 2023

To me, it happens pretty much all the time I try to wait for a transaction results. I think it has to do with React Native and the way it handles web sockets. I still didn't have time to provide a reproduction because for me, the workaround I pasted here temporarily fixes the problem (but introduces other side effects, so not really production ready lol).

I will try to provide a reproduction soon, but maybe that helps too. I see you also posted another PR, so I will wait till its shipped to double check.

Meanwhile, @steveluscher, how do you feel about solana-labs/solana#29195 in general? If that didn't fix the original error, do you still consider it a valid improvement or you'd say it's something that could be reverted? I left some comments here: solana-labs/solana#29195 (comment)

@github-actions
Copy link
Contributor

🎉 This issue has been resolved in version 1.73.3 🎉

The release is available on:

Your semantic-release bot 📦🚀

@steveluscher
Copy link
Collaborator

Try @solana/[email protected].

@steveluscher
Copy link
Collaborator

steveluscher commented Feb 24, 2023

@dfl-zeke, I upgraded your repro repo to 1.73.3 and can no longer kill it.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Because there has been no activity on this issue for 7 days since it was closed, it has been automatically locked. Please open a new issue if it requires a follow up.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.