Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes issue with node losing connection to BigTable and never regaining it #26217

Closed
wants to merge 9 commits into from

Conversation

PeaStew
Copy link

@PeaStew PeaStew commented Jun 25, 2022

Fixes #20336 (comment)

Problem

BigTable connection lost and never regained

Summary of Changes

Set refresh_active to false in subsequent calls after waiting 2 seconds

Fixes #20336

@mergify mergify bot added the community Community contribution label Jun 25, 2022
@mergify mergify bot requested a review from a team June 25, 2022 11:47
@PeaStew
Copy link
Author

PeaStew commented Jun 25, 2022

@CriesofCarrots Can you please check this when you get a chance, it solves a significant issue for RPC providers.

Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate on the root cause and how this change fixes it?

seems like at best it allows multiple outstanding refresh requests, which could stomp on each other

return;
}
warn!("Token ready to be refreshed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: warn is far too chatty for these new log messages. if they need to stay, please demote to debug

Copy link
Author

@PeaStew PeaStew Jun 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate on the root cause and how this change fixes it?

seems like at best it allows multiple outstanding refresh requests, which could stomp on each other

The root cause appears to be that the request to get a new token sometimes does not succeed or fail and continues effectively forever (or is garbage collected) and does not revert the refresh_active bool flag to false as a result, this may indeed be a memory leak (processes continuing to build up with no end) but I'm not a Rust dev so I can't tell, certainly if the GoAuth was failing, it should be caught, and there are no error message in the logs.

The current code, in the second test (after testing elapsed time for the token), falls out to renew the token and at the same time switches refresh_active to true (.compare_and_swap(false, true, Ordering::Relaxed)), but because the call does not complete, it never gets set back to false as it should later in the call

I first commented out both tests and determined that even requesting a new token with every single call on a live production RPC was returning very quickly, but obviously this would not be ideal, but gave me the information that a successful auth request is very quick (less than 2 seconds but I am being conservative here).

My fix changes the behaviour and I think is quite sane, it doesn't affect any other successful calls because if the token is changed by a parallel call the token is updated anyway and the value refresh_active is changed to false anyway (second link) all my change does is says, if refresh_active is true, wait for 2 seconds (by which time it should complete) and change the value to false for the next call and return (it does not attempt to renew the token). This = that if the renew call has failed, it will not block the next renew call.

As I wrote in the comments, I am sure there are better approaches, but this issue has been there since September at least and nobody had fixed it, my method does indeed fix it. One better approach would be to actually make the renew request time out by itself, if there is no response, success or failure, within 2 seconds it should just gracefully exit and set refresh_active to false, but that is a lot more code and as I said, I am not a Rust dev.

I would also like to note, this is a very serious issue with a very serious result, it requires restarting the node to get a new token, I discussed this with triton.one guys and they had set up testing so that as soon as the node is not able to get info from BigTable, they restart the node, obviously this could lead to a cascading issue if BigTable were actually down as all the nodes would be continuously restarting themselves. I think therefore a quick, if not ideal, fix, is warranted.

Feel free to make a different fix for it, I have applied this fix to 25 active production RPC nodes for Ankr and it is working just fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause appears to be that the request to get a new token sometimes does not succeed or fail and continues effectively forever (or is garbage collected) and does not revert the refresh_active bool flag to false as a result

This is not a correct statement.
Even if the request to get a new token fails flag still will be set to false because we do not unwrap/expect result, we handle Ok/Err with match.
I also do not think that process continues effectively forever, and as you tested new token returned very quickly each time.

The code looks completely ok, what is only possible are multiple updates but this should be ok and can be solved with Ordering::SeqCst memory order.

--
I do not understand why getting a new token is a stuck 😐

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause appears to be that the request to get a new token sometimes does not succeed or fail and continues effectively forever (or is garbage collected) and does not revert the refresh_active bool flag to false as a result

This is not a correct statement. Even if the request to get a new token fails flag still will be set to false because we do not unwrap/expect result, we handle Ok/Err with match. I also do not think that process continues effectively forever, and as you tested new token returned very quickly each time.

The code looks completely ok, what is only possible are multiple updates but this should be ok and can be solved with Ordering::SeqCst memory order.

-- I do not understand why getting a new token is a stuck 😐

You are declaring this without testing though right? because in checking the logging, the code is not getting past the second test. That indicates that the value is true, continuously.

Copy link
Author

@PeaStew PeaStew Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in testing the token DOES NOT get returned every time quickly, it can fail continuously in testing up to 34 seconds in row, this is only demonstrated if you add my code for logging so you can see it failing.

Copy link
Contributor

@fanatid fanatid Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeaStew can you try this patch fanatid@171c9e3? It's on top of 1.10.28 but if you need help to port it to another version I can do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @PeaStew, did you have a chance to test my patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fanatid , thanks for the patch, looks pretty reasonable. I'm thinking to ask one of the foundation rpc nodes to try it out, since we haven't heard anything from PeaStew

Copy link
Contributor

@fanatid fanatid Jul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm updated the code a little, so no unwrap now: fanatid@32afd20

Copy link
Contributor

@artburkart artburkart Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this patch still good to try? We experience this same issue multiple times per day, so we could probably report back.

I'm updated the code a little, so no unwrap now: fanatid@32afd20

storage-bigtable/src/access_token.rs Show resolved Hide resolved
storage-bigtable/src/access_token.rs Outdated Show resolved Hide resolved
PeaStew and others added 3 commits June 25, 2022 20:36
error is: within `impl futures::Future<Output = [async output]>`, the trait `std::marker::Send` is not implemented for `std::sync::RwLockReadGuard<'_, (Token, std::time::Instant)>`
note: future is not `Send` as this value is used across an await
@PeaStew
Copy link
Author

PeaStew commented Jul 6, 2022

So I dunno if this is being looked at, I can say I have been patching every version up to 1.10.29 with this code and the nodes are stable and not experiencing any sort of failures or dropping off blockheight because of pauses in the execution thread. The patch works, even if it may not be up to the elegance required.

@stale
Copy link

stale bot commented Jul 31, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jul 31, 2022
@stale
Copy link

stale bot commented Aug 13, 2022

This stale pull request has been automatically closed. Thank you for your contributions.

@PeaStew
Copy link
Author

PeaStew commented Sep 4, 2023

Your fixes still don't work, mine does, enjoy

@fanatid
Copy link
Contributor

fanatid commented Nov 2, 2023

Hi @PeaStew, can you please post your actual patch? Changes with tokio::time::timeout did not work for you at all? or worked and stopped with the latest releases?

@fanatid
Copy link
Contributor

fanatid commented Nov 23, 2023

@PeaStew jfyi I submit PR #34213 with fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants