-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve transaction-level retry logic #879
Comments
Is there any chance any of the known bugs this issue is intended to address could be cause for an exception like this?
I run a somewhat complicated function inside of an |
Any update on this or if it's even potentially related to the "The referenced transaction has expired or is no longer valid." error @breathe mentioned? I also have been getting this error sometimes. In my case I am using the transactions inside of an aws lambda python container, and it seems like it can cause the container to freeze indefinitely rather than restart. |
I'm also watching to see if this issue's fix could address the "The referenced transaction has expired or is no longer valid." exception. Currently, I'm seeing an occurrence of this exception at a rate of about 1 in 150 executions in production. I haven't been able to reproduce the error reliably or eliminate the error by pinning a specific version. |
Is the "The referenced transaction has expired or is no longer valid." error a recent one? Or has it been going on for a while? It could be related to some of the retry problems listed here, but it's hard to say for sure. It sounds like its a transient issue? If it's possible to create a reproductive sample, that would be helpful. I'll try to dig into it some more soon |
We've only been building with firebase for a few months and we've seen this issue since we started running longer jobs (up to a few hours) which update either 1 or 2 documents inside a transaction every n seconds (
It's definitely transient but regularly occurs ... We have some very long running deterministic ci tasks (2-3 hours) which publish information about their progress as they run. The task updates 1 or 2 documents at most every 30 seconds using the same transactional _update_task_output function -- and the issue was happening regularly ... Of the two documents, one of them is owned by a task and only every written by the owning task and the other is written once by up to maybe 5 different tasks (when they finish) -- so there is concurrency and its important that the updates are serialized -- but its a pretty conflict domain. It took awhile to find but the workaround that has worked for me looks like this ... with basic setup like this:
and call site logic like this (with hacky workaround I found included) ...
The trashy retry logic above evolved in place as I tried to find a workaround for this problem ... I can state for certain that just retrying the transactions after this failure is not sufficient. The workaround only started working once I added the It feels to me like the Since this happens for me only with long running jobs, my suspicion is that there could maybe be something related to auth refresh -- or some borked state or timeout within the grpc layer that causes the channel to fail to recover from some tcp/http connection state scenario when the underlying socket is alive long enough ... |
Thanks for sharing the workaround and the insights, that's very helpful It's surprising that resetting the client would be necessary, I thought the state would be contained in the Transaction object/the backend. I'm not immediately seeing anything in the client itself that would cause this... Could you try setting verbose grpc logs by setting -- I did some searching and came across this. Is it possible your transaction code could be doing something that closes the transaction early, and then tries to access the same treansaction object later? If you're able to share some of the code in _update_task_output, that might help get to the bottom of it Also, Datastore lists these limitations for transactions:
Yeah, the InvalidArgument is a grpc error returned by the backend. It is expected to bubble up when there is a server error that's not caught by the app
The server is returning us a valid 400 bad request error, so I don't think it would be related to auth or anything. It seems like the transaction being sent is just invalid. Assuming your code isn't doing anything like committing to the same transaction twice or anything, the error is likely in the transaction lifecycle code somewhere |
When a transaction fails, it will retry the entire transaction. But some bugs have been identified in the retry logic:
The text was updated successfully, but these errors were encountered: