Improve transaction-level retry logic #879

daniel-sanche · 2024-03-09T01:07:55Z

When a transaction fails, it will retry the entire transaction. But some bugs have been identified in the retry logic:

only checks for ABORTED. Should also retry on ANCELLED, UNKNOWN, DEADLINE_EXCEEDED, INTERNAL, UNAVAILABLE, UNAUTHENTICATED, RESOURCE_EXHAUSTED
backoff is not supported
transaction calls rollback on failure, even if BeginTransaction was never called
the first transaction id should be used for each retry. Currently always uses latest retry id
RESOURCE_EXHAUSTED errors should jump to the max backoff value immediately
gapic-level retries should be limited when called in the context of a transaction (BeginTransaction, Commit, BatchGetDocuments, RunQuery, RunAggregationQuery, etc)

breathe · 2024-03-12T14:51:49Z

Is there any chance any of the known bugs this issue is intended to address could be cause for an exception like this?

Traceback (most recent call last):
  File "/app/.venv/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 72, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/app/.venv/lib/python3.10/site-packages/grpc/_channel.py", line 1161, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/app/.venv/lib/python3.10/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INVALID_ARGUMENT
	details = "The referenced transaction has expired or is no longer valid."
	debug_error_string = "UNKNOWN:Error received from peer ipv4:142.250.152.95:443 {created_time:"2024-03-12T05:00:53.48562254+00:00", grpc_status:3, grpc_message:"The referenced transaction has expired or is no longer valid."}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/zephyrus/cli.py", line 326, in analyze
    analysis.run()
  File "/app/zephyrus/lib/analysis.py", line 1520, in run
    update_enlighten_pbar(
  File "/app/zephyrus/lib/cloud/progress_bar.py", line 31, in update_enlighten_pbar
    _update_enlighten_pbar(
  File "/app/zephyrus/lib/cloud/progress_bar.py", line 109, in _update_enlighten_pbar
    update_task_output(
  File "/app/zephyrus/lib/cloud/firebase_transactions.py", line 388, in update_task_output
    raise e
  File "/app/zephyrus/lib/cloud/firebase_transactions.py", line 381, in update_task_output
    old_task_output_status, next_task_output_status = _update_task_output(
  File "/app/.venv/lib/python3.10/site-packages/google/cloud/firestore_v1/transaction.py", line 273, in __call__
    transaction._commit()
  File "/app/.venv/lib/python3.10/site-packages/google/cloud/firestore_v1/transaction.py", line 144, in _commit
    commit_response = _commit_with_retry(self._client, self._write_pbs, self._id)
  File "/app/.venv/lib/python3.10/site-packages/google/cloud/firestore_v1/transaction.py", line 339, in _commit_with_retry
    return client._firestore_api.commit(
  File "/app/.venv/lib/python3.10/site-packages/google/cloud/firestore_v1/services/firestore/client.py", line 1131, in commit
    response = rpc(
  File "/app/.venv/lib/python3.10/site-packages/google/api_core/gapic_v1/method.py", line 113, in __call__
    return wrapped_func(*args, **kwargs)
  File "/app/.venv/lib/python3.10/site-packages/google/api_core/retry.py", line 349, in retry_wrapped_func
    return retry_target(
  File "/app/.venv/lib/python3.10/site-packages/google/api_core/retry.py", line 191, in retry_target
    return target()
  File "/app/.venv/lib/python3.10/site-packages/google/api_core/timeout.py", line 120, in func_with_timeout
    return func(*args, **kwargs)
  File "/app/.venv/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 74, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.InvalidArgument: 400 The referenced transaction has expired or is no longer valid.

I run a somewhat complicated function inside of an @transaction wrapped method that is the core of a co-operative task system ... I'm getting the above error where its not expected -- conflict's should be relatively rare and not sure how the transaction execution could be blocked long enough for the transaction to expire (120s I believe is the default which I haven't changed)

wwhuang · 2024-05-15T16:28:59Z

Any update on this or if it's even potentially related to the "The referenced transaction has expired or is no longer valid." error @breathe mentioned? I also have been getting this error sometimes. In my case I am using the transactions inside of an aws lambda python container, and it seems like it can cause the container to freeze indefinitely rather than restart.

dblake10 · 2024-05-16T18:33:22Z

I'm also watching to see if this issue's fix could address the "The referenced transaction has expired or is no longer valid." exception. Currently, I'm seeing an occurrence of this exception at a rate of about 1 in 150 executions in production. I haven't been able to reproduce the error reliably or eliminate the error by pinning a specific version.

daniel-sanche · 2024-05-20T19:07:42Z

Is the "The referenced transaction has expired or is no longer valid." error a recent one? Or has it been going on for a while?

It could be related to some of the retry problems listed here, but it's hard to say for sure. It sounds like its a transient issue? If it's possible to create a reproductive sample, that would be helpful. I'll try to dig into it some more soon

breathe · 2024-05-22T04:55:23Z

Is the "The referenced transaction has expired or is no longer valid." error a recent one? Or has it been going on for a while?

We've only been building with firebase for a few months and we've seen this issue since we started running longer jobs (up to a few hours) which update either 1 or 2 documents inside a transaction every n seconds (n has increased to 30 seconds but used to be 10)

It could be related to some of the retry problems listed here, but it's hard to say for sure. It sounds like its a transient issue? If it's possible to create a reproductive sample, that would be helpful. I'll try to dig into it some more soon

It's definitely transient but regularly occurs ... We have some very long running deterministic ci tasks (2-3 hours) which publish information about their progress as they run. The task updates 1 or 2 documents at most every 30 seconds using the same transactional _update_task_output function -- and the issue was happening regularly ... Of the two documents, one of them is owned by a task and only every written by the owning task and the other is written once by up to maybe 5 different tasks (when they finish) -- so there is concurrency and its important that the updates are serialized -- but its a pretty conflict domain.

It took awhile to find but the workaround that has worked for me looks like this ...

with basic setup like this:

client = firestore_api.Client(project=...)

def reset_client():
    global client
    client.close()
    client = firestore_api.Client(project=...)

@firestore_api.transactional
def _update_task_output(
    transaction: firestore_api.Transaction,
    ...
): ...

and call site logic like this (with hacky workaround I found included) ...

    failure_count = 0
    max_timeout_failures = 3

    while True:
        try:
            transaction = fb_queries.client.transaction()
            old_task_output_status, next_task_output_status = _update_task_output(
                transaction,
                task_output_id,
                task_update_data,
            )
            break
        except google.api_core.exceptions.InvalidArgument as e:
            failure_count += 1
            if failure_count == max_timeout_failures:
                logger.error(
                    f"firebase transaction failed with timeout -- failed {failure_count} times -- reached max_timeout_failures {max_timeout_failures}",
                    task_update_data=task_update_data,
                )
                raise e
            else:
                fb_queries.reset_client()
                sleep_time = random.randrange(2, 2 + failure_count**2)
                logger.info(
                    f"firebase transaction failed, failure_count: {failure_count}, reset client, retrying after {sleep_time} seconds",
                    task_update_data=task_update_data,
                )
                time.sleep(sleep_time)

        except Exception as e:
            logger.exception("error when updating task output transactionally")
            raise e

The trashy retry logic above evolved in place as I tried to find a workaround for this problem ...

I can state for certain that just retrying the transactions after this failure is not sufficient. The workaround only started working once I added the fb_queries.reset_client() method. If I try to re-run the transaction using the same client instance after the transaction fails, then the same exception is raised again immediately and the retry logic accomplishes nothing -- but when I swap out the client instance like above then the retry will work ...

It feels to me like the google.api_core.exceptions.InvalidArgument argument shouldn't be bubbling up to this context -- I think its likely something bubbling up unexpectedly from the http/grcp channel that shouldn't.

Since this happens for me only with long running jobs, my suspicion is that there could maybe be something related to auth refresh -- or some borked state or timeout within the grpc layer that causes the channel to fail to recover from some tcp/http connection state scenario when the underlying socket is alive long enough ...

daniel-sanche · 2024-05-24T00:14:16Z

Thanks for sharing the workaround and the insights, that's very helpful

It's surprising that resetting the client would be necessary, I thought the state would be contained in the Transaction object/the backend. I'm not immediately seeing anything in the client itself that would cause this...

Could you try setting verbose grpc logs by setting GRPC_VERBOSITY=INFO and GRPC_TRACE=all? I'm mostly interested in the transaction fields of BeginTransaction, Commit, and Rollback rpcs

--

I did some searching and came across this. Is it possible your transaction code could be doing something that closes the transaction early, and then tries to access the same treansaction object later? If you're able to share some of the code in _update_task_output, that might help get to the bottom of it

Also, Datastore lists these limitations for transactions: "Transactions expire after 270 seconds or if idle for 60 seconds.". ~~I'm not seeing this listed anywhere in Firestore~~ (Edit: same limits), but is it possible the transaction could block and end up sitting idle past those limits?

It feels to me like the google.api_core.exceptions.InvalidArgument argument shouldn't be bubbling up to this context -- I think its likely something bubbling up unexpectedly from the http/grcp channel that shouldn't.

Yeah, the InvalidArgument is a grpc error returned by the backend. It is expected to bubble up when there is a server error that's not caught by the app

Since this happens for me only with long running jobs, my suspicion is that there could maybe be something related to auth refresh -- or some borked state or timeout within the grpc layer that causes the channel to fail to recover from some tcp/http connection state scenario when the underlying socket is alive long enough

The server is returning us a valid 400 bad request error, so I don't think it would be related to auth or anything. It seems like the transaction being sent is just invalid. Assuming your code isn't doing anything like committing to the same transaction twice or anything, the error is likely in the transaction lifecycle code somewhere

daniel-sanche added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p3 Desirable enhancement or fix. May not be included in next release. labels Mar 9, 2024

daniel-sanche self-assigned this Mar 9, 2024

product-auto-label bot added the api: firestore Issues related to the googleapis/python-firestore API. label Mar 9, 2024

daniel-sanche added this to the Retry Audit milestone Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve transaction-level retry logic #879

Improve transaction-level retry logic #879

daniel-sanche commented Mar 9, 2024 •

edited

Loading

breathe commented Mar 12, 2024 •

edited

Loading

wwhuang commented May 15, 2024 •

edited

Loading

dblake10 commented May 16, 2024

daniel-sanche commented May 20, 2024

breathe commented May 22, 2024 •

edited

Loading

daniel-sanche commented May 24, 2024 •

edited

Loading

Improve transaction-level retry logic #879

Improve transaction-level retry logic #879

Comments

daniel-sanche commented Mar 9, 2024 • edited Loading

breathe commented Mar 12, 2024 • edited Loading

wwhuang commented May 15, 2024 • edited Loading

dblake10 commented May 16, 2024

daniel-sanche commented May 20, 2024

breathe commented May 22, 2024 • edited Loading

daniel-sanche commented May 24, 2024 • edited Loading

daniel-sanche commented Mar 9, 2024 •

edited

Loading

breathe commented Mar 12, 2024 •

edited

Loading

wwhuang commented May 15, 2024 •

edited

Loading

breathe commented May 22, 2024 •

edited

Loading

daniel-sanche commented May 24, 2024 •

edited

Loading