Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC _Rendezvous w/ 'ABORTED' status propagated from 'Session.run_in_transaction' #3562

Closed
dhermes opened this issue Jun 28, 2017 · 31 comments
Closed
Assignees
Labels
api: spanner Issues related to the Spanner API. grpc priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. release blocking Required feature/issue must be fixed prior to next release.

Comments

@dhermes
Copy link
Contributor

dhermes commented Jun 28, 2017

https://circleci.com/gh/GoogleCloudPlatform/google-cloud-python/2281

@tseaver Can you triage / update the issue title to be more specific?

@dhermes dhermes added api: spanner Issues related to the Spanner API. flaky labels Jun 28, 2017
@tseaver tseaver changed the title Flaky Spanner system test Unexpected 'ABORTED' error propagated from 'Session.run_in_transaction' Jun 28, 2017
@tseaver
Copy link
Contributor

tseaver commented Jun 28, 2017

The code for Session.run_in_transaction expects that the ABORTED bit will be propagated as an instance of google.gax.errors.GaxError: somehow, the underlying GAPIC code is raising a gRCP _Rendezvous instead.

@tseaver tseaver changed the title Unexpected 'ABORTED' error propagated from 'Session.run_in_transaction' gRPC _Rendezvous w/ 'ABORTED' status propagated from 'Session.run_in_transaction' Jun 28, 2017
@tseaver
Copy link
Contributor

tseaver commented Jun 28, 2017

@lukesneeringer Do we need to be regenerating the GAPIC code?

@volokluev
Copy link

Hi this is very annoying, to have a whole bunch of exceptions thrown in my application whenever a transaction is aborted. Is there any workaround to this issue at the moment?

@tseaver
Copy link
Contributor

tseaver commented Jul 25, 2017

@vladkluev I'm hopeful we will have a fix pushed out shortly (this bug blocks PR #3663).

@tseaver
Copy link
Contributor

tseaver commented Jul 25, 2017

So, the issue is that the streaming_read and execute_streaming_sql calls are propagating the _Rendezvous error, rather than wrapping them up in a GaxError as the commit call does.

@bjwatson bjwatson added the priority: p0 Highest priority. Critical issue. P0 implies highest priority. label Jul 25, 2017
@bjwatson
Copy link

bjwatson commented Jul 25, 2017

@tseaver I'm discussing this with @lukesneeringer offline.

@lukesneeringer
Copy link
Contributor

Repeating this back to ensure I understand: Basically this is a situation where one call wraps the errors while others do not? Is this a configuration difference? Something else?

@tseaver
Copy link
Contributor

tseaver commented Jul 25, 2017

@lukesneeringer the new system test in #3663 provokes it routinely: the gRPC ABORTED return triggered by read_streaming is not wrapped, while the other system tests (which mean to trigger it in the commit call) do get it wrapped.

@lukesneeringer
Copy link
Contributor

Okay, so it is literally non-deterministic then?

@tseaver
Copy link
Contributor

tseaver commented Jul 27, 2017

It fails repeatably on #3663. My suspicion is that the GAX/GAPIC streaming support is involved (our read and execute_sql methods actually call the streaming APIs).

@volokluev
Copy link

It is non-deterministic. On our production systems it happens maybe 1/15 times. I temporarily patched it by catching the error and doing retries. My question is, what is the behavior that is supposed to happen here? If the transaction is non-deterministically aborted due to a "transient" error, is the user of the API the one who is supposed to deal with that?

@tseaver
Copy link
Contributor

tseaver commented Jul 27, 2017

@vladkluev The ABORTED error is supposed to propagate as a GaxError, but is somehow escaping as the lower-level grpc._Rendezvous. The run_in_transaction code handles retrying correctly when GaxError is raised.

@volokluev
Copy link

Ah gotcha

tseaver added a commit that referenced this issue Jul 28, 2017
@bjwatson
Copy link

@eoogbe can you fix this issue in GAPIC/GAX code next week? I'll be OOO Monday and Tuesday, but I think @tseaver can help you understand what's needed here.

This is blocking us from finishing a system test (#3663) required for the Spanner Beta.

@evaogbe
Copy link
Contributor

evaogbe commented Aug 3, 2017

It looks like the streaming gRPC methods return a _Rendezvous on error, while the unary ones raise a _Rendezvous. The difference seems to have something to do with _Rendezvous being an iterator, but tbh I don't really understand the gRPC code enough to know what the fix is rn.

@bjwatson
Copy link

bjwatson commented Aug 3, 2017

Thanks for the analysis, @eoogbe. @nathanielmanistaatgoogle might be able to help you with understanding the gRPC code, or can point you to someone who can.

@nathanielmanistaatgoogle

@eoogbe: the response-streaming methods return an object that is both a grpc.Call and an iterator that emits the response messages of the RPC. If the RPC fails, attempts to draw another response message from the object (presumably during iteration, but next(my_response_iterator_object) would work just as well) raise an exception that is both a grpc.RpcError and a grpc.Call.

Under the hood, as you've discovered, it's just one object (a _Rendezvous instance), but that's expected to change in a future release so be reticent to code relying on it.

Does your code look something like

my_response_iterator_call = my_stub.MyResponseStreamingRpcMethod(
    my_request_or_request_iterator)
try:
    for my_response in my_response_iterator_call:
        do_thing(my_response)
except grpc.RpcError as my_rpc_error_call:
    logging.exception(
        'RPC terminated abnormally with code %s and details %s!',
        my_rpc_error_call.code(), my_rpc_error_call.details())
else:
    logging.info('RPC completed successfully!')

? Should it look something like that?

@evaogbe
Copy link
Contributor

evaogbe commented Aug 4, 2017

The code that wraps the RpcError in a GaxError is here. It is intended to work on both unary and streaming functions, so it can't assume that the return value is an iterator.

@nathanielmanistaatgoogle

@eoogbe: it's also not safe to assume that the only problems that can ever happen during an RPC happen during RPC invocation. It's perfectly possible, for example, for a response-streaming RPC to stream one hundred responses and then time out so that the client sees an exception raised from the one-hundred-and-first call to next(my_response_iterator). Or consider another RPC that has streamed fifty responses (and those responses have arrived at the client and been emitted from the response-iterator object) before the user's wireless router falls off the shelf and into the bathtub?

As I assess it _catch_errors is not adequate for what you're trying to do. Consider splitting it into _catch_errors_on_invocation and _catch_errors_during_iteration? Or maybe author some other utility function that given an iterator returns an exception-translating-wrapping of the given iterator? While it may not be the case that _catch_errors knows whether or not the RPC invocation is response-unary or response-streaming, something at some level of abstraction must, and I suppose that will be a place where at least part of your fix will have to be made.

@theacodes
Copy link
Contributor

#3738 will give us the fundamental building blocks to map exceptions correctly, but @nathanielmanistaatgoogle is right, we need to take special care with unary-streaming methods. I'll address that in the PR that adds gRPC-specific helpers to google.api.core.

@bjwatson
Copy link

bjwatson commented Aug 7, 2017

@jonparrott Is there anything that @eoogbe needs to do in https://github.com/googleapis/gax-python to fix this issue, or does #3738 provide everything we need to fix this issue?

@bjwatson bjwatson added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. release blocking Required feature/issue must be fixed prior to next release. and removed priority: p0 Highest priority. Critical issue. P0 implies highest priority. labels Aug 7, 2017
@lukesneeringer
Copy link
Contributor

@theacodes
Copy link
Contributor

theacodes commented Aug 8, 2017 via email

@bjwatson
Copy link

bjwatson commented Aug 8, 2017

Fixing this issue that's blocking Python Spanner Beta has the greatest urgency, so it's good to hear that #3738 is sufficient to fix it.

That said, we'll need to fix this for GAPIC-only libraries. @jonparrott can you show @eoogbe what needs to be done to port your work to GAX, so that she's also familiar with the fix?

@theacodes
Copy link
Contributor

Following up internally.

@bjwatson
Copy link

bjwatson commented Aug 8, 2017

Thanks for the update @jonparrott. I've responded internally.

What needs to be done to close this specific issue for Spanner? Will you do it or should @tseaver?

@theacodes
Copy link
Contributor

It's up to @tseaver (see here) but i'm happy to help.

@bjwatson
Copy link

bjwatson commented Aug 8, 2017

Thanks, Jon.

tseaver added a commit that referenced this issue Aug 9, 2017
@tseaver
Copy link
Contributor

tseaver commented Aug 9, 2017

I will be merging the workaround in #3663 for spanner tonight.

@bjwatson
Copy link

bjwatson commented Aug 9, 2017

Thanks @tseaver!

@tseaver
Copy link
Contributor

tseaver commented Aug 10, 2017

Via #3663.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. grpc priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. release blocking Required feature/issue must be fixed prior to next release.
Projects
None yet
Development

No branches or pull requests

8 participants