Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Bubble up errors from HARouting unless using StandbyFallbackException #7238

Merged

Conversation

AlanConfluent
Copy link
Member

@AlanConfluent AlanConfluent commented Mar 16, 2021

Description

HARouting has previously treated all Exceptions the same and used them as reason to retry with standbys. The issue is that certain errors are deterministic and often caused by some form of user error or code bug. Not only is there no use in retrying something that will have the same effect on a standby, but if we treat it similar to a transient error and print the error message Unable to execute pull query: "%s". Exhausted standby hosts to try., it's rather confusing since trying again won't fix the issue.

The goal of this PR is to surface any bug by requiring any standby retries to explicitly use StandbyFallbackException and by bubbling up other exceptions by default. In particular, any network error gets wrapped in a StandbyFallbackException and everything else, including local lookup errors, is surfaced.

Fixes #6799

Testing done

I rewrote much of HARoutingTest since it wasn't well tested before and stopped at the routing step. It's now tested all the way down to either the network call or execution at the physical plan layer.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@AlanConfluent AlanConfluent requested a review from a team as a code owner March 16, 2021 16:59
Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a first pass on non-testing code. I'm wondering if we are going to capture/swallow the new type of exceptions anyways, could we just let the executeOrRouteQuery to return something for the caller that captures standby fallback instead of the exception throwing path?

pullQueryMetrics
.ifPresent(queryExecutorMetrics -> queryExecutorMetrics.recordRemoteRequests(1));
forwardTo(node, locations, statement, serviceContext, pullQueryQueue, rowFactory,
outputSchema);
} catch (StandbyFallbackException e) {
LOG.error("Standby Fallback: Error forwarding query {} to node {} with exception {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to log it as error really, since we are about to retry on the standby right?

Also if we intend to expose this to users, they may not really understand what standby fallback means. What about just "Failed to execute query {} on node {}, will try to switch to execute on the standby query state which may result in stale results"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that error is maybe too severe. It's still not perfect operation, so I think warn is most appropriate.

This error is only meant to show up in the logs and not be surfaced via the CLI, but I agree it's a bit unclear. I changed it to "Error executing query {} locally at node {}. Falling back to standby state which may return stale results"

} catch (StandbyFallbackException e) {
LOG.error("Standby Fallback: Error executing query {} locally at node {} with exception",
statement.getStatementText(), node, e.getCause());
throw e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

} catch (StandbyFallbackException e) {
LOG.error("Standby Fallback: Error forwarding query {} to node {} with exception {}",
statement.getStatementText(), node, e.getCause());
throw e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since on the caller we are going to swallow the StandbyFallbackException and retry on standbys, how about letting executeOrRouteQuery to return a result (for now, just boolean is okay; in general, maybe a enum can do more things) where the caller can check on the futures, instead of capturing and opening the exception envelope to check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, good suggestion. I just created an enum to make it a little easier to read.

Copy link
Member

@cprasad1 cprasad1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guozhangwang
Copy link
Contributor

LGTM!

@AlanConfluent AlanConfluent merged commit ec12516 into confluentinc:master Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pull Queries: HARouting does not show correct exception
3 participants