Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow scalable push queries to handle rebalances #7988

Merged
merged 8 commits into from
Aug 18, 2021

Conversation

AlanConfluent
Copy link
Member

Description

Currently, we don't handle rebalances explicitly in SPQ. If a new node joins the cluster or is removed from the cluster, we don't do anything. In this PR, it attempts to start a new request to the node or expects the node to drop out of the cluster and tries to keep the connections ongoing, ensuring now data is missed.

Testing done

Ran unit tests.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@AlanConfluent AlanConfluent requested a review from a team as a code owner August 11, 2021 16:43
Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed the PR, it makes sense to me - though PushQueryRouting is getting a little too large for comfort

serviceContext, pushPhysicalPlan, statement, hosts, outputSchema,
transientQueryQueue, pushConnectionsHandle, false);
// Only check for new nodes if this is the source node
if (!pushRoutingOptions.getIsSkipForwardRequest()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a violation of demeter's law - it assumes that !getIsSkipForwardRequest implies that this is the source node, which requires knowledge about how forwarding works. can we rename that method to something more meaningful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I copied that name, which I agree is pretty not great. I changed it to getHasBeenForwarded() which is pretty clear what it means and hopefully shouldn't seem like a violation of encapsulation.

result.close();
result.updateStatus(RoutingResultStatus.COMPLETE);
});
LOG.info("Host {} completed request.", node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any identifier for what request it completed? Otherwise I'm not sure how helpful this message will be for debugging

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added pushPhysicalPlan.getQueryId() to most of the logging calls to make it a bit easier to understand, though we don't log every request because it could result in a lot of logging. They probably happen less often than pull queries, but we want them to be that lightweight.

I still think there's some value having semi exceptional or uncommon things in the logs to at least know something even if we don't know how to tie it to an exact query. This usually should only happen if a node is shut down which doesn't happen often.

private static Function<ScalablePushRegistry, Set<KsqlNode>> createLoadingCache() {
final LoadingCache<ScalablePushRegistry, Set<KsqlNode>> cache = CacheBuilder.newBuilder()
.maximumSize(40)
.expireAfterWrite(2000, TimeUnit.MILLISECONDS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this meant to be CLUSTER_CHECK_INTERVAL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it probably should be. We want them to expire at least that often. I'll actually create a constant for this.

public enum RoutingResultStatus {
SUCCESS
IN_PROGRESS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: javadocs for each of these could help :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added a bunch of javadoc.

@agavra agavra requested a review from a team August 11, 2021 23:31
@nateab
Copy link
Member

nateab commented Aug 12, 2021

I skimmed the PR, it makes sense to me - though PushQueryRouting is getting a little too large for comfort

For my understanding, when does something get too large for comfort? When it's logic is trying to encompass too much and is better served by breaking it up more modularly? I just ask because on first glance PushRouting is only around 400 lines which doesn't seem that large compared to a lot of our files.

@agavra
Copy link
Contributor

agavra commented Aug 12, 2021

when does something get too large for comfort

@nateab My barometer is usually when I have to pull up the IDE to effectively review the code, it's getting too large. There isn't a science to this, unfortunately, and everyone has different opinions.

Copy link
Member

@nateab nateab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! just left some comments

@AlanConfluent AlanConfluent merged commit b3dbed3 into confluentinc:master Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants