New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: Allow scalable push queries to handle rebalances #7988

Merged

AlanConfluent merged 8 commits into confluentinc:master from AlanConfluent:spq_rebalance

Aug 18, 2021

Member

AlanConfluent commented Aug 11, 2021

Description

Currently, we don't handle rebalances explicitly in SPQ. If a new node joins the cluster or is removed from the cluster, we don't do anything. In this PR, it attempts to start a new request to the node or expects the node to drop out of the cluster and tries to keep the connections ongoing, ensuring now data is missed.

Testing done

Ran unit tests.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

AlanConfluent requested a review from a team as a code owner

August 11, 2021 16:43

nateab reviewed

View reviewed changes

ksqldb-engine/src/main/java/io/confluent/ksql/physical/scalablepush/PushRouting.java Outdated Show resolved Hide resolved

nateab reviewed

View reviewed changes

ksqldb-engine/src/main/java/io/confluent/ksql/physical/scalablepush/PushRouting.java Outdated Show resolved Hide resolved

agavra reviewed

View reviewed changes

Contributor

agavra left a comment

I skimmed the PR, it makes sense to me - though PushQueryRouting is getting a little too large for comfort

ksqldb-engine/src/main/java/io/confluent/ksql/physical/scalablepush/PushRouting.java Outdated

+                      serviceContext, pushPhysicalPlan, statement, hosts, outputSchema,
+                      transientQueryQueue, pushConnectionsHandle, false);
+                  // Only check for new nodes if this is the source node
+                  if (!pushRoutingOptions.getIsSkipForwardRequest()) {

Contributor

agavra Aug 11, 2021

this seems like a violation of demeter's law - it assumes that !getIsSkipForwardRequest implies that this is the source node, which requires knowledge about how forwarding works. can we rename that method to something more meaningful?

Member Author

AlanConfluent Aug 13, 2021

For some reason I copied that name, which I agree is pretty not great. I changed it to getHasBeenForwarded() which is pretty clear what it means and hopefully shouldn't seem like a violation of encapsulation.

ksqldb-engine/src/main/java/io/confluent/ksql/physical/scalablepush/PushRouting.java Outdated

+                              result.close();
+                              result.updateStatus(RoutingResultStatus.COMPLETE);
+                            });
+                        LOG.info("Host {} completed request.", node);

Contributor

agavra Aug 11, 2021

do we have any identifier for what request it completed? Otherwise I'm not sure how helpful this message will be for debugging

Member Author

AlanConfluent Aug 13, 2021

Yes, I added pushPhysicalPlan.getQueryId() to most of the logging calls to make it a bit easier to understand, though we don't log every request because it could result in a lot of logging. They probably happen less often than pull queries, but we want them to be that lightweight.

I still think there's some value having semi exceptional or uncommon things in the logs to at least know something even if we don't know how to tie it to an exact query. This usually should only happen if a node is shut down which doesn't happen often.

ksqldb-engine/src/main/java/io/confluent/ksql/physical/scalablepush/PushRouting.java Outdated

+                private static Function<ScalablePushRegistry, Set<KsqlNode>> createLoadingCache() {
+                  final LoadingCache<ScalablePushRegistry, Set<KsqlNode>> cache = CacheBuilder.newBuilder()
+                      .maximumSize(40)
+                      .expireAfterWrite(2000, TimeUnit.MILLISECONDS)

Contributor

agavra Aug 11, 2021

is this meant to be CLUSTER_CHECK_INTERVAL?

Member Author

AlanConfluent Aug 13, 2021

Yeah, it probably should be. We want them to expire at least that often. I'll actually create a constant for this.

ksqldb-engine/src/main/java/io/confluent/ksql/physical/scalablepush/PushRouting.java

                 public enum RoutingResultStatus {
-                  SUCCESS
+                  IN_PROGRESS,

Contributor

agavra Aug 11, 2021

nit: javadocs for each of these could help :)

Member Author

AlanConfluent Aug 13, 2021

Done, added a bunch of javadoc.

agavra requested a review from a team

August 11, 2021 23:31

Member

nateab commented Aug 12, 2021 •

edited

Loading

I skimmed the PR, it makes sense to me - though PushQueryRouting is getting a little too large for comfort

For my understanding, when does something get too large for comfort? When it's logic is trying to encompass too much and is better served by breaking it up more modularly? I just ask because on first glance PushRouting is only around 400 lines which doesn't seem that large compared to a lot of our files.

Contributor

agavra commented Aug 12, 2021

when does something get too large for comfort

@nateab My barometer is usually when I have to pull up the IDE to effectively review the code, it's getting too large. There isn't a science to this, unfortunately, and everyone has different opinions.

nateab approved these changes

View reviewed changes

Member

nateab left a comment

LGTM! just left some comments

AlanConfluent force-pushed the spq_rebalance branch from c508a16 to 69daf4d Compare

August 17, 2021 21:47

AlanConfluent added 8 commits

August 18, 2021 09:00


          feat: Allow scalable push queries to handle rebalances

b7888b3


          Gets tests passing

9d5bf8a


          Remove print statements and lint everythign

de34baf


          Feedback

44db5fc


          Fix spotbugs error

e356045


          Findbugs warning

70cafef


          Allows truly dynamic connecting

ab3c9ad


          Fix tests from bad merge

183aa20

AlanConfluent force-pushed the spq_rebalance branch from 69daf4d to 183aa20 Compare

August 18, 2021 16:36

AlanConfluent merged commit b3dbed3 into confluentinc:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet