chore: support multi-col pull queries #6796

agavra · 2020-12-17T05:11:38Z

Description

Support pull queries on aggregations that aggregate over multiple key columns by requiring that all keys that make up the primary key are selected in the WHERE clause as a conjunction of equality expressions (e.g. WHERE K1=1 AND K2=2).

"In" queries are not yet supported, there's some thinking to be done about how we can support that (cc @AlanConfluent perhaps we could do something like WHERE K1 IN (1, 2) AND K2 in (3,4) and this would find keys that make up the cross-product of those fields {1,3}, {1, 4}, {2, 3}, {2, 4})

Alternatively, we could consider supporting OR in order to get multiple keys (e.g. WHERE (K1 = 1 AND K2 =2) OR (K1 = 3 AND K2 = 4)

Testing done

Unit testing
QTT testing
e2e testing (in progress)

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

vcrfxia

Awesome! Some suggestions for additional test coverage but LGTM otherwise.

vcrfxia · 2020-12-17T18:09:19Z

...test/resources/rest-query-validation-tests/pull-queries-against-materialized-aggregates.json

+          {"header":{"schema":"`ID1` STRING KEY, `ID2` INTEGER KEY, `COUNT` BIGINT"}},
+          {"row":{"columns":["11", 10, 1]}}
+        ]}
+      ]


Add negative tests to check that sane errors are thrown if not all primary keys are selected, or if a multi-column key schema is used with IN rather than equality? Might also be good to add a test for a windowed aggregate with multiple keys.

the former test is in the unit testing of WhereInfo. I feel that each RQTT test adds somewhat large time overhead, so I'm a bit judicious about what to add here. If you feel that it's important I can add it in.

Might also be good to add a test for a windowed aggregate with multiple keys.

I can do that

ksqldb-engine/src/test/java/io/confluent/ksql/physical/pull/operators/WhereInfoTest.java

ksqldb-engine/src/main/java/io/confluent/ksql/physical/pull/operators/WhereInfo.java

docs/developer-guide/ksqldb-reference/select-pull-query.md

docs/concepts/queries/pull.md

docs/developer-guide/ksqldb-reference/select-pull-query.md

JimGalasyn

LGTM, with a couple of suggestions.

vpapavas · 2020-12-17T18:42:54Z

Great job @agavra ! LGTM!

I think going with the WHERE K1 IN (1, 2) AND K2 in (3,4) is a good idea. Once I am done with the refactoring, we will have support for OR in the WHERE clause so I wouldn't worry about adding it now just for the IN predicate.

vpapavas

LGTM

ksqldb-engine/src/main/java/io/confluent/ksql/physical/pull/operators/WhereInfo.java

AlanConfluent · 2020-12-17T21:07:44Z

ksqldb-engine/src/main/java/io/confluent/ksql/physical/pull/operators/WhereInfo.java

-      throw invalidWhereClauseException("Bound on '" + keyColumn.text()
-          + "' must currently be '='", windowed);
+    final Object[] keyContents = new Object[schema.key().size()];
+    final BitSet seenKeys = new BitSet(schema.key().size());


I'm not super familiar with BitSet, but is this the number of unique bits or the highest bit value you have to account for?

And are key column indexes guaranteed to be the first 0...N indexes or can they skip around?

I'm just asking to to be sure this has the capacity and because I'm curious to learn the answers. :-)

/** * Creates a bit set whose initial size is large enough to explicitly * represent bits with indices in the range {@code 0} through * {@code nbits-1}. All bits are initially {@code false}. * * @param nbits the initial size of the bit set * @throws NegativeArraySizeException if the specified initial size * is negative */ public BitSet(int nbits) {

A bitset is essentially boolean[] but with one bit per boolean - I'm not entirely sure what the difference is between "number of unique bits or the highest bit value you have to account for".

And are key column indexes guaranteed to be the first 0...N indexes or can they skip around?

When you take it from schema.key() they are indexed as a sepearate namespace from the values. So there is a 0-indexed key and a 0-indexed value.

Ah, ok. that makes sense. I didn't realize that. If they weren't compacted like that, the index could be greater than the number of keys (and hence my mentioning unique keys vs highest index), but that's not the case here.

AlanConfluent · 2020-12-17T21:42:26Z

"In" queries are not yet supported, there's some thinking to be done about how we can support that (cc @AlanConfluent perhaps we could do something like WHERE K1 IN (1, 2) AND K2 in (3,4) and this would find keys that make up the cross-product of those fields {1,3}, {1, 4}, {2, 3}, {2, 4})

I agree that this is a reasonable way to go in the near term. It might require you to lookup combinations you're not really interested in, creating inefficiency, but for small sets of keys, it's probably not a big deal.

On the other hand, @vpapavas 's change to cover the general where clause only works on an existing data source. If the data source is a full table scan, that's fine, but we really want to still be able to extract keys so that we can identify whether we can do a key lookup data source (i.e. index) instead. Someone will still need to add that logic to cover the efficient OR key lookup case (WHERE (K1 = 1 AND K2 =2) OR (K1 = 3 AND K2 = 4). I guess it seems to me that this wouldn't be a waste of work (if you wanted to tackle that), but maybe @vpapavas doesn't want you to step on each other's toes too much in this area while it's being refactored.

agavra · 2020-12-17T22:03:24Z

Thanks for the inputs everyone! At the moment, I'll keep the functionality to this and we can address the IN use case as a follow-up sometime in the future.

agavra marked this pull request as ready for review December 17, 2020 16:37

agavra requested review from a team and JimGalasyn as code owners December 17, 2020 16:37

vcrfxia approved these changes Dec 17, 2020

View reviewed changes

JimGalasyn reviewed Dec 17, 2020

View reviewed changes

docs/developer-guide/ksqldb-reference/select-pull-query.md Outdated Show resolved Hide resolved

vcrfxia mentioned this pull request Dec 17, 2020

feat: ungate support for multi-column GROUP BY #6786

Merged

2 tasks

JimGalasyn reviewed Dec 17, 2020

View reviewed changes

docs/developer-guide/ksqldb-reference/select-pull-query.md Outdated Show resolved Hide resolved

JimGalasyn approved these changes Dec 17, 2020

View reviewed changes

vpapavas approved these changes Dec 17, 2020

View reviewed changes

AlanConfluent approved these changes Dec 17, 2020

View reviewed changes

agavra added 2 commits December 17, 2020 14:55

chore: support multi-col pull queries

81c8e71

docs: add docs

106f49f

agavra force-pushed the multi_col_pull branch 2 times, most recently from 424065c to e6025c7 Compare December 17, 2020 22:56

chore: review comments

85a5b3a

agavra force-pushed the multi_col_pull branch from e6025c7 to 85a5b3a Compare December 17, 2020 23:01

agavra merged commit 63dcfda into confluentinc:master Dec 18, 2020

agavra deleted the multi_col_pull branch December 18, 2020 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: support multi-col pull queries #6796

chore: support multi-col pull queries #6796

agavra commented Dec 17, 2020 •

edited

Loading

vcrfxia left a comment

vcrfxia Dec 17, 2020

agavra Dec 17, 2020 •

edited

Loading

JimGalasyn left a comment

vpapavas commented Dec 17, 2020

vpapavas left a comment

AlanConfluent Dec 17, 2020

agavra Dec 17, 2020 •

edited

Loading

AlanConfluent Dec 17, 2020

AlanConfluent commented Dec 17, 2020 •

edited

Loading

agavra commented Dec 17, 2020

chore: support multi-col pull queries #6796

chore: support multi-col pull queries #6796

Conversation

agavra commented Dec 17, 2020 • edited Loading

Description

Testing done

Reviewer checklist

vcrfxia left a comment

Choose a reason for hiding this comment

vcrfxia Dec 17, 2020

Choose a reason for hiding this comment

agavra Dec 17, 2020 • edited Loading

Choose a reason for hiding this comment

JimGalasyn left a comment

Choose a reason for hiding this comment

vpapavas commented Dec 17, 2020

vpapavas left a comment

Choose a reason for hiding this comment

AlanConfluent Dec 17, 2020

Choose a reason for hiding this comment

agavra Dec 17, 2020 • edited Loading

Choose a reason for hiding this comment

AlanConfluent Dec 17, 2020

Choose a reason for hiding this comment

AlanConfluent commented Dec 17, 2020 • edited Loading

agavra commented Dec 17, 2020

agavra commented Dec 17, 2020 •

edited

Loading

agavra Dec 17, 2020 •

edited

Loading

agavra Dec 17, 2020 •

edited

Loading

AlanConfluent commented Dec 17, 2020 •

edited

Loading