-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: support multi-col pull queries #6796
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Some suggestions for additional test coverage but LGTM otherwise.
{"header":{"schema":"`ID1` STRING KEY, `ID2` INTEGER KEY, `COUNT` BIGINT"}}, | ||
{"row":{"columns":["11", 10, 1]}} | ||
]} | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add negative tests to check that sane errors are thrown if not all primary keys are selected, or if a multi-column key schema is used with IN
rather than equality? Might also be good to add a test for a windowed aggregate with multiple keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the former test is in the unit testing of WhereInfo
. I feel that each RQTT test adds somewhat large time overhead, so I'm a bit judicious about what to add here. If you feel that it's important I can add it in.
Might also be good to add a test for a windowed aggregate with multiple keys.
I can do that
ksqldb-engine/src/test/java/io/confluent/ksql/physical/pull/operators/WhereInfoTest.java
Outdated
Show resolved
Hide resolved
ksqldb-engine/src/main/java/io/confluent/ksql/physical/pull/operators/WhereInfo.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a couple of suggestions.
Great job @agavra ! LGTM! I think going with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ksqldb-engine/src/main/java/io/confluent/ksql/physical/pull/operators/WhereInfo.java
Outdated
Show resolved
Hide resolved
throw invalidWhereClauseException("Bound on '" + keyColumn.text() | ||
+ "' must currently be '='", windowed); | ||
final Object[] keyContents = new Object[schema.key().size()]; | ||
final BitSet seenKeys = new BitSet(schema.key().size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with BitSet, but is this the number of unique bits or the highest bit value you have to account for?
And are key column indexes guaranteed to be the first 0...N indexes or can they skip around?
I'm just asking to to be sure this has the capacity and because I'm curious to learn the answers. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/**
* Creates a bit set whose initial size is large enough to explicitly
* represent bits with indices in the range {@code 0} through
* {@code nbits-1}. All bits are initially {@code false}.
*
* @param nbits the initial size of the bit set
* @throws NegativeArraySizeException if the specified initial size
* is negative
*/
public BitSet(int nbits) {
A bitset is essentially boolean[]
but with one bit per boolean - I'm not entirely sure what the difference is between "number of unique bits or the highest bit value you have to account for".
And are key column indexes guaranteed to be the first 0...N indexes or can they skip around?
When you take it from schema.key()
they are indexed as a sepearate namespace from the values. So there is a 0-indexed key and a 0-indexed value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. that makes sense. I didn't realize that. If they weren't compacted like that, the index could be greater than the number of keys (and hence my mentioning unique keys vs highest index), but that's not the case here.
I agree that this is a reasonable way to go in the near term. It might require you to lookup combinations you're not really interested in, creating inefficiency, but for small sets of keys, it's probably not a big deal. On the other hand, @vpapavas 's change to cover the general where clause only works on an existing data source. If the data source is a full table scan, that's fine, but we really want to still be able to extract keys so that we can identify whether we can do a key lookup data source (i.e. index) instead. Someone will still need to add that logic to cover the efficient OR key lookup case ( |
Thanks for the inputs everyone! At the moment, I'll keep the functionality to this and we can address the |
424065c
to
e6025c7
Compare
e6025c7
to
85a5b3a
Compare
Description
Support pull queries on aggregations that aggregate over multiple key columns by requiring that all keys that make up the primary key are selected in the
WHERE
clause as a conjunction of equality expressions (e.g.WHERE K1=1 AND K2=2
)."In" queries are not yet supported, there's some thinking to be done about how we can support that (cc @AlanConfluent perhaps we could do something like
WHERE K1 IN (1, 2) AND K2 in (3,4)
and this would find keys that make up the cross-product of those fields{1,3}, {1, 4}, {2, 3}, {2, 4}
)Alternatively, we could consider supporting
OR
in order to get multiple keys (e.g.WHERE (K1 = 1 AND K2 =2) OR (K1 = 3 AND K2 = 4
)Testing done
Reviewer checklist