Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: support multi-col pull queries #6796

Merged
merged 3 commits into from
Dec 18, 2020

Conversation

agavra
Copy link
Contributor

@agavra agavra commented Dec 17, 2020

Description

Support pull queries on aggregations that aggregate over multiple key columns by requiring that all keys that make up the primary key are selected in the WHERE clause as a conjunction of equality expressions (e.g. WHERE K1=1 AND K2=2).

"In" queries are not yet supported, there's some thinking to be done about how we can support that (cc @AlanConfluent perhaps we could do something like WHERE K1 IN (1, 2) AND K2 in (3,4) and this would find keys that make up the cross-product of those fields {1,3}, {1, 4}, {2, 3}, {2, 4})

Alternatively, we could consider supporting OR in order to get multiple keys (e.g. WHERE (K1 = 1 AND K2 =2) OR (K1 = 3 AND K2 = 4)

Testing done

  • Unit testing
  • QTT testing
  • e2e testing (in progress)

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@agavra agavra marked this pull request as ready for review December 17, 2020 16:37
@agavra agavra requested review from a team and JimGalasyn as code owners December 17, 2020 16:37
Copy link
Contributor

@vcrfxia vcrfxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Some suggestions for additional test coverage but LGTM otherwise.

{"header":{"schema":"`ID1` STRING KEY, `ID2` INTEGER KEY, `COUNT` BIGINT"}},
{"row":{"columns":["11", 10, 1]}}
]}
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add negative tests to check that sane errors are thrown if not all primary keys are selected, or if a multi-column key schema is used with IN rather than equality? Might also be good to add a test for a windowed aggregate with multiple keys.

Copy link
Contributor Author

@agavra agavra Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the former test is in the unit testing of WhereInfo. I feel that each RQTT test adds somewhat large time overhead, so I'm a bit judicious about what to add here. If you feel that it's important I can add it in.

Might also be good to add a test for a windowed aggregate with multiple keys.

I can do that

docs/developer-guide/ksqldb-reference/select-pull-query.md Outdated Show resolved Hide resolved
docs/concepts/queries/pull.md Outdated Show resolved Hide resolved
Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a couple of suggestions.

@vpapavas
Copy link
Member

Great job @agavra ! LGTM!

I think going with the WHERE K1 IN (1, 2) AND K2 in (3,4) is a good idea. Once I am done with the refactoring, we will have support for OR in the WHERE clause so I wouldn't worry about adding it now just for the IN predicate.

Copy link
Member

@vpapavas vpapavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

throw invalidWhereClauseException("Bound on '" + keyColumn.text()
+ "' must currently be '='", windowed);
final Object[] keyContents = new Object[schema.key().size()];
final BitSet seenKeys = new BitSet(schema.key().size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with BitSet, but is this the number of unique bits or the highest bit value you have to account for?

And are key column indexes guaranteed to be the first 0...N indexes or can they skip around?

I'm just asking to to be sure this has the capacity and because I'm curious to learn the answers. :-)

Copy link
Contributor Author

@agavra agavra Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
     * Creates a bit set whose initial size is large enough to explicitly
     * represent bits with indices in the range {@code 0} through
     * {@code nbits-1}. All bits are initially {@code false}.
     *
     * @param  nbits the initial size of the bit set
     * @throws NegativeArraySizeException if the specified initial size
     *         is negative
     */
    public BitSet(int nbits) {

A bitset is essentially boolean[] but with one bit per boolean - I'm not entirely sure what the difference is between "number of unique bits or the highest bit value you have to account for".

And are key column indexes guaranteed to be the first 0...N indexes or can they skip around?

When you take it from schema.key() they are indexed as a sepearate namespace from the values. So there is a 0-indexed key and a 0-indexed value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. that makes sense. I didn't realize that. If they weren't compacted like that, the index could be greater than the number of keys (and hence my mentioning unique keys vs highest index), but that's not the case here.

@AlanConfluent
Copy link
Member

AlanConfluent commented Dec 17, 2020

"In" queries are not yet supported, there's some thinking to be done about how we can support that (cc @AlanConfluent perhaps we could do something like WHERE K1 IN (1, 2) AND K2 in (3,4) and this would find keys that make up the cross-product of those fields {1,3}, {1, 4}, {2, 3}, {2, 4})

I agree that this is a reasonable way to go in the near term. It might require you to lookup combinations you're not really interested in, creating inefficiency, but for small sets of keys, it's probably not a big deal.

On the other hand, @vpapavas 's change to cover the general where clause only works on an existing data source. If the data source is a full table scan, that's fine, but we really want to still be able to extract keys so that we can identify whether we can do a key lookup data source (i.e. index) instead. Someone will still need to add that logic to cover the efficient OR key lookup case (WHERE (K1 = 1 AND K2 =2) OR (K1 = 3 AND K2 = 4). I guess it seems to me that this wouldn't be a waste of work (if you wanted to tackle that), but maybe @vpapavas doesn't want you to step on each other's toes too much in this area while it's being refactored.

@agavra
Copy link
Contributor Author

agavra commented Dec 17, 2020

Thanks for the inputs everyone! At the moment, I'll keep the functionality to this and we can address the IN use case as a follow-up sometime in the future.

@agavra agavra force-pushed the multi_col_pull branch 2 times, most recently from 424065c to e6025c7 Compare December 17, 2020 22:56
@agavra agavra merged commit 63dcfda into confluentinc:master Dec 18, 2020
@agavra agavra deleted the multi_col_pull branch December 18, 2020 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants