Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support subscript and nested functions in grouping queries #5998

Merged
merged 7 commits into from
Aug 21, 2020

Conversation

vpapavas
Copy link
Member

Description

Fixes #5906 and #5967

Testing done

Added test cases in group-by.json

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly good to me - thanks @vpapavas! A few inline comments and more test requests 😂

@@ -359,5 +401,14 @@ public Void visitQualifiedColumnReference(
) {
throw new UnsupportedOperationException("Should of been converted to unqualified");
}

@Override
public Void visitSubscriptExpression(final SubscriptExpression node, final Void context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any other expressions that we should be looking that can match exactly other than function calls and subscript expressions? i.e. what about arithmetic expressions (SELECT AS_VALUE(a + b) AS foo GROUP BY a + b)

I think it makes sense to implement this in visitExpression instead of in each individual expression type. We would only "override" this behavior when we need to (e.g. inside visitFuncitonCall)

Copy link
Member Author

@vpapavas vpapavas Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check for arithmetic expressions.

Yes, this is a good point in the sense that we need to figure out which expressions need special handling. However, there is no visitExpression in TraversalExpressionVisitor unless I misunderstand what you meant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agavra The approach of putting the "common" code in process doesn't work. Reason being is that if the parent expression exists in the group by e.g. (a+b), it will set the flag foundExpressionInGroupBy to true and will continue visiting the children, a and b. If a child doesn't exist in the group by, it will set the flag to false. All the rest of the children then (b in this example), will fail and will wrongly get added to the nonAggParams.
TLDR: the flag must be set to false only after the entire tree of the expression has been visited. And that is only possible to do in the specific method for each expression

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at my newest comment, I think we have that problem anyway. I still think we should be able to do this in the common process if you take the suggestion I have there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agavra
Copy link
Contributor

agavra commented Aug 13, 2020

Also don't forget to check in the historical plans:

[2020-08-13T01:41:47.476Z] [ERROR]   PlannedTestsUpToDateTest.shouldHaveLatestPlans:87 Current query plan differs from latest for: group-by - subscript in group-by and select. 
18:41:47  location: file:///home/jenkins/workspace/entinc_Contributors_ksql_PR-5998/ksqldb-functional-tests/src/test/resources/query-validation-tests/group-by.json:366
18:41:47  Please re-generate QTT plans by running temporarily commenting out the @Ignore andthen running PlannedTestGeneratorTest.manuallyGeneratePlans().
18:41:47  See `ksqldb-functional-tests/README.md` for more info.
18:41:47  Expected: no saved plan
18:41:47       but: was <io.confluent.ksql.test.planned.TestCasePlan@3b7d8d2e>

@big-andy-coates
Copy link
Contributor

Look at PlannedTestGeneratorTest for generating the plans.

@big-andy-coates
Copy link
Contributor

Hey @vpapavas,

There's another potential issue around this you may have fixed, or which is worth looking at at the same time.

The thread is in community slack: https://confluentcommunity.slack.com/archives/C6UJNMY67/p1594325261431500

However, the short version is that this doesn't currently work:

CREATE TABLE my_agg_z_t AS
    SELECT   
    id as id_key,
    number as number_key,
    CAST(id as STRING)+'|+|'+CAST(number as STRING) AS crk,
    AS_VALUE(CAST(id as STRING)) as id,
    AS_VALUE(CAST(number as STRING)) as number,
    AVG(value) AS someavg FROM blah_s
    WINDOW TUMBLING (size 1 hour)
    GROUP BY CAST(id as STRING)+'|+|'+CAST(number as STRING)
    EMIT CHANGES;

Resulting in the error:

Non-aggregate SELECT expression(s) not part of GROUP BY: ID, NUMBER, AS_VALUE(CAST(ID AS STRING)), AS_VALUE(CAST(NUMBER AS STRING))
Either add the column to the GROUP BY or remove it from the SELECT.

I think this PR fixes some of this, but maybe not all. It would be good to add a test case that copies the key columns, i.e. the grouping expressions, into the value using AS_VALUE, where the grouping expressions are not simple columns names, as is being done above.

@vpapavas
Copy link
Member Author

vpapavas commented Aug 18, 2020

Hey @big-andy-coates ,

the query you posted shouldn't work, i.e., it is correct that it fails. One can only use grouping columns in non-aggregate functions. Namely, the grouping column in this example is CAST(id as STRING)+'|+|'+CAST(number as STRING). So, only this entire expression can appear in the AS_VALUE function like so AS_VALUE(CAST(id as STRING)+'|+|'+CAST(number as STRING)).

For the query to work as is one workaround is for the grouping columns to contain additionally id and number like so:

CREATE TABLE my_agg_z_t AS
    SELECT   
    id as id_key,
    number as number_key,
    CAST(id as STRING)+'|+|'+CAST(number as STRING) AS crk,
    AS_VALUE(CAST(id as STRING)) as id,
    AS_VALUE(CAST(number as STRING)) as number,
    AVG(value) AS someavg FROM blah_s
    WINDOW TUMBLING (size 1 hour)
    GROUP BY id,  number, CAST(id as STRING)+'|+|'+CAST(number as STRING)
    EMIT CHANGES;

and this query with my changes works.

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost lgtm, for what it's worth I think your code is much easier to follow than what it used to be :D

@@ -359,5 +401,14 @@ public Void visitQualifiedColumnReference(
) {
throw new UnsupportedOperationException("Should of been converted to unqualified");
}

@Override
public Void visitSubscriptExpression(final SubscriptExpression node, final Void context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at my newest comment, I think we have that problem anyway. I still think we should be able to do this in the common process if you take the suggestion I have there.

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Minor comments leftover

* then the UnqualifiedColumnReference is added to the schema.
*
* For validation, the visitor checks that:
* 1) expressions in non-aggregate functions are part of the grouping clause,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* 1) expressions in non-aggregate functions are part of the grouping clause,
* 1) expressions not in aggregate functions are part of the grouping clause,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually correct what I wrote :P It will throw if expressions used in aggregate functions are not part of the group by

Copy link
Contributor

@agavra agavra Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will throw if expressions used in aggregate functions are not part of the group by

now I'm more confused 😂 expressions in aggregate functions must not be part of the group by - SELECT id, SUM(foo) FROM s GROUP BY id works (foo is an aggregate function parameter not in group by)

What you had originally written is true, but does not cover everything. It says expressions in UDFs (not UDAFs) must be part of the grouping clause (true). But it is also true that even if they're not in a UDF, as long as they're not in an aggregate function, they must be in the grouping clause:

--- valid
SELECT non_agg_fun(foo), COUNT(*) from s GROUP BY foo; -- (expression in non-aggregate function)
SELECT foo, COUNT(*) from s GROUP BY foo;  -- (expression not in aggregate function)

-- invalid
SELECT foo, COUN(T*) from s GROUP BY bar; -- (expression not in aggregate function, and not in grouping)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I meant expressions in non-aggregate functions must be part of group-by. So much confusion :) So, the comments say that the validator will throw if any of the conditions does not hold

Copy link
Member Author

@vpapavas vpapavas Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah wait, I think I now get what you are suggesting. You are suggesting to make the sentence more general. OK, will make the change

@vpapavas vpapavas merged commit 8d383db into confluentinc:master Aug 21, 2020
sarwarbhuiyan pushed a commit to sarwarbhuiyan/ksql that referenced this pull request Sep 4, 2020
…fluentinc#5998)

* fixed handling subscript and nested functions

* remove intermediate topics from test file

* addressed comments, handle struct and arithmetic, added plans

* fixed window bounds error

* adding historic plans

* addressed Almog's comments

* minor fixes
vpapavas added a commit to vpapavas/ksql that referenced this pull request Nov 23, 2020
…fluentinc#5998)

* fixed handling subscript and nested functions

* remove intermediate topics from test file

* addressed comments, handle struct and arithmetic, added plans

* fixed window bounds error

* adding historic plans

* addressed Almog's comments

* minor fixes
@vpapavas vpapavas deleted the groupby-bug branch January 26, 2022 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Subscript Expressions in GROUP BY not valid as non-aggregate SELECT function arguments
3 participants