Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GROUP-BY prioritizes input columns in case of ambiguity #9228

Merged
merged 3 commits into from
Feb 16, 2024

Conversation

jonahgao
Copy link
Member

Which issue does this PR close?

Closes #9162.

Rationale for this change

When a column referenced by group-by exists both in the select list and the input, the one from the input should be given priority. In issue 9162, there are two references with the same name: one is an unqualified "t.a," and the other is a qualified t.a.

This is the practice of many databases, including PostgreSQL, Oracle, MySQL, Duckdb, etc.
In the PostgreSQL documentation, there is an explanation about it.

An expression used inside a grouping_element can be an input column name, or the name or ordinal number of an output column (SELECT list item), or an arbitrary expression formed from input-column values. In case of ambiguity, a GROUP BY name will be interpreted as an input-column name rather than an output column name.

What changes are included in this PR?

Prioritize searching the schema of the base plan when generating GROUP BY expressions.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Feb 14, 2024

# The column name referenced by HAVING is ambiguous
query I
SELECT 0 AS "t.a" FROM t HAVING MAX(t.a) = 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generation of the HAVING expression uses the same schema as GROUP BY.
The changes in this PR will also keep the result of this statement consistent with PostgreSQL.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jonahgao -- this change makes sense to me

I believe we have seen issues related to this in InfluxDB as well.

I verified that this PR makes DataFusion consistent with postgres

postgres=# CREATE TABLE t(a BIGINT);
CREATE TABLE
postgres=# INSERT INTO t  VALUES (1), (2), (3);
INSERT 0 3
postgres=# SELECT 0 as "t.a" FROM t GROUP BY t.a;
 t.a
-----
   0
   0
   0
(3 rows)

postgres=# SELECT 0 AS "t.a" FROM t HAVING MAX(t.a) = 0;
 t.a
-----
(0 rows)

postgres=#

I also verified on main that DataFusion takes the values from the select list

andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$ datafusion-cli
DataFusion CLI v35.0.0
❯ CREATE TABLE t(a BIGINT) AS VALUES(1), (2), (3);
0 rows in set. Query took 0.011 seconds.

❯ SELECT 0 as "t.a" FROM t GROUP BY t.a;
+-----+
| t.a |
+-----+
| 0   |
+-----+
1 row in set. Query took 0.007 seconds.

❯ SELECT 0 AS "t.a" FROM t HAVING MAX(t.a) = 0;
+-----+
| t.a |
+-----+
| 0   |
+-----+
1 row in set. Query took 0.004 seconds.

datafusion/sqllogictest/test_files/aggregate.slt Outdated Show resolved Hide resolved
datafusion/sqllogictest/test_files/aggregate.slt Outdated Show resolved Hide resolved
@jonahgao
Copy link
Member Author

Thank you @alamb for reviewing and for the suggestions.

@alamb alamb merged commit 40353fe into apache:main Feb 16, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 16, 2024

Thanks again @jonahgao

@jonahgao jonahgao deleted the group-by branch February 17, 2024 04:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Probably incorrect result when the column referenced in GROUP BY is ambiguous
2 participants