Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35026][SQL] Support nested CUBE/ROLLUP/GROUPING SETS in GROUPING SETS #32201

Closed
wants to merge 13 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

PG and Oracle both support use CUBE/ROLLUP/GROUPING SETS in GROUPING SETS's grouping set as a sugar syntax.
image

In this PR, we support it in Spark SQL too

Why are the changes needed?

Keep consistent with PG and oracle

Does this PR introduce any user-facing change?

User can write grouping analytics like

SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ());
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()));

How was this patch tested?

Added Test

@github-actions
Copy link

Test build #754479169 for PR 32201 at commit f18045a.

@github-actions github-actions bot added the SQL label Apr 16, 2021
@AngersZhuuuu
Copy link
Contributor Author

FYI @cloud-fan @maropu

@SparkQA
Copy link

SparkQA commented Apr 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42038/

@SparkQA
Copy link

SparkQA commented Apr 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42038/

@SparkQA
Copy link

SparkQA commented Apr 16, 2021

Test build #137463 has finished for PR 32201 at commit f18045a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

| expression
;

nestedGroupingSets
: GROUPING SETS '(' nestedGroupingSet (',' nestedGroupingSet)* ')'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have (2 or more)-nested grouping sets? GROUPING SETS(GROUPING SETS(GROUPING SETS(...? I think we need tests for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have (2 or more)-nested grouping sets? GROUPING SETS(GROUPING SETS(GROUPING SETS(...? I think we need tests for it.

Good catch..Just tested, PG support nested grouping sets.

@@ -80,3 +80,8 @@ SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ());
SELECT a, b, count(1) FROM testData GROUP BY a, CUBE(a, b), GROUPING SETS((a, b), (a), ());
SELECT a, b, count(1) FROM testData GROUP BY a, CUBE(a, b), ROLLUP(a, b), GROUPING SETS((a, b), (a), ());

-- Support use CUBE/ROLLUP/GROUPING SETS in GROUPING SETS
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is related to this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is related to this PR?

Hmmm what to show

SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b));

is equal to

SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, this test is the same with one in L79?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, this test is the same with one in L79?

Remove this line

@cloud-fan
Copy link
Contributor

let's update the SQL doc as well

@AngersZhuuuu
Copy link
Contributor Author

let's update the SQL doc as well

Done

@github-actions github-actions bot added the DOCS label Apr 19, 2021
`CUBE|ROLLUP` is just a syntax sugar for `GROUPING SETS`, please refer to the sections above for
how to translate `CUBE|ROLLUP` to `GROUPING SETS`. `group_expression` can be treated as a single-group
`GROUPING SETS` under this context. For multiple `GROUPING SETS` in the `GROUP BY` clause, we generate
`GROUPING SETS` under this context. `GROUPING SETS` also support nested grouping anlytics, Spark can support
use `CUBE|ROLLUP|GROUPING SETS` as `GROUPING SETS`'s `grouping_set`. For multiple `GROUPING SETS` in the `GROUP BY` clause, we generate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GROUPING SETS also support nested grouping anlytics, Spark can support use CUBE|ROLLUP|GROUPING SETS as GROUPING SETS's grouping_set.

=>

GROUPING SETS can have nested GROUPING SETS, CUBE, ROLLUP clauses, e.g. GROUPING SETS(ROLLUP(X, Y), ...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about current?

@@ -95,12 +95,13 @@ aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_ex
(product, warehouse, location), (warehouse), (product), (warehouse, product), ())`.
The N elements of a `CUBE` specification results in 2^N `GROUPING SETS`.

* **Mixed Grouping Analytics**
* **Mixed/Nested Grouping Analytics**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel Nested Grouping Analytics looks confusing because GROUPING SETS only accepts the nested cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel Nested Grouping Analytics looks confusing because GROUPING SETS only accepts the nested cases.

So let's make these two case separated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. on second thought, **Mixed/Nested Grouping Analytics** looks okay (I couldn't come up with a better title...)

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42143/

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42143/

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42150/

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42150/

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42153/

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42153/

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42152/

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42152/

nestedGroupingSet
: groupingAnalytics
| '(' (expression (',' expression)*)? ')'
| expression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nestedGroupingSet
    : groupingAnalytics
    | groupingSet
    ;

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, nice suggestion

| GROUPING SETS '(' nestedGroupingSet (',' nestedGroupingSet)* ')'
;

nestedGroupingSet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about nestedGroupingSet -> groupingElement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SparkQA
Copy link

SparkQA commented Apr 20, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42192/

@SparkQA
Copy link

SparkQA commented Apr 20, 2021

Test build #137664 has finished for PR 32201 at commit 73478ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

ping @cloud-fan

@@ -113,7 +116,18 @@ aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_ex
(warehouse, location),
(warehouse, size),
(warehouse))`.


`GROUP BY warehouse, GROUPING SETS((env), GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ()))`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give an explanation on the behavior instead of examples only? We have a good explaination for the mixed case:

For multiple `GROUPING SETS` in the `GROUP BY` clause, we generate
a single `GROUPING SETS` by doing a cross-product of the original `GROUPING SETS`s.

Can we do the same for nested case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about current?

`CUBE|ROLLUP` is just a syntax sugar for `GROUPING SETS`, please refer to the sections above for
how to translate `CUBE|ROLLUP` to `GROUPING SETS`. `group_expression` can be treated as a single-group
`GROUPING SETS` under this context. For multiple `GROUPING SETS` in the `GROUP BY` clause, we generate
a single `GROUPING SETS` by doing a cross-product of the original `GROUPING SETS`s. For example,
a single `GROUPING SETS` by doing a cross-product of the original `GROUPING SETS`s. For nested `GROUPING SETS` in the `GROUP BY` clause,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nested GROUPING SETS in the GROUPING SETS clause

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



`GROUP BY GROUPING SETS(GROUPING SETS(warehouse, product), GROUPING SETS(location, size))` is equivalent to
`GROUP BY GROUPING SETS((warehouse, product), (location, size))`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, so nested grouping set is just syntax sugar for specifying a grouping set?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

GROUPING SETS(GROUPING SETS((warehouse), (warehouse, product)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, so nested grouping set is just syntax sugar for specifying a grouping set?

It seems yes, the benefit is we can write CUBE/ROLLUP in GROUPING SETS.
since GROUP BY GROUPING SETS((A, B), (A), (), (A, C)) can write as GROUPING BY(ROLLUP(A, B), (A, C))

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42297/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42297/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42312/

`CUBE|ROLLUP` is just a syntax sugar for `GROUPING SETS`, please refer to the sections above for
how to translate `CUBE|ROLLUP` to `GROUPING SETS`. `group_expression` can be treated as a single-group
`GROUPING SETS` under this context. For multiple `GROUPING SETS` in the `GROUP BY` clause, we generate
a single `GROUPING SETS` by doing a cross-product of the original `GROUPING SETS`s. For example,
a single `GROUPING SETS` by doing a cross-product of the original `GROUPING SETS`s. For nested `GROUPING SETS` in the `GROUPING SETS` clause,
we generate a top-level `GROUING SETS` by expand selected grouping sets of nested group elements. For example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..., we simply take its grouping sets and strip it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems your previous example was wrong.

GROUPING SETS(GROUPING SETS(a, b), GROUPING SETS(c, d)) should be GROUPING SETS(a, b, c, d), instead of GROUPING SETS((a, b), (c, d))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems your previous example was wrong.

GROUPING SETS(GROUPING SETS(a, b), GROUPING SETS(c, d)) should be GROUPING SETS(a, b, c, d), instead of GROUPING SETS((a, b), (c, d))

Yea, previous example I miss (), current is correct.



`GROUP BY GROUPING SETS(GROUPING SETS((warehouse), (warehouse, product)))` is equivalent to
`GROUP BY GROUPING SETS((warehouse), (warehouse, product))`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the extra space after GROUP BY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -113,7 +117,10 @@ aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_ex
(warehouse, location),
(warehouse, size),
(warehouse))`.


`GROUP BY GROUPING SETS(GROUPING SETS((warehouse), (warehouse, product)))` is equivalent to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use 2 nested GROUPING SETS in the example

GROUPING SETS(GROUPING SETS(warehouse), GROUPING SETS((warehouse, product)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Test build #137769 has finished for PR 32201 at commit 866a373.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42323/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42323/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Test build #137783 has finished for PR 32201 at commit cdbe1c7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

retest this please

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in b22d54a Apr 22, 2021
@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Test build #137794 has finished for PR 32201 at commit 3b97a8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42335/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42335/

xuanyuanking pushed a commit to xuanyuanking/spark that referenced this pull request Sep 29, 2021
…NG SETS

### What changes were proposed in this pull request?
PG and Oracle both support use CUBE/ROLLUP/GROUPING SETS in GROUPING SETS's grouping set as a sugar syntax.
![image](https://user-images.githubusercontent.com/46485123/114975588-139a1180-9eb7-11eb-8f53-498c1db934e0.png)

In this PR, we support it in Spark SQL too

### Why are the changes needed?
Keep consistent with PG and oracle

### Does this PR introduce _any_ user-facing change?
User can write grouping analytics like
```
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ());
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()));
```

### How was this patch tested?
Added Test

Closes apache#32201 from AngersZhuuuu/SPARK-35026.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants