Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32816][SQL][3.0] Fix analyzer bug when aggregating multiple distinct DECIMAL columns #30053

Closed

Conversation

linhongliu-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR fixes a conflict between RewriteDistinctAggregates and DecimalAggregates.
In some cases, DecimalAggregates will wrap the decimal column to UnscaledValue using
different rules for different aggregates.

This means, same distinct column with different aggregates will change to different distinct columns
after DecimalAggregates. For example:
avg(distinct decimal_col), sum(distinct decimal_col) may change to
avg(distinct UnscaledValue(decimal_col)), sum(distinct decimal_col)

We assume after RewriteDistinctAggregates, there will be at most one distinct column in aggregates,
but DecimalAggregates breaks this assumption. To fix this, we have to switch the order of these two
rules.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

no

How was this patch tested?

added test cases

Authored-by: Linhong Liu [email protected]
Signed-off-by: Wenchen Fan [email protected]
(cherry picked from commit 40ef5c9)

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Test build #129826 has finished for PR 30053 at commit b043f94.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34429/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34429/

@linhongliu-db
Copy link
Contributor Author

This PR depends on #30052 to pass tests

…t DECIMAL columns

This PR fixes a conflict between `RewriteDistinctAggregates` and `DecimalAggregates`.
In some cases, `DecimalAggregates` will wrap the decimal column to `UnscaledValue` using
different rules for different aggregates.

This means, same distinct column with different aggregates will change to different distinct columns
after `DecimalAggregates`. For example:
`avg(distinct decimal_col), sum(distinct decimal_col)` may change to
`avg(distinct UnscaledValue(decimal_col)), sum(distinct decimal_col)`

We assume after `RewriteDistinctAggregates`, there will be at most one distinct column in aggregates,
but `DecimalAggregates` breaks this assumption. To fix this, we have to switch the order of these two
rules.

bug fix

no

added test cases

Closes apache#29673 from linhongliu-db/SPARK-32816.

Authored-by: Linhong Liu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 40ef5c9)
@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34481/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34481/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Test build #129876 has finished for PR 30053 at commit 2634588.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@linhongliu-db
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34486/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34486/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34487/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34487/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Test build #129881 has finished for PR 30053 at commit 2634588.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Test build #129882 has finished for PR 30053 at commit 96a0706.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@linhongliu-db
Copy link
Contributor Author

@cloud-fan This PR depends on another 2 fixes in the master branch. Should we cherry-pick them?
#29026
#27627

@cloud-fan
Copy link
Contributor

ah, that two are hard to backport as they change streaming state store. I guess we can't backport this fix either.

@linhongliu-db
Copy link
Contributor Author

got it. Let's close it for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants