Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value #29125

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

partially backport #29026

@cloud-fan cloud-fan changed the title [SPARK-32018][SQL] UnsafeRow.setDecimal should set null with overflowed value [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value Jul 15, 2020
@cloud-fan
Copy link
Contributor Author

cc @dongjoon-hyun @viirya

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @cloud-fan .

@dongjoon-hyun
Copy link
Member

Retest this please.

1 similar comment
@dongjoon-hyun
Copy link
Member

Retest this please.

@viirya
Copy link
Member

viirya commented Jul 15, 2020

Jenkins seems not working one this. But GitHub Actions are passed.

Oh, this is for 3.0 and GitHub Actions is for master only.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to fix Sum.scala?

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125906 has finished for PR 29125 at commit 4518513.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125916 has finished for PR 29125 at commit 4518513.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

No need to fix Sum.scala?

That sum fix is in master only. I don't know if we can backport it as it breaks the streaming state store.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125962 has finished for PR 29125 at commit 7268c05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125965 has finished for PR 29125 at commit 7268c05.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @cloud-fan .
Merged to branch-3.0. (Jenkins passed here #29125 (comment))

dongjoon-hyun pushed a commit that referenced this pull request Jul 16, 2020
…rflowed value

partially backport #29026

Closes #29125 from cloud-fan/backport.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Could you make a backporting PR on branch-2.4 since SPARK-32018 is reported on 2.x too? This partial patch looks safe to have.

cloud-fan added a commit to cloud-fan/spark that referenced this pull request Jul 17, 2020
…rflowed value

partially backport apache#29026

Closes apache#29125 from cloud-fan/backport.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Jul 17, 2020
…rflowed value

backport #29125

Closes #29141 from cloud-fan/backport.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@skambha
Copy link
Contributor

skambha commented Jul 31, 2020

@cloud-fan, I noticed this back port only now. This change is more far reaching in its impact as previous callers of UnsafeRow.getDecimal that would have thrown an exception earlier would now return null.

As an e.g, a caller like aggregate sum will need changes to account for this. Earlier cases where sum would throw error for overflow will now return incorrect results. The new tests that were added for sum overflow cases in the DataFrameSuite in master can be used to see repro.

Since this pr is closed, I will add a comment to the JIRA as well.

@cloud-fan
Copy link
Contributor Author

@skambha the sum shouldn't fail without ANSI mode, this PR fixes it.

It's indeed a bug that we can write an overflowed decimal to UnsafeRow but can't read it. The sum is also buggy but we can't backport the fix due to streaming compatibility reasons.

@skambha
Copy link
Contributor

skambha commented Aug 3, 2020

@cloud-fan,
In some overflow scenarios:
With just this back port change it will cause incorrect results to be returned to the user now
Before this change, the user would see error

The test cases in DataFrameSuite will show these scenarios. Here is an example taken from there that I tried on spark 3.0.1 with and without this change and you can see this incorrect result behavior.

This back port by itself causes more scenarios to return incorrect results to the user.

  1. With this back port change, incorrect results:
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1-SNAPSHOT
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_251)
Type in expressions to have them evaluated.
Type :help for more information.

scala>  val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000

scala>            val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint] 
    
scala>             val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
     |               lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]
   

scala> d5.show(false)
+---------------------------------------+
|sumd                                   |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+

  1. With this change, incorrect results with ansi enabled mode as well.
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1-SNAPSHOT
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_251)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.set("spark.sql.ansi.enabled","true")

scala> val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000

scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala>  val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
     |      |               lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]

scala> d5.show(false)
+---------------------------------------+
|sumd                                   |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+

WITHOUT THIS CHANGE. the same test will throw an error for both the cases (ansi enabled) and not.

scala> val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000

scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
     |          lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]

scala> d5.show(false)
20/08/03 11:15:05 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)/ 2]
java.lang.ArithmeticException: Decimal precision 39 exceeds max precision 38
        at org.apache.spark.sql.types.Decimal.set(Decimal.scala:122)
        at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574)
        at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
        at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:393)

Without this change, the ansi enabled scenario also throws error.

scala> spark.conf.set("spark.sql.ansi.enabled","true")

scala> val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000

scala>  val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]

val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
     |        lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]

scala> d5.show(false)
20/08/03 11:18:08 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.ArithmeticException: Decimal precision 39 exceeds max precision 38
        at org.apache.spark.sql.types.Decimal.set(Decimal.scala:122)
        at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574)
        at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
        at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:393)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.agg_doAggregate_sum_0$(generated.java:41)


@cloud-fan
Copy link
Contributor Author

It's "one bug hides another bug". I don't think the right choice is to leave the bug there. UnsafeRow is a very fundamental component and I think it's better to fix all the bugs we know. Aggregate is not the only place that uses UnsafeRow. It can even be used by external data sources.

If we think the decimal sum overflow is serious enough, we should consider backporting the actual fix, and evaluate the streaming backward compatibility impact.

@skambha
Copy link
Contributor

skambha commented Aug 4, 2020

  • Sum operation is very common and heavily used by users.
  • Returning incorrect results silently is serious as there is no way for a user to know that their query returned incorrect results. Earlier the user would get an error and they can possibly increase the precision and rerun their query, but now they will not even know it is incorrect results unless they manually verify (which may not even be possible for large data). We are now exposing more cases which will return incorrect results now with this back port.

The Spark website states this “Note that, data correctness/data loss bugs are very serious. Make sure the corresponding bug report JIRA ticket is labeled as correctness or data-loss. If the bug report doesn’t get enough attention, please send an email to [email protected], to draw more attentions."

Incorrect results/data correctness are very serious

As already discussed, yes the UnsafeRow has far reaching impact and has unsafe side effects. In my opinion we should not back port just this change to v3. and v2.4.x line specially in a point release and expose wrong results to user for a common operation like sum.

So, my vote would be to not have this UnsafeRow only change in v3.0.x and v2.x.x


@cloud-fan Regarding your question on back porting the sum change, I think the issue was the streaming backward compatibility impact which blocked that change from going in. I am not that familiar with the streaming backward compatibility implications.

@cloud-fan
Copy link
Contributor Author

@skambha you will still hit the sum bug when you disable whole-stage-codegen (or fallback to it due to generated code exceeds 64kb), right?

We are not introducing a new correctness bug. It's an existing bug and the backport makes it more visible.

We've added a mechanism in the master branch to check the streaming state store backward compatibility. If we want to backport the actual fix, we need to backport this mechanism as well. I think that's too many things to backport.

How about this: we force to enable ANSI for decimal sum, so that the behavior is the same without fixing the UnsafeRow bug? It's not an ideal fix but should be safer to backport. @skambha what do you think? Can you help to do it?

@skambha
Copy link
Contributor

skambha commented Aug 5, 2020

How about this: we force to enable ANSI for decimal sum, so that the behavior is the same without fixing the UnsafeRow >bug? It's not an ideal fix but should be safer to backport. @skambha what do you think? Can you help to do it?

Not sure if I understand correctly, so can you clarify. The reason I ask is : Currently, the v3.0 Sum has a ANSI mode in the evaluationExpression and forcing that to be true will not give us much. We will still run into the problems I mentioned a few comments earlier.

--
@cloud-fan, Just to clarify that we are in agreement. The first step is to revert this back port. Can you confirm please.
Yes, I can submit a PR to do this UnsafeRow revert for the v3.0.x and v2.x.x.

@cloud-fan
Copy link
Contributor Author

I don't agree to revert the UnsafeRow bug fix. As I said, UnsafeRow is very fundamental and we can't tolerant any bugs.

I agree that the sum decimal bug becomes more visible with the UnsafeRow fix, and I see 2 options (reverting the UnsafeRow fix is not an option to me):

  1. Backport the actual fix. This brings backward compatibility issues for streaming state store.
  2. Fail if overflow happens, regardless of the ansi flag. This is not ideal but at least it's better than 3.0.0/2.x, which we fail overflow when whole-stage-codegen is on, and return wrong answer otherwise.

@skambha
Copy link
Contributor

skambha commented Aug 6, 2020

IIUC, The solutions you mention were also discussed earlier and were not accepted by you. If you do not want to revert this backport, then I hope you agree it is critical to fix it so users do not run into this incorrectness issue. Please feel free to go ahead with the option you prefer.

I have expressed the issues and will summarize them below and also put it in the JIRA.

The important issue is we should not return incorrect results. In general, it is not a good practice to back port a change to a stable branch and cause more queries to return incorrect results.

Just to reiterate:

  1. This current PR that has back ported the UnsafeRow fix causes queries to return incorrect results. This is for v2.4.x and v3.0.x line. This change by itself has unsafe side effects and results in incorrect results being returned.
  2. It does not matter whether you have whole stage on or off, ansi on or off, you will get more queries returning incorrect results.

scala> val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000

scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala>  val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]

scala> d5.show(false)   <----  INCORRECT RESULTS RETURNED
+---------------------------------------+
|sumd                                   |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+

  1. Incorrect results is very serious and it is not good for Spark users to run into it for common operations like sum.

@cloud-fan
Copy link
Contributor Author

OK let me clarify a few things:

  1. I agree with you that making Spark more likely to return incorrect results is not acceptable.
  2. I hope you understand that having a bug in the very fundamental UnsafeRow is not acceptable either.

I'll ask someone to implement the ANSI behavior for decimal sum in 3.0 and 2.4, so that it fails instead of returning wrong results.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 8, 2020

cc @ScrapCodes since he is a release manager for Apache Spark 2.4.7.

cloud-fan pushed a commit that referenced this pull request Aug 13, 2020
…erflow of sum aggregation

### What changes were proposed in this pull request?

This is a followup of #29125
In branch 3.0:
1. for hash aggregation, before #29125 there will be a runtime exception on decimal overflow of sum aggregation; after #29125, there could be a wrong result.
2. for sort aggregation, with/without #29125, there could be a wrong result on decimal overflow.

While in master branch(the future 3.1 release), the problem doesn't exist since in #27627 there is a flag for marking whether overflow happens in aggregation buffer. However, the aggregation buffer is written in steaming checkpoints. Thus, we can't change to aggregation buffer to resolve the issue.

As there is no easy solution for returning null/throwing exception regarding `spark.sql.ansi.enabled` on overflow in branch 3.0, we have to make a choice here: always throw exception on decimal value overflow of sum aggregation.
### Why are the changes needed?

Avoid returning wrong result in decimal value sum aggregation.

### Does this PR introduce _any_ user-facing change?

Yes, there is always exception on decimal value overflow of sum aggregation, instead of a possible wrong result.

### How was this patch tested?

Unit test case

Closes #29404 from gengliangwang/fixSum.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Aug 17, 2020
### What changes were proposed in this pull request?

Revert SPARK-32018 related changes in branch 3.0: #29125 and  #29404

### Why are the changes needed?

#29404 is made to fix correctness regression introduced by #29125. However, the behavior of decimal overflow is strange in non-ansi mode:
1. from 3.0.0 to 3.0.1: decimal overflow will throw exceptions instead of returning null on decimal overflow
2. from 3.0.1 to 3.1.0: decimal overflow will return null instead of throwing exceptions.

So, this PR proposes to revert both #29404 and #29125. So that Spark will return null on decimal overflow in Spark 3.0.0 and Spark 3.0.1.

### Does this PR introduce _any_ user-facing change?

Yes, Spark will return null on decimal overflow in Spark 3.0.1.

### How was this patch tested?

Unit tests

Closes #29450 from gengliangwang/revertDecimalOverflow.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants