Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35207][SQL] Normalize hash function behavior with negative zero (floating point types) #32496

Closed

Conversation

planga82
Copy link
Contributor

What changes were proposed in this pull request?

Generally, we would expect that x = y => hash( x ) = hash( y ). However +-0 hash to different values for floating point types.

scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as double))").show
+-------------------------+--------------------------+
|hash(CAST(0.0 AS DOUBLE))|hash(CAST(-0.0 AS DOUBLE))|
+-------------------------+--------------------------+
|              -1670924195|                -853646085|
+-------------------------+--------------------------+
scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as double)").show
+--------------------------------------------+
|(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))|
+--------------------------------------------+
|                                        true|
+--------------------------------------------+

Here is an extract from IEEE 754:

The two zeros are distinguishable arithmetically only by either division-byzero ( producing appropriately signed infinities ) or else by the CopySign function recommended by IEEE 754 /854. Infinities, SNaNs, NaNs and Subnormal numbers necessitate four more special cases

From this, I deduce that the hash function must produce the same result for 0 and -0.

Why are the changes needed?

It is a correctness issue

Does this PR introduce any user-facing change?

This changes only affect to the hash function applied to -0 value in float and double types

How was this patch tested?

Unit testing and manual testing

@github-actions github-actions bot added the SQL label May 11, 2021
Clean solution
@planga82 planga82 force-pushed the feature/spark35207_hashnegativezero branch from 28f85ae to f86fba6 Compare May 11, 2021 00:44
genHashInt(s"Float.floatToIntBits(0.0f)", result) +
"}else{" +
genHashInt(s"Float.floatToIntBits($input)", result) +
"}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit style:

  protected def genHashFloat(input: String, result: String): String = {
    s"""
       |if(Float.floatToIntBits($input) == Float.floatToIntBits(-0.0f)) {
       |  ${genHashInt(s"Float.floatToIntBits(0.0f)", result)}
       |} else {
       |  ${genHashInt(s"Float.floatToIntBits($input)", result)}
       |}
     """.stripMargin
  }

genHashLong(s"Double.doubleToLongBits(0.0d)", result) +
"}else{" +
genHashLong(s"Double.doubleToLongBits($input)", result) +
"}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@maropu
Copy link
Member

maropu commented May 12, 2021

ok to test

@maropu
Copy link
Member

maropu commented May 12, 2021

Could you re-invoke the GA tests? It seems the weird errors happened.

@SparkQA
Copy link

SparkQA commented May 12, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42968/

@SparkQA
Copy link

SparkQA commented May 12, 2021

Test build #138447 has finished for PR 32496 at commit f86fba6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42995/

@SparkQA
Copy link

SparkQA commented May 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42995/

@planga82
Copy link
Contributor Author

Hi @maropu , Thank for your comments. I have looked at the weird problems of the GA tests and I think they were because of the name of the branch. This name "feature/spark35207_hashnegativezero" causes the problem. I execute tests in the same commit but in another branch with the name "spark35207_hashnegativezero" and they are on going without this problem (https://github.com/planga82/sparkFork/actions/runs/836970758)

@maropu
Copy link
Member

maropu commented May 12, 2021

Yea, it seems @ueshin 's working on it #32524

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug, but it seems this is a long-standing behaviour for the hash functions. So, it's better to describe this behaviour change in the migration guide so that users can notice it easily? cc: @cloud-fan @dongjoon-hyun @viirya

genHashInt(s"Float.floatToIntBits($input)", result)
protected def genHashFloat(input: String, result: String): String = {
s"""
|if(Float.floatToIntBits($input) == Float.floatToIntBits(-0.0f)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use floatToIntBits here? $input == -0.0f instead?

Copy link
Contributor

@cloud-fan cloud-fan May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, $input == 0.0f should be good enough.

genHashLong(s"Double.doubleToLongBits($input)", result)
protected def genHashDouble(input: String, result: String): String = {
s"""
|if(Double.doubleToLongBits($input) == Double.doubleToLongBits(-0.0d)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

assert(XxHash64(Seq(exprs1), 42).eval() == XxHash64(Seq(exprs2), 42).eval())
assert(HiveHash(Seq(exprs1)).eval() == HiveHash(Seq(exprs2)).eval())
}
checkResult(Literal.create(0D, DoubleType), Literal.create(-0D, DoubleType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use checkEvaluation instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add float tests here, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! thanks, I pretend to put float instead long

@@ -654,4 +654,30 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession
}
}
}

test("SPARK-35207: Compute hash consistent between -0.0 and 0.0 doubles with Codegen") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to add tests here (It's okay just to add tests in HashExprSuite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, if you use checkEvaluation, both codegen and interpreted are checked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very useful function!

@SparkQA
Copy link

SparkQA commented May 13, 2021

Test build #138474 has finished for PR 32496 at commit 641629d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@planga82
Copy link
Contributor Author

If you consider, even though it is a bug, that we should put it in the migration guide I can take care of it in this PR.

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @maropu .

protected def genHashFloat(input: String, result: String): String = {
s"""
|if($input == -0.0f) {
| ${genHashInt(s"Float.floatToIntBits(0.0f)", result)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this has the semantic, shall we use simply "0" instead of s"Float.floatToIntBits(0.0f)"? We may add some comment for the semantic instead.

jshell> Float.floatToIntBits(0.0f)
$1 ==> 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I have tested it to be sure.
Murmur3HashFunction.hashInt(java.lang.Float.floatToIntBits(0.0f), 42) == Murmur3HashFunction.hashInt(0, 42)
It is simpler.

protected def genHashDouble(input: String, result: String): String = {
s"""
|if($input == -0.0d) {
| ${genHashLong(s"Double.doubleToLongBits(0.0d)", result)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. We had better use the simplest constant here instead of s"Double.doubleToLongBits(0.0d)".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, 0L?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as the previous point, thanks

@SparkQA
Copy link

SparkQA commented May 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43046/

@SparkQA
Copy link

SparkQA commented May 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43046/

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok. We should also update the migration guide. Also, please be more specific in the PR title as this is for float/double type only?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about other hash functions, e.g., md5?

@github-actions github-actions bot added the DOCS label May 14, 2021
@planga82 planga82 changed the title [SPARK-35207][SQL] Normalize hash function behavior with negative zero [SPARK-35207][SQL] Normalize hash function behavior with negative zero (floating point types) May 14, 2021
@planga82
Copy link
Contributor Author

Looks ok. We should also update the migration guide. Also, please be more specific in the PR title as this is for float/double type only?

Applies only to floating point types, I have updated the title

How about other hash functions, e.g., md5?

The other functions require binary types, anyway, I have tested it and it does not reproduce the problem.

 spark.sql("select md5(bin(cast('0.0' as double))) == md5(bin(cast('-0.0' as double)))").show
 spark.sql("select sha2(bin(cast('0.0' as double)),224) == sha2(bin(cast('-0.0' as double)),224)").show
 spark.sql("select sha1(bin(cast('0.0' as double))) == sha1(bin(cast('-0.0' as double)))").show
 spark.sql("select crc32(bin(cast('0.0' as double))) == crc32(bin(cast('-0.0' as double)))").show

Thanks!

@SparkQA
Copy link

SparkQA commented May 14, 2021

Test build #138525 has finished for PR 32496 at commit 1de9c3d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43053/

@SparkQA
Copy link

SparkQA commented May 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43053/

@srowen
Copy link
Member

srowen commented May 14, 2021

I recall we had some subtle problem where 0 and -0 should not be considered equal, but I don't think it's relevant here.

@cloud-fan
Copy link
Contributor

cloud-fan commented May 14, 2021

thanks, merging to master!

@cloud-fan cloud-fan closed this in 9ea55fe May 14, 2021
@SparkQA
Copy link

SparkQA commented May 14, 2021

Test build #138535 has finished for PR 32496 at commit c123599.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants