Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

casting double to string does not match Spark #4204

Closed
HaoYang670 opened this issue Nov 24, 2021 · 7 comments · Fixed by #9470
Closed

casting double to string does not match Spark #4204

HaoYang670 opened this issue Nov 24, 2021 · 7 comments · Fixed by #9470
Labels
bug Something isn't working documentation Improvements or additions to documentation

Comments

@HaoYang670
Copy link
Collaborator

Describe the bug
I tried to cast 5.0e-10 to string. On Spark 3.2, I got "5.0E-10"; on spark-rapids I got "5.0e-10"

Steps/Code to reproduce bug
Spark result:

scala> val schema = StructType(Array(StructField("a", DoubleType)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true))

scala> val df0 = spark.createDataFrame(spark.sparkContext.parallelize(data0), schema)
<console>:29: error: not found: value data0
       val df0 = spark.createDataFrame(spark.sparkContext.parallelize(data0), schema)
                                                                      ^

scala> val data0 = Seq(Row(5e-10))
data0: Seq[org.apache.spark.sql.Row] = List([5.0E-10])

scala> val df0 = spark.createDataFrame(spark.sparkContext.parallelize(data0), schema)
df0: org.apache.spark.sql.DataFrame = [a: double]

scala> df0.show
+-------+
|      a|
+-------+
|5.0E-10|
+-------+

spark-rapids result:

scala> val schema = StructType(Array(StructField("a", DoubleType)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true))

scala> val data0 = Seq(Row(5e-10))
data0: Seq[org.apache.spark.sql.Row] = List([5.0E-10])

scala> val df0 = spark.createDataFrame(spark.sparkContext.parallelize(data0), schema)
df0: org.apache.spark.sql.DataFrame = [a: double]

scala> df0.sqlContext.setConf("spark.rapids.sql.castFloatToString.enabled", "true")

scala> df0.show
21/11/24 16:29:40 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(a#1 as string) AS a#4 will run on GPU
      *Expression <Cast> cast(a#1 as string) will run on GPU
    !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found
      @Expression <AttributeReference> a#1 could run on GPU

+-------+                                                                       
|      a|
+-------+
|5.0e-10|
+-------+

Expected behavior
I hope in rapids, it gives 5.0E-10

Environment details (please complete the following information)
Spark 3.2.0
rapids 22.02.0
cudf 22.02.0
using spark-shell on my desktop
setConf("spark.rapids.sql.castFloatToString.enabled", "true")

Additional context
This issue is related to #4028.

@HaoYang670 HaoYang670 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 24, 2021
@jlowe jlowe changed the title get different result from spark and spark-rapids when cast double to string casting double to string does not match Spark Nov 24, 2021
@jlowe
Copy link
Member

jlowe commented Nov 24, 2021

Note that this behavior is expected and documented in the current release. See the spark.rapids.sql.castFloatToString.enabled documentation which states that the result does not always match Spark. That is why this behavior is not enabled by default and the user must explicitly enable it once they are sure it will not affect their application.

@HaoYang670
Copy link
Collaborator Author

I agree with you on different precision of CPU and GPU. Apart from that, in Rapids, the result contains "e" when the exponent is negative. But in Spark, it is always "E".

@revans2
Copy link
Collaborator

revans2 commented Nov 29, 2021

Yup this is a bug in our code where we clean things up. It looks like we are looking for e+, and are not also matching e by itself.

val replaceExponent = withResource(Scalar.fromString("e+")) { cudfExponent =>
withResource(Scalar.fromString("E")) { sparkExponent =>
cudfCast.stringReplace(cudfExponent, sparkExponent)
}
}

We should be consistent as much as we can be.

@Salonijain27 Salonijain27 added documentation Improvements or additions to documentation and removed ? - Needs Triage Need team to review and classify labels Nov 30, 2021
@HaoYang670
Copy link
Collaborator Author

I am a little curious about the way of testing.
Here is an unreal example, just for fun. If we were casting float number 10000.0 to string. For CPU version, the output was "10000.0", and for GPU version, the output was "1E4", the test could be passed. But were "10000.0" and "1E4" equal?

def compareStringifiedFloats(expected: String, actual: String): Boolean = {
// handle exact matches first
if (expected == actual) {
return true
}
// need to split into mantissa and exponent
def parse(s: String): (Double, Int) = s match {
case s if s == "Inf" => (Double.PositiveInfinity, 0)
case s if s == "-Inf" => (Double.NegativeInfinity, 0)
case s if s.contains('E') =>
val parts = s.split('E')
(parts.head.toDouble, parts(1).toInt)
case _ =>
(s.toDouble, 0)
}
val (expectedMantissa, expectedExponent) = parse(expected)
val (actualMantissa, actualExponent) = parse(actual)
if (expectedExponent == actualExponent) {
// mantissas need to be within tolerance
compare(expectedMantissa, actualMantissa, 0.00001)
} else {
// whole number need to be within tolerance
compare(expected.toDouble, actual.toDouble, 0.00001)
}
}

@jlowe
Copy link
Member

jlowe commented Dec 1, 2021

It depends on your definition of "equal." The purpose of that test is to verify that if someone tried to turn the string back into a float, it would be "close enough" to the Spark CPU version. It's not intending to check if we produce the exact same string as Spark, as we already know we don't simply because of precision errors. That's one of many reasons why this feature is disabled by default.

@jlowe
Copy link
Member

jlowe commented Dec 2, 2021

Updated the documentation to add clarification that more than just precision can be different with the resulting string. This is unlikely to be fixed until we add a custom kernel for casting floating point to string that can be compatible with Java/Spark and thus remove then need for the castFloatToString config entirely.

@jlowe jlowe removed their assignment Dec 2, 2021
This was referenced Sep 20, 2023
@razajafri
Copy link
Collaborator

I have found another example that @andygrove found while testing ToPrettyString but it turns out this is a problem in version 23.08 and possibly prior versions.

val df = Seq(9223372036854775807L, -9223372036854775808L).toDF("a").repartition(2)
val df2 = df.withColumn("b", expr("cast(a as float)")).withColumn("c", expr("cast(a as double)"))
spark.conf.set("spark.rapids.sql.enabled", true)
df2.show

+--------------------+---------------+---------------+
|                   a|              b|              c|
+--------------------+---------------+---------------+
| 9223372036854775807| 9.223372037E18| 9.223372037E18|
|-9223372036854775808|-9.223372037E18|-9.223372037E18|
+--------------------+---------------+---------------+

spark.conf.set("spark.rapids.sql.enabled", false)
df2.show

+--------------------+------------+--------------------+
|                   a|           b|                   c|
+--------------------+------------+--------------------+
| 9223372036854775807| 9.223372E18|9.223372036854776E18|
|-9223372036854775808|-9.223372E18|-9.22337203685477...|
+--------------------+------------+--------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants