casting double to string does not match Spark #4204

HaoYang670 · 2021-11-24T08:43:08Z

Describe the bug
I tried to cast 5.0e-10 to string. On Spark 3.2, I got "5.0E-10"; on spark-rapids I got "5.0e-10"

Steps/Code to reproduce bug
Spark result:

scala> val schema = StructType(Array(StructField("a", DoubleType)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true))

scala> val df0 = spark.createDataFrame(spark.sparkContext.parallelize(data0), schema)
<console>:29: error: not found: value data0
       val df0 = spark.createDataFrame(spark.sparkContext.parallelize(data0), schema)
                                                                      ^

scala> val data0 = Seq(Row(5e-10))
data0: Seq[org.apache.spark.sql.Row] = List([5.0E-10])

scala> val df0 = spark.createDataFrame(spark.sparkContext.parallelize(data0), schema)
df0: org.apache.spark.sql.DataFrame = [a: double]

scala> df0.show
+-------+
|      a|
+-------+
|5.0E-10|
+-------+

spark-rapids result:

scala> val schema = StructType(Array(StructField("a", DoubleType)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true))

scala> val data0 = Seq(Row(5e-10))
data0: Seq[org.apache.spark.sql.Row] = List([5.0E-10])

scala> val df0 = spark.createDataFrame(spark.sparkContext.parallelize(data0), schema)
df0: org.apache.spark.sql.DataFrame = [a: double]

scala> df0.sqlContext.setConf("spark.rapids.sql.castFloatToString.enabled", "true")

scala> df0.show
21/11/24 16:29:40 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(a#1 as string) AS a#4 will run on GPU
      *Expression <Cast> cast(a#1 as string) will run on GPU
    !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found
      @Expression <AttributeReference> a#1 could run on GPU

+-------+                                                                       
|      a|
+-------+
|5.0e-10|
+-------+

Expected behavior
I hope in rapids, it gives 5.0E-10

Environment details (please complete the following information)
Spark 3.2.0
rapids 22.02.0
cudf 22.02.0
using spark-shell on my desktop
setConf("spark.rapids.sql.castFloatToString.enabled", "true")

Additional context
This issue is related to #4028.

The text was updated successfully, but these errors were encountered:

jlowe · 2021-11-24T14:58:50Z

Note that this behavior is expected and documented in the current release. See the spark.rapids.sql.castFloatToString.enabled documentation which states that the result does not always match Spark. That is why this behavior is not enabled by default and the user must explicitly enable it once they are sure it will not affect their application.

HaoYang670 · 2021-11-25T01:20:55Z

I agree with you on different precision of CPU and GPU. Apart from that, in Rapids, the result contains "e" when the exponent is negative. But in Spark, it is always "E".

revans2 · 2021-11-29T17:32:38Z

Yup this is a bug in our code where we clean things up. It looks like we are looking for e+, and are not also matching e by itself.

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala

Lines 729 to 733 in b18492e

    
           val replaceExponent = withResource(Scalar.fromString("e+")) { cudfExponent => 
        
             withResource(Scalar.fromString("E")) { sparkExponent => 
        
               cudfCast.stringReplace(cudfExponent, sparkExponent) 
        
             } 
        
           }

We should be consistent as much as we can be.

HaoYang670 · 2021-12-01T02:17:58Z

I am a little curious about the way of testing.
Here is an unreal example, just for fun. If we were casting float number 10000.0 to string. For CPU version, the output was "10000.0", and for GPU version, the output was "1E4", the test could be passed. But were "10000.0" and "1E4" equal?

spark-rapids/tests/src/test/scala/com/nvidia/spark/rapids/GpuExpressionTestSuite.scala

Lines 145 to 173 in b18492e

    
           def compareStringifiedFloats(expected: String, actual: String): Boolean = { 
        
             // handle exact matches first 
        
             if (expected == actual) { 
        
               return true 
        
             } 
        
             // need to split into mantissa and exponent 
        
             def parse(s: String): (Double, Int) = s match { 
        
               case s if s == "Inf" => (Double.PositiveInfinity, 0) 
        
               case s if s == "-Inf" => (Double.NegativeInfinity, 0) 
        
               case s if s.contains('E') => 
        
                 val parts = s.split('E') 
        
                 (parts.head.toDouble, parts(1).toInt) 
        
               case _ => 
        
                 (s.toDouble, 0) 
        
             } 
        
             val (expectedMantissa, expectedExponent) = parse(expected) 
        
             val (actualMantissa, actualExponent) = parse(actual) 
        
             if (expectedExponent == actualExponent) { 
        
               // mantissas need to be within tolerance 
        
               compare(expectedMantissa, actualMantissa, 0.00001) 
        
             } else { 
        
               // whole number need to be within tolerance 
        
               compare(expected.toDouble, actual.toDouble, 0.00001) 
        
             } 
        
           }

jlowe · 2021-12-01T14:05:06Z

It depends on your definition of "equal." The purpose of that test is to verify that if someone tried to turn the string back into a float, it would be "close enough" to the Spark CPU version. It's not intending to check if we produce the exact same string as Spark, as we already know we don't simply because of precision errors. That's one of many reasons why this feature is disabled by default.

jlowe · 2021-12-02T21:14:47Z

Updated the documentation to add clarification that more than just precision can be different with the resulting string. This is unlikely to be fixed until we add a custom kernel for casting floating point to string that can be compatible with Java/Spark and thus remove then need for the castFloatToString config entirely.

razajafri · 2023-09-25T20:45:59Z

I have found another example that @andygrove found while testing ToPrettyString but it turns out this is a problem in version 23.08 and possibly prior versions.

val df = Seq(9223372036854775807L, -9223372036854775808L).toDF("a").repartition(2)
val df2 = df.withColumn("b", expr("cast(a as float)")).withColumn("c", expr("cast(a as double)"))
spark.conf.set("spark.rapids.sql.enabled", true)
df2.show

+--------------------+---------------+---------------+
|                   a|              b|              c|
+--------------------+---------------+---------------+
| 9223372036854775807| 9.223372037E18| 9.223372037E18|
|-9223372036854775808|-9.223372037E18|-9.223372037E18|
+--------------------+---------------+---------------+

spark.conf.set("spark.rapids.sql.enabled", false)
df2.show

+--------------------+------------+--------------------+
|                   a|           b|                   c|
+--------------------+------------+--------------------+
| 9223372036854775807| 9.223372E18|9.223372036854776E18|
|-9223372036854775808|-9.223372E18|-9.22337203685477...|
+--------------------+------------+--------------------+

HaoYang670 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 24, 2021

jlowe changed the title ~~get different result from spark and spark-rapids when cast double to string~~ casting double to string does not match Spark Nov 24, 2021

Salonijain27 added documentation Improvements or additions to documentation and removed ? - Needs Triage Need team to review and classify labels Nov 30, 2021

Salonijain27 assigned jlowe Nov 30, 2021

jlowe mentioned this issue Dec 2, 2021

Document exponent differences when casting floating point to string [skip ci] #4270

Merged

jlowe removed their assignment Dec 2, 2021

This was referenced Sep 20, 2023

[FEA] Support format_number #9173

Closed

Support format_number #9281

Merged

razajafri mentioned this issue Sep 25, 2023

Add GPU version of ToPrettyString [databricks] #9221

Merged

thirtiseven mentioned this issue Oct 18, 2023

Use float to string kernel #9470

Merged

thirtiseven closed this as completed in #9470 Dec 14, 2023

This was referenced Dec 15, 2023

[BUG] Unit tests ToPrettyStringSuite FAILED on spark-3.5.0 #10056

Closed

Fixed Failing ToPrettyStringSuite Test for 3.5.0 #10059

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

casting double to string does not match Spark #4204

casting double to string does not match Spark #4204

HaoYang670 commented Nov 24, 2021

jlowe commented Nov 24, 2021

HaoYang670 commented Nov 25, 2021

revans2 commented Nov 29, 2021

HaoYang670 commented Dec 1, 2021

jlowe commented Dec 1, 2021

jlowe commented Dec 2, 2021

razajafri commented Sep 25, 2023

casting double to string does not match Spark #4204

casting double to string does not match Spark #4204

Comments

HaoYang670 commented Nov 24, 2021

jlowe commented Nov 24, 2021

HaoYang670 commented Nov 25, 2021

revans2 commented Nov 29, 2021

HaoYang670 commented Dec 1, 2021

jlowe commented Dec 1, 2021

jlowe commented Dec 2, 2021

razajafri commented Sep 25, 2023