-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Casting FLOAT64 to DECIMAL(12,7) produces different rows from Apache Spark CPU #9682
Comments
Attached herewith is a zipped Parquet file with 102 rows in a single Taking the window functions out of the equation, one sees that running // On Spark.
scala> spark.read.parquet("/tmp/decimals_avg.parquet").select( expr("avg(c)") ).show
+------------+
| avg(c)|
+------------+
|3527.6195313|
+------------+
// On the plugin:
+------------+
| avg(c)|
+------------+
|3527.6195312|
+------------+ The behaviour seems to be consistent on Spark |
I have filed rapidsai/cudf#14507 to track the CUDF side of this. I was able to repro this on CUDF by writing the input as |
A couple of other findings. I tried querying select sum(c), count(c), sum(c)/count(c), avg(c), cast(avg(c) as DECIMAL(12,8)) , cast(sum(c)/count(c) as decimal(12,7)) from foobar On CPU, those results tally up:
Here's what one finds on GPU:
|
There were some red herrings in investigating this bug. First off, I have closed the CUDF bug (rapidsai/cudf#14507) I raised for this. CUDF is not at fault; it consistently truncates additional decimal digits. It looked like this might have to do with SELECT avg(c) FROM foobar; -- c is a DECIMAL(8,3). The execution plan indicates
The The Here's the simplest repro for the problem: Seq(3527.61953125).toDF("d").repartition(1).selectExpr("d", "CAST(d AS DECIMAL(12,7))").show On CPU: +-------------+---------------+
| d|as_decimal_12_7|
+-------------+---------------+
|3527.61953125| 3527.6195313|
+-------------+---------------+ On GPU: +-----------+---------------+
| d|as_decimal_12_7|
+-----------+---------------+
|3527.619531| 3527.6195312|
+-----------+---------------+ (Ignore how All mention of window functions, aggregation, |
This is a performance optimization in Spark that is only supposed to happen when the value would not be impacted by potential floating point issues. So if the precision is less than 15. 15 requires 50 bits to store it and a double has 52 bits in the significant section so the result should produce the correct answer without any possibility of errors. https://en.wikipedia.org/wiki/Double-precision_floating-point_format So if we are getting the wrong answer back, then the problem is some where in the computation that the average was replaced with. |
I am remembering more now. Converting a double to a decimal value has problems because they do it by going from a double to a string to a decimal. This is inherent in how scala does it in their BigDecimal class, and it is even a bit of magic with an implicit method that just makes it happen behind the scenes. But going from a Double to a String we cannot match what java does. It is not standard which is why we have |
So java does odd things when interpreting floating point values compared to the rest of the world. They try to fix the problem that some decimal values cannot be represented as floating point values. https://docs.oracle.com/javase/8/docs/api/java/lang/Double.html#toString-double- https://docs.oracle.com/javase/8/docs/api/java/lang/Double.html#valueOf-java.lang.String- They are self consistent, but it is not standard. The number we are trying to convert is one of them that cannot be accurately represented as a double. https://binaryconvert.com/result_double.html?decimal=051053050055046054049057053051049050053 So technically the Spark performance optimization is wrong in the general case. But I think how Java/Scala convert double to Strings and in turn decimal values "fixes" it. So there are two options that we have to fix the problem ourselves. We either undo the optimization and just to the average on Decimal values, or we find a way to replicate what Java is doing. None of these are simple. In the case of a Window operation it is not that hard to undo the optimization because it is self contained in a single exec. We can do the pattern matching see the UnscaledValue(e) being manipulated. But for a hash aggregation or a reduction it gets to be much harder. Especially if the optimization later went through other transformations related to distinct/etc it could get to be really hard to detect and undo this. We might be able to just find the final Divide by a constant followed by a cast to a Decimal and try to rewrite that part. Just because we get the rest of it right. That might be the simplest way to make this work. Matching java code is really difficult because it is GPL Licensed so we cannot copy or even read it and try to apply it. I think if we can try and detect the case of |
But before we do any of that I want to be sure that we know what the original input long was before the divide happened and what the double was that we are dividing? I am assuming that it was |
Not exactly. The result of the average (of the unscaled decimals) was |
I can confirm here that // Approach to minimize difference between CPUCast and GPUCast:
// step 1. cast input to FLOAT64 (if necessary)
// step 2. cast FLOAT64 to container DECIMAL (who keeps one more digit for rounding)
// step 3. perform HALF_UP rounding on container DECIMAL
val checkedInput = withResource(input.castTo(DType.FLOAT64)) { double =>
val roundedDouble = double.round(dt.scale, cudf.RoundMode.HALF_UP)
withResource(roundedDouble) { rounded =>
// ...
}
} The second step (i.e. after ensuring the input is 3527.61953125 -> 3527.6195312 The (final) CPU output for this row is |
I've relinquished ownership on this bug. I'm not actively working on this one. |
Since we have an almost match float to string kernel in jni, does that means we can also almost match float to decimal easily by follow Spark's float => string => decimal way? |
@thirtiseven yes that is one possibility, but again it is almost match. That is up to management about how close is good enough. |
Update:
This bug was originally filed with the title:
It has since been established that the problem does not lie in window functions, or aggregations. The problem is with casting float64 to decimal, producing rounding errors.
Repro:
This produces different results on CPU and GPU.
On CPU:
On GPU:
The old description continues below:
test_window_aggs_for_rows
fails withDATAGEN_SEED=1698940723
.Repro:
The text was updated successfully, but these errors were encountered: