Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_hash_agg_with_nan_keys failed with a DATAGEN_SEED=1702335559 #10026

Closed
revans2 opened this issue Dec 12, 2023 · 8 comments · Fixed by #10148
Closed

[BUG] test_hash_agg_with_nan_keys failed with a DATAGEN_SEED=1702335559 #10026

revans2 opened this issue Dec 12, 2023 · 8 comments · Fixed by #10148
Assignees
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 12, 2023

Describe the bug

FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_agg_with_nan_keys[[('a', RepeatSeq(Double)), ('b', Float), ('c', Long)]][DATAGEN_SEED=1702335559, INJECT_OOM, IGNORE_ORDER, APPROXIMATE_FLOAT] - AssertionError: GPU and CPU float values are different [17, 'sum_of_bees']

It was running against Spark 3.2.4.

I tried to reproduce this locally and I could not. The only explanation I have is that the GPU did the sum in a non-deterministic order and we some how ended up with a number that was larger than approximate float could handle. The key (a) on line 17 was
-3.1261173254359074e-295 and the values were 1.7718319043726909e+25 vs 1.7756097975589866e+25.

Not sure what more to do except dig in a little deeper to see what might have caused the result by looking at the values used by the SUM.

@jbrennan333
Copy link
Collaborator

When I ran this in a loop 100 times, I got it to fail just once:

[gw1] [100%] FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_agg_with_nan_keys[[('a', RepeatSeq(Double)), ('b', Float), ('c', Long)]][DATAGEN_SEED=1702335559, INJECT_OOM, IGNORE_ORDER, APPROXIMATE_FLOAT]

cpu = 1.7718319043726909e+25, gpu = 1.7726637916678512e+25
path = [17, 'sum_of_bees']

-Row(a=-3.1261173254359074e-295, count_stars=21, count_parameterless=0, count_bees=20, sum_of_bees=1.7718319043726909e+25, max_seas=7098372324280358315, min_seas=-9170459960093196357, count_distinct_cees=20, average_seas=4.216639238479068e+17)
+Row(a=-3.1261173254359074e-295, count_stars=21, count_parameterless=0, count_bees=20, sum_of_bees=1.7726637916678512e+25, max_seas=7098372324280358315, min_seas=-9170459960093196357, count_distinct_cees=20, average_seas=4.216639238479067e+17)

Need to capture the contents of the b columns.

@jbrennan333
Copy link
Collaborator

input.json

I captured the input data so that I could use it to experiment.

@jbrennan333
Copy link
Collaborator

jbrennan333 commented Dec 20, 2023

I read in the data with:

import org.apache.spark.sql.types._
val schema = StructType(StructField("a", DoubleType, true) :: StructField("b", FloatType, true) :: StructField("c", LongType, true) :: Nil)
val cdf=spark.read.schema(schema).json("/opt/data/jimb/debug-input.json")

Then

cdf.createOrReplaceTempView("input")
val sumCpu = spark.sql("select a, sum(b) as sum_of_bees from input group by a")

I ran this on cpu and gpu and compared the problematic row (with a=-3.1261173254359074e-295).

(unsorted)
	GPU:
		|-3.1261173254359074E-295|1.776076418134077E25|
		|-3.1261173254359074E-295|1.7756164845547208E25 |
		|-3.1261173254359074E-295|1.776076500186438E25  |
		|-3.1261173254359074E-295|1.776076500186438E25  |
		
	CPU:
		|-3.1261173254359074E-295|1.7714274095956874E25|
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		|-3.1261173254359074E-295|1.7714274095956874E25 |

You can see that GPU results have some variability, while CPU is consistent.
I was wondering if this is caused by differences in the order of sums, because of the hash.
So I tried sorting on columns a and b:

Sort by a, b columns before doing aggregate

	GPU:
		|-3.1261173254359074E-295|1.7756097975589866E25 |
		|-3.1261173254359074E-295|1.7756097975589866E25 |
		|-3.1261173254359074E-295|1.7756097975589866E25 |
		
	CPU:
		|-3.1261173254359074E-295|1.7756097975589866E25 |
		|-3.1261173254359074E-295|1.7756097975589866E25 |
		|-3.1261173254359074E-295|1.7756097975589866E25 |

This produces consistent results.

Sort by a column only

	GPU:
		|-3.1261173254359074E-295|1.776076500186438E25  |
		|-3.1261173254359074E-295|1.776076500186438E25  |
		|-3.1261173254359074E-295|1.776076500186438E25  |
		
	CPU:
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		
export SPARK_HOME=/opt/spark-3.2.4-bin-hadoop3.2
Sort by a column only, descending

	GPU:
		|-3.1261173254359074E-295|1.776076500186438E25  |
		|-3.1261173254359074E-295|1.776076500186438E25  |
		|-3.1261173254359074E-295|1.776076500186438E25  |
	CPU:
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		|-3.1261173254359074E-295|1.7714274095956874E25 |

Consistent, but different. Lets's try sorting by b only:

Sort by b column only

	GPU:
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |
	CPU:
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |

export SPARK_HOME=/opt/spark-3.2.4-bin-hadoop3.2
Sort by b column only, descending

	GPU:
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |
	CPU:
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |

Sorting by column b gives us consistent results either way.

@jbrennan333
Copy link
Collaborator

So my theory is that the non-deterministic ordering from the hash aggregate is what causes the results to vary on the GPU, and every once in a while it is enough of a difference to be outside the float approximate comparison.

@jbrennan333
Copy link
Collaborator

@revans2 any thoughts on this? Should we adjust the approximate_float threshold for this test, or do we need to investigate further?

@revans2
Copy link
Collaborator Author

revans2 commented Dec 28, 2023

@jbrennan333 that is my theory too. What I wanted to do was to take the input data for this particular key and see if we can figure out what the order would need to be to make this problem happen. I don't know how many values there are, but hopefully it is not so big that the possibilities make it impossible to do.

If we can prove that is what happened, then we really only have a few choices.

  1. Pretend the problem does not exist
  2. Say the problem does exist and try the help qualify it especially if this is only ever going to be bad for very specific operators in very specific situations.
  3. Fix the problem.

Option 1 feels horrible in general and we should not do it. 3 is great, but I don't know how to do that without slowing down the GPU SUM operation massively, and even then it would get the same answer as Spark if and only if the inputs are in the same order, which is not ever guaranteed by Spark. So then we are left with 2. Which means we should try and understand how big of a difference is possible for different situations. Document them and ideally update the tests so that we reduce the impact of making approximate_float less strict. Happily approximate_float can take parameters to control the comparison. The args are fed directly into https://docs.pytest.org/en/4.6.x/reference.html#pytest-approx

@jbrennan333
Copy link
Collaborator

One thing I noticed, is that this particular test test_hash_agg_with_nan_keys doesn't really need to be failing for this issue.

It is run with two datagens, _grpkey_floats_with_nan_zero_grouping_keys and _grpkey_doubles_with_nan_zero_grouping_keys.

_grpkey_floats_with_nan_zero_grouping_keys = [
    ('a', RepeatSeqGen(FloatGen(nullable=(True, 10.0), special_cases=_nan_zero_float_special_cases), length=50)),
    ('b', IntegerGen(nullable=(True, 10.0))),
    ('c', LongGen())]
_grpkey_doubles_with_nan_zero_grouping_keys = [
    ('a', RepeatSeqGen(DoubleGen(nullable=(True, 10.0), special_cases=_nan_zero_double_special_cases), length=50)),
    ('b', FloatGen(nullable=(True, 10.0))),
    ('c', LongGen())]

The first uses an integer b column. The second uses a float b column, and it is the summing of this column that is causing the differences. This particular test is more about the keys in column a, so we could fix it by changing b to Integer in _grpkey_doubles_with_nan_zero_grouping_keys.

That datagen is used in one other test, test_count_distinct_with_nan_floats, but that test could use _grpkey_floats_with_nulls_and_nans. In fact, it should because the one it is currently using does not have nans in the float column that it is counting.

There is definitely an issue with differences when summing floating point columns, but I think that is effectively the same issue as reported in #9822 for test_hash_reduction_sum. I propose we address that issue there, and fix the problem with test_hash_agg_with_nan_keys and test_count_distinct_with_nan_floats by adjusting the datagens used.

I will add some additional findings in a separate comment.

@jbrennan333
Copy link
Collaborator

I have been running some tests on the set of floating point values that I captured from the test input when run with the given seed value. There are only 21 values.

+--------------+
|b             |
+--------------+
|348.6652      |
|null          |
|8.2052361E17  |
|-9.615448E24  |
|3.4028235E38  |
|8.3188725E21  |
|-1.86747077E11|
|2.2641698E-5  |
|1.8981274E-15 |
|-8.9130315E23 |
|8.241963E-4   |
|-3.4028235E38 |
|4.86208753E14 |
|2.825404E25   |
|9.7615254E-4  |
|3.074054E-20  |
|1.870257E-19  |
|-1.1672826E-14|
|-8.2195184E-17|
|5.5520855E20  |
|1.751082E-24  |
+--------------+

I uploaded a parquet file with these values:
sum-test-vals.parquet.gz

I made a sorted and reverse sorted version of this, and then ran sums on them on cpu and gpu. (SELECT sum(b) as sum from data)

These are the results:

SUM CPU: 1.7714274095956874e+25
SUM GPU: 1.7718874252274046e+25
SUM CPU/GPU APPROX EQ: False

SORTED SUM CPU: 1.7718319043726909e+25
SORTED SUM GPU: 1.7756097975589866e+25
SORTED SUM APPROC EQ: False

REVERSE SORTED SUM CPU: 1.7718319043726909e+25
REVERSE SORTED SUM GPU: 1.7718319043726909e+25
REVERSE SORTED APPROX EQ: True

I then did this with an accumulations (SELECT b, sum(b) over (rows between unbounded preceding and current row) as cumulative from data) to see if I could catch where the difference is introduced:

UNSORTED ACCUMULATIONS:
+----------------------+----------------------+----------------------+
|b                     |cpu_accum             |gpu_accum             |
+----------------------+----------------------+----------------------+
|348.6651916503906     |348.6651916503906     |348.6651916503906     |
|null                  |348.6651916503906     |348.6651916503906     |
|8.2052360892841984E17 |8.2052360892842022E17 |8.2052360892842022E17 |
|-9.615447782308683E24 |-9.615446961785074E24 |-9.615446961785074E24 |
|3.4028234663852886E38 |3.4028234663851923E38 |3.4028234663851923E38 |
|8.318872465393406E21  |3.4028234663851923E38 |3.4028234663851923E38 |
|-1.86747076608E11     |3.4028234663851923E38 |3.4028234663851923E38 |
|2.264169779664371E-5  |3.4028234663851923E38 |3.4028234663851923E38 |
|1.8981273546887127E-15|3.4028234663851923E38 |3.4028234663851923E38 |
|-8.913031508548466E23 |3.4028234663851832E38 |3.4028234663851832E38 |
|8.241963223554194E-4  |3.4028234663851832E38 |3.4028234663851832E38 |
|-3.4028234663852886E38|-1.0540321989765048E25|-1.0540321989765048E25|
|4.862087528448E14     |-1.0540321989278838E25|-1.0540321989278838E25|
|2.8254040876688576E25 |1.7713718887409737E25 |1.7713718887409737E25 |
|9.761525434441864E-4  |1.7713718887409737E25 |1.7718319043726909E25 |
|3.074053895370231E-20 |1.7713718887409737E25 |1.7718319043726909E25 |
|1.8702570451930148E-19|1.7713718887409737E25 |1.7718319043726909E25 |
|-1.16728263678729E-14 |1.7713718887409737E25 |1.7718319043726909E25 |
|-8.219518384649575E-17|1.7713718887409737E25 |1.7718319043726909E25 |
|5.552085471368388E20  |1.7714274095956874E25 |1.7718874252274046E25 |
|1.7510820801886195E-24|1.7714274095956874E25 |1.7718874252274046E25 |
+----------------------+----------------------+----------------------+
SORTED ACCUMULATIONS:
+----------------------+----------------------+----------------------+
|b                     |cpu_accum             |gpu_accum             |
+----------------------+----------------------+----------------------+
|null                  |null                  |null                  |
|-3.4028234663852886E38|-3.4028234663852886E38|-3.4028234663852886E38|
|-9.615447782308683E24 |-3.402823466385385E38 |-3.402823466385385E38 |
|-8.913031508548466E23 |-3.402823466385394E38 |-3.402823466385394E38 |
|-1.86747076608E11     |-3.402823466385394E38 |-3.402823466385394E38 |
|-1.16728263678729E-14 |-3.402823466385394E38 |-3.402823466385394E38 |
|-8.219518384649575E-17|-3.402823466385394E38 |-3.402823466385394E38 |
|1.7510820801886195E-24|-3.402823466385394E38 |-3.402823466385394E38 |
|3.074053895370231E-20 |-3.402823466385394E38 |-3.402823466385394E38 |
|1.8702570451930148E-19|-3.402823466385394E38 |-3.402823466385394E38 |
|1.8981273546887127E-15|-3.402823466385394E38 |-3.402823466385394E38 |
|2.264169779664371E-5  |-3.402823466385394E38 |-3.402823466385394E38 |
|8.241963223554194E-4  |-3.402823466385394E38 |-3.402823466385394E38 |
|9.761525434441864E-4  |-3.402823466385394E38 |-3.402823466385394E38 |
|348.6651916503906     |-3.402823466385394E38 |-3.402823466385394E38 |
|4.862087528448E14     |-3.402823466385394E38 |-3.402823466385394E38 |
|8.2052360892841984E17 |-3.402823466385394E38 |-3.402823466385394E38 |
|5.552085471368388E20  |-3.402823466385394E38 |-3.402823466385394E38 |
|8.318872465393406E21  |-3.402823466385394E38 |-3.402823466385394E38 |
|2.8254040876688576E25 |-3.4028234663851114E38|-3.4028234663851114E38|
|3.4028234663852886E38 |1.7718319043726909E25 |1.7718319043726909E25 |
+----------------------+----------------------+----------------------+
NOTE:  I don't understand why I get a different final result on GPU with
the accumulated values than I got from just doing the sum:
SORTED SUM GPU: 1.7756097975589866e+25

REVERSE SORTED ACCUMULATIONS:
+----------------------+---------------------+---------------------+
|b                     |cpu_accum            |gpu_accum            |
+----------------------+---------------------+---------------------+
|3.4028234663852886E38 |3.4028234663852886E38|3.4028234663852886E38|
|2.8254040876688576E25 |3.402823466385571E38 |3.402823466385571E38 |
|8.318872465393406E21  |3.402823466385571E38 |3.402823466385571E38 |
|5.552085471368388E20  |3.402823466385571E38 |3.402823466385571E38 |
|8.2052360892841984E17 |3.402823466385571E38 |3.402823466385571E38 |
|4.862087528448E14     |3.402823466385571E38 |3.402823466385571E38 |
|348.6651916503906     |3.402823466385571E38 |3.402823466385571E38 |
|9.761525434441864E-4  |3.402823466385571E38 |3.402823466385571E38 |
|8.241963223554194E-4  |3.402823466385571E38 |3.402823466385571E38 |
|2.264169779664371E-5  |3.402823466385571E38 |3.402823466385571E38 |
|1.8981273546887127E-15|3.402823466385571E38 |3.402823466385571E38 |
|1.8702570451930148E-19|3.402823466385571E38 |3.402823466385571E38 |
|3.074053895370231E-20 |3.402823466385571E38 |3.402823466385571E38 |
|1.7510820801886195E-24|3.402823466385571E38 |3.402823466385571E38 |
|-8.219518384649575E-17|3.402823466385571E38 |3.402823466385571E38 |
|-1.16728263678729E-14 |3.402823466385571E38 |3.402823466385571E38 |
|-1.86747076608E11     |3.402823466385571E38 |3.402823466385571E38 |
|-8.913031508548466E23 |3.402823466385562E38 |3.402823466385562E38 |
|-9.615447782308683E24 |3.4028234663854658E38|3.4028234663854658E38|
|-3.4028234663852886E38|1.7718319043726909E25|1.7718319043726909E25|
|null                  |1.7718319043726909E25|1.7718319043726909E25|
+----------------------+---------------------+---------------------+

This sequence from the unsorted accumulations is interesting:

|2.8254040876688576E25 |1.7713718887409737E25 |1.7713718887409737E25 |
|9.761525434441864E-4  |1.7713718887409737E25 |1.7718319043726909E25 |

I don't understand why the GPU gets a different value here. I tried just summing the column 1.7713718887409737E25, 9.761525434441864E-4, but I can't repro the difference. I get the same value on cpu and gpu with just these two values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants