[BUG] test_hash_agg_with_nan_keys failed with a DATAGEN_SEED=1702335559 #10026

revans2 · 2023-12-12T15:57:30Z

Describe the bug

FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_agg_with_nan_keys[[('a', RepeatSeq(Double)), ('b', Float), ('c', Long)]][DATAGEN_SEED=1702335559, INJECT_OOM, IGNORE_ORDER, APPROXIMATE_FLOAT] - AssertionError: GPU and CPU float values are different [17, 'sum_of_bees']

It was running against Spark 3.2.4.

I tried to reproduce this locally and I could not. The only explanation I have is that the GPU did the sum in a non-deterministic order and we some how ended up with a number that was larger than approximate float could handle. The key (a) on line 17 was
-3.1261173254359074e-295 and the values were 1.7718319043726909e+25 vs 1.7756097975589866e+25.

Not sure what more to do except dig in a little deeper to see what might have caused the result by looking at the values used by the SUM.

The text was updated successfully, but these errors were encountered:

jbrennan333 · 2023-12-19T19:18:22Z

When I ran this in a loop 100 times, I got it to fail just once:

[gw1] [100%] FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_agg_with_nan_keys[[('a', RepeatSeq(Double)), ('b', Float), ('c', Long)]][DATAGEN_SEED=1702335559, INJECT_OOM, IGNORE_ORDER, APPROXIMATE_FLOAT]

cpu = 1.7718319043726909e+25, gpu = 1.7726637916678512e+25
path = [17, 'sum_of_bees']

-Row(a=-3.1261173254359074e-295, count_stars=21, count_parameterless=0, count_bees=20, sum_of_bees=1.7718319043726909e+25, max_seas=7098372324280358315, min_seas=-9170459960093196357, count_distinct_cees=20, average_seas=4.216639238479068e+17)
+Row(a=-3.1261173254359074e-295, count_stars=21, count_parameterless=0, count_bees=20, sum_of_bees=1.7726637916678512e+25, max_seas=7098372324280358315, min_seas=-9170459960093196357, count_distinct_cees=20, average_seas=4.216639238479067e+17)

Need to capture the contents of the b columns.

jbrennan333 · 2023-12-20T20:20:15Z

input.json

I captured the input data so that I could use it to experiment.

jbrennan333 · 2023-12-20T20:23:33Z

I read in the data with:

import org.apache.spark.sql.types._
val schema = StructType(StructField("a", DoubleType, true) :: StructField("b", FloatType, true) :: StructField("c", LongType, true) :: Nil)
val cdf=spark.read.schema(schema).json("/opt/data/jimb/debug-input.json")

Then

cdf.createOrReplaceTempView("input")
val sumCpu = spark.sql("select a, sum(b) as sum_of_bees from input group by a")

I ran this on cpu and gpu and compared the problematic row (with a=-3.1261173254359074e-295).

(unsorted)
	GPU:
		|-3.1261173254359074E-295|1.776076418134077E25|
		|-3.1261173254359074E-295|1.7756164845547208E25 |
		|-3.1261173254359074E-295|1.776076500186438E25  |
		|-3.1261173254359074E-295|1.776076500186438E25  |
		
	CPU:
		|-3.1261173254359074E-295|1.7714274095956874E25|
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		|-3.1261173254359074E-295|1.7714274095956874E25 |

You can see that GPU results have some variability, while CPU is consistent.
I was wondering if this is caused by differences in the order of sums, because of the hash.
So I tried sorting on columns a and b:

Sort by a, b columns before doing aggregate

	GPU:
		|-3.1261173254359074E-295|1.7756097975589866E25 |
		|-3.1261173254359074E-295|1.7756097975589866E25 |
		|-3.1261173254359074E-295|1.7756097975589866E25 |
		
	CPU:
		|-3.1261173254359074E-295|1.7756097975589866E25 |
		|-3.1261173254359074E-295|1.7756097975589866E25 |
		|-3.1261173254359074E-295|1.7756097975589866E25 |

This produces consistent results.

Sort by a column only

	GPU:
		|-3.1261173254359074E-295|1.776076500186438E25  |
		|-3.1261173254359074E-295|1.776076500186438E25  |
		|-3.1261173254359074E-295|1.776076500186438E25  |
		
	CPU:
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		
export SPARK_HOME=/opt/spark-3.2.4-bin-hadoop3.2
Sort by a column only, descending

	GPU:
		|-3.1261173254359074E-295|1.776076500186438E25  |
		|-3.1261173254359074E-295|1.776076500186438E25  |
		|-3.1261173254359074E-295|1.776076500186438E25  |
	CPU:
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		|-3.1261173254359074E-295|1.7714274095956874E25 |
		|-3.1261173254359074E-295|1.7714274095956874E25 |

Consistent, but different. Lets's try sorting by b only:

Sort by b column only

	GPU:
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |
	CPU:
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |

export SPARK_HOME=/opt/spark-3.2.4-bin-hadoop3.2
Sort by b column only, descending

	GPU:
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |
	CPU:
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |
		|-3.1261173254359074E-295|1.7718319043726909E25 |

Sorting by column b gives us consistent results either way.

jbrennan333 · 2023-12-20T20:35:09Z

So my theory is that the non-deterministic ordering from the hash aggregate is what causes the results to vary on the GPU, and every once in a while it is enough of a difference to be outside the float approximate comparison.

jbrennan333 · 2023-12-28T15:59:48Z

@revans2 any thoughts on this? Should we adjust the approximate_float threshold for this test, or do we need to investigate further?

revans2 · 2023-12-28T16:32:56Z

@jbrennan333 that is my theory too. What I wanted to do was to take the input data for this particular key and see if we can figure out what the order would need to be to make this problem happen. I don't know how many values there are, but hopefully it is not so big that the possibilities make it impossible to do.

If we can prove that is what happened, then we really only have a few choices.

Pretend the problem does not exist
Say the problem does exist and try the help qualify it especially if this is only ever going to be bad for very specific operators in very specific situations.
Fix the problem.

Option 1 feels horrible in general and we should not do it. 3 is great, but I don't know how to do that without slowing down the GPU SUM operation massively, and even then it would get the same answer as Spark if and only if the inputs are in the same order, which is not ever guaranteed by Spark. So then we are left with 2. Which means we should try and understand how big of a difference is possible for different situations. Document them and ideally update the tests so that we reduce the impact of making approximate_float less strict. Happily approximate_float can take parameters to control the comparison. The args are fed directly into https://docs.pytest.org/en/4.6.x/reference.html#pytest-approx

jbrennan333 · 2024-01-03T14:40:51Z

One thing I noticed, is that this particular test test_hash_agg_with_nan_keys doesn't really need to be failing for this issue.

It is run with two datagens, _grpkey_floats_with_nan_zero_grouping_keys and _grpkey_doubles_with_nan_zero_grouping_keys.

_grpkey_floats_with_nan_zero_grouping_keys = [
    ('a', RepeatSeqGen(FloatGen(nullable=(True, 10.0), special_cases=_nan_zero_float_special_cases), length=50)),
    ('b', IntegerGen(nullable=(True, 10.0))),
    ('c', LongGen())]

_grpkey_doubles_with_nan_zero_grouping_keys = [
    ('a', RepeatSeqGen(DoubleGen(nullable=(True, 10.0), special_cases=_nan_zero_double_special_cases), length=50)),
    ('b', FloatGen(nullable=(True, 10.0))),
    ('c', LongGen())]

The first uses an integer b column. The second uses a float b column, and it is the summing of this column that is causing the differences. This particular test is more about the keys in column a, so we could fix it by changing b to Integer in _grpkey_doubles_with_nan_zero_grouping_keys.

That datagen is used in one other test, test_count_distinct_with_nan_floats, but that test could use _grpkey_floats_with_nulls_and_nans. In fact, it should because the one it is currently using does not have nans in the float column that it is counting.

There is definitely an issue with differences when summing floating point columns, but I think that is effectively the same issue as reported in #9822 for test_hash_reduction_sum. I propose we address that issue there, and fix the problem with test_hash_agg_with_nan_keys and test_count_distinct_with_nan_floats by adjusting the datagens used.

I will add some additional findings in a separate comment.

jbrennan333 · 2024-01-03T16:35:31Z

I have been running some tests on the set of floating point values that I captured from the test input when run with the given seed value. There are only 21 values.

+--------------+
|b             |
+--------------+
|348.6652      |
|null          |
|8.2052361E17  |
|-9.615448E24  |
|3.4028235E38  |
|8.3188725E21  |
|-1.86747077E11|
|2.2641698E-5  |
|1.8981274E-15 |
|-8.9130315E23 |
|8.241963E-4   |
|-3.4028235E38 |
|4.86208753E14 |
|2.825404E25   |
|9.7615254E-4  |
|3.074054E-20  |
|1.870257E-19  |
|-1.1672826E-14|
|-8.2195184E-17|
|5.5520855E20  |
|1.751082E-24  |
+--------------+

I uploaded a parquet file with these values:
sum-test-vals.parquet.gz

I made a sorted and reverse sorted version of this, and then ran sums on them on cpu and gpu. (SELECT sum(b) as sum from data)

These are the results:

SUM CPU: 1.7714274095956874e+25
SUM GPU: 1.7718874252274046e+25
SUM CPU/GPU APPROX EQ: False

SORTED SUM CPU: 1.7718319043726909e+25
SORTED SUM GPU: 1.7756097975589866e+25
SORTED SUM APPROC EQ: False

REVERSE SORTED SUM CPU: 1.7718319043726909e+25
REVERSE SORTED SUM GPU: 1.7718319043726909e+25
REVERSE SORTED APPROX EQ: True

I then did this with an accumulations (SELECT b, sum(b) over (rows between unbounded preceding and current row) as cumulative from data) to see if I could catch where the difference is introduced:

UNSORTED ACCUMULATIONS:
+----------------------+----------------------+----------------------+
|b                     |cpu_accum             |gpu_accum             |
+----------------------+----------------------+----------------------+
|348.6651916503906     |348.6651916503906     |348.6651916503906     |
|null                  |348.6651916503906     |348.6651916503906     |
|8.2052360892841984E17 |8.2052360892842022E17 |8.2052360892842022E17 |
|-9.615447782308683E24 |-9.615446961785074E24 |-9.615446961785074E24 |
|3.4028234663852886E38 |3.4028234663851923E38 |3.4028234663851923E38 |
|8.318872465393406E21  |3.4028234663851923E38 |3.4028234663851923E38 |
|-1.86747076608E11     |3.4028234663851923E38 |3.4028234663851923E38 |
|2.264169779664371E-5  |3.4028234663851923E38 |3.4028234663851923E38 |
|1.8981273546887127E-15|3.4028234663851923E38 |3.4028234663851923E38 |
|-8.913031508548466E23 |3.4028234663851832E38 |3.4028234663851832E38 |
|8.241963223554194E-4  |3.4028234663851832E38 |3.4028234663851832E38 |
|-3.4028234663852886E38|-1.0540321989765048E25|-1.0540321989765048E25|
|4.862087528448E14     |-1.0540321989278838E25|-1.0540321989278838E25|
|2.8254040876688576E25 |1.7713718887409737E25 |1.7713718887409737E25 |
|9.761525434441864E-4  |1.7713718887409737E25 |1.7718319043726909E25 |
|3.074053895370231E-20 |1.7713718887409737E25 |1.7718319043726909E25 |
|1.8702570451930148E-19|1.7713718887409737E25 |1.7718319043726909E25 |
|-1.16728263678729E-14 |1.7713718887409737E25 |1.7718319043726909E25 |
|-8.219518384649575E-17|1.7713718887409737E25 |1.7718319043726909E25 |
|5.552085471368388E20  |1.7714274095956874E25 |1.7718874252274046E25 |
|1.7510820801886195E-24|1.7714274095956874E25 |1.7718874252274046E25 |
+----------------------+----------------------+----------------------+

SORTED ACCUMULATIONS:
+----------------------+----------------------+----------------------+
|b                     |cpu_accum             |gpu_accum             |
+----------------------+----------------------+----------------------+
|null                  |null                  |null                  |
|-3.4028234663852886E38|-3.4028234663852886E38|-3.4028234663852886E38|
|-9.615447782308683E24 |-3.402823466385385E38 |-3.402823466385385E38 |
|-8.913031508548466E23 |-3.402823466385394E38 |-3.402823466385394E38 |
|-1.86747076608E11     |-3.402823466385394E38 |-3.402823466385394E38 |
|-1.16728263678729E-14 |-3.402823466385394E38 |-3.402823466385394E38 |
|-8.219518384649575E-17|-3.402823466385394E38 |-3.402823466385394E38 |
|1.7510820801886195E-24|-3.402823466385394E38 |-3.402823466385394E38 |
|3.074053895370231E-20 |-3.402823466385394E38 |-3.402823466385394E38 |
|1.8702570451930148E-19|-3.402823466385394E38 |-3.402823466385394E38 |
|1.8981273546887127E-15|-3.402823466385394E38 |-3.402823466385394E38 |
|2.264169779664371E-5  |-3.402823466385394E38 |-3.402823466385394E38 |
|8.241963223554194E-4  |-3.402823466385394E38 |-3.402823466385394E38 |
|9.761525434441864E-4  |-3.402823466385394E38 |-3.402823466385394E38 |
|348.6651916503906     |-3.402823466385394E38 |-3.402823466385394E38 |
|4.862087528448E14     |-3.402823466385394E38 |-3.402823466385394E38 |
|8.2052360892841984E17 |-3.402823466385394E38 |-3.402823466385394E38 |
|5.552085471368388E20  |-3.402823466385394E38 |-3.402823466385394E38 |
|8.318872465393406E21  |-3.402823466385394E38 |-3.402823466385394E38 |
|2.8254040876688576E25 |-3.4028234663851114E38|-3.4028234663851114E38|
|3.4028234663852886E38 |1.7718319043726909E25 |1.7718319043726909E25 |
+----------------------+----------------------+----------------------+
NOTE:  I don't understand why I get a different final result on GPU with
the accumulated values than I got from just doing the sum:
SORTED SUM GPU: 1.7756097975589866e+25

REVERSE SORTED ACCUMULATIONS:
+----------------------+---------------------+---------------------+
|b                     |cpu_accum            |gpu_accum            |
+----------------------+---------------------+---------------------+
|3.4028234663852886E38 |3.4028234663852886E38|3.4028234663852886E38|
|2.8254040876688576E25 |3.402823466385571E38 |3.402823466385571E38 |
|8.318872465393406E21  |3.402823466385571E38 |3.402823466385571E38 |
|5.552085471368388E20  |3.402823466385571E38 |3.402823466385571E38 |
|8.2052360892841984E17 |3.402823466385571E38 |3.402823466385571E38 |
|4.862087528448E14     |3.402823466385571E38 |3.402823466385571E38 |
|348.6651916503906     |3.402823466385571E38 |3.402823466385571E38 |
|9.761525434441864E-4  |3.402823466385571E38 |3.402823466385571E38 |
|8.241963223554194E-4  |3.402823466385571E38 |3.402823466385571E38 |
|2.264169779664371E-5  |3.402823466385571E38 |3.402823466385571E38 |
|1.8981273546887127E-15|3.402823466385571E38 |3.402823466385571E38 |
|1.8702570451930148E-19|3.402823466385571E38 |3.402823466385571E38 |
|3.074053895370231E-20 |3.402823466385571E38 |3.402823466385571E38 |
|1.7510820801886195E-24|3.402823466385571E38 |3.402823466385571E38 |
|-8.219518384649575E-17|3.402823466385571E38 |3.402823466385571E38 |
|-1.16728263678729E-14 |3.402823466385571E38 |3.402823466385571E38 |
|-1.86747076608E11     |3.402823466385571E38 |3.402823466385571E38 |
|-8.913031508548466E23 |3.402823466385562E38 |3.402823466385562E38 |
|-9.615447782308683E24 |3.4028234663854658E38|3.4028234663854658E38|
|-3.4028234663852886E38|1.7718319043726909E25|1.7718319043726909E25|
|null                  |1.7718319043726909E25|1.7718319043726909E25|
+----------------------+---------------------+---------------------+

This sequence from the unsorted accumulations is interesting:

|2.8254040876688576E25 |1.7713718887409737E25 |1.7713718887409737E25 |
|9.761525434441864E-4  |1.7713718887409737E25 |1.7718319043726909E25 |

I don't understand why the GPU gets a different value here. I tried just summing the column 1.7713718887409737E25, 9.761525434441864E-4, but I can't repro the difference. I get the same value on cpu and gpu with just these two values.

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 12, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 12, 2023

NVnavkumar mentioned this issue Dec 14, 2023

Support date_format via Gpu for non-UTC time zone [databricks] #9721

Merged

thirtiseven mentioned this issue Dec 14, 2023

Set seed=0 for some integration test cases #10052

Merged

mattahrens assigned jbrennan333 Dec 14, 2023

This was referenced Jan 3, 2024

[BUG] Failed test_hash_reduction_sum [DATAGEN_SEED=1700579573, INJECT_OOM, IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT] on CI #9822

Open

fix test_hash_agg_with_nan_keys floating point sum failure #10148

Merged

jbrennan333 closed this as completed in #10148 Jan 4, 2024

revans2 mentioned this issue Jan 22, 2024

[BUG] hash_aggregate_test.py::test_hash_multiple_mode_query_avg_distincts failed with DATAGEN_SEED=1705756525 #10234

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_hash_agg_with_nan_keys failed with a DATAGEN_SEED=1702335559 #10026

[BUG] test_hash_agg_with_nan_keys failed with a DATAGEN_SEED=1702335559 #10026

revans2 commented Dec 12, 2023

jbrennan333 commented Dec 19, 2023

jbrennan333 commented Dec 20, 2023

jbrennan333 commented Dec 20, 2023 •

edited

Loading

jbrennan333 commented Dec 20, 2023

jbrennan333 commented Dec 28, 2023

revans2 commented Dec 28, 2023 •

edited

Loading

jbrennan333 commented Jan 3, 2024

jbrennan333 commented Jan 3, 2024

[BUG] test_hash_agg_with_nan_keys failed with a DATAGEN_SEED=1702335559 #10026

[BUG] test_hash_agg_with_nan_keys failed with a DATAGEN_SEED=1702335559 #10026

Comments

revans2 commented Dec 12, 2023

jbrennan333 commented Dec 19, 2023

jbrennan333 commented Dec 20, 2023

jbrennan333 commented Dec 20, 2023 • edited Loading

jbrennan333 commented Dec 20, 2023

jbrennan333 commented Dec 28, 2023

revans2 commented Dec 28, 2023 • edited Loading

jbrennan333 commented Jan 3, 2024

jbrennan333 commented Jan 3, 2024

jbrennan333 commented Dec 20, 2023 •

edited

Loading

revans2 commented Dec 28, 2023 •

edited

Loading