Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ast_test.py::test_X[(String, True)][DATAGEN_SEED=1700205785] failed #9771

Closed
pxLi opened this issue Nov 17, 2023 · 7 comments · Fixed by #9809
Closed

[BUG] ast_test.py::test_X[(String, True)][DATAGEN_SEED=1700205785] failed #9771

pxLi opened this issue Nov 17, 2023 · 7 comments · Fixed by #9809
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@pxLi
Copy link
Collaborator

pxLi commented Nov 17, 2023

Describe the bug
failed nightly build at commit 30c3df3,
internal pipeline: rapids_integration-scala213-dev-github, build ID: 15 (scala 2.13 build + cuda12 + JDK17 runtime)

currently failed all nightly integration tests pipelines

failures list

[2023-11-17T08:50:24.100Z] FAILED ../../src/main/python/ast_test.py::test_eq[(String, True)][DATAGEN_SEED=1700205785] - AssertionError: GPU and CPU boolean values are different at [0, '(a = CAST(ÍDøñ\x06A7\x92DèY¥~\x88\x95~DP.Ù\x05×\x15p\x07Çw¨çh AS STRING))']
[2023-11-17T08:50:24.100Z] FAILED ../../src/main/python/ast_test.py::test_ne[(String, True)][DATAGEN_SEED=1700205785] - AssertionError: GPU and CPU boolean values are different at [0, '(NOT (a = CAST(ÍDøñ\x06A7\x92DèY¥~\x88\x95~DP.Ù\x05×\x15p\x07Çw¨çh AS STRING)))']
[2023-11-17T08:50:24.100Z] FAILED ../../src/main/python/ast_test.py::test_lt[(String, True)][DATAGEN_SEED=1700205785] - AssertionError: GPU and CPU boolean values are different at [1, '(a < CAST(ÍDøñ\x06A7\x92DèY¥~\x88\x95~DP.Ù\x05×\x15p\x07Çw¨çh AS STRING))']
[2023-11-17T08:50:24.100Z] FAILED ../../src/main/python/ast_test.py::test_lte[(String, True)][DATAGEN_SEED_OVERRIDE=0] - AssertionError: GPU and CPU boolean values are different at [0, '(a <= CAST(øË\x82âæb÷(´úBH^:\t]/4kª\x94Uãç\x05\x82Â\x08L\xad AS STRING))']
[2023-11-17T08:50:24.100Z] FAILED ../../src/main/python/ast_test.py::test_gt[(String, True)][DATAGEN_SEED=1700205785, INJECT_OOM] - AssertionError: GPU and CPU boolean values are different at [0, '(a > CAST(øË\x82âæb÷(´úBH^:\t]/4kª\x94Uãç\x05\x82Â\x08L\xad AS STRING))']
[2023-11-17T08:50:24.100Z] FAILED ../../src/main/python/ast_test.py::test_gte[(String, True)][DATAGEN_SEED=1700205785] - AssertionError: GPU and CPU boolean values are different at [0, '(CAST(0\x8dÁ|by#Æa/H&\x86s#§q\x1a\x813(v7ëÜÙèz\x16\x14 AS STRING) >= b)']
data_descr = (String, True)

    @pytest.mark.parametrize('data_descr', ast_comparable_descrs, ids=idfn)
    def test_eq(data_descr):
        (s1, s2) = with_cpu_session(lambda spark: gen_scalars(data_descr[0], 2))
>       assert_binary_ast(data_descr,
            lambda df: df.select(
                f.col('a') == s1,
                s2 == f.col('b'),
                f.col('a') == f.col('b')))

../../src/main/python/ast_test.py:237: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../src/main/python/ast_test.py:70: in assert_binary_ast
    assert_gpu_ast(is_supported, lambda spark: func(binary_op_df(spark, data_gen)), conf=conf)
../../src/main/python/ast_test.py:58: in assert_gpu_ast
    assert_cpu_and_gpu_are_equal_collect_with_capture(
../../src/main/python/asserts.py:402: in assert_cpu_and_gpu_are_equal_collect_with_capture
    assert_equal(from_cpu, from_gpu)
../../src/main/python/asserts.py:107: in assert_equal
    _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
../../src/main/python/asserts.py:43: in _assert_equal
    _assert_equal(cpu[index], gpu[index], float_check, path + [index])
../../src/main/python/asserts.py:36: in _assert_equal
    _assert_equal(cpu[field], gpu[field], float_check, path + [field])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cpu = True, gpu = False
float_check = <function get_float_check.<locals>.<lambda> at 0x7f30633245e0>
path = [0, '(a = CAST(ÍDøñ\x06A7\x92DèY¥~\x88\x95~DP.Ù\x05×\x15p\x07Çw¨çh AS STRING))']

    def _assert_equal(cpu, gpu, float_check, path):
        t = type(cpu)
        if (t is Row):
            assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
            if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
                assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)
                for field in cpu.__fields__:
                    _assert_equal(cpu[field], gpu[field], float_check, path + [field])
            else:
                for index in range(len(cpu)):
                    _assert_equal(cpu[index], gpu[index], float_check, path + [index])
        elif (t is list):
            assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
            for index in range(len(cpu)):
                _assert_equal(cpu[index], gpu[index], float_check, path + [index])
        elif (t is tuple):
            assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
            for index in range(len(cpu)):
                _assert_equal(cpu[index], gpu[index], float_check, path + [index])
        elif (t is pytypes.GeneratorType):
            index = 0
            # generator has no zip :( so we have to do this the hard way
            done = False
            while not done:
                sub_cpu = None
                sub_gpu = None
                try:
                    sub_cpu = next(cpu)
                except StopIteration:
                    done = True
    
                try:
                    sub_gpu = next(gpu)
                except StopIteration:
                    done = True
    
                if done:
                    assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
                else:
                    _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])
    
                index = index + 1
        elif (t is dict):
            # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark
            # so sort the items to do our best with ignoring the order of dicts
            cpu_items = list(cpu.items()).sort(key=_RowCmp)
            gpu_items = list(gpu.items()).sort(key=_RowCmp)
            _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
        elif (t is int):
            assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
        elif (t is float):
            if (math.isnan(cpu)):
                assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)
            else:
                assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
        elif isinstance(cpu, str):
            assert cpu == gpu, "GPU and CPU string values are different at {}".format(path)
        elif isinstance(cpu, datetime):
            assert cpu == gpu, "GPU and CPU timestamp values are different at {}".format(path)
        elif isinstance(cpu, date):
            assert cpu == gpu, "GPU and CPU date values are different at {}".format(path)
        elif isinstance(cpu, bool):
>           assert cpu == gpu, "GPU and CPU boolean values are different at {}".format(path)
E           AssertionError: GPU and CPU boolean values are different at [0, '(a = CAST(ÍDøñ\x06A7\x92DèY¥~\x88\x95~DP.Ù\x05×\x15p\x07Çw¨çh AS STRING))']

../../src/main/python/asserts.py:91: AssertionError

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 17, 2023
@pxLi pxLi changed the title [BUG] ast_test.py::test_eq[(String, True)][DATAGEN_SEED=1700205785] failed spark 332 + 340 [BUG] ast_test.py::test_X[(String, True)][DATAGEN_SEED=1700205785] failed spark 332 + 340 Nov 17, 2023
@pxLi pxLi added the test Only impacts tests label Nov 17, 2023
@pxLi pxLi changed the title [BUG] ast_test.py::test_X[(String, True)][DATAGEN_SEED=1700205785] failed spark 332 + 340 [BUG] ast_test.py::test_X[(String, True)][DATAGEN_SEED=1700205785] failed assertion in spark 332 + 340 Nov 17, 2023
@pxLi
Copy link
Collaborator Author

pxLi commented Nov 17, 2023

there could be more ENVs having the same issue after tonight's runs.

@pxLi pxLi changed the title [BUG] ast_test.py::test_X[(String, True)][DATAGEN_SEED=1700205785] failed assertion in spark 332 + 340 [BUG] ast_test.py::test_X[(String, True)][DATAGEN_SEED=1700205785] failed Nov 20, 2023
@pxLi
Copy link
Collaborator Author

pxLi commented Nov 20, 2023

related fix at #9780

@jlowe
Copy link
Member

jlowe commented Nov 20, 2023

#9780 is specific to floating point and will not fix this. I xfailed the AST string tests in #9795.

I am unable to reproduce the AST string test failures on my desktop, and it's not failing in premerge (which explains how the change was able to commit).

@pxLi
Copy link
Collaborator Author

pxLi commented Nov 21, 2023

#9780 is specific to floating point and will not fix this. I xfailed the AST string tests in #9795.

I am unable to reproduce the AST string test failures on my desktop, and it's not failing in premerge (which explains how the change was able to commit).

I just got a stable repro by setting LC_ALL=C for testing locally,
for the locale default in cuda image is not UTF8 (we only enable UTF8 for specific case)
https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/Dockerfile-blossom.integration.ubuntu#L76

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 21, 2023
@winningsix
Copy link
Collaborator

#9780 is specific to floating point and will not fix this. I xfailed the AST string tests in #9795.
I am unable to reproduce the AST string test failures on my desktop, and it's not failing in premerge (which explains how the change was able to commit).

I just got a stable repro by setting LC_ALL=C for testing locally, for the locale default in cuda image is not UTF8 (we only enable UTF8 for specific case) https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/Dockerfile-blossom.integration.ubuntu#L76

@pxLi I am thinking whether we can change pre-merge LC_ALL=C as well. This can help expose such problems. Thoughts?

@pxLi
Copy link
Collaborator Author

pxLi commented Nov 22, 2023

#9780 is specific to floating point and will not fix this. I xfailed the AST string tests in #9795.
I am unable to reproduce the AST string test failures on my desktop, and it's not failing in premerge (which explains how the change was able to commit).

I just got a stable repro by setting LC_ALL=C for testing locally, for the locale default in cuda image is not UTF8 (we only enable UTF8 for specific case) https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/Dockerfile-blossom.integration.ubuntu#L76

@pxLi I am thinking whether we can change pre-merge LC_ALL=C as well. This can help expose such problems. Thoughts?

it should be default C as well in pre-merge, developers adding command-attached ENV for testing against UTF8 https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/spark-premerge-build.sh#L67
https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/spark-premerge-build.sh#L161

I am not sure if some tests/ENV causes some side effects for following cases.

If needed, feel free to force the locale to C in run_pyspark script if no specific ENV

@winningsix
Copy link
Collaborator

From https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/spark-premerge-build.sh#L77C13-L77C16, it will run with UTF8 environment variable. So that may be the problem why we can see that issue only in nightly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants