Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] conditionals_test.py::test_conditional_with_side_effects_cast[String] failed with DATAGEN_SEED=1701976979 #9992

Closed
revans2 opened this issue Dec 7, 2023 · 6 comments · Fixed by #10090 or #10112
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 7, 2023

Describe the bug

TEST_PARALLEL=0 DATAGEN_SEED=1701976979 TZ=UTC ./run_pyspark_from_build.sh --test_oom_injection_mode=always -k 'test_conditional_with_side_effects_cast'

Causes this to fail all the time. It looks like it is a test error because this is failing on the CPU with no GPU code running. I think it is related to the RLIKE that runs, but I am not 100% sure. Probably the \z is not escaping things properly.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify test Only impacts tests labels Dec 7, 2023
@revans2
Copy link
Collaborator Author

revans2 commented Dec 7, 2023

Yup if I update the test to escape the z more we get an error because we cannot support it on the GPU. But using a $ works. But I don't think that is the right solution because we might have 100\nBAD STUFF which would cause the same kind of problem. I think the ^ is also not correct.

We probably want to update test_conditional_with_side_effects_case_when too that is doing something similar. But we also want to update the generated data to make it actually hit the various parts and verify that the input/output data looks correct.

@gerashegalov
Copy link
Collaborator

This is an instance where the test is not accounting for the need to \\\\ which can be avoided by using raw strings either at Python or SQL levels #8289

Currently the test ends up using the regex ^[0-9]{1,5}z instead of ^[0-9]{1,5}\z.

And the seed produces a string starting with 5z making it a match for RLIKE. Thus both CPU and GPU fail

spark.sql("SELECT a, a RLIKE '^[0-9]{1,5}z' FROM df WHERE a RLIKE '^[0-9]{1,5}z'").show()
+----------+--------------------+
|         a|a RLIKE ^[0-9]{1,5}z|
+----------+--------------------+
|5z.Q��ô�Ù�|                true|
+----------+--------------------+

@abellina
Copy link
Collaborator

@gerashegalov for this test, the @datagen_override is still there, yet the issue is closed (https://github.com/NVIDIA/spark-rapids/blob/branch-24.02/integration_tests/src/main/python/conditionals_test.py#L211).

Is that intended? I am going through various places where we'd like to mark the override permanent, and this one stands out.

@gerashegalov
Copy link
Collaborator

Good catch @abellina. #10090 should have deleted the override

gerashegalov added a commit to gerashegalov/spark-rapids that referenced this issue Dec 28, 2023
@abellina
Copy link
Collaborator

abellina commented Dec 28, 2023

Good catch @abellina. #10090 should have deleted the override

thanks for confirming! I can remove in the test pr I have open.

@gerashegalov
Copy link
Collaborator

Good catch @abellina. #10090 should have deleted the override

thanks for confirming! I can remove in the test pr I have open.

I opened #10112 . Let us keep it out of the bigger scope PR you are working on in case it gets reverted :)

gerashegalov added a commit that referenced this issue Dec 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
5 participants