Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip fastparquet timestamp tests when plugin cannot read/write timestamps #9831

Merged
merged 3 commits into from
Dec 1, 2023

Conversation

mythrocks
Copy link
Collaborator

Fixes #9776.

The tests in fastparquet_compatibility_test.py check for compatibility between Apache Spark, the Spark RAPIDS plugin, and fastparquet. In particular:

  1. test_reading_file_written_by_spark_cpu checks if timestamp columns written with Apache Spark are read similarly with fastparquet and the plugin.
  2. test_reading_file_written_with_gpu checks if timestamps written with the plugin are read the same on Apache Spark and fastparquet.

If the timezone is not set to UTC, and the system timezone isn't UTC either, the plugin falls back to CPU for read/write of Parquet timestamp columns. This would cause the above tests not to run: the plugin can neither read nor write timestamps on GPU.

Further, fastparquet seems to interpret timestamps written from Spark as being in UTC, regardless of the timezone settings. So on non-UTC timezones, Apache Spark and fastparquet get different results for the same input.

For the two reasons above, it is best to only run the three-way timestamp comparison tests in setups with UTC timezone.

This commit skips the timestamp tests described above, when a non-UTC timezone is detected.

Fixes NVIDIA#9776.

The tests in `fastparquet_compatibility_test.py` check for compatibility between
Apache Spark, the Spark RAPIDS plugin, and fastparquet. In particular:
1. `test_reading_file_written_by_spark_cpu` checks if timestamp columns written
    with Apache Spark are read similarly with fastparquet and the plugin.
2. `test_reading_file_written_with_gpu` checks if timestamps written with
   the plugin are read the same on Apache Spark and fastparquet.

If the timezone is not set to "UTC", and the system timezone isn't "UTC" either,
the plugin falls back to CPU for read/write of Parquet timestamp columns. This would
cause the above tests not to run: the plugin can neither read nor write timestamps
on GPU.

Further, fastparquet seems to interpret timestamps written from Spark as being
in "UTC", regardless of the timezone settings. So on non-UTC timezones,
Apache Spark and fastparquet get different results for the same input.

For the two reasons above, it is best to only run the three-way timestamp comparison
tests in setups with "UTC" timezone.

This commit skips the timestamp tests described above, when a non-UTC timezone is
detected.

Signed-off-by: MithunR <[email protected]>
@mythrocks mythrocks self-assigned this Nov 22, 2023
@mythrocks mythrocks added the test Only impacts tests label Nov 22, 2023
@mythrocks mythrocks changed the title Skip fastparquet timestamp tests for non-UTC timezones. Skip fastparquet timestamp tests when plugin cannot read/write timestamps Nov 22, 2023
@mythrocks
Copy link
Collaborator Author

Build

@mythrocks
Copy link
Collaborator Author

Build

@pxLi
Copy link
Collaborator

pxLi commented Dec 1, 2023

build

@pxLi
Copy link
Collaborator

pxLi commented Dec 1, 2023

re-triggered previous github payload empty issue

@mythrocks mythrocks merged commit c1a0c61 into NVIDIA:branch-23.12 Dec 1, 2023
37 checks passed
@mythrocks
Copy link
Collaborator Author

I've merged this change.
Thanks to @revans2 for the review, and @pxLi for help getting the change through CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] fastparquet compatibility tests fail with data mismatch if TZ is not set and system timezone is not UTC
3 participants