Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_delta_part_write_round_trip_unmanaged and test_delta_multi_part_write_round_trip_unmanaged fail with DATAGEN_SEED=1700105176 #9738

Closed
ttnghia opened this issue Nov 16, 2023 · 3 comments · Fixed by #9748 or #9840
Assignees
Labels
bug Something isn't working

Comments

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 16, 2023

They are new failed tests on databricks, with DATAGEN_SEED=1700105176:

FAILED ../../src/main/python/delta_lake_write_test.py::test_delta_part_write_round_trip_unmanaged[Float] - AssertionError: Different line counts in 00000000000000000000.json
FAILED ../../src/main/python/delta_lake_write_test.py::test_delta_part_write_round_trip_unmanaged[Double] - AssertionError: Different line counts in 00000000000000000000.json
FAILED ../../src/main/python/delta_lake_write_test.py::1mtest_delta_multi_part_write_round_trip_unmanaged[Float] - AssertionError: Different line counts in 00000000000000000000.json
FAILED ../../src/main/python/delta_lake_write_test.py::1mtest_delta_multi_part_write_round_trip_unmanaged[Double] - AssertionError: Different line counts in 00000000000000000000.json
@ttnghia ttnghia added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 16, 2023
@abellina abellina self-assigned this Nov 16, 2023
@jlowe jlowe self-assigned this Nov 16, 2023
@jlowe
Copy link
Member

jlowe commented Nov 16, 2023

The tests are failing in the delta log check because, for some weird reason, the CPU decides to create lots of very tiny files in the partitions and the GPU only creates one file per partition. In the CPU case, each file contains only between 1-3 records which is crazy and likely a bug. It's pretty weird to use floating point types as partition keys, so maybe not surprising to find odd behaviors here.

@jlowe
Copy link
Member

jlowe commented Nov 22, 2023

These tests failed on Apache Spark 3.3.3 in a recently nightly build, specifically:

test_delta_part_write_round_trip_unmanaged[Float][DATAGEN_SEED=1700671951]
test_delta_multi_part_write_round_trip_unmanaged[Float][DATAGEN_SEED=1700671951]

@jlowe
Copy link
Member

jlowe commented Nov 22, 2023

The failure on Apache Spark 3.3.3 involves NaN as a partition value. It appears this triggers Spark CPU to create lots of tiny files, probably because when it checks to see if it's in the same partition when it moves to the next record, NaN != NaN so it thinks it is switching partitions and thus needs to create a new file. Since we're writing batches of data rather than individual rows, we end up creating orders of magnitude fewer files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants