Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] mismatching timezone settings on executor and driver can cause ORC read data corruption #3970

Closed
tgravescs opened this issue Oct 29, 2021 · 1 comment · Fixed by #4129
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
A user reported orc_test failing when running in their environment. The diffs were in the hours of the DateTime types.

One of them was:
FAILED src/main/python/orc_test.py::test_basic_read[{'spark.rapids.sql.format.orc.reader.type': 'PERFILE'}-native--read_orc_df-timestamp-date-test.orc]
with lots of others as well....

After some debugging it turns out they only set:
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC

and were missing:
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC

The Host timezone was set to: Time zone: America/New_York (EDT, -0400)

So here I think the planning on driver passed because it was UTC but executors weren't UTC so data returned wasn't the same as CPU generated.

Perhaps we can add more validation on executor side to make sure timezone UTC, if nothing else throw so it fails rather then corrupting data.

Note I haven't tried to reproduce this yet. The user set all the timezone settings properly and the test started to pass.

@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 29, 2021
@tgravescs tgravescs changed the title [BUG] mismatching UTC settings on executor and driver can cause ORC read data corruption [BUG] mismatching timezone settings on executor and driver can cause ORC read data corruption Oct 29, 2021
@Salonijain27 Salonijain27 added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Nov 2, 2021
@nartal1
Copy link
Collaborator

nartal1 commented Nov 12, 2021

I was able to reproduce this bug on YARN cluster by passing --conf spark.driver.extraJavaOptions=-Duser.timezone=America/New_York to the tests.

After further looking into the results, the mismatch in GPU and CPU results is that in CPU the timestamp is read in the timezone provided i.e America/New_York(EDT) time but in the GPU results the timestamps are read in "UTC" .

I thought it would have failed by giving unsupported data type error(as spark-rapids supports on UTC) but that is not the case. GPU is reading in UTC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants