Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++] failing tests due to unfound timezones on Windows with 18.0.0-rc0 #44455

Closed
h-vetinari opened this issue Oct 17, 2024 · 10 comments
Closed

Comments

@h-vetinari
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

While testing arrow 18.0.0rc0, I'm getting a new batch of failing tests on windows

FAILED pyarrow/tests/test_compute.py::test_strftime - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'CET': oops: bad dow name: il
FAILED pyarrow/tests/test_compute.py::test_extract_datetime_components - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'UTC': oops: bad dow name: il
FAILED pyarrow/tests/test_compute.py::test_assume_timezone - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'UTC': oops: bad dow name: il
FAILED pyarrow/tests/test_compute.py::test_round_temporal[nanosecond] - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'Asia/Kolkata': oops: bad dow name: il
FAILED pyarrow/tests/test_compute.py::test_round_temporal[microsecond] - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'Asia/Kolkata': oops: bad dow name: il
FAILED pyarrow/tests/test_compute.py::test_round_temporal[millisecond] - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'Asia/Kolkata': oops: bad dow name: il
FAILED pyarrow/tests/test_compute.py::test_round_temporal[second] - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'Asia/Kolkata': oops: bad dow name: il
FAILED pyarrow/tests/test_compute.py::test_round_temporal[minute] - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'Asia/Kolkata': oops: bad dow name: il
FAILED pyarrow/tests/test_compute.py::test_round_temporal[hour] - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'Asia/Kolkata': oops: bad dow name: il
FAILED pyarrow/tests/test_compute.py::test_round_temporal[day] - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'Asia/Kolkata': oops: bad dow name: il
FAILED pyarrow/tests/test_convert_builtin.py::test_sequence_timestamp_from_int_with_unit - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'UTC': oops: bad dow name: il
FAILED pyarrow/tests/test_scalars.py::test_timestamp_scalar - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'UTC': oops: bad dow name: il
FAILED pyarrow/tests/test_scalars.py::test_cast_timestamp_to_string - pyarrow.lib.ArrowInvalid: Cannot locate timezone 'UTC': oops: bad dow name: il

The "cannot locate timezone" seems to be coming from

return Status::Invalid("Cannot locate timezone '", timezone, "': ", ex.what());

and hasn't changed recently. In particular, my understanding would be that

doesn't yet refer to C++20's specification that includes the timezones (but rather some vendored variant of https://github.com/HowardHinnant/date, c.f. 641c699, 2bf4421, #29200, etc.)

AFAIU, the change is most likely due to the tests getting a new/updated decorator:

@pytest.mark.timezone_data
which is ultimately determined by
def windows_has_tzdata():
"""
This is the default location where tz.cpp will look for (until we make
this configurable at run-time)
"""
tzdata_bool = False
if "PYARROW_TZDATA_PATH" in os.environ:
tzdata_bool = os.path.exists(os.environ['PYARROW_TZDATA_PATH'])
if not tzdata_bool:
tzdata_path = os.path.expandvars(r"%USERPROFILE%\Downloads\tzdata")
tzdata_bool = os.path.exists(tzdata_path)
return tzdata_bool

It's not exactly described what needs to be under that path, but it does seem to get forwarded to the vendored date here

arrow_vendored::date::set_install(options.timezone_db_path.value());

which will then iterate through the contents.

As such, conda-forge (which ships its own tzdata) should be able to set "PYARROW_TZDATA_PATH=%PREFIX%\share\zoneinfo" and be done with it. I'm mainly opening this issue now to document the chasing down of rabbit holes for this, which might serve someone else (and if I'm wrong, we'll be able to discuss further as well).

Component(s)

C++, Packaging

@h-vetinari
Copy link
Contributor Author

On second thought, it would be nice if windows_has_tzdata could also check for existence of %PREFIX%\share\zoneinfo, because that'll be available on all conda installs of arrow.

@h-vetinari
Copy link
Contributor Author

As such, conda-forge (which ships its own tzdata) should be able to set "PYARROW_TZDATA_PATH=%PREFIX%\share\zoneinfo" and be done with it.

Well, that didn't work unfortunately:

E   OSError: Unable to get Timezone database version from D:\bld\apache-arrow_1729139898363\_test_env\share\zoneinfo\

That's despite the required files definitely being there.

## Package Plan ##

  environment location: D:\bld\apache-arrow_1729139898363\_test_env


The following NEW packages will be INSTALLED:

    [...]
    tzdata:                  2024b-hc8b5060_0               conda-forge

@h-vetinari
Copy link
Contributor Author

I've now also tried setting PYARROW_TZDATA_PATH with forward-slashes only, and it still fails in the same way

@kou
Copy link
Member

kou commented Oct 18, 2024

I think that this is a problem in https://www.iana.org/time-zones data.

https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=1056872&view=logs&j=f4756218-2105-5c4f-5cb9-a0ac2817b29f&t=9335683b-4beb-5b59-1ea3-40427499ebdb&l=44798

---------------------------- Captured stderr call -----------------------------
Rule	Mexico	1931	only	-	April	30	0:00	1:00	D
Mexico         1931    1931    Jan/01                  00:00:00       00:00:00   
============================== warnings summary ===============================

The latest data https://data.iana.org/time-zones/releases/tzdata2024b.tar.gz includes the "April" line in northamerica. Other lines use Oct, Feb and so on (3 characters). tz.cpp assumes 3 characters month:

auto s = parse3(in);

But the upstream supports April: https://github.com/HowardHinnant/date/blob/deec054564d0950b1a70b7f54b9039e4f0fb26cf/src/tz.cpp#L664-L674

Workaround: Use old tzdata.
Solution: Update bundled datetime.

FYI: tz.cpp doesn't use tzdata (.../zoneinfo/...) from conda on Windows. It uses raw https://www.iana.org/time-zones data

@h-vetinari
Copy link
Contributor Author

FYI: tz.cpp doesn't use tzdata (.../zoneinfo/...) from conda on Windows. It uses raw https://www.iana.org/time-zones data

I don't understand the distinction. What's expected to be under PYARROW_TZDATA_PATH? We should be able to put whatever is required there.

For context, it worked for orc (even upstream, not just in conda-forge) to use a conda-packaged tzdata (if available). In conda-forge itself, we're able to point libstc++ to our tzdata as well, and will do the same for libc++ once the C++20 <chrono> implementation becomes non-experimental (because we cannot patch the static libc++experimental.a without creating attack vectors).

Solution: Update bundled datetime.

That sound like a reasonable first step. Medium term, it's also conceivable to move to C++20 facilities.

@kou kou changed the title [18.0.0rc0] BUG: failing tests due to unfound timezones on windows [Python][C++] failing tests due to unfound timezones on windows with 18.0.0-rc0 Oct 21, 2024
@kou kou changed the title [Python][C++] failing tests due to unfound timezones on windows with 18.0.0-rc0 [Python][C++] failing tests due to unfound timezones on Windows with 18.0.0-rc0 Oct 21, 2024
kou added a commit to kou/arrow that referenced this issue Oct 21, 2024
IANA tzdata changed its data format. So we need to update vendored
date to parse it.
kou added a commit to kou/arrow that referenced this issue Oct 21, 2024
IANA tzdata changed its data format. So we need to update vendored
date to parse it.
@kou
Copy link
Member

kou commented Oct 21, 2024

Could you try #44482 ?

@h-vetinari
Copy link
Contributor Author

Is in progress in conda-forge/arrow-cpp-feedstock#1432; I'll try to comment afterwards, but you can also check yourself if the non-CUDA windows builds pass in ~3-4h.

@h-vetinari
Copy link
Contributor Author

Can confirm that #44482 works. 🥳 Thank you!

@kou
Copy link
Member

kou commented Oct 23, 2024

Thanks!
I'll merge #44482.

kou added a commit that referenced this issue Oct 23, 2024
### Rationale for this change

IANA tzdata changed its data format. So we need to update vendored date to parse it.

### What changes are included in this PR?

Update vendored date to 3.0.3.

Update script is added. So all our changes are automated now.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #44455

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou kou added this to the 19.0.0 milestone Oct 23, 2024
@kou
Copy link
Member

kou commented Oct 23, 2024

Issue resolved by pull request 44482
#44482

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants