-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine UnsignedIntegerCoder
and CFMaskCoder
#9274
Combine UnsignedIntegerCoder
and CFMaskCoder
#9274
Conversation
083c6b1
to
cae77aa
Compare
transform = partial(np.asarray, dtype=signed_dtype) | ||
data = lazy_elemwise_func(data, transform, signed_dtype) | ||
if raw_fill_value is not None: | ||
new_fill = signed_dtype.type(raw_fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block didn't use @kmuehlbauer's trick for handling overflow by using .view
. I'm wondering if I need to use that here, but no tests hit it. I've never actually seen _Unsigned == "false"
in the wild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I change that here then I think I might be able to shrink this function and do things as "old dtype" and "new dtype" rather than signed versus unsigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block didn't use @kmuehlbauer's trick for handling overflow by using .view. I'm wondering if I need to use that here, but no tests hit it. I've never actually seen _Unsigned == "false" in the wild.
That was requested at some point in time for a specific use case. I'll try to dig it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK looking at the tests, it looks like this case is not tested (but I'm triple checking). This set of if
s says that the data on-disk is unsigned, but _Unsigned
is false which means the user wants signed data in-memory. The tests for _Unsigned="false"
have signed data on-disk and in-memory so no casting/conversion happens.
Edit: Scratch that. test_backends
doesn't test it, but test_coding
does but never uses _FillValue
. Let's see what I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref #4966
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I thought I was being smart and added tests for the _Unsigned: "false"
case and now things are just all weird. It made me realize I wasn't handling this case in the encoding step, but it also kind of seems like it never was handled or that I have the wrong impression of how that configuration is supposed to be handled. The tests added in that PR @kmuehlbauer don't present the _FillValue
so I've tried adding that to the backend tests but now non-NC4 backends are complaining about converting uint8 to int8. This coercion was added to solve #4014 it seems.
My assumption is that if _Unsigned: "false"
then the data is saved as uint8 and _FillValue
should be uint8. But again, I'm not sure why my new tests are even trying to get to int8. I'll do some more testing later tonight hopefully. I'm not sure if it is better to spend a ton of time getting this small functionality working "as expected" or leave it undefined/untested as it is right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my main concern and all of your handling in the decode pipeline for casting fill values turns out to not be an issue anymore as the raw fill values turn out to be numpy scalars (np.uint8
) when they get loaded from the file. Or at least they are for the NetCDF4 cases. So numpy is perfectly happy casting uint8 to int8 and back if they are already numpy scalars. If my new tests cases make sense then I think this is fine.
I've run out of time to really work on refactoring this more. If people are able to review it in its current state that would be great. I'm not sure if it is in a good enough place to be merged, but I'll let the reviewers decide. |
Thanks @djhoese for tackling this hard part of the CF conventions. This is looking good to me. Let's have another CI run now. |
Test failures are due to #9327. |
Kai, please merge if you're comfortable with this! |
Thanks @djhoese, for going down that rabbit hole. Finger's crossed 😬. |
* main: (214 commits) Adds copy parameter to __array__ for numpy 2.0 (pydata#9393) `numpy 2` compatibility in the `pydap` backend (pydata#9391) pyarrow dependency added to doc environment (pydata#9394) Extend padding functionalities (pydata#9353) refactor GroupBy internals (pydata#9389) Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274) passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377) Fix tests on big-endian systems (pydata#9380) Improve error message on `ds['x', 'y']` (pydata#9375) Improve error message for missing coordinate index (pydata#9370) Add flaky to TestNetCDF4ViaDaskData (pydata#9373) Make chunk manager an option in `set_options` (pydata#9362) Revise (pydata#9371) Remove duplicate word from docs (pydata#9367) Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243) Revise (pydata#9366) Fix rechunking to a frequency with empty bins. (pydata#9364) whats-new entry for dropping python 3.9 (pydata#9359) drop support for `python=3.9` (pydata#8937) Revise (pydata#9357) ...
* main: Adds copy parameter to __array__ for numpy 2.0 (pydata#9393) `numpy 2` compatibility in the `pydap` backend (pydata#9391) pyarrow dependency added to doc environment (pydata#9394) Extend padding functionalities (pydata#9353) refactor GroupBy internals (pydata#9389) Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274) passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377) Fix tests on big-endian systems (pydata#9380) Improve error message on `ds['x', 'y']` (pydata#9375)
See #9266 and the related issues for the detailed discussion. Bottom line is that CF
_Unsigned
handling gets weird when handling_FillValue
. The CF standard says that_FillValue
on disk should always match the array's type on disk. However, when xarray loads the data and masks it, it does this with the in-memory unsigned integer data. Currently inmain
this is handled by a combination of theUnsignedIntegerCoder
class and theCFMaskCoder
class. It unfortunately requires the "decoded" unsigned_FillValue
to be stored in the loaded variable so it can be used by the masking code. But to match CF standards this_FillValue
should remain as the on-disk signed type. In this PR I combine the two classes to avoid storing the temporary unsigned version of_FillValue
and only use it for masking.Other important changes:
_FillValue
they have passed does not match expectations for the CF standard._FillValue
values that are numpy scalar types (ex.np.uint8(255)
) are always converted to native python integers before being encoded for_Unsigned
variables by calling.item()
on them. This ensures the above serialization warning is always issued as numpy will silently cast a uint8 (ex. 255) to a int8 (ex. -1) without warning.At the time of writing this code needs lots of cleanup or at least I hope it can because it is hard to follow in my opinion...but maybe that's just the way CF handling code is.
whats-new.rst
api.rst