-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
correctly encode/decode _FillValues/missing_values/dtypes for packed data #8713
Conversation
Still an issue with some example in the docs. Probably non-conforming NaN _FillValue for packed data in that one example. |
@mankoff if you have time, can you check your workload against this PR, or review it please? |
raw_fill_dict = {} | ||
[ | ||
pop_to(attrs, raw_fill_dict, attr, name=name) | ||
for attr in ("missing_value", "_FillValue") | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pops missing_values
and/or _FillValue
into temporary raw_fill_dict
.
for k in list(raw_fill_dict): | ||
v = raw_fill_dict[k] | ||
kfill = {fv for fv in np.ravel(v) if not pd.isnull(fv)} | ||
if not kfill and np.issubdtype(dtype, np.integer): | ||
warnings.warn( | ||
f"variable {name!r} has non-conforming {k!r} " | ||
f"{v!r} defined, dropping {k!r} entirely.", | ||
SerializationWarning, | ||
stacklevel=3, | ||
) | ||
del raw_fill_dict[k] | ||
else: | ||
encoded_fill_values |= kfill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterate over the (two) possible keys and extract the provided fill values. If the extracted fill values are empty (due to filtering with pd.isnull
) a warning is issued for integer type data and the according key is deleted from the dict. This prevents from moving the nonconforming fill value into encoding.
if len(encoded_fill_values) > 1: | ||
warnings.warn( | ||
f"variable {name!r} has multiple fill values " | ||
f"{encoded_fill_values} defined, decoding all values to NaN.", | ||
SerializationWarning, | ||
stacklevel=3, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have multiple fill values after the procedure a warning is issued.
if raw_fill_dict: | ||
dims, data, attrs, encoding = unpack_for_decoding(variable) | ||
[ | ||
safe_setitem(encoding, attr, value, name=name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move remaining key, value pairs into encoding.
@@ -348,20 +375,51 @@ def _scale_offset_decoding(data, scale_factor, add_offset, dtype: np.typing.DTyp | |||
return data | |||
|
|||
|
|||
def _choose_float_dtype(dtype: np.dtype, has_offset: bool) -> type[np.floating[Any]]: | |||
def _choose_float_dtype( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function checks for the most appropriate dtype to use when encoding/decoding in CFScaleOffsetCoder
.
It would be great if Ken could have a look here. At least I tried to follow the changes to CF which followed after the discussion in cf-convention/cf-conventions#374. |
On 2024-02-07 at 04:54 +13, Deepak Cherian ***@***.***> wrote...
@mankoff if you have time, can you check your workload against this
PR, or review it please?
I'm reading this via slow satellite in Antarctica. I'm 'offline'(ish) for another week, then have 2 months of emails and work to catch up on. Reviewing this PR will be low priority.
|
Thanks Ken, for letting us know. Greetings to Antarctica, hope time is good there! I'll try to ping some folks from the linked issues to get more input here. |
@kmuehlbauer shall we merge? A number of numpy warnings (and presumably numpy 2 failures) are from this type of dtype maniputation. 🤞🏾 this PR fixes them all! :) |
should silence this warning |
@dcherian I'm good to merge this, we still can iterate later if complaints are coming in. |
@dcherian There are still some warnings which could be fixed/silenced with this PR. All try to get behind it now. Update: so some warnings (cast RuntimeWarnings) are not directly connected to this PR but to some ill-defined test setup (mixing int/uint) with resulting casting issues. Best resolved in separate PR. |
Co-authored-by: Deepak Cherian <[email protected]>
for more information, see https://pre-commit.ci
@@ -63,7 +63,13 @@ def test_decode_cf_with_conflicting_fill_missing_value() -> None: | |||
np.arange(10), | |||
{"units": "foobar", "missing_value": np.nan, "_FillValue": np.nan}, | |||
) | |||
actual = conventions.decode_cf_variable("t", var) | |||
|
|||
# the following code issues two warnings, so we need to check for both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcherian I've tried to make this way shorter. Using for-loop looks ugly, though.
LGTM. Thanks! |
* main: (31 commits) correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713) Expand use of `.oindex` and `.vindex` (pydata#8790) Return a dataclass from Grouper.factorize (pydata#8777) [skip-ci] Fix upstream-dev env (pydata#8839) Add dask-expr for windows envs (pydata#8837) [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835) Add `dask-expr` to environment-3.12.yml (pydata#8827) Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736) Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784) try to get the `upstream-dev` CI to complete again (pydata#8823) Bump the actions group with 1 update (pydata#8818) Update documentation for clarity (pydata#8817) DOC: link to zarr.convenience.consolidate_metadata (pydata#8816) Refactor Grouper objects (pydata#8776) Grouper object design doc (pydata#8510) Bump the actions group with 2 updates (pydata#8804) tokenize() should ignore difference between None and {} attrs (pydata#8797) fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791) Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782) Migrate treenode module. (pydata#8757) ...
* main: (42 commits) correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713) Expand use of `.oindex` and `.vindex` (pydata#8790) Return a dataclass from Grouper.factorize (pydata#8777) [skip-ci] Fix upstream-dev env (pydata#8839) Add dask-expr for windows envs (pydata#8837) [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835) Add `dask-expr` to environment-3.12.yml (pydata#8827) Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736) Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784) try to get the `upstream-dev` CI to complete again (pydata#8823) Bump the actions group with 1 update (pydata#8818) Update documentation for clarity (pydata#8817) DOC: link to zarr.convenience.consolidate_metadata (pydata#8816) Refactor Grouper objects (pydata#8776) Grouper object design doc (pydata#8510) Bump the actions group with 2 updates (pydata#8804) tokenize() should ignore difference between None and {} attrs (pydata#8797) fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791) Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782) Migrate treenode module. (pydata#8757) ...
* upstream/main: (765 commits) increase typing annotations coverage in `xarray/core/indexing.py` (pydata#8857) pandas 3 MultiIndex fixes (pydata#8847) FIX: adapt handling of copy keyword argument in scipy backend for numpy >= 2.0dev (pydata#8851) FIX: do not cast _FillValue/missing_value in CFMaskCoder if _Unsigned is provided (pydata#8852) Implement setitem syntax for `.oindex` and `.vindex` properties (pydata#8845) Support pandas copy-on-write behaviour (pydata#8846) correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713) Expand use of `.oindex` and `.vindex` (pydata#8790) Return a dataclass from Grouper.factorize (pydata#8777) [skip-ci] Fix upstream-dev env (pydata#8839) Add dask-expr for windows envs (pydata#8837) [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835) Add `dask-expr` to environment-3.12.yml (pydata#8827) Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736) Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784) try to get the `upstream-dev` CI to complete again (pydata#8823) Bump the actions group with 1 update (pydata#8818) Update documentation for clarity (pydata#8817) DOC: link to zarr.convenience.consolidate_metadata (pydata#8816) Refactor Grouper objects (pydata#8776) ...
see reported issue pydata/xarray#7691 and pr pydata/xarray#8713 which was included into xarray v2024.03.0
* fix: Skip previous encoding workaround for fixed xarray versions see reported issue pydata/xarray#7691 and pr pydata/xarray#8713 which was included into xarray v2024.03.0 * Replace workaround by a new xarray lower bound --------- Co-authored-by: Jonas Hoersch <[email protected]>
nan
values appearing when saving and loading fromnetCDF
due to encoding #7691whats-new.rst
This resurrects some of #7654. It takes special care of correctly handling dtypes when encoding/decoding packed data