correctly encode/decode _FillValues/missing_values/dtypes for packed data #8713

kmuehlbauer · 2024-02-06T08:51:47Z

Closes nan values appearing when saving and loading from netCDF due to encoding #7691
Closes Decoding netCDF is giving incorrect values for a large file #5597
Closes float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray #2304
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

This resurrects some of #7654. It takes special care of correctly handling dtypes when encoding/decoding packed data

kmuehlbauer · 2024-02-06T15:34:19Z

Still an issue with some example in the docs. Probably non-conforming NaN _FillValue for packed data in that one example.

xarray/coding/variables.py

xarray/tests/test_backends.py

xarray/coding/variables.py

dcherian · 2024-02-06T15:48:30Z

@mankoff if you have time, can you check your workload against this PR, or review it please?

…, apply review suggestions

kmuehlbauer · 2024-02-07T12:26:02Z

xarray/coding/variables.py

+    raw_fill_dict = {}
+    [
+        pop_to(attrs, raw_fill_dict, attr, name=name)
+        for attr in ("missing_value", "_FillValue")
+    ]


This pops missing_values and/or _FillValue into temporary raw_fill_dict.

kmuehlbauer · 2024-02-07T12:29:36Z

xarray/coding/variables.py

+    for k in list(raw_fill_dict):
+        v = raw_fill_dict[k]
+        kfill = {fv for fv in np.ravel(v) if not pd.isnull(fv)}
+        if not kfill and np.issubdtype(dtype, np.integer):
+            warnings.warn(
+                f"variable {name!r} has non-conforming {k!r} "
+                f"{v!r} defined, dropping {k!r} entirely.",
+                SerializationWarning,
+                stacklevel=3,
+            )
+            del raw_fill_dict[k]
+        else:
+            encoded_fill_values |= kfill


Iterate over the (two) possible keys and extract the provided fill values. If the extracted fill values are empty (due to filtering with pd.isnull) a warning is issued for integer type data and the according key is deleted from the dict. This prevents from moving the nonconforming fill value into encoding.

kmuehlbauer · 2024-02-07T12:31:04Z

xarray/coding/variables.py

+        if len(encoded_fill_values) > 1:
+            warnings.warn(
+                f"variable {name!r} has multiple fill values "
+                f"{encoded_fill_values} defined, decoding all values to NaN.",
+                SerializationWarning,
+                stacklevel=3,
+            )


If we have multiple fill values after the procedure a warning is issued.

kmuehlbauer · 2024-02-07T12:32:12Z

xarray/coding/variables.py

+        if raw_fill_dict:
+            dims, data, attrs, encoding = unpack_for_decoding(variable)
+            [
+                safe_setitem(encoding, attr, value, name=name)


Move remaining key, value pairs into encoding.

kmuehlbauer · 2024-02-07T12:36:32Z

xarray/coding/variables.py

@@ -348,20 +375,51 @@ def _scale_offset_decoding(data, scale_factor, add_offset, dtype: np.typing.DTyp
    return data


-def _choose_float_dtype(dtype: np.dtype, has_offset: bool) -> type[np.floating[Any]]:
+def _choose_float_dtype(


This function checks for the most appropriate dtype to use when encoding/decoding in CFScaleOffsetCoder.

kmuehlbauer · 2024-02-07T21:10:41Z

@mankoff if you have time, can you check your workload against this PR, or review it please?

It would be great if Ken could have a look here. At least I tried to follow the changes to CF which followed after the discussion in cf-convention/cf-conventions#374.

mankoff · 2024-02-07T21:19:52Z

On 2024-02-07 at 04:54 +13, Deepak Cherian ***@***.***> wrote...

@mankoff if you have time, can you check your workload against this PR, or review it please?

I'm reading this via slow satellite in Antarctica. I'm 'offline'(ish) for another week, then have 2 months of emails and work to catch up on. Reviewing this PR will be low priority.

kmuehlbauer · 2024-02-07T21:24:10Z

I'm reading this via slow satellite in Antarctica. I'm 'offline'(ish) for another week, then have 2 months of emails and work to catch up on. Reviewing this PR will be low priority.

Thanks Ken, for letting us know. Greetings to Antarctica, hope time is good there! I'll try to ping some folks from the linked issues to get more input here.

dcherian · 2024-03-15T04:55:36Z

@kmuehlbauer shall we merge? A number of numpy warnings (and presumably numpy 2 failures) are from this type of dtype maniputation. 🤞🏾 this PR fixes them all! :)

dcherian · 2024-03-15T05:02:45Z

xarray/tests/test_conventions.py::test_decode_cf_with_conflicting_fill_missing_value
  /home/runner/work/xarray/xarray/xarray/conventions.py:286: SerializationWarning: variable 't' has non-conforming '_FillValue' nan defined, dropping '_FillValue' entirely.
    var = coder.decode(var, name=name)

should silence this warning

kmuehlbauer · 2024-03-15T06:07:30Z

@dcherian I'm good to merge this, we still can iterate later if complaints are coming in.

kmuehlbauer · 2024-03-15T06:36:07Z

@dcherian There are still some warnings which could be fixed/silenced with this PR. All try to get behind it now.

Update: so some warnings (cast RuntimeWarnings) are not directly connected to this PR but to some ill-defined test setup (mixing int/uint) with resulting casting issues. Best resolved in separate PR.

xarray/tests/test_conventions.py

Co-authored-by: Deepak Cherian <[email protected]>

for more information, see https://pre-commit.ci

kmuehlbauer · 2024-03-15T16:13:05Z

xarray/tests/test_conventions.py

@@ -63,7 +63,13 @@ def test_decode_cf_with_conflicting_fill_missing_value() -> None:
        np.arange(10),
        {"units": "foobar", "missing_value": np.nan, "_FillValue": np.nan},
    )
-    actual = conventions.decode_cf_variable("t", var)
+
+    # the following code issues two warnings, so we need to check for both


@dcherian I've tried to make this way shorter. Using for-loop looks ugly, though.

dcherian · 2024-03-15T16:14:12Z

LGTM. Thanks!

* main: (31 commits) correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713) Expand use of `.oindex` and `.vindex` (pydata#8790) Return a dataclass from Grouper.factorize (pydata#8777) [skip-ci] Fix upstream-dev env (pydata#8839) Add dask-expr for windows envs (pydata#8837) [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835) Add `dask-expr` to environment-3.12.yml (pydata#8827) Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736) Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784) try to get the `upstream-dev` CI to complete again (pydata#8823) Bump the actions group with 1 update (pydata#8818) Update documentation for clarity (pydata#8817) DOC: link to zarr.convenience.consolidate_metadata (pydata#8816) Refactor Grouper objects (pydata#8776) Grouper object design doc (pydata#8510) Bump the actions group with 2 updates (pydata#8804) tokenize() should ignore difference between None and {} attrs (pydata#8797) fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791) Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782) Migrate treenode module. (pydata#8757) ...

* main: (42 commits) correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713) Expand use of `.oindex` and `.vindex` (pydata#8790) Return a dataclass from Grouper.factorize (pydata#8777) [skip-ci] Fix upstream-dev env (pydata#8839) Add dask-expr for windows envs (pydata#8837) [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835) Add `dask-expr` to environment-3.12.yml (pydata#8827) Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736) Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784) try to get the `upstream-dev` CI to complete again (pydata#8823) Bump the actions group with 1 update (pydata#8818) Update documentation for clarity (pydata#8817) DOC: link to zarr.convenience.consolidate_metadata (pydata#8816) Refactor Grouper objects (pydata#8776) Grouper object design doc (pydata#8510) Bump the actions group with 2 updates (pydata#8804) tokenize() should ignore difference between None and {} attrs (pydata#8797) fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791) Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782) Migrate treenode module. (pydata#8757) ...

* upstream/main: (765 commits) increase typing annotations coverage in `xarray/core/indexing.py` (pydata#8857) pandas 3 MultiIndex fixes (pydata#8847) FIX: adapt handling of copy keyword argument in scipy backend for numpy >= 2.0dev (pydata#8851) FIX: do not cast _FillValue/missing_value in CFMaskCoder if _Unsigned is provided (pydata#8852) Implement setitem syntax for `.oindex` and `.vindex` properties (pydata#8845) Support pandas copy-on-write behaviour (pydata#8846) correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713) Expand use of `.oindex` and `.vindex` (pydata#8790) Return a dataclass from Grouper.factorize (pydata#8777) [skip-ci] Fix upstream-dev env (pydata#8839) Add dask-expr for windows envs (pydata#8837) [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835) Add `dask-expr` to environment-3.12.yml (pydata#8827) Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736) Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784) try to get the `upstream-dev` CI to complete again (pydata#8823) Bump the actions group with 1 update (pydata#8818) Update documentation for clarity (pydata#8817) DOC: link to zarr.convenience.consolidate_metadata (pydata#8816) Refactor Grouper objects (pydata#8776) ...

see reported issue pydata/xarray#7691 and pr pydata/xarray#8713 which was included into xarray v2024.03.0

* fix: Skip previous encoding workaround for fixed xarray versions see reported issue pydata/xarray#7691 and pr pydata/xarray#8713 which was included into xarray v2024.03.0 * Replace workaround by a new xarray lower bound --------- Co-authored-by: Jonas Hoersch <[email protected]>

correctly encode/decode _FillValues

477cd58

kmuehlbauer mentioned this pull request Feb 6, 2024

nan values appearing when saving and loading from netCDF due to encoding #7691

Closed

4 tasks

kmuehlbauer added 2 commits February 6, 2024 10:05

fix mypy

792e942

fix CFMaskCode test

f743fd1

kmuehlbauer mentioned this pull request Feb 6, 2024

Decoding netCDF is giving incorrect values for a large file #5597

Closed

fix scale/offset

551719a

kmuehlbauer mentioned this pull request Feb 6, 2024

float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray #2304

Closed

kmuehlbauer added 2 commits February 6, 2024 13:49

avert zarr issue

85730e2

add whats-new.rst entry

5fe874f

dcherian reviewed Feb 6, 2024

View reviewed changes

xarray/coding/variables.py Show resolved Hide resolved

dcherian reviewed Feb 6, 2024

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

xarray/coding/variables.py Show resolved Hide resolved

kmuehlbauer added 4 commits February 7, 2024 12:23

refactor fillvalue/missing value check to catch non-conforming values…

f848872

…, apply review suggestions

fix typing

672e84e

suppress warning in doc

bff3a5d

Merge branch 'main' into fix-7691

45b5b8c

kmuehlbauer commented Feb 7, 2024

View reviewed changes

JoerivanEngelen mentioned this pull request Feb 7, 2024

nan fillvalue attributes written by xarray Deltares/xugrid#176

Open

Merge branch 'main' into fix-7691

ed79eb7

Merge branch 'main' into fix-7691

b89ce93

kmuehlbauer added 2 commits March 15, 2024 08:20

FIX: silence SerializationWarnings

f90e7e0

Merge branch 'main' into fix-7691

554a61e

FIX: silence mypy by casting to string early

da2222e

dcherian reviewed Mar 15, 2024

View reviewed changes

xarray/tests/test_conventions.py Outdated Show resolved Hide resolved

dcherian reviewed Mar 15, 2024

View reviewed changes

xarray/tests/test_conventions.py Outdated Show resolved Hide resolved

dcherian reviewed Mar 15, 2024

View reviewed changes

xarray/tests/test_conventions.py Outdated Show resolved Hide resolved

kmuehlbauer and others added 4 commits March 15, 2024 16:50

Update xarray/tests/test_conventions.py

f6fe9cb

Co-authored-by: Deepak Cherian <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6b30298

for more information, see https://pre-commit.ci

Shorten test, add comment checking for two warnings

1d1b1a0

make test even shorter

30985fa

kmuehlbauer commented Mar 15, 2024

View reviewed changes

dcherian enabled auto-merge (squash) March 15, 2024 16:14

dcherian merged commit fbcac76 into pydata:main Mar 15, 2024
27 of 29 checks passed

pont-us mentioned this pull request Apr 2, 2024

Some unit tests failing with xarray 2024.3.0 xcube-dev/xcube#958

Closed

thabbott mentioned this pull request Apr 5, 2024

Update CoCiP-grid data access patterns to reduce duplicate chunk downloads contrailcirrus/pycontrails#171

Merged

2 tasks

coroa pushed a commit to coroa/atlite that referenced this pull request Nov 1, 2024

fix: Skip previous encoding workaround for fixed xarray versions

9bd3014

see reported issue pydata/xarray#7691 and pr pydata/xarray#8713 which was included into xarray v2024.03.0

coroa mentioned this pull request Nov 1, 2024

fix: Skip previous encoding workaround for fixed xarray versions PyPSA/atlite#401

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

correctly encode/decode _FillValues/missing_values/dtypes for packed data #8713

correctly encode/decode _FillValues/missing_values/dtypes for packed data #8713

kmuehlbauer commented Feb 6, 2024 •

edited

Loading

kmuehlbauer commented Feb 6, 2024

dcherian commented Feb 6, 2024 •

edited

Loading

kmuehlbauer Feb 7, 2024

kmuehlbauer Feb 7, 2024

kmuehlbauer Feb 7, 2024

kmuehlbauer Feb 7, 2024

kmuehlbauer Feb 7, 2024

kmuehlbauer commented Feb 7, 2024

mankoff commented Feb 7, 2024 via email

kmuehlbauer commented Feb 7, 2024

dcherian commented Mar 15, 2024

dcherian commented Mar 15, 2024

kmuehlbauer commented Mar 15, 2024

kmuehlbauer commented Mar 15, 2024 •

edited

Loading

kmuehlbauer Mar 15, 2024

dcherian commented Mar 15, 2024

correctly encode/decode _FillValues/missing_values/dtypes for packed data #8713

correctly encode/decode _FillValues/missing_values/dtypes for packed data #8713

Conversation

kmuehlbauer commented Feb 6, 2024 • edited Loading

kmuehlbauer commented Feb 6, 2024

dcherian commented Feb 6, 2024 • edited Loading

kmuehlbauer Feb 7, 2024

Choose a reason for hiding this comment

kmuehlbauer Feb 7, 2024

Choose a reason for hiding this comment

kmuehlbauer Feb 7, 2024

Choose a reason for hiding this comment

kmuehlbauer Feb 7, 2024

Choose a reason for hiding this comment

kmuehlbauer Feb 7, 2024

Choose a reason for hiding this comment

kmuehlbauer commented Feb 7, 2024

mankoff commented Feb 7, 2024 via email

kmuehlbauer commented Feb 7, 2024

dcherian commented Mar 15, 2024

dcherian commented Mar 15, 2024

kmuehlbauer commented Mar 15, 2024

kmuehlbauer commented Mar 15, 2024 • edited Loading

kmuehlbauer Mar 15, 2024

Choose a reason for hiding this comment

dcherian commented Mar 15, 2024

kmuehlbauer commented Feb 6, 2024 •

edited

Loading

dcherian commented Feb 6, 2024 •

edited

Loading

kmuehlbauer commented Mar 15, 2024 •

edited

Loading