Extend padding functionalities #9353

tsanona · 2024-08-13T18:56:45Z

While working on a project I realized that in the dataset padding function one can set different constant_values for each dimension but not for each variable. I've added some simple code to allow this, one can now also define padding differently for each data variable. I also extended the tests to include these examples.

Let me know if I can do any changes to better fit your standards.

Tests added

…values

… tests

welcome · 2024-08-13T18:56:48Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

…adding-functionalities

max-sixty

Thanks @tsanona , this is good. I added some comments to try and make it simpler to understand, but it's good functionality!

max-sixty · 2024-08-13T21:44:25Z

xarray/core/dataset.py

+            | Mapping[
+                Any, float | tuple[float, float] | Mapping[Any, tuple[float, float]]
+            ]
+            | None


This is a pretty gnarly type signature. Same with the docstring. But I don't think there's much we can do to simplify it.

We possibly could make float | tuple[float, float] | Mapping[Any, tuple[float, float]] into an alias?

Yeah that's true, I'll do that.

So, I changed it slightly. I created these aliases:

T_PadConstantValues = float | tuple[float, float] T_VarPadConstantValues = T_PadConstantValues | Mapping[Any, T_PadConstantValues] T_DatasetPadConstantValues = T_VarPadConstantValues | Mapping[Any, T_VarPadConstantValues]

This way it is a bit more organized. Also changed the implementation of pad for variables so that it can make use of Mapping[Any, float] to keep consistency with the pad function for datasets.

What do you think?
(Also let me know if the location in types.py and import in dataset.py is correct. I don't fully understand when to import types under the if TYPE_CHECKING: block or above in dataset.py)

Looks good, thanks!

(others may have refinements...)

max-sixty · 2024-08-13T21:44:52Z

xarray/core/dataset.py

-            Used in 'constant'.  The values to set the padded values for each
-            axis.
+        constant_values : scalar, tuple, mapping of hashable to tuple or
+            mapping of hashable to mapping of hashable to tuple, default: 0


Not necessarily on this PR but do you happen to know why the default is listed as 0 here but is None in the func signature?

Hum, what happened there, but it is a good point since if left at None the arrays are padded with np.nan. Should I change the default to nan or None?

Ah you mean changing the default and not the docstring?

Ah sorry, so what's the actual value that goes onto the dataset, by default? I had originally thought it was 0, but does None / NaN actually get placed on the dataset? (I realize it's listed as the default in the signature)

I also wasn't sure when you first asked so I created a test for the default state and it does indeed look like it populates the padding with np.nan when constant_values is left at None. So we it is confusing to have default: 0 in the doc string. We can either change the doc string to None or NaN or change the default in the function signature to 0 rather then None. I'm not sure which would be better, what do you think?

Please change the docstrings, otherwise that's a breaking change!

Ah of course, will change it to default: None in the docstring.
I'm also changing it for end_values which seems to me to have the same problem.
I also just realized it must have been copied from the docstring of np.pad since it looks quite similar and the actual default is 0.

Great, thank you!

xarray/core/dataset.py

…taset pad func. enforce number values to be converted to tuples for all in `_pad_options_dim_to_index`. make variable pad funtion consistent with dataset. extend tests

headtr1ck

Very useful, thanks. I needed that already a few times and ended up writing my own wrapper.

headtr1ck · 2024-08-15T13:11:09Z

xarray/core/dataset.py

@@ -9293,6 +9294,12 @@ def pad(
            if not pad_dims.intersection(xindexes.get_all_dims(k)):
                indexes[k] = idx

+        per_data_var_constant_values = {}
+        if isinstance(constant_values, dict):


The typing claims that any mapping works, but here you are only checking dicts.
I would propose to change to utils.is_dict_like and find an alternative to pop (Mapping does not define pop because it is read-only). The simplest way would be to transform it into a dict first.

That's a fair point, to be honest I just followed the way it is done in the implementation of pad for the variables in variable.py. Maybe I should just change all these type hints to dict rather than Mapping. What do you think?

From a typing perspective dict is always problematic because it is invariant.
I would try to stick with Mapping instead.

Ah okay, so just I'll just change this one isinstance(constant_values, dict) to utils.is_dict_like(constant_values). Thanks for the clarifications!

Yes and no.
For the check it should suffice, but the general Mapping class has no pop defined because it is read-only (non-mutable).
Anyway, I guess the user would find it weird that this input is changed in-place, so remove the pop method and replace it with a get item call (not sure if the Mapping ABC defines a get method).

I wonder if the check needs to be more thorough? It is totally possible to create Dataset objects with a data variable that has the same name as a dimension:

xr.Dataset({"x": ("x", [1, 2])}).drop_indexes("x").reset_coords("x")

so we could potentially break the original use case if we remove all data variables from the dictionary, because we could also remove a dimension name.

In general, I believe we need to be thorough in defining the new feature:

should we also allow padding coordinate variables?

if we use it, should it still be possible to specify blanket dimension padding? I.e. should it be possible to mix the two? If so, how do we figure out which is which? Check if the value is a dict and decide it's the padding for a specific variable?

what should happen if we specify padding values only for a subset of the variables? Dimensions need to have the same size for all variables in a Dataset, so either we need to specify a fallback (by dimension names), or we need to require specifying the padding for all affected variables (and raise otherwise).

Yes and no. For the check it should suffice, but the general Mapping class has no pop defined because it is read-only (non-mutable). Anyway, I guess the user would find it weird that this input is changed in-place, so remove the pop method and replace it with a get item call (not sure if the Mapping ABC defines a get method).

That's a very fair point, always forget that about dicts 🙃 , I'll figure out a different way to do it.

Alrighty, changed, now the input constant_values is not mutated in place.

I wonder if the check needs to be more thorough? It is totally possible to create Dataset objects with a data variable that has the same name as a dimension:

xr.Dataset({"x": ("x", [1, 2])}).drop_indexes("x").reset_coords("x")

so we could potentially break the original use case if we remove all data variables from the dictionary, because we could also remove a dimension name.

In general, I believe we need to be thorough in defining the new feature:

should we also allow padding coordinate variables?

if we use it, should it still be possible to specify blanket dimension padding? I.e. should it be possible to mix the two? If so, how do we figure out which is which? Check if the value is a dict and decide it's the padding for a specific variable?

what should happen if we specify padding values only for a subset of the variables? Dimensions need to have the same size for all variables in a Dataset, so either we need to specify a fallback (by dimension names), or we need to require specifying the padding for all affected variables (and raise otherwise).

It is a fair point, I have rewritten this section so the original constant_variables mapping is not mutated.
Regarding the bullet points:

As I understand it was already possible to pad coordinate variables.

As it is codded now if a data var has a dim with the same name and the user sets a value with this name in the constant_values then the data var will be padded with those values (regardless of the dim being padded) and all other data vars that also contain that dim will be padded with that value, provided that it is the dim being padded (a.k.a. the dim provided in pad_width). In the tests I also show that one can mix data var and dim in the same dict. In that case the value for the data var has priority over the value of the dim.

I realized that it was already implemented that if one didn't set values for all dims that are being padded then they are be padded with 0, so I kept that. For example, if one sets a constant_value for var1 and is padding a dim that is contained in in other vars, var1 will be padded with that value along the dim while all other vars will be padded with 0 along the same dim.

I hope I didn't make it too confusing. Let me know if it answers your questions and if the logic makes sense :D

headtr1ck · 2024-08-15T13:13:22Z

xarray/tests/test_dataset.py

@@ -6689,17 +6689,45 @@ def test_polyfit_warnings(self) -> None:
            ds.var1.polyfit("dim2", 10, full=True)
            assert len(ws) == 1

-    def test_pad(self) -> None:
+    @pytest.mark.parametrize(


Does this work if you want to pad along a dimension coordinate (aka. a variable that is called the same as it's dimension)?

Good question, I'll investigate :D

So, as I understand it most dims in the test dataset are dimension coordinates and they pad correctly, so I think so. In any case I've extended the tests to pad all dimensions just to be sure nothing is behaving incorrectly. Let me know if I missed any case.

… dataset pad function

max-sixty · 2024-08-20T19:46:52Z

Any other changes pending or is this ready to go?

(thanks a lot for the work @tsanona !)

headtr1ck · 2024-08-20T21:55:34Z

xarray/core/dataset.py

+                            k: v for k, v in constant_values.items() if k in var.dims
+                        }
+                    else:
+                        filtered_constant_values = 0


Suggested change

filtered_constant_values = 0

filtered_constant_values = None

If I'm not mistaken

Probably a test should cover this as well?

If we change that 0 to None then there is a discrepancy between what happens if we don't set all dim in constant_variables (this defaults to padding with 0 and is coded in the pad function from variable.py) and if we don't set all var (code I added in pad function from dataset.py which will then pad dimensions with np.nan).
I can either change every where which I think might be a breaking change (?) or just leave everything as is.

I think this would be worth changing but doesn't need to be in this PR (sorry if I slowed things down by introducing this). Assuming it isn't trivial, adding a TODO with a link to this would be very sufficient

No worries, added the TODO 🚀
Lemme know if there's anything else left to do and thanks for all the comments!

…values

max-sixty · 2024-08-21T17:45:00Z

If we can add a whatsnew entry, that would be great!

welcome · 2024-08-21T20:47:37Z

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again!

max-sixty · 2024-08-21T20:47:45Z

Thank you very much @tsanona !

* main: Extend padding functionalities (pydata#9353)

* main: (214 commits) Adds copy parameter to __array__ for numpy 2.0 (pydata#9393) `numpy 2` compatibility in the `pydap` backend (pydata#9391) pyarrow dependency added to doc environment (pydata#9394) Extend padding functionalities (pydata#9353) refactor GroupBy internals (pydata#9389) Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274) passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377) Fix tests on big-endian systems (pydata#9380) Improve error message on `ds['x', 'y']` (pydata#9375) Improve error message for missing coordinate index (pydata#9370) Add flaky to TestNetCDF4ViaDaskData (pydata#9373) Make chunk manager an option in `set_options` (pydata#9362) Revise (pydata#9371) Remove duplicate word from docs (pydata#9367) Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243) Revise (pydata#9366) Fix rechunking to a frequency with empty bins. (pydata#9364) whats-new entry for dropping python 3.9 (pydata#9359) drop support for `python=3.9` (pydata#8937) Revise (pydata#9357) ...

* main: Adds copy parameter to __array__ for numpy 2.0 (pydata#9393) `numpy 2` compatibility in the `pydap` backend (pydata#9391) pyarrow dependency added to doc environment (pydata#9394) Extend padding functionalities (pydata#9353) refactor GroupBy internals (pydata#9389) Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274) passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377) Fix tests on big-endian systems (pydata#9380) Improve error message on `ds['x', 'y']` (pydata#9375)

tiago added 2 commits May 23, 2024 02:32

add functionality to pad dataset data variables with unique constant …

7998afb

…values

clean up implementation of variable specific padding for dataset. Add…

44e7d26

… tests

Merge commit '562015ce7063a634c2072f9c84a724fecf18aff0' into extend-p…

fd3e304

…adding-functionalities

max-sixty approved these changes Aug 13, 2024

View reviewed changes

tsanona and others added 2 commits August 14, 2024 02:30

Merge branch 'pydata:main' into extend-padding-functionalities

9f66ee2

Merge branch 'main' into extend-padding-functionalities

2f33b40

keewis reviewed Aug 14, 2024

View reviewed changes

xarray/core/dataset.py Outdated Show resolved Hide resolved

tsanona and others added 3 commits August 14, 2024 16:18

Merge branch 'main' into extend-padding-functionalities

b70b25d

more expressive docsting and symplefy type signature with alias in da…

d939030

…taset pad func. enforce number values to be converted to tuples for all in `_pad_options_dim_to_index`. make variable pad funtion consistent with dataset. extend tests

fix typing

48a972f

tsanona requested a review from max-sixty August 15, 2024 08:40

headtr1ck reviewed Aug 15, 2024

View reviewed changes

tsanona and others added 4 commits August 15, 2024 21:07

Merge branch 'main' into extend-padding-functionalities

098706b

add terms to conf.py, make docstrings more accurate, expand tests for…

afda62d

… dataset pad function

filter constant value types without mutating input map

8bdb6f1

Merge branch 'main' into extend-padding-functionalities

992e20e

headtr1ck approved these changes Aug 20, 2024

View reviewed changes

tsanona and others added 3 commits August 21, 2024 01:40

Merge branch 'main' into extend-padding-functionalities

d3f0275

add todo to change default padding for missing variables in constant_…

0a51b7c

…values

Merge branch 'main' into extend-padding-functionalities

3d40800

max-sixty added the plan to merge Final call for comments label Aug 21, 2024

add changes to whats new

a848351

max-sixty merged commit ed5900b into pydata:main Aug 21, 2024
28 checks passed

tsanona deleted the extend-padding-functionalities branch August 21, 2024 21:19

dcherian added a commit to dcherian/xarray that referenced this pull request Aug 21, 2024

Merge branch 'main' into multiple-groupers-2

3dfc019

* main: Extend padding functionalities (pydata#9353)

Extend padding functionalities #9353

Extend padding functionalities #9353

Conversation

tsanona commented Aug 13, 2024

welcome bot commented Aug 13, 2024

max-sixty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsanona Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

tsanona Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsanona Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsanona Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

headtr1ck left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsanona Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keewis Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Aug 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Aug 21, 2024

welcome bot commented Aug 21, 2024

max-sixty commented Aug 21, 2024

tsanona Aug 14, 2024 •

edited

Loading

tsanona Aug 14, 2024 •

edited

Loading

tsanona Aug 14, 2024 •

edited

Loading

tsanona Aug 16, 2024 •

edited

Loading

tsanona Aug 16, 2024 •

edited

Loading

keewis Aug 16, 2024 •

edited

Loading