Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend padding functionalities #9353

Merged
merged 16 commits into from
Aug 21, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 16 additions & 7 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@
ReindexMethodOptions,
SideOptions,
T_ChunkDimFreq,
T_DatasetPadConstantValues,
T_Xarray,
)
from xarray.core.weighted import DatasetWeighted
Expand Down Expand Up @@ -9147,9 +9148,7 @@ def pad(
stat_length: (
int | tuple[int, int] | Mapping[Any, tuple[int, int]] | None
) = None,
constant_values: (
float | tuple[float, float] | Mapping[Any, tuple[float, float]] | None
) = None,
constant_values: T_DatasetPadConstantValues | None = None,
end_values: int | tuple[int, int] | Mapping[Any, tuple[int, int]] | None = None,
reflect_type: PadReflectOptions = None,
keep_attrs: bool | None = None,
Expand Down Expand Up @@ -9205,9 +9204,11 @@ def pad(
(stat_length,) or int is a shortcut for before = after = statistic
length for all axes.
Default is ``None``, to use the entire axis.
constant_values : scalar, tuple or mapping of hashable to tuple, default: 0
Used in 'constant'. The values to set the padded values for each
axis.
constant_values : scalar, tuple, mapping of dim name to scalar or tuple, or \
mapping of var name to scalar, tuple or to mapping of dim name to scalar or tuple, default: 0
Used in 'constant'. The values to set the padded values for each data variable / axis.
``{var_1: {dim_1: (before_1, after_1), ... dim_N: (before_N, after_N)}, ...
var_M: (before, after)}`` unique pad constants per data variable.
``{dim_1: (before_1, after_1), ... dim_N: (before_N, after_N)}`` unique
pad constants along each dimension.
``((before, after),)`` yields same before and after constants for each
Expand Down Expand Up @@ -9293,6 +9294,12 @@ def pad(
if not pad_dims.intersection(xindexes.get_all_dims(k)):
indexes[k] = idx

per_data_var_constant_values = {}
if isinstance(constant_values, dict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typing claims that any mapping works, but here you are only checking dicts.
I would propose to change to utils.is_dict_like and find an alternative to pop (Mapping does not define pop because it is read-only). The simplest way would be to transform it into a dict first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, to be honest I just followed the way it is done in the implementation of pad for the variables in variable.py. Maybe I should just change all these type hints to dict rather than Mapping. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a typing perspective dict is always problematic because it is invariant.
I would try to stick with Mapping instead.

Copy link
Contributor Author

@tsanona tsanona Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, so just I'll just change this one isinstance(constant_values, dict) to utils.is_dict_like(constant_values). Thanks for the clarifications!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no.
For the check it should suffice, but the general Mapping class has no pop defined because it is read-only (non-mutable).
Anyway, I guess the user would find it weird that this input is changed in-place, so remove the pop method and replace it with a get item call (not sure if the Mapping ABC defines a get method).

Copy link
Collaborator

@keewis keewis Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the check needs to be more thorough? It is totally possible to create Dataset objects with a data variable that has the same name as a dimension:

xr.Dataset({"x": ("x", [1, 2])}).drop_indexes("x").reset_coords("x")

so we could potentially break the original use case if we remove all data variables from the dictionary, because we could also remove a dimension name.


In general, I believe we need to be thorough in defining the new feature:

  • should we also allow padding coordinate variables?
  • if we use it, should it still be possible to specify blanket dimension padding? I.e. should it be possible to mix the two? If so, how do we figure out which is which? Check if the value is a dict and decide it's the padding for a specific variable?
  • what should happen if we specify padding values only for a subset of the variables? Dimensions need to have the same size for all variables in a Dataset, so either we need to specify a fallback (by dimension names), or we need to require specifying the padding for all affected variables (and raise otherwise).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. For the check it should suffice, but the general Mapping class has no pop defined because it is read-only (non-mutable). Anyway, I guess the user would find it weird that this input is changed in-place, so remove the pop method and replace it with a get item call (not sure if the Mapping ABC defines a get method).

That's a very fair point, always forget that about dicts 🙃 , I'll figure out a different way to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alrighty, changed, now the input constant_values is not mutated in place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the check needs to be more thorough? It is totally possible to create Dataset objects with a data variable that has the same name as a dimension:

xr.Dataset({"x": ("x", [1, 2])}).drop_indexes("x").reset_coords("x")

so we could potentially break the original use case if we remove all data variables from the dictionary, because we could also remove a dimension name.

In general, I believe we need to be thorough in defining the new feature:

  • should we also allow padding coordinate variables?
  • if we use it, should it still be possible to specify blanket dimension padding? I.e. should it be possible to mix the two? If so, how do we figure out which is which? Check if the value is a dict and decide it's the padding for a specific variable?
  • what should happen if we specify padding values only for a subset of the variables? Dimensions need to have the same size for all variables in a Dataset, so either we need to specify a fallback (by dimension names), or we need to require specifying the padding for all affected variables (and raise otherwise).

It is a fair point, I have rewritten this section so the original constant_variables mapping is not mutated.
Regarding the bullet points:

  • As I understand it was already possible to pad coordinate variables.
  • As it is codded now if a data var has a dim with the same name and the user sets a value with this name in the constant_values then the data var will be padded with those values (regardless of the dim being padded) and all other data vars that also contain that dim will be padded with that value, provided that it is the dim being padded (a.k.a. the dim provided in pad_width). In the tests I also show that one can mix data var and dim in the same dict. In that case the value for the data var has priority over the value of the dim.
  • I realized that it was already implemented that if one didn't set values for all dims that are being padded then they are be padded with 0, so I kept that. For example, if one sets a constant_value for var1 and is padding a dim that is contained in in other vars, var1 will be padded with that value along the dim while all other vars will be padded with 0 along the same dim.

I hope I didn't make it too confusing. Let me know if it answers your questions and if the logic makes sense :D

for k in self.data_vars:
if v := constant_values.pop(k, None):
per_data_var_constant_values[k] = v

for name, var in self.variables.items():
var_pad_width = {k: v for k, v in pad_width.items() if k in var.dims}
if not var_pad_width:
Expand All @@ -9302,7 +9309,9 @@ def pad(
pad_width=var_pad_width,
mode=mode,
stat_length=stat_length,
constant_values=constant_values,
constant_values=per_data_var_constant_values.get(
name, constant_values
),
end_values=end_values,
reflect_type=reflect_type,
keep_attrs=keep_attrs,
Expand Down
5 changes: 5 additions & 0 deletions xarray/core/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,11 @@ def copy(
"symmetric",
"wrap",
]
T_PadConstantValues = float | tuple[float, float]
T_VarPadConstantValues = T_PadConstantValues | Mapping[Any, T_PadConstantValues]
T_DatasetPadConstantValues = (
T_VarPadConstantValues | Mapping[Any, T_VarPadConstantValues]
)
PadReflectOptions = Literal["even", "odd", None]

CFCalendar = Literal[
Expand Down
18 changes: 9 additions & 9 deletions xarray/core/variable.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@
Self,
T_Chunks,
T_DuckArray,
T_VarPadConstantValues,
)
from xarray.namedarray.parallelcompat import ChunkManagerEntrypoint

Expand Down Expand Up @@ -1121,9 +1122,14 @@ def shift(self, shifts=None, fill_value=dtypes.NA, **shifts_kwargs):

def _pad_options_dim_to_index(
self,
pad_option: Mapping[Any, int | tuple[int, int]],
pad_option: Mapping[Any, int | float | tuple[int, int] | tuple[float, float]],
fill_with_shape=False,
):
# change number values to a tuple of two of those values
for k, v in pad_option.items():
if isinstance(v, numbers.Number):
pad_option[k] = (v, v)

if fill_with_shape:
return [
(n, n) if d not in pad_option else pad_option[d]
Expand All @@ -1138,9 +1144,7 @@ def pad(
stat_length: (
int | tuple[int, int] | Mapping[Any, tuple[int, int]] | None
) = None,
constant_values: (
float | tuple[float, float] | Mapping[Any, tuple[float, float]] | None
) = None,
constant_values: T_VarPadConstantValues | None = None,
end_values: int | tuple[int, int] | Mapping[Any, tuple[int, int]] | None = None,
reflect_type: PadReflectOptions = None,
keep_attrs: bool | None = None,
Expand All @@ -1160,7 +1164,7 @@ def pad(
stat_length : int, tuple or mapping of hashable to tuple
Used in 'maximum', 'mean', 'median', and 'minimum'. Number of
values at edge of each axis used to calculate the statistic value.
constant_values : scalar, tuple or mapping of hashable to tuple
constant_values : scalar, tuple or mapping of hashable to scalar or tuple
Used in 'constant'. The values to set the padded values for each
axis.
end_values : scalar, tuple or mapping of hashable to tuple
Expand Down Expand Up @@ -1207,10 +1211,6 @@ def pad(
if stat_length is None and mode in ["maximum", "mean", "median", "minimum"]:
stat_length = [(n, n) for n in self.data.shape] # type: ignore[assignment]

# change integer values to a tuple of two of those values and change pad_width to index
for k, v in pad_width.items():
if isinstance(v, numbers.Number):
pad_width[k] = (v, v)
pad_width_by_index = self._pad_options_dim_to_index(pad_width)

# create pad_options_kwargs, numpy/dask requires only relevant kwargs to be nonempty
Expand Down
34 changes: 31 additions & 3 deletions xarray/tests/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -6689,17 +6689,45 @@ def test_polyfit_warnings(self) -> None:
ds.var1.polyfit("dim2", 10, full=True)
assert len(ws) == 1

def test_pad(self) -> None:
@pytest.mark.parametrize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work if you want to pad along a dimension coordinate (aka. a variable that is called the same as it's dimension)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I'll investigate :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as I understand it most dims in the test dataset are dimension coordinates and they pad correctly, so I think so. In any case I've extended the tests to pad all dimensions just to be sure nothing is behaving incorrectly. Let me know if I missed any case.

["constant_values", "expected"],
[
pytest.param(None, {"var1": np.nan}, id="default"),
pytest.param(42, {"var1": 42, "var2": 42}, id="scalar"),
pytest.param((42, 43), {"var1": (42, 43), "var2": (42, 43)}, id="tuple"),
pytest.param({"dim2": 42}, {"var1": 42, "var2": 42}, id="per dim scalar"),
pytest.param(
{"dim2": (42, 43)},
{"var1": (42, 43), "var2": (42, 43)},
id="per dim tuple",
),
pytest.param(
{"var1": 42, "var2": (42, 43)},
{"var1": 42, "var2": (42, 43)},
id="per var",
),
pytest.param(
{"var1": 42, "dim2": (42, 43)},
{"var1": 42, "var2": (42, 43)},
id="mixed",
),
],
)
def test_pad(self, constant_values, expected) -> None:
ds = create_test_data(seed=1)
padded = ds.pad(dim2=(1, 1), constant_values=42)
padded = ds.pad(dim2=(1, 1), constant_values=constant_values)

assert padded["dim2"].shape == (11,)
assert padded["var1"].shape == (8, 11)
assert padded["var2"].shape == (8, 11)
assert padded["var3"].shape == (10, 8)
assert dict(padded.sizes) == {"dim1": 8, "dim2": 11, "dim3": 10, "time": 20}

np.testing.assert_equal(padded["var1"].isel(dim2=[0, -1]).data, 42)
for var, expected_value in expected.items():
np.testing.assert_equal(
np.unique(padded[var].isel(dim2=[0, -1]).data), expected_value
)
# np.testing.assert_equal(padded["var1"].isel(dim2=[0, -1]).data, 42)
np.testing.assert_equal(padded["dim2"][[0, -1]].data, np.nan)

@pytest.mark.parametrize(
Expand Down
Loading