-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend padding functionalities #9353
Extend padding functionalities #9353
Conversation
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
…adding-functionalities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tsanona , this is good. I added some comments to try and make it simpler to understand, but it's good functionality!
xarray/core/dataset.py
Outdated
| Mapping[ | ||
Any, float | tuple[float, float] | Mapping[Any, tuple[float, float]] | ||
] | ||
| None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty gnarly type signature. Same with the docstring. But I don't think there's much we can do to simplify it.
We possibly could make float | tuple[float, float] | Mapping[Any, tuple[float, float]]
into an alias?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's true, I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I changed it slightly. I created these aliases:
T_PadConstantValues = float | tuple[float, float]
T_VarPadConstantValues = T_PadConstantValues | Mapping[Any, T_PadConstantValues]
T_DatasetPadConstantValues = T_VarPadConstantValues | Mapping[Any, T_VarPadConstantValues]
This way it is a bit more organized. Also changed the implementation of pad
for variables so that it can make use of Mapping[Any, float]
to keep consistency with the pad
function for datasets.
What do you think?
(Also let me know if the location in types.py
and import in dataset.py
is correct. I don't fully understand when to import types under the if TYPE_CHECKING:
block or above in dataset.py
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
(others may have refinements...)
xarray/core/dataset.py
Outdated
Used in 'constant'. The values to set the padded values for each | ||
axis. | ||
constant_values : scalar, tuple, mapping of hashable to tuple or | ||
mapping of hashable to mapping of hashable to tuple, default: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily on this PR but do you happen to know why the default is listed as 0
here but is None
in the func signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, what happened there, but it is a good point since if left at None the arrays are padded with np.nan
. Should I change the default to nan
or None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you mean changing the default and not the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, so what's the actual value that goes onto the dataset, by default? I had originally thought it was 0
, but does None
/ NaN
actually get placed on the dataset? (I realize it's listed as the default in the signature)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wasn't sure when you first asked so I created a test for the default state and it does indeed look like it populates the padding with np.nan
when constant_values
is left at None
. So we it is confusing to have default: 0
in the doc string. We can either change the doc string to None
or NaN
or change the default in the function signature to 0
rather then None
. I'm not sure which would be better, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change the docstrings, otherwise that's a breaking change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah of course, will change it to default: None
in the docstring.
I'm also changing it for end_values
which seems to me to have the same problem.
I also just realized it must have been copied from the docstring of np.pad
since it looks quite similar and the actual default is 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you!
…taset pad func. enforce number values to be converted to tuples for all in `_pad_options_dim_to_index`. make variable pad funtion consistent with dataset. extend tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very useful, thanks. I needed that already a few times and ended up writing my own wrapper.
xarray/core/dataset.py
Outdated
@@ -9293,6 +9294,12 @@ def pad( | |||
if not pad_dims.intersection(xindexes.get_all_dims(k)): | |||
indexes[k] = idx | |||
|
|||
per_data_var_constant_values = {} | |||
if isinstance(constant_values, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typing claims that any mapping works, but here you are only checking dicts.
I would propose to change to utils.is_dict_like
and find an alternative to pop (Mapping does not define pop because it is read-only). The simplest way would be to transform it into a dict first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point, to be honest I just followed the way it is done in the implementation of pad
for the variables in variable.py
. Maybe I should just change all these type hints to dict
rather than Mapping
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a typing perspective dict is always problematic because it is invariant.
I would try to stick with Mapping instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, so just I'll just change this one isinstance(constant_values, dict)
to utils.is_dict_like(constant_values)
. Thanks for the clarifications!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no.
For the check it should suffice, but the general Mapping class has no pop
defined because it is read-only (non-mutable).
Anyway, I guess the user would find it weird that this input is changed in-place, so remove the pop
method and replace it with a get item call (not sure if the Mapping ABC defines a get
method).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the check needs to be more thorough? It is totally possible to create Dataset
objects with a data variable that has the same name as a dimension:
xr.Dataset({"x": ("x", [1, 2])}).drop_indexes("x").reset_coords("x")
so we could potentially break the original use case if we remove all data variables from the dictionary, because we could also remove a dimension name.
In general, I believe we need to be thorough in defining the new feature:
- should we also allow padding coordinate variables?
- if we use it, should it still be possible to specify blanket dimension padding? I.e. should it be possible to mix the two? If so, how do we figure out which is which? Check if the value is a
dict
and decide it's the padding for a specific variable? - what should happen if we specify padding values only for a subset of the variables? Dimensions need to have the same size for all variables in a
Dataset
, so either we need to specify a fallback (by dimension names), or we need to require specifying the padding for all affected variables (and raise otherwise).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no. For the check it should suffice, but the general Mapping class has no
pop
defined because it is read-only (non-mutable). Anyway, I guess the user would find it weird that this input is changed in-place, so remove thepop
method and replace it with a get item call (not sure if the Mapping ABC defines aget
method).
That's a very fair point, always forget that about dicts 🙃 , I'll figure out a different way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alrighty, changed, now the input constant_values
is not mutated in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the check needs to be more thorough? It is totally possible to create
Dataset
objects with a data variable that has the same name as a dimension:xr.Dataset({"x": ("x", [1, 2])}).drop_indexes("x").reset_coords("x")so we could potentially break the original use case if we remove all data variables from the dictionary, because we could also remove a dimension name.
In general, I believe we need to be thorough in defining the new feature:
- should we also allow padding coordinate variables?
- if we use it, should it still be possible to specify blanket dimension padding? I.e. should it be possible to mix the two? If so, how do we figure out which is which? Check if the value is a
dict
and decide it's the padding for a specific variable?- what should happen if we specify padding values only for a subset of the variables? Dimensions need to have the same size for all variables in a
Dataset
, so either we need to specify a fallback (by dimension names), or we need to require specifying the padding for all affected variables (and raise otherwise).
It is a fair point, I have rewritten this section so the original constant_variables
mapping is not mutated.
Regarding the bullet points:
- As I understand it was already possible to pad coordinate variables.
- As it is codded now if a
data var
has adim
with the same name and the user sets a value with this name in theconstant_values
then thedata var
will be padded with those values (regardless of the dim being padded) and all otherdata vars
that also contain thatdim
will be padded with that value, provided that it is thedim
being padded (a.k.a. the dim provided inpad_width
). In the tests I also show that one can mixdata var
anddim
in the same dict. In that case the value for thedata var
has priority over the value of thedim
. - I realized that it was already implemented that if one didn't set values for all
dims
that are being padded then they are be padded with 0, so I kept that. For example, if one sets aconstant_value
forvar1
and is padding adim
that is contained in in othervars
,var1
will be padded with that value along thedim
while all othervars
will be padded with 0 along the samedim
.
I hope I didn't make it too confusing. Let me know if it answers your questions and if the logic makes sense :D
@@ -6689,17 +6689,45 @@ def test_polyfit_warnings(self) -> None: | |||
ds.var1.polyfit("dim2", 10, full=True) | |||
assert len(ws) == 1 | |||
|
|||
def test_pad(self) -> None: | |||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work if you want to pad along a dimension coordinate (aka. a variable that is called the same as it's dimension)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I'll investigate :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, as I understand it most dims in the test dataset are dimension coordinates and they pad correctly, so I think so. In any case I've extended the tests to pad all dimensions just to be sure nothing is behaving incorrectly. Let me know if I missed any case.
Any other changes pending or is this ready to go? (thanks a lot for the work @tsanona !) |
xarray/core/dataset.py
Outdated
k: v for k, v in constant_values.items() if k in var.dims | ||
} | ||
else: | ||
filtered_constant_values = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filtered_constant_values = 0 | |
filtered_constant_values = None |
If I'm not mistaken
Probably a test should cover this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change that 0 to None
then there is a discrepancy between what happens if we don't set all dim
in constant_variables
(this defaults to padding with 0 and is coded in the pad
function from variable.py
) and if we don't set all var
(code I added in pad
function from dataset.py
which will then pad dimensions with np.nan
).
I can either change every where which I think might be a breaking change (?) or just leave everything as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be worth changing but doesn't need to be in this PR (sorry if I slowed things down by introducing this). Assuming it isn't trivial, adding a TODO with a link to this would be very sufficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, added the TODO 🚀
Lemme know if there's anything else left to do and thanks for all the comments!
If we can add a whatsnew entry, that would be great! |
Thank you very much @tsanona ! |
* main: Extend padding functionalities (pydata#9353)
* main: (214 commits) Adds copy parameter to __array__ for numpy 2.0 (pydata#9393) `numpy 2` compatibility in the `pydap` backend (pydata#9391) pyarrow dependency added to doc environment (pydata#9394) Extend padding functionalities (pydata#9353) refactor GroupBy internals (pydata#9389) Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274) passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377) Fix tests on big-endian systems (pydata#9380) Improve error message on `ds['x', 'y']` (pydata#9375) Improve error message for missing coordinate index (pydata#9370) Add flaky to TestNetCDF4ViaDaskData (pydata#9373) Make chunk manager an option in `set_options` (pydata#9362) Revise (pydata#9371) Remove duplicate word from docs (pydata#9367) Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243) Revise (pydata#9366) Fix rechunking to a frequency with empty bins. (pydata#9364) whats-new entry for dropping python 3.9 (pydata#9359) drop support for `python=3.9` (pydata#8937) Revise (pydata#9357) ...
* main: Adds copy parameter to __array__ for numpy 2.0 (pydata#9393) `numpy 2` compatibility in the `pydap` backend (pydata#9391) pyarrow dependency added to doc environment (pydata#9394) Extend padding functionalities (pydata#9353) refactor GroupBy internals (pydata#9389) Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274) passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377) Fix tests on big-endian systems (pydata#9380) Improve error message on `ds['x', 'y']` (pydata#9375)
While working on a project I realized that in the dataset padding function one can set different
constant_values
for each dimension but not for each variable. I've added some simple code to allow this, one can now also define padding differently for each data variable. I also extended the tests to include these examples.Let me know if I can do any changes to better fit your standards.