Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DatasetGroupBy.quantile #3527

Merged
merged 5 commits into from
Nov 15, 2019
Merged

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Nov 13, 2019

This adds DatasetGroupBy.quantile by moving DataArrayGroupBy.quantile to GroupBy as proposed in #3018 (comment).

The tests are a modified copy of the ones from #2828. What confuses me is that expected_yy in test_ds_groupby_quantile needs the transpose whereas the equivalent in test_da_groupby_quantile doesn't. Does anyone have an idea about why that is?

@keewis
Copy link
Collaborator Author

keewis commented Nov 13, 2019

is Bug Fixes the correct section for this?

@dcherian
Copy link
Contributor

Let's do enhancements!

@keewis
Copy link
Collaborator Author

keewis commented Nov 13, 2019

that would open a new category. Should I use New Features?

@dcherian
Copy link
Contributor

👍

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @keewis !

I left some minor comments. The transpose is surprising. It would be good to track that down to either .map or Groupby._combine

xarray/core/groupby.py Show resolved Hide resolved
xarray/core/groupby.py Outdated Show resolved Hide resolved
@keewis
Copy link
Collaborator Author

keewis commented Nov 14, 2019

the difference is here, where in DataArrayGroupBy the original dimension order is restored. DatasetGroupBy does not do that, so the test requires the transpose

@dcherian
Copy link
Contributor

LGTM @keewis.

Re: docstring. If you jave time, can you add a couple of examples (just pick two from the tests) to illustrate the scalar vs vector behaviour of q?

@keewis
Copy link
Collaborator Author

keewis commented Nov 15, 2019

sure. But as Dataset.quantile and DataArray.quantile don't have examples either, I think it might be worth to do that in a new PR?

Also, what will we do re transpose?

@dcherian
Copy link
Contributor

👍

The transpose is fine, I think. It looks intentional and must be consistent with other Dataset/DataArray operations?

@keewis
Copy link
Collaborator Author

keewis commented Nov 15, 2019

well, it was surprising to me. It also might be problematic that it matters if we first pull out a DataArray and then compute or the other way around:

>>> ds = xr.Dataset(
...     data_vars={
...         "a": (
...             ("x", "y"),
...             [[1, 11, 26], [2, 12, 22], [3, 13, 23],^I[4, 16,^I24], [5, 15, 25]],
...         )
...     },
...     coords={"x": [1, 1, 1, 2, 2], "y": [0, 0, 1]},
... )
>>> a_ds = ds.groupby("y").quantile(0).a
>>> a_da = ds.a.groupby("y").quantile(0)
>>> a_ds.identical(a_da)
False
>>> a_ds.transpose().identical(a_da)
True

@dcherian
Copy link
Contributor

Hmm.. ok calling mean in this example passes the identical check so maybe there's something funny in quantile:

ds = xr.Dataset(
    data_vars={
        "a": (
            ("x", "y"),
            [[1, 11, 26], [2, 12, 22], [3, 13, 23],
             [4, 16,24], [5, 15, 25]],
        )
    },
    coords={"x": [1, 1, 1, 2, 2], "y": [0, 0, 1]},
).expand_dims({"z": [0,1,2]})
a_ds = ds.groupby("y").mean("x").a
a_da = ds.a.groupby("y").mean("x")
a_ds.identical(a_da)

@keewis
Copy link
Collaborator Author

keewis commented Nov 15, 2019

but that's only because mean removes x, doesn't it?

@keewis
Copy link
Collaborator Author

keewis commented Nov 15, 2019

I can also reproduce this with std

Edit: I'm fairly certain this is due to what I mentioned in #3527 (comment)

@dcherian
Copy link
Contributor

I can also reproduce this with std

Which behaviour can you reproduce?

Replacing mean with std in my last example passes the identical check. Actually this works too

a_ds = ds.groupby("y").apply(lambda x: x).a
a_da = ds.a.groupby("y").apply(lambda x: x)
a_ds.identical(a_da)

@max-sixty
Copy link
Collaborator

Dimension reordering has come up a few times before (#1739, and others that I can't immediately find)

I think we should probably have a general solution & standard behavior here. Others should weigh in (CC @shoyer), but I would vote to not worry too much about it here, and prioritize simplicity and consistency with past behavior until we decide on a more general solution.

@shoyer
Copy link
Member

shoyer commented Nov 15, 2019

the difference is here, where in DataArrayGroupBy the original dimension order is restored. DatasetGroupBy does not do that, so the test requires the transpose

It was not entirely obvious to me how to generalize "restoring dimension order" to Dataset. On a Dataset, it is not entirely clear which variable dimension order should be copied from. Maybe each variable in the result should have dimension order copied from the variable with the same name in the original dataset? But then what about new variables?

I agree @max-sixty, let's not worry about this too much for now.

@dcherian
Copy link
Contributor

In it goes. Great work as usual, @keewis. Thanks!

@dcherian dcherian merged commit 52d4845 into pydata:master Nov 15, 2019
@keewis keewis deleted the fix-datasetgroupby-quantile branch November 15, 2019 23:15
@keewis
Copy link
Collaborator Author

keewis commented Nov 15, 2019

Which behaviour can you reproduce?

ds = xr.Dataset(
    data_vars={
        "a": (
            ("x", "y"),
            [[1, 11, 26], [2, 12, 22], [3, 13, 23],
             [4, 16,24], [5, 15, 25]],
        )
    },
    coords={"x": [1, 1, 1, 2, 2], "y": [0, 0, 1]},
).expand_dims({"z": [0, 1, 1, 2, 2]})
a_ds = ds.groupby("y").std().a
a_da = ds.a.groupby("y").std()
a_ds.identical(a_da)

which seems to be the same as

a_ds = ds.groupby("y").mean("y").a
a_da = ds.a.groupby("y").mean("y")
a_ds.identical(a_da)

My impression is that this puts the dimension grouped over to the front:

  • before: ds.a.dims == ("z", "x", "y")
  • afterwards: a_ds.dims == ("y", "z", "x")

But I agree this discussion does not really belong here. So, thanks all!

dcherian added a commit to dcherian/xarray that referenced this pull request Nov 17, 2019
* upstream/master:
  Added fill_value for unstack (pydata#3541)
  Add DatasetGroupBy.quantile (pydata#3527)
  ensure rename does not change index type (pydata#3532)
  Leave empty slot when not using accessors
  interpolate_na: Add max_gap support. (pydata#3302)
  units & deprecation merge (pydata#3530)
  Fix set_index when an existing dimension becomes a level (pydata#3520)
  add Variable._replace (pydata#3528)
  Tests for module-level functions with units (pydata#3493)
  Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502)
  FUNDING.yml (pydata#3523)
  Allow appending datetime & boolean variables to zarr stores (pydata#3504)
  warn if dim is passed to rolling operations. (pydata#3513)
  Deprecate allow_lazy (pydata#3435)
  Recursive tokenization (pydata#3515)
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 17, 2019
* upstream/master: (22 commits)
  Added fill_value for unstack (pydata#3541)
  Add DatasetGroupBy.quantile (pydata#3527)
  ensure rename does not change index type (pydata#3532)
  Leave empty slot when not using accessors
  interpolate_na: Add max_gap support. (pydata#3302)
  units & deprecation merge (pydata#3530)
  Fix set_index when an existing dimension becomes a level (pydata#3520)
  add Variable._replace (pydata#3528)
  Tests for module-level functions with units (pydata#3493)
  Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502)
  FUNDING.yml (pydata#3523)
  Allow appending datetime & boolean variables to zarr stores (pydata#3504)
  warn if dim is passed to rolling operations. (pydata#3513)
  Deprecate allow_lazy (pydata#3435)
  Recursive tokenization (pydata#3515)
  format indexing.rst code with black (pydata#3511)
  add missing pint integration tests (pydata#3508)
  DOC: update bottleneck repo url (pydata#3507)
  add drop_sel, drop_vars, map to api.rst (pydata#3506)
  remove syntax warning (pydata#3505)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 22, 2019
* master: (24 commits)
  Tweaks to release instructions (pydata#3555)
  Clarify conda environments for new contributors (pydata#3551)
  Revert to dev version
  0.14.1 whatsnew (pydata#3547)
  sparse option to reindex and unstack (pydata#3542)
  Silence sphinx warnings (pydata#3516)
  Numpy 1.18 support (pydata#3537)
  tweak whats-new. (pydata#3540)
  small simplification of rename from pydata#3532 (pydata#3539)
  Added fill_value for unstack (pydata#3541)
  Add DatasetGroupBy.quantile (pydata#3527)
  ensure rename does not change index type (pydata#3532)
  Leave empty slot when not using accessors
  interpolate_na: Add max_gap support. (pydata#3302)
  units & deprecation merge (pydata#3530)
  Fix set_index when an existing dimension becomes a level (pydata#3520)
  add Variable._replace (pydata#3528)
  Tests for module-level functions with units (pydata#3493)
  Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502)
  FUNDING.yml (pydata#3523)
  ...
@keewis keewis mentioned this pull request Nov 27, 2019
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DatasetGroupBy does not implement quantile
4 participants