-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improving the API for binned groupby #191
Comments
okay, turns out I seriously misunderstood what I guess something like this might be more intuitive? # by as a mapping, to avoid shadowing variable names
flox.xarray.xarray_reduce(arr, by={"time": flox.Bins(...), "depth": flox.Bins(...)}, func="mean", ...)
# by as a *args
flox.xarray.xarray_reduce(
arr,
flox.Bins(variable="time", ...),
flox.Bins(variable="depth", ...),
func="mean",
...
) |
Thanks for trying it out Justus! One reason for all this confusion is that I always expected
oops yeah. The issue is that I wanted to enable the really simple line:
So for multiple groupers
So Line 318 in 51fb6e9
And that function does handle pd.Index objects. I think we should update the typing and docstring. This would be a helpful PR! edit ah now i see, we'll need to remove the
This sounds like a bug but I'm surprised. Line 232 in 51fb6e9
and the tests: Lines 111 to 127 in 51fb6e9
A reproducible example would help.
Yeah I've actually been considering the
It also means you can't do
An alternative is to use |
here's the example I tested this with (although I just realized I could have used In [1]: import xarray as xr
...: import cf_xarray
...: import numpy as np
...: import flox.xarray
...:
...:
...: def add_vertices(ds, bounds_dim="bounds"):
...: new_names = {
...: name: f"{name.removesuffix(bounds_dim).rstrip('_')}_vertices"
...: for name, coord in ds.variables.items()
...: if bounds_dim in coord.dims
...: }
...: new_coords = {
...: new_name: cf_xarray.bounds_to_vertices(ds[name], bounds_dim=bounds_dim)
...: for name, new_name in new_names.items()
...: }
...: return ds.assign_coords(new_coords)
...:
...:
...: categories = list("abcefghi")
...:
...: coords = (
...: xr.Dataset(coords={"x": np.arange(10), "y": ("y", categories)})
...: .cf.add_bounds(["x"])
...: .pipe(add_vertices)
...: )
...: coords
Out[1]:
<xarray.Dataset>
Dimensions: (x: 10, y: 8, bounds: 2, x_vertices: 11)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) <U1 'a' 'b' 'c' 'e' 'f' 'g' 'h' 'i'
x_bounds (x, bounds) float64 -0.5 0.5 0.5 1.5 1.5 ... 7.5 7.5 8.5 8.5 9.5
* x_vertices (x_vertices) float64 -0.5 0.5 1.5 2.5 3.5 ... 6.5 7.5 8.5 9.5
Dimensions without coordinates: bounds
Data variables:
*empty*
In [2]: data = xr.Dataset(
...: {"a": ("x", np.arange(200))},
...: coords={
...: "x": np.linspace(-0.5, 9.5, 200),
...: "y": ("x", np.random.choice(categories, size=200)),
...: },
...: )
...: data
Out[2]:
<xarray.Dataset>
Dimensions: (x: 200)
Coordinates:
* x (x) float64 -0.5 -0.4497 -0.3995 -0.3492 ... 9.349 9.399 9.45 9.5
y (x) <U1 'e' 'g' 'b' 'b' 'h' 'a' 'g' ... 'a' 'c' 'c' 'h' 'c' 'b' 'h'
Data variables:
a (x) int64 0 1 2 3 4 5 6 7 8 ... 191 192 193 194 195 196 197 198 199
In [3]: flox.xarray.xarray_reduce(
...: data["a"],
...: coords["x"],
...: expected_groups=(coords["x_vertices"],),
...: isbin=[True] * 1,
...: func="mean",
...: )
Out[3]:
<xarray.DataArray 'a' (x_bins: 10)>
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
Coordinates:
* x_bins (x_bins) object (-0.5, 0.5] (0.5, 1.5] ... (7.5, 8.5] (8.5, 9.5]
In [4]: flox.xarray.xarray_reduce(
...: data["a"],
...: "x",
...: expected_groups=(coords["x_vertices"],),
...: isbin=[True] * 1,
...: func="mean",
...: )
Out[4]:
<xarray.DataArray 'a' (x_bins: 10)>
array([ 10. , 29.5, 49.5, 69.5, 89.5, 109.5, 129.5, 149.5, 169.5,
189.5])
Coordinates:
* x_bins (x_bins) object (-0.5, 0.5] (0.5, 1.5] ... (7.5, 8.5] (8.5, 9.5] the first call passes the dimension coordinate and does something weird, while the second call succeeds. |
Something else I noticed is that when passing a list with a single-element list containing a
We could get around that restriction by renaming
Edit: at the moment, the |
This is interesting but also messy. Presumably with the dict, everything else (
It does what Xarray does at the moment (see output of
Can you clarify? In |
Ah I don't think this is a bug. I think I'll check for exact alignment. |
Check for exact alignment in xarray_reduce xref #191
Yeah, that was a user error. After reading the example from #189, I finally understand that So with that, I now think the Which means we're left with the custom object suggestion: flox.xarray.xarray_reduce(
data,
flox.Grouper("x", bins=coords.x_vertices),
flox.Grouper(data.y, values=["a", "b", "c"]),
) which basically does the same thing as the combination of But yes, the example helps quite a bit, so the custom object would not improve the API as much as I had thought. Edit: but I still would like to have a convenient way to convert between different bounds conventions (bounds, vertices, interval index) |
I think this is up to xarray/cf-xarray since it is a generally useful thing. Once xarray can stick intervalindex in Xarray objects, I think flox should just do that.
|
Oh great point!
|
I did some experiments with the grouper object this afternoon. I'd imagine something like this (I'm not particularly fond of the bounds-type detection, but of course we can just require @attrs.define
class Grouper:
"""grouper for use with `flox.xarray.xarray_reduce`
Parameter
---------
over : hashable or DataArray
The coordinate to group over. If a hashable, has to be a variable on the
data. If a `DataArray`, its dimensions have to be a subset of the data's.
values : list-like, array-like, or DataArray, optional
The expected group labels. Mutually exclusive with `bins`.
bins : DataArray or IntervalIndex, optional
The bins used to group the data. Can either be a `IntervalIndex`, a `DataArray`
of `n + 1` vertices, or a `DataArray` of `(n, 2)` bounds.
Mutually exclusive with `values`.
bounds_dim : hashable, optional
The bounds dimension if the bins were passed as bounds.
"""
over = attrs.field()
bins = attrs.field(kw_only=True, default=None)
values = attrs.field(kw_only=True, default=None)
labels = attrs.field(init=False, default=None)
bounds_dim = attrs.field(kw_only=True, default=None, repr=False)
def __attrs_post_init__(self):
if self.bins is not None and self.values is not None:
raise TypeError("cannot specify both bins and group labels")
if self.bins is not None:
self.labels = to_intervals(self.bins, self.bounds_dim)
elif self.values is not None:
self.labels = self.values
@property
def is_bin(self):
if self.labels is None:
return None
return self.bins is not None
def merge_lists(*lists):
def merge_elements(elements):
filtered = [element for element in elements if element is not None]
return more_itertools.first(filtered, default=None)
return [merge_elements(elements) for elements in zip(*lists)]
def groupby(obj, *by, **flox_kwargs):
orig_expected_groups = flox_kwargs.get("expected_groups", [None] * len(by))
orig_isbin = flox_kwargs.get("isbin", [None] * len(by))
extracted = ((grouper.over, grouper.labels, grouper.is_bin) for grouper in by)
by_, expected_groups, isbin = (list(_) for _ in more_itertools.unzip(extracted))
flox_kwargs["expected_groups"] = tuple(
merge_lists(orig_expected_groups, expected_groups)
)
flox_kwargs["isbin"] = merge_lists(orig_isbin, isbin)
return flox.xarray.xarray_reduce(obj, *by_, **flox_kwargs)
flox.xarray.xarray_reduce(
data,
Grouper(over="x", bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])),
Grouper(over=data.y, values=["a", "b", "c"]),
func="mean",
) |
The nice thing is that we could also support I really like this |
well, I'm not really attached to the names, so that sounds good to me? |
I've been trying to use
flox
for multi-dimensional binning and found the API a bit tricky to understand.For some context, I have two variables (
depth(time)
andtemperature(time)
), which I'd like to bin intotime_bounds(time, bounds)
anddepth_bounds(time, bounds)
.I can get this to work using
but in the process of getting this right I frequently hit the
Needs better message
error fromflox/flox/xarray.py
Line 219 in 51fb6e9
which certainly did not help too much. However, ignoring that it was pretty difficult to make sense of the combination of
*by
,expected_groups
, andisbin
, and I'm not confident I won't be going through the same cycle of trial and error if I were to retry in a few months.Instead, I wonder if we could change the call to something like:
(leaving aside the question of which bounds convention(s) this
Bin
object should support)Another option might be to just use an interval index. Something like:
That would be pretty close to the existing
groupby
interface. And we could even combine both:xref pydata/xarray#6610, where we probably want to adopt whatever signature we figure out here. Also, do tell me if you'd prefer to have this discussion in that issue instead (but figuring this out here might allow for quicker iteration). And maybe I'm trying to get
xarray_reduce
to do something too similar togroupby
?The text was updated successfully, but these errors were encountered: