Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable taking the mean of dask-backed cftime arrays #6940

Merged
merged 11 commits into from
Sep 9, 2022

Conversation

spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Aug 21, 2022

This was essentially enabled by @dcherian in #6556, but we did not remove the error that prevented computing the mean of a dask-backed cftime array. This PR removes that error, and adds some tests. One minor modification in _timedelta_to_seconds was needed for compatibility with scalar cftime arrays.

This happens to address the second part of #5897, so I added a regression test for that. It seems like we decided to simply document the behavior in the first part (#5898, dcherian@99bfe12), but I'm not sure if we intend to change that behavior eventually or not.

  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

@github-actions github-actions bot added topic-arrays related to flexible array support topic-documentation labels Aug 21, 2022
@spencerkclark
Copy link
Member Author

Ah, this still didn't quite work for older dask versions:

====================================================================================================== FAILURES =======================================================================================================
___________________________________________________________________________________________ test_cftime_datetime_mean[True] ___________________________________________________________________________________________

dask = True

    @requires_cftime
    @pytest.mark.parametrize("dask", [False, True])
    def test_cftime_datetime_mean(dask):
        if dask and not has_dask:
            pytest.skip("requires dask")

        times = cftime_range("2000", periods=4)
        da = DataArray(times, dims=["time"])
        da_2d = DataArray(times.values.reshape(2, 2))

        if dask:
            da = da.chunk({"time": 2})
            da_2d = da_2d.chunk({"dim_0": 2})

        expected = da.isel(time=0)
        # one compute needed to check the array contains cftime datetimes
        with raise_if_dask_computes(max_computes=1):
>           result = da.isel(time=0).mean()

/Users/spencer/software/xarray/xarray/tests/test_duck_array_ops.py:342:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Users/spencer/software/xarray/xarray/core/_reductions.py:1478: in mean
    return self.reduce(
/Users/spencer/software/xarray/xarray/core/dataarray.py:2930: in reduce
    var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs)
/Users/spencer/software/xarray/xarray/core/variable.py:1854: in reduce
    data = func(self.data, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

array = dask.array<getitem, shape=(), dtype=object, chunksize=(), chunktype=numpy.ndarray>, axis = None, skipna = None, kwargs = {}
_contains_cftime_datetimes = <function _contains_cftime_datetimes at 0x7fcdd91ce3a0>, offset = dask.array<where, shape=(), dtype=object, chunksize=(), chunktype=numpy.ndarray>
timedeltas = dask.array<multiply, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray>, mean_timedeltas = dask.array<mean_agg-aggregate, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray>

    def mean(array, axis=None, skipna=None, **kwargs):
        """inhouse mean that can handle np.datetime64 or cftime.datetime
        dtypes"""
        from .common import _contains_cftime_datetimes

        array = asarray(array)
        if array.dtype.kind in "Mm":
            offset = _datetime_nanmin(array)

            # xarray always uses np.datetime64[ns] for np.datetime64 data
            dtype = "timedelta64[ns]"
            return (
                _mean(
                    datetime_to_numeric(array, offset), axis=axis, skipna=skipna, **kwargs
                ).astype(dtype)
                + offset
            )
        elif _contains_cftime_datetimes(array):
            offset = min(array)
            timedeltas = datetime_to_numeric(array, offset, datetime_unit="us")
            mean_timedeltas = _mean(timedeltas, axis=axis, skipna=skipna, **kwargs)
>           return _to_pytimedelta(mean_timedeltas, unit="us") + offset
E           TypeError: unsupported operand type(s) for +: 'Array' and 'Array'

/Users/spencer/software/xarray/xarray/core/duck_array_ops.py:573: TypeError

This appears to be the commit that fixed this issue in dask, meaning it should be fixed as of dask version 2021.07.0:

git bisect broken
626cc724ffa1d14cd74f48ff4f21ef443d832afa is the first fixed commit
commit 626cc724ffa1d14cd74f48ff4f21ef443d832afa
Author: Julia Signell <[email protected]>
Date:   Tue Jun 29 11:50:36 2021 -0400

    Allow mixing dask and numpy arrays in @guvectorize (#6863)

 dask/array/core.py              |  7 ++++++-
 dask/array/tests/test_gufunc.py | 11 +++++++++++
 2 files changed, 17 insertions(+), 1 deletion(-)

@dcherian
Copy link
Contributor

Thanks @spencerkclark

dask-core         2021.4  (2021-04-02) 2021.8  (2021-08-13) <
distributed       2021.4  (2021-04-02) 2021.8  (2021-08-14) <

It looks like we can bump to 2021.08.0 instead of dealing with this compatibility code.

@spencerkclark
Copy link
Member Author

It looks like we can bump to 2021.08.0 instead of dealing with this compatibility code.

Thanks for noting this @dcherian. I wasn't sure if this was worth bumping our minimum version for, but since our policy allows for it, I'm happy to make that change.

@github-actions github-actions bot added CI Continuous Integration tools dependencies Pull requests that update a dependency file labels Aug 24, 2022
result = da_2d.mean()
with raise_if_dask_computes(max_computes=1):
result = da_2d.mean()
assert_dask_array(result, dask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! Thanks for making extra sure that we didn't compute the whole thing.

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dcherian dcherian added plan to merge Final call for comments and removed topic-documentation labels Aug 24, 2022
Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - can you merge master to fix the merge conflict?

@dcherian dcherian enabled auto-merge (squash) September 9, 2022 16:18
@dcherian dcherian merged commit 2553762 into pydata:main Sep 9, 2022
@spencerkclark spencerkclark deleted the dask-cftime-mean branch September 10, 2022 11:51
@spencerkclark
Copy link
Member Author

Oops sorry I missed the merge conflict. Thanks @dcherian and @mathause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration tools dependencies Pull requests that update a dependency file plan to merge Final call for comments topic-arrays related to flexible array support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants