Fix CFTimeIndex-related errors stemming from updates in pandas #3764

spencerkclark · 2020-02-11T13:22:04Z

Closes more upstream-dev cftime failures #3751
Tests added
Passes isort -rc . && black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

This fixes the errors identified when #3751 was created by allowing one to subtract a pd.Index of cftime.datetime objects from a CFTimeIndex.

Some new errors have come up too (not associated with any updates I made here), which I still need to work on identifying the source of:

____________________________ test_indexing_in_series_getitem[365_day] _____________________________

series = 0001-01-01 00:00:00    1
0001-02-01 00:00:00    2
0002-01-01 00:00:00    3
0002-02-01 00:00:00    4
dtype: int64
index = CFTimeIndex([0001-01-01 00:00:00, 0001-02-01 00:00:00, 0002-01-01 00:00:00,
             0002-02-01 00:00:00],
            dtype='object')
scalar_args = [cftime.DatetimeNoLeap(0001-01-01 00:00:00)]
range_args = ['0001', slice('0001-01-01', '0001-12-30', None), slice(None, '0001-12-30', None), slice(cftime.DatetimeNoLeap(0001-01...:00), cftime.DatetimeNoLeap(0001-12-30 00:00:00), None), slice(None, cftime.DatetimeNoLeap(0001-12-30 00:00:00), None)]

    @requires_cftime
    def test_indexing_in_series_getitem(series, index, scalar_args, range_args):
        for arg in scalar_args:
>           assert series[arg] == 1

test_cftimeindex.py:597:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../pandas/pandas/core/series.py:884: in __getitem__
    return self._get_with(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = 0001-01-01 00:00:00    1
0001-02-01 00:00:00    2
0002-01-01 00:00:00    3
0002-02-01 00:00:00    4
dtype: int64
key = cftime.DatetimeNoLeap(0001-01-01 00:00:00)

    def _get_with(self, key):
        # other: fancy integer or otherwise
        if isinstance(key, slice):
            # _convert_slice_indexer to determing if this slice is positional
            #  or label based, and if the latter, convert to positional
            slobj = self.index._convert_slice_indexer(key, kind="getitem")
            return self._slice(slobj)
        elif isinstance(key, ABCDataFrame):
            raise TypeError(
                "Indexing a Series with DataFrame is not "
                "supported, use the appropriate DataFrame column"
            )
        elif isinstance(key, tuple):
            try:
                return self._get_values_tuple(key)
            except ValueError:
                # if we don't have a MultiIndex, we may still be able to handle
                #  a 1-tuple.  see test_1tuple_without_multiindex
                if len(key) == 1:
                    key = key[0]
                    if isinstance(key, slice):
                        return self._get_values(key)
                raise

        if not isinstance(key, (list, np.ndarray, ExtensionArray, Series, Index)):
>           key = list(key)
E           TypeError: 'cftime._cftime.DatetimeNoLeap' object is not iterable

../../../pandas/pandas/core/series.py:911: TypeError

…eIndex

spencerkclark · 2020-02-15T14:51:05Z

This is a little bit trickier than I originally anticipated. For indexing with dates very distant from the dates in the index, I'm still running into an issue at this step in pandas.Index._get_nearest_indexer:

left_distances = np.abs(self[left_indexer] - target)

Consider the following the example:

In [1]: import numpy as np; import pandas as pd; import xarray as xr

In [2]: import cftime

In [3]: times = xr.cftime_range('0001', periods=4)

In [4]: da = xr.DataArray(range(4), coords=[('time', times)])

In [5]: da.sel(time=cftime.DatetimeGregorian(2000, 1, 1), method='nearest')

In this example self[left_indexer] is CFTimeIndex([0001-01-04 00:00:00], dtype='object', name='time'), while target is Index([2000-01-01 00:00:00], dtype='object'). The distance between these dates is greater than 292 years, so it cannot be represented in a TimedeltaIndex.

One could argue that we could fall back on using a generic Index of dtype object to store the datetime.timedelta object produced in this case:

In [6]: left_index = xr.CFTimeIndex([cftime.DatetimeGregorian(1, 1, 4)])

In [7]: target = pd.Index([cftime.DatetimeGregorian(2000, 1, 1)])

In [8]: difference = left_index - target

In [9]: difference
Out[9]: Index([-730118 days, 0:00:00], dtype='object')

A problem occurs though when we try to take the absolute value of this index. Pandas (I think reasonably so) tries to detect the datatype of the result and construct a new index. In doing so it tries to create a TimedeltaIndex, but cannot, because the timedelta inside is too large:

Result of np.abs(difference)

In [10]: np.abs(difference)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/Software/pandas/pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.array_to_timedelta64()

TypeError: Expected unicode, got datetime.timedelta

During handling of the above exception, another exception occurred:

OverflowError                             Traceback (most recent call last)
<ipython-input-17-95776624315c> in <module>
----> 1 np.abs(difference)

~/Software/pandas/pandas/core/indexes/base.py in __array_wrap__(self, result, context)
    628
    629         attrs = self._get_attributes_dict()
--> 630         return Index(result, **attrs)
    631
    632     @cache_readonly

~/Software/pandas/pandas/core/indexes/base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
    412
    413             if dtype is None:
--> 414                 new_data, new_dtype = _maybe_cast_data_without_dtype(subarr)
    415                 if new_dtype is not None:
    416                     return cls(

~/Software/pandas/pandas/core/indexes/base.py in _maybe_cast_data_without_dtype(subarr)
   5711
   5712         elif inferred.startswith("timedelta"):
-> 5713             data = TimedeltaArray._from_sequence(subarr, copy=False)
   5714             return data, data.dtype
   5715         elif inferred == "period":

~/Software/pandas/pandas/core/arrays/timedeltas.py in _from_sequence(cls, data, dtype, copy, freq, unit)
    212         freq, freq_infer = dtl.maybe_infer_freq(freq)
    213
--> 214         data, inferred_freq = sequence_to_td64ns(data, copy=copy, unit=unit)
    215         freq, freq_infer = dtl.validate_inferred_freq(freq, inferred_freq, freq_infer)
    216

~/Software/pandas/pandas/core/arrays/timedeltas.py in sequence_to_td64ns(data, copy, unit, errors)
    938     if is_object_dtype(data.dtype) or is_string_dtype(data.dtype):
    939         # no need to make a copy, need to convert if string-dtyped
--> 940         data = objects_to_td64ns(data, unit=unit, errors=errors)
    941         copy = False
    942

~/Software/pandas/pandas/core/arrays/timedeltas.py in objects_to_td64ns(data, unit, errors)
   1047     values = np.array(data, dtype=np.object_, copy=False)
   1048
-> 1049     result = array_to_timedelta64(values, unit=unit, errors=errors)
   1050     return result.view("timedelta64[ns]")
   1051

~/Software/pandas/pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.array_to_timedelta64()

~/Software/pandas/pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.convert_to_timedelta64()

~/Software/pandas/pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.delta_to_nanoseconds()

OverflowError: Python int too large to convert to C long

So I'm a little bit back to the drawing board regarding a solution for this. Part of me is tempted to write an overriding version of _get_nearest_indexer for CFTimeIndex that basically restores the old behavior; I'm not sure how dangerous that would be to rely on though.

spencerkclark · 2020-02-22T14:55:00Z

I added some updates to this PR this morning that in principle would solve the indexing with method="nearest" from within xarray. Unfortunately though, due to the issue I described in #3764 (comment), I was not able to come up with a solution that did not require overriding the pandas implementation of _get_nearest_indexer. If this seems unacceptable, maybe we can think harder about how we might address this upstream instead (e.g. I think special-casing the change made in pandas-dev/pandas#31511 to just DatetimeIndexes, and preserving the old behavior for everything else, could be sufficient).

In addition, for the time being I xfailed test_indexing_in_series_getitem, as I think there is agreement that that would be best addressed upstream, #3751 (comment).

Finally in the process of doing this, I cleaned up the implementation of CFTimeIndex.__sub__, and added some more test coverage; hopefully now it's a little clearer what cases it's supposed to work for and what cases it is not.

dcherian · 2020-02-24T18:22:29Z

ping @shoyer for input.

max-sixty · 2020-02-28T03:20:35Z

Re master failing, what do we think about xfail-ing those tests on master and then adding a PR which undoes the xfail and attempts fixes the issue?

I'm cognizant that every PR now fails, which less friendly to contributors. The project home page also states .

max-sixty · 2020-02-28T03:22:01Z

Re the specific issue, that's a tough one. It could be worth floating on pandas and seeing if they have thoughts...

shoyer · 2020-02-28T05:57:21Z

+1 for marking as xfail. There’s no point in keeping the master branch failing if we know what the issue is.

…

On Thu, Feb 27, 2020 at 7:22 PM Maximilian Roos ***@***.***> wrote: Re the specific issue, that's a tough one. It could be worth floating on pandas and seeing if they have thoughts... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3764?email_source=notifications&email_token=AAJJFVXTH5QBZHOH47UZIQDRFB7NXA5CNFSM4KTAWZ7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENG2V6Y#issuecomment-592292603>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJJFVQKSGKMKA35V4JDV4DRFB7NXANCNFSM4KTAWZ7A> .

pep8speaks · 2020-03-09T07:36:39Z

Hello @spencerkclark! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-09 07:37:46 UTC

shoyer · 2020-03-13T06:08:34Z

These changes look good to me. It's definitely not ideal to be overriding all these details on CFTimeIndex -- this why xarray needs its own indexing API -- but for now it seems like about the best we can do.

shoyer · 2020-03-13T06:09:01Z

It's also worth noting that this note appears to be causing our doc builds to fail: https://readthedocs.org/projects/xray/builds/10604464/

shoyer · 2020-03-13T06:14:37Z

I'm going to merge this to see if it fixes our doc build...

…e-size * upstream/master: (24 commits) Fix alignment with join="override" when some dims are unindexed (pydata#3839) Fix CFTimeIndex-related errors stemming from updates in pandas (pydata#3764) Doctests fixes (pydata#3846) add xpublish to related projects (pydata#3850) update installation instruction (pydata#3849) Pint support for top-level functions (pydata#3611) un-xfail tests that append to netCDF files with scipy (pydata#3805) remove panel conversion (pydata#3845) Add nxarray to related-projects.rst (pydata#3848) Implement skipna kwarg in xr.quantile (pydata#3844) Allow `where` to receive a callable (pydata#3827) update macos image (pydata#3838) Label "Installed Versions" item in Issue template (pydata#3832) Add note on diff's n differing from pandas (pydata#3822) DOC: Add rioxarray and other external examples (pydata#3757) Use stable RTD image. removed mention that 'dims' are inferred from 'coords'-dict when omit… (pydata#3821) Coarsen keep attrs 3376 (pydata#3801) Turn on html repr by default (pydata#3812) Fix zarr append with groups (pydata#3610) ...

spencerkclark added 3 commits February 11, 2020 07:57

Allow subtraction of a generic Index of cftime.datetimes from a CFTim…

d138a57

…eIndex

black

bd92d06

Test that NotImplemented logic works

9fddded

spencerkclark added 5 commits February 22, 2020 06:49

Vendor _get_nearest_indexer and _filter_indexer_tolerance

4480922

Test OverflowError in __rsub__

08f1c3d

Fix name of pandas method in docstring

1f573a3

Merge branch 'master' into fix-3751

01c46c4

Add what's new entries

c91e625

spencerkclark changed the title ~~WIP: Fix CFTimeIndex-related errors stemming from updates in pandas~~ Fix CFTimeIndex-related errors stemming from updates in pandas Feb 22, 2020

Enable use of tolerance greater than 292 years

1496513

spencerkclark mentioned this pull request Feb 28, 2020

more upstream-dev cftime failures #3751

Closed

Merge branch 'master' into fix-3751

2424de0

newlinw

ec4e19f

dcherian requested a review from shoyer March 9, 2020 09:25

shoyer merged commit 650a981 into pydata:master Mar 13, 2020

spencerkclark deleted the fix-3751 branch March 15, 2020 14:58

spencerkclark mentioned this pull request Mar 22, 2020

Fix to _get_nearest_indexer for pydata/xarray#3751 pandas-dev/pandas#32905

Merged

4 tasks

spencerkclark mentioned this pull request Apr 8, 2020

Add workflow to run model with supplied nudging tendency ai2cm/fv3net#215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CFTimeIndex-related errors stemming from updates in pandas #3764

Fix CFTimeIndex-related errors stemming from updates in pandas #3764

spencerkclark commented Feb 11, 2020 •

edited

Loading

spencerkclark commented Feb 15, 2020

spencerkclark commented Feb 22, 2020

dcherian commented Feb 24, 2020

max-sixty commented Feb 28, 2020

max-sixty commented Feb 28, 2020

shoyer commented Feb 28, 2020 via email

pep8speaks commented Mar 9, 2020 •

edited

Loading

shoyer commented Mar 13, 2020

shoyer commented Mar 13, 2020

shoyer commented Mar 13, 2020

Fix CFTimeIndex-related errors stemming from updates in pandas #3764

Fix CFTimeIndex-related errors stemming from updates in pandas #3764

Conversation

spencerkclark commented Feb 11, 2020 • edited Loading

spencerkclark commented Feb 15, 2020

spencerkclark commented Feb 22, 2020

dcherian commented Feb 24, 2020

max-sixty commented Feb 28, 2020

max-sixty commented Feb 28, 2020

shoyer commented Feb 28, 2020 via email

pep8speaks commented Mar 9, 2020 • edited Loading

Comment last updated at 2020-03-09 07:37:46 UTC

shoyer commented Mar 13, 2020

shoyer commented Mar 13, 2020

shoyer commented Mar 13, 2020

spencerkclark commented Feb 11, 2020 •

edited

Loading

pep8speaks commented Mar 9, 2020 •

edited

Loading