Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CFTimeIndex-related errors stemming from updates in pandas #3764

Merged
merged 11 commits into from
Mar 13, 2020

Conversation

spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Feb 11, 2020

This fixes the errors identified when #3751 was created by allowing one to subtract a pd.Index of cftime.datetime objects from a CFTimeIndex.

Some new errors have come up too (not associated with any updates I made here), which I still need to work on identifying the source of:

____________________________ test_indexing_in_series_getitem[365_day] _____________________________

series = 0001-01-01 00:00:00    1
0001-02-01 00:00:00    2
0002-01-01 00:00:00    3
0002-02-01 00:00:00    4
dtype: int64
index = CFTimeIndex([0001-01-01 00:00:00, 0001-02-01 00:00:00, 0002-01-01 00:00:00,
             0002-02-01 00:00:00],
            dtype='object')
scalar_args = [cftime.DatetimeNoLeap(0001-01-01 00:00:00)]
range_args = ['0001', slice('0001-01-01', '0001-12-30', None), slice(None, '0001-12-30', None), slice(cftime.DatetimeNoLeap(0001-01...:00), cftime.DatetimeNoLeap(0001-12-30 00:00:00), None), slice(None, cftime.DatetimeNoLeap(0001-12-30 00:00:00), None)]

    @requires_cftime
    def test_indexing_in_series_getitem(series, index, scalar_args, range_args):
        for arg in scalar_args:
>           assert series[arg] == 1

test_cftimeindex.py:597:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../pandas/pandas/core/series.py:884: in __getitem__
    return self._get_with(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = 0001-01-01 00:00:00    1
0001-02-01 00:00:00    2
0002-01-01 00:00:00    3
0002-02-01 00:00:00    4
dtype: int64
key = cftime.DatetimeNoLeap(0001-01-01 00:00:00)

    def _get_with(self, key):
        # other: fancy integer or otherwise
        if isinstance(key, slice):
            # _convert_slice_indexer to determing if this slice is positional
            #  or label based, and if the latter, convert to positional
            slobj = self.index._convert_slice_indexer(key, kind="getitem")
            return self._slice(slobj)
        elif isinstance(key, ABCDataFrame):
            raise TypeError(
                "Indexing a Series with DataFrame is not "
                "supported, use the appropriate DataFrame column"
            )
        elif isinstance(key, tuple):
            try:
                return self._get_values_tuple(key)
            except ValueError:
                # if we don't have a MultiIndex, we may still be able to handle
                #  a 1-tuple.  see test_1tuple_without_multiindex
                if len(key) == 1:
                    key = key[0]
                    if isinstance(key, slice):
                        return self._get_values(key)
                raise

        if not isinstance(key, (list, np.ndarray, ExtensionArray, Series, Index)):
>           key = list(key)
E           TypeError: 'cftime._cftime.DatetimeNoLeap' object is not iterable

../../../pandas/pandas/core/series.py:911: TypeError

@spencerkclark
Copy link
Member Author

This is a little bit trickier than I originally anticipated. For indexing with dates very distant from the dates in the index, I'm still running into an issue at this step in pandas.Index._get_nearest_indexer:

left_distances = np.abs(self[left_indexer] - target)

Consider the following the example:

In [1]: import numpy as np; import pandas as pd; import xarray as xr

In [2]: import cftime

In [3]: times = xr.cftime_range('0001', periods=4)

In [4]: da = xr.DataArray(range(4), coords=[('time', times)])

In [5]: da.sel(time=cftime.DatetimeGregorian(2000, 1, 1), method='nearest')

In this example self[left_indexer] is CFTimeIndex([0001-01-04 00:00:00], dtype='object', name='time'), while target is Index([2000-01-01 00:00:00], dtype='object'). The distance between these dates is greater than 292 years, so it cannot be represented in a TimedeltaIndex.

One could argue that we could fall back on using a generic Index of dtype object to store the datetime.timedelta object produced in this case:

In [6]: left_index = xr.CFTimeIndex([cftime.DatetimeGregorian(1, 1, 4)])

In [7]: target = pd.Index([cftime.DatetimeGregorian(2000, 1, 1)])

In [8]: difference = left_index - target

In [9]: difference
Out[9]: Index([-730118 days, 0:00:00], dtype='object')

A problem occurs though when we try to take the absolute value of this index. Pandas (I think reasonably so) tries to detect the datatype of the result and construct a new index. In doing so it tries to create a TimedeltaIndex, but cannot, because the timedelta inside is too large:

Result of np.abs(difference)

In [10]: np.abs(difference)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/Software/pandas/pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.array_to_timedelta64()

TypeError: Expected unicode, got datetime.timedelta

During handling of the above exception, another exception occurred:

OverflowError                             Traceback (most recent call last)
<ipython-input-17-95776624315c> in <module>
----> 1 np.abs(difference)

~/Software/pandas/pandas/core/indexes/base.py in __array_wrap__(self, result, context)
    628
    629         attrs = self._get_attributes_dict()
--> 630         return Index(result, **attrs)
    631
    632     @cache_readonly

~/Software/pandas/pandas/core/indexes/base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
    412
    413             if dtype is None:
--> 414                 new_data, new_dtype = _maybe_cast_data_without_dtype(subarr)
    415                 if new_dtype is not None:
    416                     return cls(

~/Software/pandas/pandas/core/indexes/base.py in _maybe_cast_data_without_dtype(subarr)
   5711
   5712         elif inferred.startswith("timedelta"):
-> 5713             data = TimedeltaArray._from_sequence(subarr, copy=False)
   5714             return data, data.dtype
   5715         elif inferred == "period":

~/Software/pandas/pandas/core/arrays/timedeltas.py in _from_sequence(cls, data, dtype, copy, freq, unit)
    212         freq, freq_infer = dtl.maybe_infer_freq(freq)
    213
--> 214         data, inferred_freq = sequence_to_td64ns(data, copy=copy, unit=unit)
    215         freq, freq_infer = dtl.validate_inferred_freq(freq, inferred_freq, freq_infer)
    216

~/Software/pandas/pandas/core/arrays/timedeltas.py in sequence_to_td64ns(data, copy, unit, errors)
    938     if is_object_dtype(data.dtype) or is_string_dtype(data.dtype):
    939         # no need to make a copy, need to convert if string-dtyped
--> 940         data = objects_to_td64ns(data, unit=unit, errors=errors)
    941         copy = False
    942

~/Software/pandas/pandas/core/arrays/timedeltas.py in objects_to_td64ns(data, unit, errors)
   1047     values = np.array(data, dtype=np.object_, copy=False)
   1048
-> 1049     result = array_to_timedelta64(values, unit=unit, errors=errors)
   1050     return result.view("timedelta64[ns]")
   1051

~/Software/pandas/pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.array_to_timedelta64()

~/Software/pandas/pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.convert_to_timedelta64()

~/Software/pandas/pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.delta_to_nanoseconds()

OverflowError: Python int too large to convert to C long

So I'm a little bit back to the drawing board regarding a solution for this. Part of me is tempted to write an overriding version of _get_nearest_indexer for CFTimeIndex that basically restores the old behavior; I'm not sure how dangerous that would be to rely on though.

@spencerkclark spencerkclark changed the title WIP: Fix CFTimeIndex-related errors stemming from updates in pandas Fix CFTimeIndex-related errors stemming from updates in pandas Feb 22, 2020
@spencerkclark
Copy link
Member Author

I added some updates to this PR this morning that in principle would solve the indexing with method="nearest" from within xarray. Unfortunately though, due to the issue I described in #3764 (comment), I was not able to come up with a solution that did not require overriding the pandas implementation of _get_nearest_indexer. If this seems unacceptable, maybe we can think harder about how we might address this upstream instead (e.g. I think special-casing the change made in pandas-dev/pandas#31511 to just DatetimeIndexes, and preserving the old behavior for everything else, could be sufficient).

In addition, for the time being I xfailed test_indexing_in_series_getitem, as I think there is agreement that that would be best addressed upstream, #3751 (comment).

Finally in the process of doing this, I cleaned up the implementation of CFTimeIndex.__sub__, and added some more test coverage; hopefully now it's a little clearer what cases it's supposed to work for and what cases it is not.

@dcherian
Copy link
Contributor

ping @shoyer for input.

@max-sixty
Copy link
Collaborator

Re master failing, what do we think about xfail-ing those tests on master and then adding a PR which undoes the xfail and attempts fixes the issue?

I'm cognizant that every PR now fails, which less friendly to contributors. The project home page also states image.

@max-sixty
Copy link
Collaborator

Re the specific issue, that's a tough one. It could be worth floating on pandas and seeing if they have thoughts...

@shoyer
Copy link
Member

shoyer commented Feb 28, 2020 via email

@pep8speaks
Copy link

pep8speaks commented Mar 9, 2020

Hello @spencerkclark! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-09 07:37:46 UTC

@dcherian dcherian requested a review from shoyer March 9, 2020 09:25
@shoyer
Copy link
Member

shoyer commented Mar 13, 2020

These changes look good to me. It's definitely not ideal to be overriding all these details on CFTimeIndex -- this why xarray needs its own indexing API -- but for now it seems like about the best we can do.

@shoyer
Copy link
Member

shoyer commented Mar 13, 2020

It's also worth noting that this note appears to be causing our doc builds to fail: https://readthedocs.org/projects/xray/builds/10604464/

@shoyer
Copy link
Member

shoyer commented Mar 13, 2020

I'm going to merge this to see if it fixes our doc build...

@shoyer shoyer merged commit 650a981 into pydata:master Mar 13, 2020
dcherian added a commit to ej81/xarray that referenced this pull request Mar 14, 2020
…e-size

* upstream/master: (24 commits)
  Fix alignment with join="override" when some dims are unindexed (pydata#3839)
  Fix CFTimeIndex-related errors stemming from updates in pandas (pydata#3764)
  Doctests fixes (pydata#3846)
  add xpublish to related projects (pydata#3850)
  update installation instruction (pydata#3849)
  Pint support for top-level functions (pydata#3611)
  un-xfail tests that append to netCDF files with scipy (pydata#3805)
  remove panel conversion (pydata#3845)
  Add nxarray to related-projects.rst (pydata#3848)
  Implement skipna kwarg in xr.quantile (pydata#3844)
  Allow `where` to receive a callable (pydata#3827)
  update macos image (pydata#3838)
  Label "Installed Versions" item in Issue template (pydata#3832)
  Add note on diff's n differing from pandas (pydata#3822)
  DOC: Add rioxarray and other external examples (pydata#3757)
  Use stable RTD image.
  removed mention that 'dims' are inferred from 'coords'-dict when omit… (pydata#3821)
  Coarsen keep attrs 3376 (pydata#3801)
  Turn on html repr by default (pydata#3812)
  Fix zarr append with groups (pydata#3610)
  ...
@spencerkclark spencerkclark deleted the fix-3751 branch March 15, 2020 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

more upstream-dev cftime failures
5 participants