Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TYP: interval.pyi #44922

Closed
wants to merge 5 commits into from
Closed

TYP: interval.pyi #44922

wants to merge 5 commits into from

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Dec 16, 2021

Currently rebased on top of #44339.

This is the second last part of @erictraut's #43744 (offsets.pyi is still missing).

@pep8speaks
Copy link

pep8speaks commented Dec 16, 2021

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 89:89: E501 line too long (89 > 88 characters)

Comment last updated at 2022-02-21 04:04:50 UTC

@twoertwein twoertwein marked this pull request as draft December 16, 2021 03:43
@twoertwein twoertwein changed the title interval.pyi TYP: interval.pyi Dec 16, 2021
@@ -200,6 +201,9 @@ class IntervalArray(IntervalMixin, ExtensionArray):
ndim = 1
can_hold_na = True
_na_value = _fill_value = np.nan
_left: np.ndarray
_right: np.ndarray
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the type of _left and _right, see mypy error for _from_sequence in this file.

@twoertwein twoertwein marked this pull request as ready for review December 16, 2021 16:39
@twoertwein twoertwein added the Typing type annotations, mypy/pyright type checking label Dec 16, 2021
@@ -83,7 +83,7 @@
PythonScalar = Union[str, int, float, bool]
DatetimeLikeScalar = Union["Period", "Timestamp", "Timedelta"]
PandasScalar = Union["Period", "Timestamp", "Timedelta", "Interval"]
Scalar = Union[PythonScalar, PandasScalar]
Scalar = Union[PythonScalar, PandasScalar, np.datetime64, np.timedelta64]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, Interval resolved to Any. Some functions that are supposed to return a Scalar can return np.datetime64 / np.timedelta64 which was previously unnoticed.

@jreback jreback added this to the 1.4 milestone Dec 19, 2021
pandas/core/indexes/interval.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Dec 19, 2021

I liket he _ScalarT better for a name of the typevar (as compared to _S in timestamps.pyi)

@jreback
Copy link
Contributor

jreback commented Dec 19, 2021

cc @simonjayhawkins

@jreback
Copy link
Contributor

jreback commented Dec 19, 2021

cc @jbrockmendel

pandas/_libs/interval.pyi Outdated Show resolved Hide resolved
@property
def mid(self) -> float: ...
@property
def length(self) -> float: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be timedelta?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now generic but IntervalMixin[datetime].length is annotated to return a datetime. I don't think this exception can be achieved with overloads (overlapping overloads with different return type).

@@ -482,11 +482,20 @@ def pivot(
if columns is None:
raise TypeError("pivot() missing 1 required argument: 'columns'")

columns_listlike = com.convert_to_list_like(columns)
# error: Argument 1 to "convert_to_list_like" has incompatible type "Hashable";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this suggest that the 'columns' annotation may be too loose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_to_list_like seems to be too strict: it literally accepts anything.

@twoertwein
Copy link
Member Author

I just realized that typing interval.pyi is much messier than I expected:

  1. A TypeVar cannot be defined by a generic type, but Interval is now a generic type which depends on Scalar (includes Interval) and uses it as a TypeVar
  2. IntervalMixin should also be a Generic (the two float return types should be _ScalarT) but IntervalTree inherits also from IntervalMixin which means that the generic type of IntervalMixin is not only Scalar but also np.ndarray
  3. __sub__ and length should reflect the datetime/timedelta mechanics (datetime - datetime = timedelta)
  4. and __add__ should probably not allow datetime + datetime (but datetime + timedelta)

I think the only way to address the above is to use define a Protocol (needs to support +, -, <, ...) and then use this protocol as a bound for a TypeVar which is then used by IntervalMixin&co. The datetime/timedelta mechanics would then need to be addressed by overloads.

Marking this PR as a draft until I have a working version with a protocol.

@twoertwein twoertwein marked this pull request as draft December 21, 2021 03:17
@jreback jreback removed this from the 1.4 milestone Dec 23, 2021
@twoertwein
Copy link
Member Author

twoertwein commented Jan 8, 2022

Version 2: more aligned with the documentation (any "orderable scalar" is accepted) which also breaks the Scalar-Interval dependency.

@@ -517,7 +516,7 @@ def f(x):


def convert_to_list_like(
values: Scalar | Iterable | AnyArrayLike,
values: Hashable | Iterable | AnyArrayLike,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All scalars should also be hashable.

@twoertwein twoertwein marked this pull request as ready for review January 8, 2022 17:08
@twoertwein twoertwein mentioned this pull request Jan 12, 2022
@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

@twoertwein if you can merge master

@jreback jreback added this to the 1.5 milestone Jan 17, 2022
@jreback
Copy link
Contributor

jreback commented Jan 17, 2022

cc @simonjayhawkins if any comments

@@ -893,7 +893,7 @@ def _clear_buffer(self) -> None:

def _get_index_name(
self, columns: list[Hashable]
) -> tuple[list[Hashable] | None, list[Hashable], list[Hashable]]:
) -> tuple[Sequence[Hashable] | None, list[Hashable], list[Hashable]]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, we would get:

pandas/io/parsers/python_parser.py:949: error: Incompatible return value type (got "Tuple[List[Union[Union[str, int, float, bool], Union[Period, Timestamp, Timedelta, Interval[Any]]git , datetime64, timedelta64]], List[Hashable], List[Hashable]]", expected "Tuple[Optional[List[Hashable]], List[Hashable], List[Hashable]]") [return-value]

@twoertwein twoertwein mentioned this pull request Feb 20, 2022
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 20, 2022

I think that this is based on stuff I've put in the Microsoft stubs. I recently did a PR there that addresses a few issues. See https://github.com/microsoft/python-type-stubs/pull/167/files

One thing that I had to test was getting the type right for intervals based on int or float compared to Timestamp.

I added the following test in that PR in the MS stubs:

def test_interval_length() -> None:
    i1 = pd.Interval(pd.Timestamp("2000-01-01"), pd.Timestamp("2000-01-02"), closed="both")
    reveal_type(i1.length, expected_string="Timedelta")
    i1.length.total_seconds()

    i2 = pd.Interval(10, 20)
    reveal_type(i2.length, expected_type=int)

    i3 = pd.Interval(13.2, 19.5)
    reveal_type(i3.length, expected_type=float)

mypy doesn't support the expected_type or expected_string arguments to reveal_type, but you ought to make sure that your stubs are figuring out the above in terms of the type of length being dependent on the constructor for Interval

@twoertwein
Copy link
Member Author

Thanks @Dr-Irv It definitely makes sense to add exceptions for Pandas/important stdlib types. I think I found a way to encode this (using your workaround for overload+decorators):

# note: mypy doesn't support overloading properties
# based on github.com/microsoft/python-type-stubs/pull/167
class _LengthDescriptor:
    @overload
    def __get__(self, instance: IntervalMixin[Timestamp], owner: Any) -> Timedelta: ...
    @overload
    def __get__(self, instance: IntervalMixin[datetime], owner: Any) -> timedelta: ...
    @overload
    def __get__(
        self, instance: IntervalMixin[_OrderableT], owner: Any
    ) -> _OrderableT: ...

class IntervalMixin(Generic[_OrderableT]):
    ...
    length: _LengthDescriptor

Unfortunately, there doesn't seem to be a generic way to handle other exceptions (Interval can be used with more classes than just int/float/Timestamp).

Still need to debug this locally - getting some "has-type/misc" mypy errors and pyright doesn't pick up on it (probably an issue with TypeVar'ed Protocol in Generic classes - will create an issue on pyright).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 20, 2022

Unfortunately, there doesn't seem to be a generic way to handle other exceptions (Interval can be used with more classes than just int/float/Timestamp).

Not sure about that. See below:

pd.Interval(datetime.datetime(year=2022,month=3, day=15), datetime.datetime(year=2022,month=4, day=15))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas\_libs\interval.pyx", line 325, in pandas._libs.interval.Interval.__init__
  File "pandas\_libs\interval.pyx", line 345, in pandas._libs.interval.Interval._validate_endpoint
ValueError: Only numeric, Timestamp and Timedelta endpoints are allowed when constructing an Interval.

So I think we don't need to worry about datetime endpoints.

@twoertwein
Copy link
Member Author

In that case, 1) your approach might be much simpler and 2) the documentation needs to be updated.

def __str__(self) -> str: ...
# TODO: could return Interval with different type
def __add__(
self, y: numbers.Number | np.timedelta64 | timedelta
Copy link
Contributor

@Dr-Irv Dr-Irv Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to make the operators type specific. For Interval[Timestamp], you can only add and subtract Timedelta. For the numeric ones, you can only add/subtract/multiply/divide float or int. See the latest in microsoft/python-type-stubs#167

def __sub__(
self, y: numbers.Number | np.timedelta64 | timedelta
) -> Interval[_OrderableT]: ...
def __mul__(self, y: numbers.Number) -> Interval[_OrderableT]: ...
Copy link
Contributor

@Dr-Irv Dr-Irv Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiply and divide don't apply for the Timestamp intervals

def __truediv__(self, y: numbers.Number) -> Interval[_OrderableT]: ...
def __floordiv__(self, y: numbers.Number) -> Interval[_OrderableT]: ...
def __hash__(self) -> int: ...
def __contains__(self: Interval[_OrderableT], key: _OrderableT) -> bool: ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to be explicit here about the 4 different types. You can't have __contains__() support testing an integer inside a Timestamp interval. See PR microsoft/python-type-stubs#167

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two _OrderableT bind to the same type simultaneously.

i2 = pd.Interval(10, 20)
i2.__contains__(4) # ok
i2.__contains__(4.0) # error: Unsupported operand types for in ("float" and "Interval[int]")  [operator]

i3 = pd.Interval(13.2, 19.5)
i3.__contains__(4) # ok
i3.__contains__(4.0) # ok
i3.__contains__(pd.Timestamp(0)) # error: Unsupported operand types for in ("Timestamp" and "Interval[float]")  [operator]

Thank you for your feedback! I will integrate it over the next days. If you prefer, I can also leave the operators as-is and leave it to you - doesn't feel great copy&pasting your code ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that i2.__contains__(4.0) should be allowed. If you have an interval based on integers, you can test whether a float is inside.

@@ -663,7 +672,9 @@ def _get_indexer(
# homogeneous scalar index: use IntervalTree
# we should always have self._should_partial_index(target) here
target = self._maybe_convert_i8(target)
indexer = self._engine.get_indexer(target.values)
# error: Argument 1 to "get_indexer" of "IntervalTree" has incompatible type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add typing to _maybe_convert_i8 ?

@simonjayhawkins
Copy link
Member

cc @simonjayhawkins if any comments

yep. lgtm. always happy to have PRs that add more types merged if mypy is green.

@Dr-Irv comments and pre-commit failure outstanding.

@twoertwein
Copy link
Member Author

Closing in favor of #46098

@twoertwein twoertwein closed this Feb 23, 2022
@twoertwein twoertwein deleted the interval branch March 9, 2022 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants